Changeset 9a72c4d for doc


Ignore:
Timestamp:
Jun 4, 2018, 6:24:01 PM (6 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, with_gc
Children:
6e3eaa57
Parents:
f77dbc0
Message:

updates

Location:
doc/papers
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    rf77dbc0 r9a72c4d  
    288288\CFA is an extension of ISO-C, and hence, supports all C paradigms.
    289289%It is a non-object-oriented system-language, meaning most of the major abstractions have either no runtime overhead or can be opted out easily.
    290 Like C, the basics of \CFA revolve around structures and functions.
     290Like C, the basics of \CFA revolve around structures and routines.
    291291Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions.
    292292While \CFA is not an object-oriented language, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does have a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}.
     
    349349        'with' '(' $\emph{expression-list}$ ')' $\emph{compound-statement}$
    350350\end{cfa}
    351 and may appear as the body of a function or nested within a function body.
     351and may appear as the body of a routine or nested within a routine body.
    352352Each expression in the expression-list provides a type and object.
    353353The type must be an aggregate type.
     
    360360
    361361\CFA maximizes the ability to reuse names via overloading to aggressively address the naming problem.
    362 Both variables and functions may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments.
     362Both variables and routines may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments.
    363363\begin{cquote}
    364364\vspace*{-\baselineskip}%???
     
    416416Overloading is important for \CFA concurrency since the runtime system relies on creating different types to represent concurrency objects.
    417417Therefore, overloading is necessary to prevent the need for long prefixes and other naming conventions to prevent name clashes.
    418 As seen in Section~\ref{basics}, function @main@ is heavily overloaded.
     418As seen in Section~\ref{basics}, routine @main@ is heavily overloaded.
    419419
    420420Variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name:
     
    438438
    439439Overloading also extends to operators.
    440 Operator-overloading syntax names a routine with the operator symbol and question marks for the operands:
     440Operator-overloading syntax creates a routine name with an operator symbol and question marks for the operands:
    441441\begin{cquote}
    442442\lstDeleteShortInline@%
     
    477477\label{s:ParametricPolymorphism}
    478478
    479 The signature feature of \CFA is parametric-polymorphic functions~\cite{} with functions generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types.
    480 For example, the following sum function works for any type that supports construction from 0 and addition:
     479The signature feature of \CFA is parametric-polymorphic routines~\cite{} with routines generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types.
     480For example, the following sum routine works for any type that supports construction from 0 and addition:
    481481\begin{cfa}
    482482forall( otype T | { void `?{}`( T *, zero_t ); T `?+?`( T, T ); } ) // constraint type, 0 and +
     
    491491\end{cfa}
    492492
    493 \CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each function declaration:
     493\CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each routine declaration:
    494494\begin{cfa}
    495495trait `sumable`( otype T ) {
     
    735735`coroutine` Fib { int fn; };
    736736\end{cfa}
    737 which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface functions, @next@.
     737which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines @next@.
    738738Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main.
    739739The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's resume.
    740 The interface function, @next@, takes a Fibonacci instance and context switches to it using @resume@;
     740The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@;
    741741on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
    742742The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates the stack;
     
    847847\end{figure}
    848848
    849 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming function for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal functions.
     849The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming routine for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal routines
    850850However, there is no stack growth because @resume@/@suspend@ context switch to existing stack-frames rather than create new ones.
    851 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a resuming-call cycle.
     851\newterm{Symmetric (full) coroutine}s have a coroutine call a resuming routine for another coroutine, which eventually forms a resuming-call cycle.
    852852(The trivial cycle is a coroutine resuming itself.)
    853853This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch.
     
    931931Figure~\ref{f:ProdCons} shows a producer/consumer symmetric-coroutine performing bi-directional communication.
    932932Since the solution involves a full-coroutining cycle, the program main creates one coroutine in isolation, passes this coroutine to its partner, and closes the cycle at the call to @start@.
    933 The @start@ function communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
     933The @start@ routine communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
    934934Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
    935935@prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer.
     
    980980However, there is nothing preventing wrong placement or multiple declarations.
    981981
    982 For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
     982For coroutines as for threads, many implementations are based on routine pointers or routine objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
    983983For example, Boost implements coroutines in terms of four functor object-types:
    984984\begin{cfa}
     
    988988symmetric_coroutine<>::yield_type
    989989\end{cfa}
    990 Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
     990Similarly, the canonical threading paradigm is often based on routine pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
    991991However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type.
    992992\begin{cfa}
     
    10021002\end{cfa}
    10031003Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little.
     1004
     1005Note, the type @coroutine_t@ must be an abstract handle to the coroutine, because the coroutine descriptor and its stack are non-copyable.
     1006Copying the coroutine descriptor results in copies being out of date with the current state of the stack.
     1007Correspondingly, copying the stack results is copies being out of date with coroutine descriptor, and pointers in the stack being out of date to data on the stack.
     1008(There is no mechanism in C to find all stack-specific pointers and update them as part of a copy.)
    10041009
    10051010The selected approach is to use language support by introducing a new kind of aggregate (structure):
     
    10161021The reserved keyword eases use for the common cases.
    10171022
    1018 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation functions:
    1019 \begin{cfa}
    1020 trait is_coroutine( dtype T ) {
    1021       void main( T & this );
    1022       coroutine_desc * get_coroutine( T & this );
     1023Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation routines:
     1024\begin{cfa}
     1025trait is_coroutine( `dtype` T ) {
     1026        void main( T & );
     1027        coroutine_desc * get_coroutine( T & );
    10231028};
    1024 forall( dtype T | is_coroutine(T) ) void get_coroutine( T & );
    1025 forall( dtype T | is_coroutine(T) ) void suspend( T & );
    1026 forall( dtype T | is_coroutine(T) ) void resume( T & );
    1027 \end{cfa}
    1028 This definition ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine.
    1029 No return value or additional parameters are necessary for this function, because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
    1030 As well, any object passed to @suspend@ and @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
    1031 The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
     1029forall( `dtype` T | is_coroutine(T) ) void suspend( T & );
     1030forall( `dtype` T | is_coroutine(T) ) void resume( T & );
     1031\end{cfa}
     1032The @dtype@ property of the trait ensures the coroutine descriptor is non-copyable, so all coroutines must be passed by reference (pointer).
     1033The routine definitions ensures there is a statically-typed @main@ routine that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle.
     1034The @main@ routine has no return value or additional parameters because the coroutine type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values versus fixed ones.
     1035The generic routines @suspend@ and @resume@ can be redefined, but any object passed to them is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
     1036The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine, and possibly redefining @suspend@ and @resume@.
    10321037The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main:
    10331038\begin{cquote}
     
    10731078Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait:
    10741079\begin{cquote}
    1075 \begin{tabular}{@{}c@{\hspace{2\parindentlnth}}c@{}}
     1080\begin{tabular}{@{}c@{\hspace{3\parindentlnth}}c@{}}
    10761081\begin{cfa}
    10771082thread myThread {
     
    10831088&
    10841089\begin{cfa}
    1085 trait is_thread( dtype T ) {
    1086       void main( T & this );
    1087       thread_desc * get_thread( T & this );
    1088       void ^?{}( T & `mutex` this );
     1090trait is_thread( `dtype` T ) {
     1091      void main( T & );
     1092      thread_desc * get_thread( T & );
     1093      void ^?{}( T & `mutex` );
    10891094};
    10901095\end{cfa}
     
    10921097\end{cquote}
    10931098(The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitors}.)
    1094 Like a coroutine, the statically-typed @main@ function is the starting point (first stack frame) of a user thread.
     1099Like a coroutine, the statically-typed @main@ routine is the starting point (first stack frame) of a user thread.
    10951100The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@;
    10961101whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{
    1097 The \lstinline@main@ function is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).}
    1098 No return value or additional parameters are necessary for this function, because the task type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
     1102The \lstinline@main@ routine is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).}
     1103No return value or additional parameters are necessary for this routine because the task type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values.
    10991104
    11001105\begin{comment} % put in appendix with coroutine version ???
     
    11091114
    11101115In this example, threads of type @foo@ start execution in the @void main(foo &)@ routine, which prints @"Hello World!".@ While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity.
    1111 With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously.
    1112 \begin{cfa}
    1113 typedef void (*voidFunc)(int);
    1114 
    1115 thread FuncRunner {
    1116         voidFunc func;
     1116With the static semantics it is trivial to write a thread type that takes a routine pointer as a parameter and executes it on its stack asynchronously.
     1117\begin{cfa}
     1118typedef void (*voidRtn)(int);
     1119
     1120thread RtnRunner {
     1121        voidRtn func;
    11171122        int arg;
    11181123};
    11191124
    1120 void ?{}(FuncRunner & this, voidFunc inFunc, int arg) {
    1121         this.func = inFunc;
     1125void ?{}(RtnRunner & this, voidRtn inRtn, int arg) {
     1126        this.func = inRtn;
    11221127        this.arg  = arg;
    11231128}
    11241129
    1125 void main(FuncRunner & this) {
    1126         // thread starts here and runs the function
     1130void main(RtnRunner & this) {
     1131        // thread starts here and runs the routine
    11271132        this.func( this.arg );
    11281133}
     
    11331138
    11341139int main() {
    1135         FuncRunner f = {hello, 42};
     1140        RtnRunner f = {hello, 42};
    11361141        return 0?
    11371142}
     
    12101215
    12111216
    1212 \section{Synchronization / Mutual Exclusion}
     1217\section{Mutual Exclusion / Synchronization}
    12131218
    12141219Uncontrolled non-deterministic execution is meaningless.
    1215 To reestablish meaningful execution requires mechanisms to reintroduce determinism (control non-determinism), called synchronization and mutual exclusion, where synchronization is a timing relationship among threads and mutual exclusion is an access-control mechanism on data shared by threads.
    1216 Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala)).
     1220To reestablish meaningful execution requires mechanisms to reintroduce determinism (\ie restrict non-determinism), called mutual exclusion and synchronization, where mutual exclusion is an access-control mechanism on data shared by threads, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}.
     1221Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it, \eg Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala).
    12171222In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (\eg channels~\cite{CSP,Go}).
    1218 However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between non-concurrent and concurrent computation (\ie function call versus message passing).
    1219 This distinction means a programmers needs to learn two sets of design patterns.
     1223However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between regular and concurrent computation (\ie routine call versus message passing).
     1224Hence, a programmer must learn and manipulate two sets of design patterns.
    12201225While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account.
    12211226In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm.
    12221227
    1223 At the lowest level, concurrent control is implemented as atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
     1228At the lowest level, concurrent control is implemented by atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b}, barriers, and path expressions~\cite{Campbell74}.
    12241229However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}.
    1225 A newer approach is transactional memory~\cite{Herlihy93}.
     1230A newer approach for restricting non-determinism is transactional memory~\cite{Herlihy93}.
    12261231While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
    12271232
    1228 One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}.
    1229 Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}.
    1230 Many programming languages -- \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java} -- provide monitors as explicit language constructs.
    1231 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors.
    1232 For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed..
     1233One of the most natural, elegant, and efficient mechanisms for mutual exclusion and synchronization for shared-memory systems is the \emph{monitor}.
     1234First proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}, many concurrent programming-languages provide monitors as an explicit language construct: \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}.
     1235In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as mutex locks or semaphores to simulate monitors.
     1236For these reasons, \CFA selected monitors as the core high-level concurrency-construct, upon which higher-level approaches can be easily constructed.
    12331237
    12341238
     
    12361240
    12371241A group of instructions manipulating a specific instance of shared data that must be performed atomically is called an (individual) \newterm{critical-section}~\cite{Dijkstra65}.
    1238 A generalization is a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.
     1242The generalization is called a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.
    12391243The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers have the same session and all writers have a unique session.
    1240 \newterm{Mutual exclusion} enforces the correction number of threads are using a critical section at the same time.
     1244\newterm{Mutual exclusion} enforces that the correct kind and number of threads are using a critical section.
    12411245
    12421246However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use.
    12431247Methods range from low-level locks, which are fast and flexible but require significant attention for correctness, to higher-level concurrency techniques, which sacrifice some performance to improve ease of use.
    12441248Ease of use comes by either guaranteeing some problems cannot occur (\eg deadlock free), or by offering a more explicit coupling between shared data and critical section.
    1245 For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations (\eg reading/writing large types atomically).
    1246 However, a significant challenge with (low-level) locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.
    1247 Easing composability is another feature higher-level mutual-exclusion mechanisms offer.
     1249For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations (\eg reading/writing) for numerical types.
     1250However, a significant challenge with locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.
     1251Easing composability is another feature higher-level mutual-exclusion mechanisms can offer.
    12481252
    12491253
     
    12511255
    12521256Synchronization enforces relative ordering of execution, and synchronization tools provide numerous mechanisms to establish these timing relationships.
    1253 Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use.
    1254 Higher-level mechanisms often simplify usage by adding better coupling between synchronization and data (\eg message passing), or offering a simpler solution to otherwise involved challenges, \eg barrier lock.
    1255 As mentioned above, synchronization can be expressed as guaranteeing that event \textit{X} always happens before \textit{Y}.
    1256 Often synchronization is used to order access to a critical section, \eg ensuring the next kind of thread to enter a critical section is a reader thread
    1257 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, the reader has \newterm{barged}.
    1258 Barging can result in staleness/freshness problems, where a reader barges ahead of a write and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from having an opportunity to be read.
     1257Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use;
     1258higher-level mechanisms often simplify usage by adding better coupling between synchronization and data (\eg message passing), or offering a simpler solution to otherwise involved challenges, \eg barrier lock.
     1259Often synchronization is used to order access to a critical section, \eg ensuring a reader thread is the next kind of thread to enter a critical section.
     1260If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, that reader has \newterm{barged}.
     1261Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation).
    12591262Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs.
    1260 This challenge is often split into two different approaches, barging avoidance and barging prevention.
    1261 Algorithms that allow a barger but divert it until later are avoiding the barger, while algorithms that preclude a barger from entering during synchronization in the critical section prevent the barger completely.
    1262 baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
     1263This challenge is often split into two different approaches: barging avoidance and barging prevention.
     1264Algorithms that allow a barger, but divert it until later using current synchronization state (flags), are avoiding the barger;
     1265algorithms that preclude a barger from entering during synchronization in the critical section prevent barging completely.
     1266Techniques like baton-pass locks~\cite{Andrews89} between threads instead of unconditionally releasing locks is an example of barging prevention.
    12631267
    12641268
     
    12661270\label{s:Monitors}
    12671271
    1268 A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state.
    1269 More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine.
    1270 This strong association eases readability and maintainability, at the cost of flexibility.
    1271 Note that both monitors and mutex locks, require an abstract handle to identify them.
    1272 This concept is generally associated with object-oriented languages like Java~\cite{Java} or \uC~\cite{uC++book} but does not strictly require OO semantics.
    1273 The only requirement is the ability to declare a handle to a shared object and a set of routines that act on it:
    1274 \begin{cfa}
    1275 typedef /*some monitor type*/ monitor;
    1276 int f(monitor & m);
    1277 
    1278 int main() {
    1279         monitor m;  // Handle m
    1280         f(m);       // Routine using handle
    1281 }
    1282 \end{cfa}
    1283 
    1284 % ======================================================================
    1285 % ======================================================================
    1286 \subsection{Call Semantics} \label{call}
    1287 % ======================================================================
    1288 % ======================================================================
    1289 The above monitor example displays some of the intrinsic characteristics.
    1290 First, it is necessary to use pass-by-reference over pass-by-value for monitor routines.
    1291 This semantics is important, because at their core, monitors are implicit mutual-exclusion objects (locks), and these objects cannot be copied.
    1292 Therefore, monitors are non-copy-able objects (@dtype@).
    1293 
    1294 Another aspect to consider is when a monitor acquires its mutual exclusion.
    1295 For example, a monitor may need to be passed through multiple helper routines that do not acquire the monitor mutual-exclusion on entry.
    1296 Passthrough can occur for generic helper routines (@swap@, @sort@, \etc) or specific helper routines like the following to implement an atomic counter:
    1297 
    1298 \begin{cfa}
    1299 monitor counter_t { /*...see section $\ref{data}$...*/ };
    1300 
    1301 void ?{}(counter_t & nomutex this); // constructor
    1302 size_t ++?(counter_t & mutex this); // increment
    1303 
    1304 // need for mutex is platform dependent
    1305 void ?{}(size_t * this, counter_t & mutex cnt); // conversion
    1306 \end{cfa}
    1307 This counter is used as follows:
    1308 \begin{center}
    1309 \begin{tabular}{c @{\hskip 0.35in} c @{\hskip 0.35in} c}
    1310 \begin{cfa}
    1311 // shared counter
    1312 counter_t cnt1, cnt2;
    1313 
    1314 // multiple threads access counter
    1315 thread 1 : cnt1++; cnt2++;
    1316 thread 2 : cnt1++; cnt2++;
    1317 thread 3 : cnt1++; cnt2++;
    1318         ...
    1319 thread N : cnt1++; cnt2++;
    1320 \end{cfa}
    1321 \end{tabular}
    1322 \end{center}
    1323 Notice how the counter is used without any explicit synchronization and yet supports thread-safe semantics for both reading and writing, which is similar in usage to the \CC template @std::atomic@.
    1324 
    1325 Here, the constructor (@?{}@) uses the @nomutex@ keyword to signify that it does not acquire the monitor mutual-exclusion when constructing.
    1326 This semantics is because an object not yet constructed should never be shared and therefore does not require mutual exclusion.
    1327 Furthermore, it allows the implementation greater freedom when it initializes the monitor locking.
    1328 The prefix increment operator uses @mutex@ to protect the incrementing process from race conditions.
    1329 Finally, there is a conversion operator from @counter_t@ to @size_t@.
    1330 This conversion may or may not require the @mutex@ keyword depending on whether or not reading a @size_t@ is an atomic operation.
    1331 
    1332 For maximum usability, monitors use \textbf{multi-acq} semantics, which means a single thread can acquire the same monitor multiple times without deadlock.
    1333 For example, listing \ref{fig:search} uses recursion and \textbf{multi-acq} to print values inside a binary tree.
    1334 \begin{figure}
    1335 \begin{cfa}[caption={Recursive printing algorithm using \textbf{multi-acq}.},label={fig:search}]
    1336 monitor printer { ... };
    1337 struct tree {
    1338         tree * left, right;
    1339         char * value;
     1272A \textbf{monitor} is a set of routines that ensure mutual exclusion when accessing shared state.
     1273More precisely, a monitor is a programming technique that binds mutual exclusion to routine scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation).
     1274The strong association with the call/return paradigm eases programmability, readability and maintainability, at a slight cost in flexibility and efficiency.
     1275
     1276Note, like coroutines/threads, both locks and monitors require an abstract handle to reference them, because at their core, both mechanisms are manipulating non-copyable shared state.
     1277Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data.
     1278Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity.
     1279As for coroutines/tasks, a non-copyable (@dtype@) trait is used to capture this requirement, so all locks/monitors must be passed by reference (pointer).
     1280\begin{cfa}
     1281trait is_monitor( `dtype` T ) {
     1282        monitor_desc * get_monitor( T & );
     1283        void ^?{}( T & mutex );
    13401284};
    1341 void print(printer & mutex p, char * v);
    1342 
    1343 void print(printer & mutex p, tree * t) {
    1344         print(p, t->value);
    1345         print(p, t->left );
    1346         print(p, t->right);
    1347 }
    1348 \end{cfa}
    1349 \end{figure}
    1350 
    1351 Having both @mutex@ and @nomutex@ keywords can be redundant, depending on the meaning of a routine having neither of these keywords.
    1352 For example, it is reasonable that it should default to the safest option (@mutex@) when given a routine without qualifiers @void foo(counter_t & this)@, whereas assuming @nomutex@ is unsafe and may cause subtle errors.
    1353 On the other hand, @nomutex@ is the ``normal'' parameter behaviour, it effectively states explicitly that ``this routine is not special''.
    1354 Another alternative is making exactly one of these keywords mandatory, which provides the same semantics but without the ambiguity of supporting routines with neither keyword.
    1355 Mandatory keywords would also have the added benefit of being self-documented but at the cost of extra typing.
    1356 While there are several benefits to mandatory keywords, they do bring a few challenges.
    1357 Mandatory keywords in \CFA would imply that the compiler must know without doubt whether or not a parameter is a monitor or not.
     1285\end{cfa}
     1286
     1287
     1288\subsection{Mutex Acquisition}
     1289\label{s:MutexAcquisition}
     1290
     1291While correctness implicitly implies a monitor's mutual exclusion is acquired and released, there are implementation options about when and where the locking/unlocking occurs.
     1292(Much of this discussion also applies to basic locks.)
     1293For example, a monitor may need to be passed through multiple helper routines before it becomes necessary to acquire the monitor mutual-exclusion.
     1294\begin{cfa}[morekeywords=nomutex]
     1295monitor Aint { int cnt; };                                      $\C{// atomic integer counter}$
     1296void ?{}( Aint & `nomutex` this ) with( this ) { cnt = 0; } $\C{// constructor}$
     1297int ?=?( Aint & `mutex`$\(_{opt}\)$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions}$
     1298void ?{}( int & this, Aint & `mutex`$\(_{opt}\)$ v ) { this = v.cnt; }
     1299int ?=?( int & lhs, Aint & `mutex`$\(_{opt}\)$ rhs ) with( rhs ) { lhs = cnt; }
     1300int ++?( Aint & `mutex`$\(_{opt}\)$ this ) with( this ) { return ++cnt; } $\C{// increment}$
     1301\end{cfa}
     1302The @Aint@ constructor, @?{}@, uses the \lstinline[morekeywords=nomutex]@nomutex@ qualifier indicating mutual exclusion is unnecessary during construction because an object is inaccessible (private) until after it is initialized.
     1303(While a constructor may publish its address into a global variable, doing so generates a race-condition.)
     1304The conversion operators for initializing and assigning with a normal integer only need @mutex@, if reading/writing the implementation type is not atomic.
     1305Finally, the prefix increment operato, @++?@, is normally @mutex@ to protect the incrementing from race conditions, unless there is an atomic increment instruction for the implementation type.
     1306
     1307The atomic counter is used without any explicit mutual-exclusion and provides thread-safe semantics, which is similar to the \CC template @std::atomic@.
     1308\begin{cfa}
     1309Aint x, y, z;
     1310++x; ++y; ++z;                                                          $\C{// safe increment by multiple threads}$
     1311x = 2; y = 2; z = 2;                                            $\C{// conversions}$
     1312int i = x, j = y, k = z;
     1313i = x; j = y; k = z;
     1314\end{cfa}
     1315
     1316For maximum usability, monitors have \newterm{multi-acquire} semantics allowing a thread to acquire it multiple times without deadlock.
     1317For example, atomically printing the contents of a binary tree:
     1318\begin{cfa}
     1319monitor Tree {
     1320        Tree * left, right;
     1321        // value
     1322};
     1323void print( Tree & mutex tree ) {                       $\C{// prefix traversal}$
     1324        // write value
     1325        print( tree->left );                                    $\C{// multiply acquire monitor lock on each recursion}$
     1326        print( tree->right );
     1327}
     1328\end{cfa}
     1329
     1330Mandatory monitor qualifiers have the benefit of being self-documented, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant.
     1331Instead, one of qualifier semantics can be the default, and the other required.
     1332For example, assume the safe @mutex@ option for a monitor parameter because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors.
     1333On the other hand, assuming \lstinline[morekeywords=nomutex]@nomutex@ is the \emph{normal} parameter behaviour, stating explicitly ``this parameter is not special''.
     1334Providing a default qualifier implies knowing whether a parameter is a monitor.
    13581335Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred.
    1359 For this reason, \CFA only has the @mutex@ keyword and uses no keyword to mean @nomutex@.
    1360 
    1361 The next semantic decision is to establish when @mutex@ may be used as a type qualifier.
    1362 Consider the following declarations:
    1363 \begin{cfa}
    1364 int f1(monitor & mutex m);
    1365 int f2(const monitor & mutex m);
    1366 int f3(monitor ** mutex m);
    1367 int f4(monitor * mutex m []);
    1368 int f5(graph(monitor *) & mutex m);
    1369 \end{cfa}
    1370 The problem is to identify which object(s) should be acquired.
    1371 Furthermore, each object needs to be acquired only once.
    1372 In the case of simple routines like @f1@ and @f2@ it is easy to identify an exhaustive list of objects to acquire on entry.
    1373 Adding indirections (@f3@) still allows the compiler and programmer to identify which object is acquired.
    1374 However, adding in arrays (@f4@) makes it much harder.
    1375 Array lengths are not necessarily known in C, and even then, making sure objects are only acquired once becomes none-trivial.
    1376 This problem can be extended to absurd limits like @f5@, which uses a graph of monitors.
    1377 To make the issue tractable, this project imposes the requirement that a routine may only acquire one monitor per parameter and it must be the type of the parameter with at most one level of indirection (ignoring potential qualifiers).
    1378 Also note that while routine @f3@ can be supported, meaning that monitor @**m@ is acquired, passing an array to this routine would be type-safe and yet result in undefined behaviour because only the first element of the array is acquired.
    1379 However, this ambiguity is part of the C type-system with respects to arrays.
    1380 For this reason, @mutex@ is disallowed in the context where arrays may be passed:
    1381 \begin{cfa}
    1382 int f1(monitor & mutex m);    // Okay : recommended case
    1383 int f2(monitor * mutex m);    // Not Okay : Could be an array
    1384 int f3(monitor mutex m []);  // Not Okay : Array of unknown length
    1385 int f4(monitor ** mutex m);   // Not Okay : Could be an array
    1386 int f5(monitor * mutex m []); // Not Okay : Array of unknown length
    1387 \end{cfa}
    1388 Note that not all array functions are actually distinct in the type system.
    1389 However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic.
    1390 
    1391 Unlike object-oriented monitors, where calling a mutex member \emph{implicitly} acquires mutual-exclusion of the receiver object, \CFA uses an explicit mechanism to specify the object that acquires mutual-exclusion.
    1392 A consequence of this approach is that it extends naturally to multi-monitor calls.
    1393 \begin{cfa}
    1394 int f(MonitorA & mutex a, MonitorB & mutex b);
    1395 
    1396 MonitorA a;
    1397 MonitorB b;
    1398 f(a,b);
    1399 \end{cfa}
    1400 While OO monitors could be extended with a mutex qualifier for multiple-monitor calls, no example of this feature could be found.
    1401 The capability to acquire multiple locks before entering a critical section is called \emph{\textbf{bulk-acq}}.
    1402 In practice, writing multi-locking routines that do not lead to deadlocks is tricky.
     1336For this reason, \CFA requires programmers to identify the kind of parameter with the @mutex@ keyword and uses no keyword to mean \lstinline[morekeywords=nomutex]@nomutex@.
     1337
     1338The next semantic decision is establishing which parameter \emph{types} may be qualified with @mutex@.
     1339Given:
     1340\begin{cfa}
     1341monitor M { ... }
     1342int f1( M & mutex m );
     1343int f2( M * mutex m );
     1344int f3( M * mutex m[] );
     1345int f4( stack( M * ) & mutex m );
     1346\end{cfa}
     1347the issue is that some of these parameter types are composed of multiple objects.
     1348For @f1@, there is only a single parameter object.
     1349Adding indirection in @f2@ still identifies a single object.
     1350However, the matrix in @f3@ introduces multiple objects.
     1351While shown shortly, multiple acquisition is possible;
     1352however array lengths are often unknown in C.
     1353This issue is exacerbated in @f4@, where the data structure must be safely traversed to acquire all of its elements.
     1354
     1355To make the issue tractable, \CFA only acquires one monitor per parameter with at most one level of indirection.
     1356However, the C type-system has an ambiguity with respects to arrays.
     1357Is the argument for @f2@ a single object or an array of objects?
     1358If it is an array, only the first element of the array is acquired, which seems unsafe;
     1359hence, @mutex@ is disallowed for array parameters.
     1360\begin{cfa}
     1361int f1( M & mutex m );                                          $\C{// allowed: recommended case}$
     1362int f2( M * mutex m );                                          $\C{// disallowed: could be an array}$
     1363int f3( M mutex m[$\,$] );                                      $\C{// disallowed: array length unknown}$
     1364int f4( M ** mutex m );                                         $\C{// disallowed: could be an array}$
     1365int f5( M * mutex m[$\,$] );                            $\C{// disallowed: array length unknown}$
     1366\end{cfa}
     1367% Note, not all array routines have distinct types: @f2@ and @f3@ have the same type, as do @f4@ and @f5@.
     1368% However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic.
     1369
     1370For object-oriented monitors, calling a mutex member \emph{implicitly} acquires mutual exclusion of the receiver object, @`rec`.foo(...)@.
     1371\CFA has no receiver, and hence, must use an explicit mechanism to specify which object has mutual exclusion acquired.
     1372A positive consequence of this design decision is the ability to support multi-monitor routines.
     1373\begin{cfa}
     1374int f( M & mutex x, M & mutex y );              $\C{// multiple monitor parameter of any type}$
     1375M m1, m2;
     1376f( m1, m2 );
     1377\end{cfa}
     1378(While object-oriented monitors can be extended with a mutex qualifier for multiple-monitor members, no prior example of this feature could be found.)
     1379In practice, writing multi-locking routines that do not deadlocks is tricky.
    14031380Having language support for such a feature is therefore a significant asset for \CFA.
    1404 In the case presented above, \CFA guarantees that the order of acquisition is consistent across calls to different routines using the same monitors as arguments.
    1405 This consistent ordering means acquiring multiple monitors is safe from deadlock when using \textbf{bulk-acq}.
    1406 However, users can still force the acquiring order.
    1407 For example, notice which routines use @mutex@/@nomutex@ and how this affects acquiring order:
    1408 \begin{cfa}
    1409 void foo(A& mutex a, B& mutex b) { // acquire a & b
    1410         ...
    1411 }
    1412 
    1413 void bar(A& mutex a, B& /*nomutex*/ b) { // acquire a
    1414         ... foo(a, b); ... // acquire b
    1415 }
    1416 
    1417 void baz(A& /*nomutex*/ a, B& mutex b) { // acquire b
    1418         ... foo(a, b); ... // acquire a
    1419 }
    1420 \end{cfa}
    1421 The \textbf{multi-acq} monitor lock allows a monitor lock to be acquired by both @bar@ or @baz@ and acquired again in @foo@.
    1422 In the calls to @bar@ and @baz@ the monitors are acquired in opposite order.
    1423 
    1424 However, such use leads to lock acquiring order problems.
    1425 In the example above, the user uses implicit ordering in the case of function @foo@ but explicit ordering in the case of @bar@ and @baz@.
    1426 This subtle difference means that calling these routines concurrently may lead to deadlock and is therefore undefined behaviour.
    1427 As shown~\cite{Lister77}, solving this problem requires:
    1428 \begin{enumerate}
    1429         \item Dynamically tracking the monitor-call order.
    1430         \item Implement rollback semantics.
    1431 \end{enumerate}
    1432 While the first requirement is already a significant constraint on the system, implementing a general rollback semantics in a C-like language is still prohibitively complex~\cite{Dice10}.
    1433 In \CFA, users simply need to be careful when acquiring multiple monitors at the same time or only use \textbf{bulk-acq} of all the monitors.
    1434 While \CFA provides only a partial solution, most systems provide no solution and the \CFA partial solution handles many useful cases.
    1435 
    1436 For example, \textbf{multi-acq} and \textbf{bulk-acq} can be used together in interesting ways:
    1437 \begin{cfa}
    1438 monitor bank { ... };
    1439 
    1440 void deposit( bank & mutex b, int deposit );
    1441 
    1442 void transfer( bank & mutex mybank, bank & mutex yourbank, int me2you) {
    1443         deposit( mybank, -me2you );
    1444         deposit( yourbank, me2you );
     1381
     1382The capability to acquire multiple locks before entering a critical section is called \newterm{bulk acquire}.
     1383In previous example, \CFA guarantees the order of acquisition is consistent across calls to different routines using the same monitors as arguments.
     1384This consistent ordering means acquiring multiple monitors is safe from deadlock.
     1385However, users can force the acquiring order.
     1386For example, notice the use of @mutex@/\lstinline[morekeywords=nomutex]@nomutex@ and how this affects the acquiring order:
     1387\begin{cfa}
     1388void foo( M & mutex m1, M & mutex m2 );         $\C{// acquire m1 and m2}$
     1389void bar( M & mutex m1, M & /* nomutex */ m2 ) { $\C{// acquire m1}$
     1390        ... foo( m1, m2 ); ...                                  $\C{// acquire m2}$
     1391}
     1392void baz( M & /* nomutex */ m1, M & mutex m2 ) { $\C{// acquire m2}$
     1393        ... foo( m1, m2 ); ...                                  $\C{// acquire m1}$
     1394}
     1395\end{cfa}
     1396The multi-acquire semantics allows @bar@ or @baz@ to acquire a monitor lock and reacquire it in @foo@.
     1397In the calls to @bar@ and @baz@, the monitors are acquired in opposite order.
     1398
     1399However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamically tracking of monitor calls, and dealing with it requires implement rollback semantics~\cite{Dice10}.
     1400In \CFA, safety is guaranteed by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid.
     1401While \CFA provides only a partial solution, the \CFA partial solution handles many useful cases.
     1402\begin{cfa}
     1403monitor Bank { ... };
     1404void deposit( Bank & `mutex` b, int deposit );
     1405void transfer( Bank & `mutex` mybank, Bank & `mutex` yourbank, int me2you) {
     1406        deposit( mybank, `-`me2you );                   $\C{// debit}$
     1407        deposit( yourbank, me2you );                    $\C{// credit}$
    14451408}
    14461409\end{cfa}
    14471410This example shows a trivial solution to the bank-account transfer problem~\cite{BankTransfer}.
    1448 Without \textbf{multi-acq} and \textbf{bulk-acq}, the solution to this problem is much more involved and requires careful engineering.
     1411Without multi- and bulk acquire, the solution to this problem requires careful engineering.
    14491412
    14501413
    14511414\subsection{\protect\lstinline|mutex| statement} \label{mutex-stmt}
    14521415
    1453 The call semantics discussed above have one software engineering issue: only a routine can acquire the mutual-exclusion of a set of monitor. \CFA offers the @mutex@ statement to work around the need for unnecessary names, avoiding a major software engineering problem~\cite{2FTwoHardThings}.
    1454 Table \ref{f:mutex-stmt} shows an example of the @mutex@ statement, which introduces a new scope in which the mutual-exclusion of a set of monitor is acquired.
    1455 Beyond naming, the @mutex@ statement has no semantic difference from a routine call with @mutex@ parameters.
    1456 
    1457 \begin{table}
    1458 \begin{center}
    1459 \begin{tabular}{|c|c|}
    1460 function call & @mutex@ statement \\
    1461 \hline
    1462 \begin{cfa}[tabsize=3]
     1416The monitor call-semantics associate all locking semantics to routines.
     1417Like Java, \CFA offers an alternative @mutex@ statement to reduce refactoring and naming.
     1418\begin{cquote}
     1419\begin{tabular}{@{}c|@{\hspace{\parindentlnth}}c@{}}
     1420routine call & @mutex@ statement \\
     1421\begin{cfa}
    14631422monitor M {};
    14641423void foo( M & mutex m1, M & mutex m2 ) {
    14651424        // critical section
    14661425}
    1467 
    14681426void bar( M & m1, M & m2 ) {
    14691427        foo( m1, m2 );
    14701428}
    1471 \end{cfa}&\begin{cfa}[tabsize=3]
    1472 monitor M {};
     1429\end{cfa}
     1430&
     1431\begin{cfa}
     1432
    14731433void bar( M & m1, M & m2 ) {
    1474         mutex(m1, m2) {
     1434        mutex( m1, m2 ) {       // remove refactoring and naming
    14751435                // critical section
    14761436        }
    14771437}
    14781438
    1479 
    14801439\end{cfa}
    14811440\end{tabular}
    1482 \end{center}
    1483 \caption{Regular call semantics vs. \protect\lstinline|mutex| statement}
    1484 \label{f:mutex-stmt}
    1485 \end{table}
    1486 
    1487 % ======================================================================
    1488 % ======================================================================
    1489 \subsection{Data semantics} \label{data}
    1490 % ======================================================================
    1491 % ======================================================================
    1492 Once the call semantics are established, the next step is to establish data semantics.
    1493 Indeed, until now a monitor is used simply as a generic handle but in most cases monitors contain shared data.
    1494 This data should be intrinsic to the monitor declaration to prevent any accidental use of data without its appropriate protection.
    1495 For example, here is a complete version of the counter shown in section \ref{call}:
    1496 \begin{cfa}
    1497 monitor counter_t {
    1498         int value;
    1499 };
    1500 
    1501 void ?{}(counter_t & this) {
    1502         this.cnt = 0;
    1503 }
    1504 
    1505 int ?++(counter_t & mutex this) {
    1506         return ++this.value;
    1507 }
    1508 
    1509 // need for mutex is platform dependent here
    1510 void ?{}(int * this, counter_t & mutex cnt) {
    1511         *this = (int)cnt;
    1512 }
    1513 \end{cfa}
    1514 
    1515 Like threads and coroutines, monitors are defined in terms of traits with some additional language support in the form of the @monitor@ keyword.
    1516 The monitor trait is:
    1517 \begin{cfa}
    1518 trait is_monitor(dtype T) {
    1519         monitor_desc * get_monitor( T & );
    1520         void ^?{}( T & mutex );
    1521 };
    1522 \end{cfa}
    1523 Note that the destructor of a monitor must be a @mutex@ routine to prevent deallocation while a thread is accessing the monitor.
    1524 As with any object, calls to a monitor, using @mutex@ or otherwise, is undefined behaviour after the destructor has run.
    1525 
    1526 % ======================================================================
    1527 % ======================================================================
    1528 \section{Internal Scheduling} \label{intsched}
    1529 % ======================================================================
    1530 % ======================================================================
    1531 In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization.
    1532 With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}.
    1533 With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (\ie with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (\ie without access to the shared state).
    1534 Since internal scheduling within a single monitor is mostly a solved problem, this paper concentrates on extending internal scheduling to multiple monitors.
    1535 Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.
     1441\end{cquote}
     1442
     1443
     1444\section{Internal Scheduling}
     1445\label{s:InternalScheduling}
     1446
     1447While monitor mutual-exclusion provides safe access to shared data, the monitor data may indicate that a thread accessing it cannot proceed, \eg a bounded buffer, Figure~\ref{f:BoundedBuffer}, may be full/empty so produce/consumer threads must block.
     1448Leaving the monitor and trying again (busy waiting) is impractical for high-level programming.
     1449Monitors eliminate busy waiting by providing internal synchronization to schedule threads needing access to the shared data, where the synchronization is blocking (threads are parked) versus spinning.
     1450The synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling, where \newterm{scheduling} is defined as indicating which thread acquires the critical section next.
     1451\newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and behalf of other threads attempting entry.
     1452
     1453Figure~\ref{f:BBInt} shows a \CFA bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@.
     1454The @wait@ routine atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the routine's parameter list.
     1455The appropriate condition lock is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer.
     1456Signalling is unconditional, because signalling an empty condition lock does nothing.
     1457Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means:
     1458\begin{enumerate}
     1459\item
     1460The signalling thread leaves immediately, and the signalled thread continues.
     1461\item
     1462The signalling thread continues and the signalled thread is marked for urgent unblocking at subsequent scheduling points (exit/wait).
     1463\item
     1464The signalling thread blocks but is marked for urgrent unblocking and the signalled thread continues.
     1465\end{enumerate}
     1466The first approach is too restrictive, as it precludes solving a reasonable class of problems (\eg dating service).
     1467\CFA supports the next two semantics as both are useful.
     1468Finally, while it is common to store a @condition@ as a field of the monitor, in \CFA, a @condition@ variable can be created/stored independently.
     1469
     1470\begin{figure}
     1471\centering
     1472\newbox\myboxA
     1473\begin{lrbox}{\myboxA}
     1474\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
     1475forall( otype T ) { // distribute forall
     1476        monitor Buffer {
     1477                `condition` full, empty;
     1478                int front, back, count;
     1479                T elements[10];
     1480        };
     1481        void ?{}( Buffer(T) & buffer ) with(buffer) {
     1482                [front, back, count] = 0;
     1483        }
     1484
     1485        void insert( Buffer(T) & mutex buffer, T elem )
     1486                                with(buffer) {
     1487                if ( count == 10 ) `wait( empty )`;
     1488                // insert elem into buffer
     1489                `signal( full )`;
     1490        }
     1491        T remove( Buffer(T) & mutex buffer ) with(buffer) {
     1492                if ( count == 0 ) `wait( full )`;
     1493                // remove elem from buffer
     1494                `signal( empty )`;
     1495                return elem;
     1496        }
     1497}
     1498\end{lstlisting}
     1499\end{lrbox}
     1500
     1501\newbox\myboxB
     1502\begin{lrbox}{\myboxB}
     1503\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
     1504forall( otype T ) { // distribute forall
     1505        monitor Buffer {
     1506
     1507                int front, back, count;
     1508                T elements[10];
     1509        };
     1510        void ?{}( Buffer(T) & buffer ) with(buffer) {
     1511                [front, back, count] = 0;
     1512        }
     1513        T remove( Buffer(T) & mutex buffer ); // forward
     1514        void insert( Buffer(T) & mutex buffer, T elem )
     1515                                with(buffer) {
     1516                if ( count == 10 ) `waitfor( remove, buffer )`;
     1517                // insert elem into buffer
     1518
     1519        }
     1520        T remove( Buffer(T) & mutex buffer ) with(buffer) {
     1521                if ( count == 0 ) `waitfor( insert, buffer )`;
     1522                // remove elem from buffer
     1523
     1524                return elem;
     1525        }
     1526}
     1527\end{lstlisting}
     1528\end{lrbox}
     1529
     1530\subfloat[Internal Scheduling]{\label{f:BBInt}\usebox\myboxA}
     1531%\qquad
     1532\subfloat[External Scheduling]{\label{f:BBExt}\usebox\myboxB}
     1533\caption{Generic Bounded-Buffer}
     1534\label{f:BoundedBuffer}
     1535\end{figure}
     1536
     1537Figure~\ref{f:BBExt} shows a \CFA bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until the buffer has a free/empty slot.
     1538External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion.
     1539If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer.
     1540Threads making calls to routines that are currently excluded wait outside (externally) of the monitor on a calling queue.
     1541
     1542An important aspect of monitor implementation is barging, \ie can calling threads barge ahead of signalled threads?
     1543If barging is allowed, synchronization between a singller and signallee is difficult, often requiring multiple unblock/block cycles (looping around a wait rechecking if a condition is met).
     1544\CFA scheduling does \emph{not} have barging, which simplifies synchronization among threads in the monitor.
     1545Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency.
     1546
     1547Indeed, like the bulk acquire semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.
    15361548
    15371549First, here is a simple example of internal scheduling:
     
    15561568}
    15571569\end{cfa}
    1558 There are two details to note here.
    1559 First, @signal@ is a delayed operation; it only unblocks the waiting thread when it reaches the end of the critical section.
    1560 This semantics is needed to respect mutual-exclusion, \ie the signaller and signalled thread cannot be in the monitor simultaneously.
    1561 The alternative is to return immediately after the call to @signal@, which is significantly more restrictive.
    1562 Second, in \CFA, while it is common to store a @condition@ as a field of the monitor, a @condition@ variable can be stored/created independently of a monitor.
    1563 Here routine @foo@ waits for the @signal@ from @bar@ before making further progress, ensuring a basic ordering.
    1564 
    1565 An important aspect of the implementation is that \CFA does not allow barging, which means that once function @bar@ releases the monitor, @foo@ is guaranteed to be the next thread to acquire the monitor (unless some other thread waited on the same condition).
    1566 This guarantee offers the benefit of not having to loop around waits to recheck that a condition is met.
    1567 The main reason \CFA offers this guarantee is that users can easily introduce barging if it becomes a necessity but adding barging prevention or barging avoidance is more involved without language support.
    1568 Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency.
    15691570
    15701571% ======================================================================
     
    16001601This semantic is a logical requirement for barging prevention.
    16011602
    1602 A direct extension of the previous example is a \textbf{bulk-acq} version:
     1603A direct extension of the previous example is a bulk acquire version:
    16031604\begin{multicols}{2}
    16041605\begin{cfa}
     
    16141615\end{cfa}
    16151616\end{multicols}
    1616 \noindent This version uses \textbf{bulk-acq} (denoted using the {\sf\&} symbol), but the presence of multiple monitors does not add a particularly new meaning.
     1617\noindent This version uses bulk acquire (denoted using the {\sf\&} symbol), but the presence of multiple monitors does not add a particularly new meaning.
    16171618Synchronization happens between the two threads in exactly the same way and order.
    16181619The only difference is that mutual exclusion covers a group of monitors.
     
    16751676% ======================================================================
    16761677
    1677 A larger example is presented to show complex issues for \textbf{bulk-acq} and its implementation options are analyzed.
    1678 Figure~\ref{f:int-bulk-cfa} shows an example where \textbf{bulk-acq} adds a significant layer of complexity to the internal signalling semantics, and listing \ref{f:int-bulk-cfa} shows the corresponding \CFA code to implement the cfa-code in listing \ref{f:int-bulk-cfa}.
     1678A larger example is presented to show complex issues for bulk acquire and its implementation options are analyzed.
     1679Figure~\ref{f:int-bulk-cfa} shows an example where bulk acquire adds a significant layer of complexity to the internal signalling semantics, and listing \ref{f:int-bulk-cfa} shows the corresponding \CFA code to implement the cfa-code in listing \ref{f:int-bulk-cfa}.
    16791680For the purpose of translating the given cfa-code into \CFA-code, any method of introducing a monitor is acceptable, \eg @mutex@ parameters, global variables, pointer parameters, or using locals with the @mutex@ statement.
    16801681
     
    17071708\end{cfa}
    17081709\end{multicols}
    1709 \begin{cfa}[caption={Internal scheduling with \textbf{bulk-acq}},label={f:int-bulk-cfa}]
     1710\begin{cfa}[caption={Internal scheduling with bulk acquire},label={f:int-bulk-cfa}]
    17101711\end{cfa}
    17111712\begin{center}
     
    17731774
    17741775The complexity begins at code sections 4 and 8 in listing \ref{f:int-bulk-cfa}, which are where the existing semantics of internal scheduling needs to be extended for multiple monitors.
    1775 The root of the problem is that \textbf{bulk-acq} is used in a context where one of the monitors is already acquired, which is why it is important to define the behaviour of the previous cfa-code.
     1776The root of the problem is that bulk acquire is used in a context where one of the monitors is already acquired, which is why it is important to define the behaviour of the previous cfa-code.
    17761777When the signaller thread reaches the location where it should ``release @A & B@'' (listing \ref{f:int-bulk-cfa} line \ref{line:releaseFirst}), it must actually transfer ownership of monitor @B@ to the waiting thread.
    17771778This ownership transfer is required in order to prevent barging into @B@ by another thread, since both the signalling and signalled threads still need monitor @A@.
     
    21102111}% subfloat
    21112112\qquad
    2112 \subfloat[\textbf{bulk-acq} Monitor] {
     2113\subfloat[bulk acquire Monitor] {
    21132114\label{fig:BulkMonitor}
    21142115{\resizebox{0.45\textwidth}{!}{\input{ext_monitor}}}
     
    21272128Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate.
    21282129Generating a mask dynamically means that the storage for the mask information can vary between calls to @waitfor@, allowing for more flexibility and extensions.
    2129 Storing an array of accepted function pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search.
     2130Storing an array of accepted routine pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search.
    21302131Furthermore, supporting nested external scheduling (\eg listing \ref{f:nest-ext}) may now require additional searches for the @waitfor@ statement to check if a routine is already queued.
    21312132
     
    21452146\end{figure}
    21462147
    2147 Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a function pointer and a set of monitors, as is discussed in the next section.
     2148Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a routine pointer and a set of monitors, as is discussed in the next section.
    21482149These details are omitted from the picture for the sake of simplicity.
    21492150
     
    22332234% ======================================================================
    22342235
    2235 Syntactically, the @waitfor@ statement takes a function identifier and a set of monitors.
    2236 While the set of monitors can be any list of expressions, the function name is more restricted because the compiler validates at compile time the validity of the function type and the parameters used with the @waitfor@ statement.
    2237 It checks that the set of monitors passed in matches the requirements for a function call.
     2236Syntactically, the @waitfor@ statement takes a routine identifier and a set of monitors.
     2237While the set of monitors can be any list of expressions, the routine name is more restricted because the compiler validates at compile time the validity of the routine type and the parameters used with the @waitfor@ statement.
     2238It checks that the set of monitors passed in matches the requirements for a routine call.
    22382239Figure~\ref{f:waitfor} shows various usages of the waitfor statement and which are acceptable.
    2239 The choice of the function type is made ignoring any non-@mutex@ parameter.
     2240The choice of the routine type is made ignoring any non-@mutex@ parameter.
    22402241One limitation of the current implementation is that it does not handle overloading, but overloading is possible.
    22412242\begin{figure}
     
    22632264        waitfor(f2, a1, a2); // Incorrect : Mutex arguments don't match
    22642265        waitfor(f1, 1);      // Incorrect : 1 not a mutex argument
    2265         waitfor(f9, a1);     // Incorrect : f9 function does not exist
     2266        waitfor(f9, a1);     // Incorrect : f9 routine does not exist
    22662267        waitfor(*fp, a1 );   // Incorrect : fp not an identifier
    22672268        waitfor(f4, a1);     // Incorrect : f4 ambiguous
     
    22732274
    22742275Finally, for added flexibility, \CFA supports constructing a complex @waitfor@ statement using the @or@, @timeout@ and @else@.
    2275 Indeed, multiple @waitfor@ clauses can be chained together using @or@; this chain forms a single statement that uses baton pass to any function that fits one of the function+monitor set passed in.
    2276 To enable users to tell which accepted function executed, @waitfor@s are followed by a statement (including the null statement @;@) or a compound statement, which is executed after the clause is triggered.
    2277 A @waitfor@ chain can also be followed by a @timeout@, to signify an upper bound on the wait, or an @else@, to signify that the call should be non-blocking, which checks for a matching function call already arrived and otherwise continues.
     2276Indeed, multiple @waitfor@ clauses can be chained together using @or@; this chain forms a single statement that uses baton pass to any routine that fits one of the routine+monitor set passed in.
     2277To enable users to tell which accepted routine executed, @waitfor@s are followed by a statement (including the null statement @;@) or a compound statement, which is executed after the clause is triggered.
     2278A @waitfor@ chain can also be followed by a @timeout@, to signify an upper bound on the wait, or an @else@, to signify that the call should be non-blocking, which checks for a matching routine call already arrived and otherwise continues.
    22782279Any and all of these clauses can be preceded by a @when@ condition to dynamically toggle the accept clauses on or off based on some current state.
    22792280Figure~\ref{f:waitfor2} demonstrates several complex masks and some incorrect ones.
     
    24412442\section{Behind the Scenes}
    24422443There are several challenges specific to \CFA when implementing concurrency.
    2443 These challenges are a direct result of \textbf{bulk-acq} and loose object definitions.
     2444These challenges are a direct result of bulk acquire and loose object definitions.
    24442445These two constraints are the root cause of most design decisions in the implementation.
    24452446Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs.
     
    24492450The main memory concern for concurrency is queues.
    24502451All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation.
    2451 Since several concurrency operations can use an unbound amount of memory (depending on \textbf{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers.
     2452Since several concurrency operations can use an unbound amount of memory (depending on bulk acquire), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers.
    24522453Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays.
    24532454Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array.
    24542455The threads and the condition both have a fixed amount of memory, while @mutex@ routines and blocking calls allow for an unbound amount, within the stack size.
    24552456
    2456 Note that since the major contributions of this paper are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
     2457Note that since the major contributions of this paper are extending monitor semantics to bulk acquire and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
    24572458
    24582459% ======================================================================
     
    25642565void foo(T * mutex t);
    25652566
    2566 // Correct: this function only works on monitors (any monitor)
     2567// Correct: this routine only works on monitors (any monitor)
    25672568forall(dtype T | is_monitor(T))
    25682569void bar(T * mutex t));
     
    25712572Both entry point and \textbf{callsite-locking} are feasible implementations.
    25722573The current \CFA implementation uses entry-point locking because it requires less work when using \textbf{raii}, effectively transferring the burden of implementation to object construction/destruction.
    2573 It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, \ie the function body.
     2574It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, \ie the routine body.
    25742575For example, the monitor call can appear in the middle of an expression.
    25752576Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites.
     
    26092610\subsection{Context Switching}
    26102611As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks.
    2611 To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call.
     2612To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific routine call.
    26122613This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread.
    2613 Note that the instruction pointer can be left untouched since the context-switch is always inside the same function.
     2614Note that the instruction pointer can be left untouched since the context-switch is always inside the same routine
    26142615Threads, however, do not context-switch between each other directly.
    26152616They context-switch to the scheduler.
     
    26482649As a result, a signal handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread).
    26492650It is important to note that signal handlers save and restore signal masks because user-thread migration can cause a signal mask to migrate from one kernel thread to another.
    2650 This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' functions from other functions.}.
     2651This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' routines from other routines}.
    26512652However, since the kernel thread handling preemption requires a different signal mask, executing user threads on the kernel-alarm thread can cause deadlocks.
    26522653For this reason, the alarm thread is in a tight loop around a system call to @sigwaitinfo@, requiring very little CPU time for preemption.
     
    26782679
    26792680For \CFA, this picture does not have support for blocking multiple monitors on a single condition.
    2680 To support \textbf{bulk-acq} two changes to this picture are required.
     2681To support bulk acquire two changes to this picture are required.
    26812682First, it is no longer helpful to attach the condition to \emph{a single} monitor.
    26822683Secondly, the thread waiting on the condition has to be separated across multiple monitors, seen in figure \ref{fig:monitor_cfa}.
     
    27272728\end{figure}
    27282729
    2729 The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{f:entry2}.
     2730The solution discussed in \ref{s:InternalScheduling} can be seen in the exit routine of listing \ref{f:entry2}.
    27302731Basically, the solution boils down to having a separate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership.
    27312732This solution is deadlock safe as well as preventing any potential barging.
     
    29632964}
    29642965\end{cfa}
    2965 This function is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, \ie no preemption.
     2966This routine is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, \ie no preemption.
    29662967However, once clusters are fully implemented, it will be possible to create fibers and \textbf{uthread} in the same system, as in listing \ref{f:fiber-uthread}
    29672968\begin{figure}
     
    31483149For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine.
    31493150Figure~\ref{f:mutex} shows the code for \CFA.
    3150 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
     3151To put the results in context, the cost of entering a non-inline routine and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
    31513152The results can be shown in table \ref{tab:mutex}.
    31523153
     
    33993400Therefore, there is still significant work to improve performance.
    34003401Many of the data structures and algorithms may change in the future to more efficient versions.
    3401 For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous.
     3402For example, the number of monitors in a single bulk acquire is only bound by the stack size, this is probably unnecessarily generous.
    34023403It may be possible that limiting the number helps increase performance.
    34033404However, it is not obvious that the benefit would be significant.
  • doc/papers/general/Paper.tex

    rf77dbc0 r9a72c4d  
    201201
    202202\abstract[Summary]{
    203 The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from commercial operating-systems to hobby projects.
     203The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from hobby projects to commercial operating-systems.
    204204This installation base and the programmers producing it represent a massive software-engineering investment spanning decades and likely to continue for decades more.
    205205Nevertheless, C, first standardized almost forty years ago, lacks many features that make programming in more modern languages safer and more productive.
     
    224224\section{Introduction}
    225225
    226 The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from commercial operating-systems to hobby projects.
     226The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from hobby projects to commercial operating-systems.
    227227This installation base and the programmers producing it represent a massive software-engineering investment spanning decades and likely to continue for decades more.
    228228The TIOBE~\cite{TIOBE} ranks the top 5 most \emph{popular} programming languages as: Java 15\%, \Textbf{C 12\%}, \Textbf{\CC 5.5\%}, Python 5\%, \Csharp 4.5\% = 42\%, where the next 50 languages are less than 4\% each with a long tail.
Note: See TracChangeset for help on using the changeset viewer.