Ignore:
Timestamp:
Jun 30, 2018, 8:35:13 AM (4 years ago)
Branches:
aaron-thesis, arm-eh, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, no_list, persistent-indexer, pthread-emulation, qualifiedEnum
Children:
2ebcb28, 35a9e41
Parents:
1f133dc
Message:

final corrections

Location:
doc/papers/concurrency
Files:
2 edited

### Legend:

Unmodified
 r1f133dc \abstract[Summary]{ \CFA is a modern, polymorphic, \emph{non-object-oriented} extension of the C programming language. This paper discusses the design of the concurrency and parallelism features in \CFA, and the concurrent runtime-system. This paper discusses the design of the concurrency and parallelism features in \CFA, and its concurrent runtime-system. These features are created from scratch as ISO C lacks concurrency, relying largely on the pthreads library for concurrency. Coroutines and lightweight (user) threads are introduced into the language. In addition, monitors are added as a high-level mechanism for mutual exclusion and synchronization. A unique contribution is allowing multiple monitors to be safely acquired simultaneously. Coroutines and lightweight (user) threads are introduced into \CFA; as well, monitors are added as a high-level mechanism for mutual exclusion and synchronization. A unique contribution of this work is allowing multiple monitors to be safely acquired \emph{simultaneously}. All features respect the expectations of C programmers, while being fully integrate with the \CFA polymorphic type-system and other language features. Finally, experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages. Experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages. }% While the focus of this discussion is concurrency and parallelism, it is important to address coroutines, which are a significant building block of a concurrency system (but not concurrent among themselves). Coroutines are generalized routines allowing execution to be temporarily suspend and later resumed. Coroutines are generalized routines allowing execution to be temporarily suspended and later resumed. Hence, unlike a normal routine, a coroutine may not terminate when it returns to its caller, allowing it to be restarted with the values and execution location present at the point of suspension. This capability is accomplish via the coroutine's stack, where suspend/resume context switch among stacks. This capability is accomplished via the coroutine's stack, where suspend/resume context switch among stacks. Because threading design-challenges are present in coroutines, their design effort is relevant, and this effort can be easily exposed to programmers giving them a useful new programming paradigm because a coroutine handles the class of problems that need to retain state between calls, \eg plugins, device drivers, and finite-state machines. Therefore, the two fundamental features of the core \CFA coroutine-API are independent call-stacks and @suspend@/@resume@ operations. \end{quote} The example takes advantage of resuming a coroutine in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops. The destruction provides a newline, if formatted text ends with a full line. Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flatten (linearized) and rechecked on each call because execution location is not retained between calls. The destructor provides a newline, if formatted text ends with a full line. Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flattened (linearized) and rechecked on each call because execution location is not retained between calls. (Linearized code is the bane of device drivers.) \end{cfa} and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack. Furthermore, the execution of constructs/destructors is in the wrong order for certain operations. Furthermore, the execution of constructors/destructors is in the wrong order for certain operations. For example, for threads if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. forall( dtype T | is_coroutine(T) ) void resume( T & ); \end{cfa} The @dtype@ property of the trait ensures the coroutine descriptor is non-copyable, so all coroutines must be passed by reference (pointer). The @dtype@ property provides no implicit copying operations and the @is_coroutine@ trait provides no explicit copying operations, so all coroutines must be passed by reference (pointer). The routine definitions ensures there is a statically-typed @main@ routine that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle. The @main@ routine has no return value or additional parameters because the coroutine type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values versus fixed ones. The allocation/deallocation pattern appears unusual because allocated objects are immediately deallocated without any intervening code. However, for threads, the deletion provides implicit synchronization, which is the intervening code. While the subtotals are added in linear order rather than completion order, which slight inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial. While the subtotals are added in linear order rather than completion order, which slightly inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial. \begin{figure} \begin{cfa} thread Adder { int * row, cols, & subtotal;                        $\C{// communication}$ }; thread Adder { int * row, cols, & subtotal; } $\C{// communication variables}$ void ?{}( Adder & adder, int row[], int cols, int & subtotal ) { adder.[ row, cols, &subtotal ] = [ row, cols, &subtotal ]; Adder * adders[rows]; for ( int r = 0; r < rows; r += 1 ) {       $\C{// start threads to sum rows}$ adders[r] = new( matrix[r], cols, &subtotals[r] ); adders[r] = new( matrix[r], cols, &subtotals[r] ); } for ( int r = 0; r < rows; r += 1 ) {       $\C{// wait for threads to finish}$ delete( adders[r] );                            $\C{// termination join}$ delete( adders[r] );                          $\C{// termination join}$ total += subtotals[r];                          $\C{// total subtotal}$ } To reestablish meaningful execution requires mechanisms to reintroduce determinism, \ie restrict non-determinism, called mutual exclusion and synchronization, where mutual exclusion is an access-control mechanism on data shared by threads, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}. Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it, \eg Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala). In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts, \eg channels~\cite{CSP,Go}. In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely related to networking concepts, \eg channels~\cite{CSP,Go}. However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm between regular and concurrent computation, \eg routine call versus message passing. Hence, a programmer must learn and manipulate two sets of design patterns. While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account. In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm. In contrast, approaches based on stateful models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm. At the lowest level, concurrent control is implemented by atomic operations, upon which different kinds of locking mechanisms are constructed, \eg semaphores~\cite{Dijkstra68b}, barriers, and path expressions~\cite{Campbell74}. Algorithms that allow a barger, but divert it until later using current synchronization state (flags), are avoiding the barger; algorithms that preclude a barger from entering during synchronization in the critical section prevent barging completely. Techniques like baton-pass locks~\cite{Andrews89} between threads instead of unconditionally releasing locks is an example of barging prevention. Techniques like baton-passing locks~\cite{Andrews89} between threads instead of unconditionally releasing locks is an example of barging prevention. Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data. Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity. As for coroutines/tasks, a non-copyable (@dtype@) trait is used to capture this requirement, so all locks/monitors must be passed by reference (pointer). As for coroutines/tasks, the @dtype@ property provides no implicit copying operations and the @is_monitor@ trait provides no explicit copying operations, so all locks/monitors must be passed by reference (pointer). \begin{cfa} trait is_monitor( dtype T ) { \label{s:MutexAcquisition} While correctness implicitly implies a monitor's mutual exclusion is acquired and released, there are implementation options about when and where the locking/unlocking occurs. While correctness implies a monitor's mutual exclusion is acquired and released, there are implementation options about when and where the locking/unlocking occurs. (Much of this discussion also applies to basic locks.) For example, a monitor may need to be passed through multiple helper routines before it becomes necessary to acquire the monitor mutual-exclusion. \begin{cfa} monitor Tree { Tree * left, right; Tree * left, * right; // value }; void print( Tree & mutex tree ) {                       $\C{// prefix traversal}$ // write value print( tree->left );                                    $\C{// multiply acquire monitor lock on each recursion}$ print( tree->left );                                    $\C{// multiply acquire monitor lock for tree on each recursion}$ print( tree->right ); } \end{cfa} The benefit of mandatory monitor qualifiers is self-documentation, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant. Instead, one of qualifier semantics can be the default, and the other required. For example, assume the safe @mutex@ qualifier for all monitor parameters because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors. On the other hand, assuming \lstinline[morekeywords=nomutex]@nomutex@ is the \emph{normal} parameter behaviour, stating explicitly \emph{this parameter is not special}. The benefit of mandatory monitor qualifiers is self-documentation, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameters is redundant. Instead, the semantics have one qualifier as the default, and the other required. For example, make the safe @mutex@ qualifier the default because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors. Alternatively, make the unsafe \lstinline[morekeywords=nomutex]@nomutex@ qualifier the default because it is the \emph{normal} parameter semantics while @mutex@ parameters are rare. Providing a default qualifier implies knowing whether a parameter is a monitor. Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred. Having language support for such a feature is therefore a significant asset for \CFA. The capability to acquire multiple locks before entering a critical section is called \newterm{bulk acquire}. The capability to acquire multiple locks before entering a critical section is called \newterm{bulk acquire} (see Section~\ref{s:Implementation} for implementation details). In the previous example, \CFA guarantees the order of acquisition is consistent across calls to different routines using the same monitors as arguments. This consistent ordering means acquiring multiple monitors is safe from deadlock. However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamic tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}. In \CFA, safety is guaranteed by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid. While \CFA provides only a partial solution, the \CFA partial solution handles many useful cases. In \CFA, a safety aid is provided by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid. While \CFA provides only a partial solution, it handles many useful cases, \eg: \begin{cfa} monitor BankAccount { ... }; For example, Figure~\ref{f:GenericBoundedBuffer} shows a bounded buffer that may be full/empty so produce/consumer threads must block. Leaving the monitor and trying again (busy waiting) is impractical for high-level programming. Monitors eliminate busy waiting by providing internal synchronization to schedule threads needing access to the shared data, where the synchronization is blocking (threads are parked) versus spinning. Monitors eliminate busy waiting by providing synchronization to schedule threads needing access to the shared data, where threads block versus spinning. Synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling, where \newterm{scheduling} defines which thread acquires the critical section next. \newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and on behalf of other threads attempting entry. External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion. If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. Threads making calls to routines that are currently excluded, block outside (external) of the monitor on a calling queue, versus blocking on condition queues inside (internal) of the monitor. % External scheduling is more constrained and explicit, which helps programmers reduce the non-deterministic nature of concurrency. Threads making calls to routines that are currently excluded, block outside of (external to) the monitor on a calling queue, versus blocking on condition queues inside of (internal to) the monitor. External scheduling allows users to wait for events from other threads without concern of unrelated events occurring. The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels. A thread blocks until an appropriate partner arrives. The complexity is exchanging phone numbers in the monitor because of the mutual-exclusion property. For signal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the oppose number, post its phone number, and unblock the partner. For signal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the opposite number, post its phone number, and unblock the partner. For signal-block scheduling, the implicit urgent-queue replaces the explict @exchange@-condition and @signal_block@ puts the finding thread on the urgent condition and unblocks the matcher. wait( Girls[ccode] ); GirlPhNo = phNo; exchange.signal(); signal( exchange ); } else { GirlPhNo = phNo; signal( Boys[ccode] ); exchange.wait(); wait( exchange ); } // if return BoyPhNo; } \end{cfa} must acquire a set of monitor locks greater than or equal to the number of locks for the waiting thread signalled from the condition queue. must have acquired at least the same locks as the waiting thread signalled from the condition queue. Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. \end{cfa} Given the ability to release a subset of acquired monitors can result in a \newterm{nested monitor}~\cite{Lister77} deadlock. The ability to release a subset of acquired monitors can result in a \newterm{nested monitor}~\cite{Lister77} deadlock. \begin{cfa} void foo( M & mutex m1, M & mutex m2 ) { Finally, an important aspect of monitor implementation is barging, \ie can calling threads barge ahead of signalled threads? If barging is allowed, synchronization between a singller and signallee is difficult, often requiring multiple unblock/block cycles (looping around a wait rechecking if a condition is met). If barging is allowed, synchronization between a signaller and signallee is difficult, often requiring multiple unblock/block cycles (looping around a wait rechecking if a condition is met). In fact, signals-as-hints is completely opposite from that proposed by Hoare in the seminal paper on monitors: \begin{quote} However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program. \end{quote} \CFA scheduling \emph{precludes} barging, which simplifies synchronization among threads in the monitor and increases correctness. Furthermore, \CFA concurrency has no spurious wakeup~\cite[\S~9]{Buhr05a}, which eliminates an implict form of barging. For example, there are no loops in either bounded buffer solution in Figure~\ref{f:GenericBoundedBuffer}. Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. Furthermore, there is no spurious wakeup~\cite[\S~9]{Buhr05a} in \CFA, which eliminates an implict form of barging. In an object-oriented programming-language, a class includes an exhaustive list of operations. However, new members can be added via static inheritance or dynaic members, \eg JavaScript~\cite{JavaScript}. However, new members can be added via static inheritance or dynamic members, \eg JavaScript~\cite{JavaScript}. Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement. \begin{cfa} void h( M & mutex m ) { waitfor( f ); }       $\C{// unclear which f}$ \end{cfa} Hence, the cfa-code for the entering a monitor looks like: Hence, the cfa-code for entering a monitor looks like: \begin{cfa} if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$ \end{cfa} Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by the accepting routine. Note, for internal and external scheduling with multiple monitors, a signalling or accepting thread must match exactly, \ie partial matching results in waiting. \begin{cquote} Also, the order of the monitors in a @waitfor@ statement is unimportant. Figure~\ref{f:UnmatchedMutexSets} shows an example where, for internal and external scheduling with multiple monitors, a signalling or accepting thread must match exactly, \ie partial matching results in waiting. For both examples, the set of monitors is disjoint so unblocking is impossible. \begin{figure} \lstDeleteShortInline@% \begin{tabular}{@{}l@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}} wait( c ); } g( m11, m2 ); // block on wait f( m12, m2 ); // cannot fulfil \end{cfa} & \begin{cfa} monitor M1 {} m11, m12; monitor M2 {} m2; void f( M1 & mutex m1, M2 & mutex m2 ) { } void g( M1 & mutex m1, M2 & mutex m2 ) { waitfor( f, m1, m2 ); } g( m11, m2 ); // block on accept f( m12, m2 ); // cannot fulfil \end{cfa} & \begin{cfa} monitor M1 {} m11, m12; monitor M2 {} m2; void f( M1 & mutex m1, M2 & mutex m2 ) { } void g( M1 & mutex m1, M2 & mutex m2 ) { waitfor( f, m1, m2 ); } g( m11, m2 ); // block on accept f( m12, m2 ); // cannot fulfil \end{cfa} \end{tabular} \lstMakeShortInline@% \end{cquote} For both internal and external scheduling, the set of monitors is disjoint so unblocking is impossible. \caption{Unmatched \protect\lstinline@mutex@ sets} \label{f:UnmatchedMutexSets} \end{figure} \subsection{Extended \protect\lstinline@waitfor@} The extended form of the @waitfor@ statement conditionally accepts one of a group of mutex routines and allows a specific action to be performed \emph{after} the mutex routine finishes. Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex routines, with a specific action to be performed \emph{after} the mutex routine finishes. For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist. The \emph{conditional-expression} of a @when@ may call a routine, but the routine must not block or context switch. If there are multiple acceptable mutex calls, selection occurs top-to-bottom (prioritized) in the @waitfor@ clauses, whereas some programming languages with similar mechanisms accept non-deterministically for this case, \eg Go \lstinline[morekeywords=select]@select@. If some accept guards are true and there are no outstanding calls to these members, the acceptor is accept-blocked until a call to one of these members is made. If all the accept guards are false, the statement does nothing, unless there is a terminating @else@ clause with a true guard, which is executed instead. Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking. If there is a @timeout@ clause, it provides an upper bound on waiting. If both a @timeout@ clause and an @else@ clause are present, the @else@ must be conditional, or the @timeout@ is never triggered. In all cases, the statement following is executed \emph{after} a clause is executed to know which of the clauses executed. \begin{figure} \begin{cfa} when ( $\emph{conditional-expression}$ )      $\C{// optional guard}$ $\emph{statement}$                                      $\C{// action when no immediate calls}$ \end{cfa} For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist. The \emph{conditional-expression} of a @when@ may call a routine, but the routine must not block or context switch. If there are several mutex calls that can be accepted, selection occurs top-to-bottom in the @waitfor@ clauses versus non-deterministically. If some accept guards are true and there are no outstanding calls to these members, the acceptor is accept-blocked until a call to one of these members is made. If all the accept guards are false, the statement does nothing, unless there is a terminating @else@ clause with a true guard, which is executed instead. Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking. If there is a @timeout@ clause, it provides an upper bound on waiting, and can only appear with a conditional @else@, otherwise the timeout cannot be triggered. In all cases, the statement following is executed \emph{after} a clause is executed to know which of the clauses executed. \caption{Extended \protect\lstinline@waitfor@} \label{f:ExtendedWaitfor} \end{figure} Note, a group of conditional @waitfor@ clauses is \emph{not} the same as a group of @if@ statements, e.g.: \begin{cfa} void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) { if ( count == BufferSize ) if ( count == 10 ) waitfor( remove, buffer ) { elements[back] = elem; back = ( back + 1 ) % BufferSize; count += 1; // insert elem into buffer } or waitfor( ^?{}, buffer ) throw insertFail; } \label{s:fibers} A variant of user thread is \newterm{fibers}, which removes preemption, \eg Go~\cite{Go}. Like functional programming, which removes mutation and its associated problems, removing preemption from concurrency reduces nondeterminism, hence race and deadlock errors are more difficult to generate. A variant of user thread is \newterm{fibers}, which removes preemption, \eg Go~\cite{Go} @goroutine@s. Like functional programming, which removes mutation and its associated problems, removing preemption from concurrency reduces nondeterminism, making race and deadlock errors more difficult to generate. However, preemption is necessary for concurrency that relies on spinning, so there are a class of problems that cannot be programmed without preemption. \section{Implementation} \label{s:Implementation} Currently, \CFA has fixed-sized stacks, where the stack size can be set at coroutine/thread creation but with no subsequent growth. Schemes exist for dynamic stack-growth, such as stack copying and chained stacks. However, stack copying requires pointer adjustment to items on the stack, which is impossible without some form of garage collection. However, stack copying requires pointer adjustment to items on the stack, which is impossible without some form of garbage collection. As well, chained stacks require all modules be recompiled to use this feature, which breaks backward compatibility with existing C libraries. In the long term, it is likely C libraries will migrate to stack chaining to support concurrency, at only a minimal cost to sequential programs. In \CFA, ordering of monitor acquisition relies on memory ordering to prevent deadlock~\cite{Havender68}, because all objects have distinct non-overlapping memory layouts, and mutual-exclusion for a monitor is only defined for its lifetime. When a mutex call is made, pointers to the concerned monitors are aggregated into a variable-length array and sorted. This array persists for the entire duration of the mutual-exclusion and its ordering reused extensively. To improve performance and simplicity, context switching occur inside a routine call, so only callee-saved registers are copied onto the stack and then the stack register is switched; This array persists for the entire duration of the mutual-exclusion and is used extensively for synchronization operations. To improve performance and simplicity, context switching occurs inside a routine call, so only callee-saved registers are copied onto the stack and then the stack register is switched; the corresponding registers are then restored for the other context. Note, the instruction pointer is untouched since the context switch is always inside the same routine. This method is a 2-step context-switch and provides a clear distinction between user and kernel code, where scheduling and other system operations happen. The alternative 1-step context-switch uses the \emph{from} thread's stack to schedule and then context-switches directly to the \emph{to} thread's stack. Experimental results (not presented) show the performance difference between these two approaches is virtually equivalent, because both approaches are dominated by locking to prevent a race condition. Experimental results (not presented) show the performance of these two approaches is virtually equivalent, because both approaches are dominated by locking to prevent a race condition. All kernel threads (@pthreads@) created a stack. Finally, an important aspect for a complete threading system is preemption, which introduces extra non-determinism via transparent interleaving, rather than cooperation among threads for proper scheduling and processor fairness from long-running threads. Because preemption frequency is usually long, 1 millisecond, performance cost is negligible. Because preemption frequency is usually long (1 millisecond) performance cost is negligible. Preemption is normally handled by setting a count-down timer on each virtual processor. When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code. \section{Conclusion} This paper demonstrate a concurrency API that is simple, efficient, and able to build higher-level concurrency features. This paper demonstrates a concurrency API that is simple, efficient, and able to build higher-level concurrency features. The approach provides concurrency based on a preemptive M:N user-level threading-system, executing in clusters, which encapsulate scheduling of work on multiple kernel threads providing parallelism. The M:N model is judged to be efficient and provide greater flexibility than a 1:1 threading model. \label{futur:tools} While monitors offer a flexible and powerful concurrent for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. While monitors offer flexible and powerful concurrency for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. Examples of such tools can include futures and promises~\cite{promises}, executors and actors. These additional features are useful when monitors offer a level of abstraction that is inadequate for certain tasks.