Jun 22, 2018, 1:51:56 PM (4 years ago)
Peter A. Buhr <pabuhr@…>
aaron-thesis, arm-eh, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, with_gc

more updates

1 edited


  • doc/papers/concurrency/Paper.tex

    r999c700 r6d43cc57  
    15371537External scheduling allows users to wait for events from other threads without concern of unrelated events occurring.
    15381538The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels.
    1539 Of course, both of these paradigms have their own strengths and weaknesses, but for this project, control-flow semantics was chosen to stay consistent with the rest of the languages semantics.
    1540 Two challenges specific to \CFA arise when trying to add external scheduling with loose object definitions and multiple-monitor routines.
    1541 The previous example shows a simple use @_Accept@ versus @wait@/@signal@ and its advantages.
    1542 Note that while other languages often use @accept@/@select@ as the core external scheduling keyword, \CFA uses @waitfor@ to prevent name collisions with existing socket \textbf{api}s.
     1539While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics.
     1540Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor routines (see Section~\ref{s:Multi-MonitorScheduling}).
    15441542For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread;
    16551653must have acquired monitor locks that are greater than or equal to the number of locks for the waiting thread signalled from the condition queue.
    1656 {\color{red}In general, the signaller does not know the order of waiting threads, so in general, it must acquire the maximum number of mutex locks for the worst-case waiting thread.}
    16581655Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@.
    18541851The extra challenge is that this dependency graph is effectively post-mortem, but the runtime system needs to be able to build and solve these graphs as the dependencies unfold.
    18551852Resolving dependency graphs being a complex and expensive endeavour, this solution is not the preferred one.
    1857 \subsubsection{Partial Signalling} \label{partial-sig}
    19331928\subsection{Loose Object Definitions}
    1935 In \uC, a monitor class declaration includes an exhaustive list of monitor operations.
    1936 Since \CFA is not object oriented, monitors become both more difficult to implement and less clear for a user:
    1938 \begin{cfa}
    1939 monitor A {};
    1941 void f(A & mutex a);
    1942 void g(A & mutex a) {
    1943         waitfor(f); // Obvious which f() to wait for
    1944 }
    1946 void f(A & mutex a, int); // New different F added in scope
    1947 void h(A & mutex a) {
    1948         waitfor(f); // Less obvious which f() to wait for
    1949 }
    1950 \end{cfa}
    1952 Furthermore, external scheduling is an example where implementation constraints become visible from the interface.
    1953 Here is the cfa-code for the entering phase of a monitor:
    1954 \begin{center}
    1955 \begin{tabular}{l}
    1956 \begin{cfa}
    1957         if monitor is free
    1958                 enter
    1959         elif already own the monitor
    1960                 continue
    1961         elif monitor accepts me
    1962                 enter
    1963         else
    1964                 block
    1965 \end{cfa}
    1966 \end{tabular}
    1967 \end{center}
     1931In an object-oriented programming-language, a class includes an exhaustive list of operations.
     1932However, new members can be added via static inheritance or dynaic members, \eg JavaScript~\cite{JavaScript}.
     1933Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement.
     1935monitor M {};
     1936void `f`( M & mutex m );
     1937void g( M & mutex m ) { waitfor( `f` ); }       $\C{// clear which f}$
     1938void `f`( M & mutex m, int );                           $\C{// different f}$
     1939void h( M & mutex m ) { waitfor( `f` ); }       $\C{// unclear which f}$
     1941Hence, the cfa-code for the entering a monitor looks like:
     1943if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$
     1944else if ( $\textrm{\textit{already own monitor}}$ ) $\LstCommentStyle{// \color{red}continue}$
     1945else if ( $\textrm{\textit{monitor accepts me}}$ ) $\LstCommentStyle{// \color{red}enter}$
     1946else $\LstCommentStyle{// \color{red}block}$
    19681948For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions.
    1969 However, a fast check for @monitor accepts me@ is much harder to implement depending on the constraints put on the monitors.
    1970 Indeed, monitors are often expressed as an entry queue and some acceptor queue as in Figure~\ref{fig:ClassicalMonitor}.
     1949However, a fast check for \emph{monitor accepts me} is much harder to implement depending on the constraints put on the monitors.
     1950Figure~\ref{fig:ClassicalMonitor} shows monitors are often expressed as an entry (calling) queue, some acceptor queues, and an urgent stack/queue.
    1974 \subfloat[Classical Monitor] {
     1954\subfloat[Classical monitor] {
    1976 {\resizebox{0.45\textwidth}{!}{\input{monitor}}}
    19771957}% subfloat
    1978 \qquad
    1979 \subfloat[bulk acquire Monitor] {
     1959\subfloat[Bulk acquire monitor] {
    1981 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor}}}
    19821962}% subfloat
    1983 \caption{External Scheduling Monitor}
     1963\caption{Monitor Implementation}
    1986 There are other alternatives to these pictures, but in the case of the left picture, implementing a fast accept check is relatively easy.
    1987 Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (\eg 128) of mutex members.
    1988 This approach requires a unique dense ordering of routines with an upper-bound and that ordering must be consistent across translation units.
    1989 For OO languages these constraints are common, since objects only offer adding member routines consistently across translation units via inheritance.
    1990 However, in \CFA users can extend objects with mutex routines that are only visible in certain translation unit.
    1991 This means that establishing a program-wide dense-ordering among mutex routines can only be done in the program linking phase, and still could have issues when using dynamically shared objects.
    1993 The alternative is to alter the implementation as in Figure~\ref{fig:BulkMonitor}.
    1994 Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate.
    1995 Generating a mask dynamically means that the storage for the mask information can vary between calls to @waitfor@, allowing for more flexibility and extensions.
    1996 Storing an array of accepted routine pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search.
    1997 Furthermore, supporting nested external scheduling (\eg listing \ref{f:nest-ext}) may now require additional searches for the @waitfor@ statement to check if a routine is already queued.
     1967For a fixed (small) number of mutex routines (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction.
     1968This approach requires a unique dense ordering of routines with a small upper-bound and the ordering must be consistent across translation units.
     1969For object-oriented languages these constraints are common, but \CFA mutex routines can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex routines.
     1971Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation.
     1972The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately.
     1973The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a linear search.
    20001977\begin{cfa}[caption={Example of nested external scheduling},label={f:nest-ext}]
    20201997In the end, the most flexible approach has been chosen since it allows users to write programs that would otherwise be  hard to write.
    20211998This decision is based on the assumption that writing fast but inflexible locks is closer to a solved problem than writing locks that are as flexible as external scheduling in \CFA.
    2023 % ======================================================================
    2024 % ======================================================================
    20252002\subsection{Multi-Monitor Scheduling}
    2026 % ======================================================================
    2027 % ======================================================================
    20292005External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax.
    2030 Even in the simplest possible case, some new semantics needs to be established:
     2006Even in the simplest possible case, new semantics needs to be established:
    20322008monitor M {};
    2034 void f(M & mutex a);
    2036 void g(M & mutex b, M & mutex c) {
    2037         waitfor(f); // two monitors M => unknown which to pass to f(M & mutex)
    2038 }
    2039 \end{cfa}
    2040 The obvious solution is to specify the correct monitor as follows:
     2009void f( M & mutex m1 );
     2010void g( M & mutex m1, M & mutex m2 ) {
     2011        waitfor( f );                                                   $\C{// pass m1 or m2 to f?}$
     2014The solution is for the programmer to disambiguate:
     2016        waitfor( f, m2 );                                               $\C{// wait for call to f with argument m2}$
     2018Routine @g@ has acquired both locks, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@ (while @g@ still holds lock @m1@).
     2019This behaviour can be extended to the multi-monitor @waitfor@ statement.
    20432021monitor M {};
    2045 void f(M & mutex a);
    2047 void g(M & mutex a, M & mutex b) {
    2048         // wait for call to f with argument b
    2049         waitfor(f, b);
    2050 }
    2051 \end{cfa}
    2052 This syntax is unambiguous.
    2053 Both locks are acquired and kept by @g@.
    2054 When routine @f@ is called, the lock for monitor @b@ is temporarily transferred from @g@ to @f@ (while @g@ still holds lock @a@).
    2055 This behaviour can be extended to the multi-monitor @waitfor@ statement as follows.
    2057 \begin{cfa}
    2058 monitor M {};
    2060 void f(M & mutex a, M & mutex b);
    2062 void g(M & mutex a, M & mutex b) {
    2063         // wait for call to f with arguments a and b
    2064         waitfor(f, a, b);
    2065 }
    2066 \end{cfa}
    2068 Note that the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired in the routine. @waitfor@ used in any other context is undefined behaviour.
     2022void f( M & mutex m1, M & mutex m2 );
     2023void g( M & mutex m1, M & mutex m2 ) {
     2024        waitfor( f, m1, m2 );                                   $\C{// wait for call to f with arguments m1 and m2}$
     2027Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by accepting routine.
    20702029An important behaviour to note is when a set of monitors only match partially:
    20732031mutex struct A {};
    20752032mutex struct B {};
    2077 void g(A & mutex a, B & mutex b) {
    2078         waitfor(f, a, b);
    2079 }
     2033void g( A & mutex m1, B & mutex m2 ) {
     2034        waitfor( f, m1, m2 );
    20812036A a1, a2;
    20822037B b;
    20842038void foo() {
    2085         g(a1, b); // block on accept
    2086 }
     2039        g( a1, b ); // block on accept
    20882041void bar() {
    2089         f(a2, b); // fulfill cooperation
     2042        f( a2, b ); // fulfill cooperation
    20942047It is also important to note that in the case of external scheduling the order of parameters is irrelevant; @waitfor(f,a,b)@ and @waitfor(f,b,a)@ are indistinguishable waiting condition.
    2096 % ======================================================================
    2097 % ======================================================================
    20982050\subsection{\protect\lstinline|waitfor| Semantics}
    2099 % ======================================================================
    2100 % ======================================================================
    21022052Syntactically, the @waitfor@ statement takes a routine identifier and a set of monitors.
    2199 % ======================================================================
    2200 % ======================================================================
    22012150\subsection{Waiting For The Destructor}
    2202 % ======================================================================
    2203 % ======================================================================
    22042152An interesting use for the @waitfor@ statement is destructor semantics.
    22052153Indeed, the @waitfor@ statement can accept any @mutex@ routine, which includes the destructor (see section \ref{data}).
    2230 % ######     #    ######     #    #       #       ####### #       ###  #####  #     #
    2231 % #     #   # #   #     #   # #   #       #       #       #        #  #     # ##   ##
    2232 % #     #  #   #  #     #  #   #  #       #       #       #        #  #       # # # #
    2233 % ######  #     # ######  #     # #       #       #####   #        #   #####  #  #  #
    2234 % #       ####### #   #   ####### #       #       #       #        #        # #     #
    2235 % #       #     # #    #  #     # #       #       #       #        #  #     # #     #
    2236 % #       #     # #     # #     # ####### ####### ####### ####### ###  #####  #     #
    22382180Historically, computer performance was about processor speeds and instruction counts.
    22392181However, with heat dissipation being a direct consequence of speed increase, parallelism has become the new source for increased performance~\cite{Sutter05, Sutter05b}.
    22452187While there are many variations of the presented paradigms, most of these variations do not actually change the guarantees or the semantics, they simply move costs in order to achieve better performance for certain workloads.
    22482193\subsection{User-Level Threads}
    22492195A direct improvement on the \textbf{kthread} approach is to use \textbf{uthread}.
    22502196These threads offer most of the same features that the operating system already provides but can be used on a much larger scale.
    22552201Examples of languages that support \textbf{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}.
    22572204\subsection{Fibers : User-Level Threads Without Preemption} \label{fibers}
    22582206A popular variant of \textbf{uthread} is what is often referred to as \textbf{fiber}.
    22592207However, \textbf{fiber} do not present meaningful semantic differences with \textbf{uthread}.
    22642212An example of a language that uses fibers is Go~\cite{Go}
    22662215\subsection{Jobs and Thread Pools}
    22672217An approach on the opposite end of the spectrum is to base parallelism on \textbf{pool}.
    22682218Indeed, \textbf{pool} offer limited flexibility but at the benefit of a simpler user interface.
    22752225The gold standard of this implementation is Intel's TBB library~\cite{TBB}.
    22772228\subsection{Paradigm Performance}
    22782230While the choice between the three paradigms listed above may have significant performance implications, it is difficult to pin down the performance implications of choosing a model at the language level.
    22792231Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload.
    22832235Finally, if the units of uninterrupted work are large, enough the paradigm choice is largely amortized by the actual work done.
    22852238\section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel}
    22862240A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}.
    22872241It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues.
    22912245Currently \CFA only supports one \textbf{cfacluster}, the initial one.
    22932248\subsection{Future Work: Machine Setup}\label{machine}
    22942250While this was not done in the context of this paper, another important aspect of clusters is affinity.
    22952251While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups.
    22972253OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
    23002258Given these building blocks, it is possible to reproduce all three of the popular paradigms.
    23012259Indeed, \textbf{uthread} is the default paradigm in \CFA.
    23082265\section{Behind the Scenes}
    23092267There are several challenges specific to \CFA when implementing concurrency.
    23102268These challenges are a direct result of bulk acquire and loose object definitions.
    23232281Note that since the major contributions of this paper are extending monitor semantics to bulk acquire and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
    2325 % ======================================================================
    2326 % ======================================================================
    23272284\section{Mutex Routines}
    2328 % ======================================================================
    2329 % ======================================================================
    23312286The first step towards the monitor implementation is simple @mutex@ routines.
    23642320\subsection{Details: Interaction with polymorphism}
    23652322Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support.
    23662323However, it is shown that entry-point locking solves most of the issues.
    24422399Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites.
    2444 % ======================================================================
    2445 % ======================================================================
    24462402\section{Threading} \label{impl:thread}
    2447 % ======================================================================
    2448 % ======================================================================
    24502404Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency.
    24622418Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically @pthread@s in the current implementation of \CFA.
    24632419Indeed, any parallelism must go through operating-system libraries.
    24672423Processors internally use coroutines to take advantage of the existing context-switching semantics.
    24692426\subsection{Stack Management}
    24702428One of the challenges of this system is to reduce the footprint as much as possible.
    24712429Specifically, all @pthread@s created also have a stack created with them, which should be used as much as possible.
    24742432In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large.
    24762435\subsection{Context Switching}
    24772437As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks.
    24782438To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific routine call.
    24882448This option is not currently present in \CFA, but the changes required to add it are strictly additive.
    24902451\subsection{Preemption} \label{preemption}
    24912453Finally, an important aspect for any complete threading system is preemption.
    24922454As mentioned in section \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution.
    25222484Indeed, @sigwait@ can differentiate signals sent from @pthread_sigqueue@ from signals sent from alarms or the kernel.
    25252488Finally, an aspect that was not mentioned yet is the scheduling algorithm.
    25272490Further discussion on scheduling is present in section \ref{futur:sched}.
    2529 % ======================================================================
    2530 % ======================================================================
    25312493\section{Internal Scheduling} \label{impl:intsched}
    2532 % ======================================================================
    2533 % ======================================================================
    25342495The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience):
    2538 {\resizebox{0.4\textwidth}{!}{\input{monitor}}}
    25402501\caption{Traditional illustration of a monitor}
Note: See TracChangeset for help on using the changeset viewer.