Ignore:
Timestamp:
Oct 26, 2016, 5:49:40 PM (6 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
aaron-thesis, arm-eh, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
Children:
03d416f, 25f49f4, 47a8d17
Parents:
1b29996 (diff), fe7b281 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg2:software/cfa/cfa-cc

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/proposals/concurrency/concurrency.tex

    r1b29996 r6d7c3df  
    6262\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
    6363\newcommand{\code}[1]{\lstinline{#1}}
     64\newcommand{\pseudo}[1]{\lstinline[language=Pseudo]{#1}}
    6465
    6566\input{glossary}
     
    510511\begin{center}
    511512\begin{tabular}{l}
    512 \begin{lstlisting}
    513         ¶if¶ critical section is free :
     513\begin{lstlisting}[language=Pseudo]
     514        if monitor is free :
    514515                enter
    515         elif critical section accepts me :
     516        elif monitor accepts me :
    516517                enter
    517         ¶else¶ :
     518        else :
    518519                block
    519520\end{lstlisting}
     
    521522\end{center}
    522523
    523 For the \code{critical section is free} condition it is easy to implement a check that can evaluate the condition in a few instruction. However, a fast check for \code{critical section accepts me} is much harder to implement depending on the constraints put on the monitors. Indeed, monitors are often expressed as an entry queue and some acceptor queue as in the following figure :
    524 
    525 \begin{center}
    526 {\resizebox{0.5\textwidth}{!}{\input{monitor}}}
    527 \end{center}
    528 
    529 There are other alternatives to these pictures but in the case of this picture implementing a fast accept check is relatively easy. Indeed simply updating a bitmask when the acceptor queue changes is enough to have a check that executes in a single instruction, even with a fairly large number of acceptor. However, this requires all the acceptable routines to be declared with the monitor declaration. For OO languages this doesn't compromise much since monitors already have an exhaustive list of member routines. However, for \CFA this isn't the case, routines can be added to a type anywhere after its declaration. A more flexible
    530 
    531 
    532 At this point we must make a decision between flexibility and performance. Many design decisions in \CFA achieve both flexibility and performance, for example polymorphic routines add significant flexibility but inlining them means the optimizer can easily remove any runtime cost.
    533 
    534 This approach leads to the \uC example being translated to :
    535 \begin{lstlisting}
    536         accept( void g(mutex struct A & mutex a) )
    537         mutex struct A {};
    538 
    539         void f(A & mutex a) { accept(g); }
    540         void g(A & mutex a);
    541 \end{lstlisting}
    542 
    543 This syntax is the most consistent with the language since it somewhat mimics the \code{forall} declarations. However, the fact that it comes before the struct declaration does means the type needs to be forward declared (done inline in the example). Here are a few alternatives to this syntax : \\
    544 \begin{tabular}[t]{l l}
     524For the \pseudo{monitor is free} condition it is easy to implement a check that can evaluate the condition in a few instruction. However, a fast check for \pseudo{monitor accepts me} is much harder to implement depending on the constraints put on the monitors. Indeed, monitors are often expressed as an entry queue and some acceptor queue as in the following figure :
     525
     526\begin{center}
     527{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
     528\end{center}
     529
     530There are other alternatives to these pictures but in the case of this picture implementing a fast accept check is relatively easy. Indeed simply updating a bitmask when the acceptor queue changes is enough to have a check that executes in a single instruction, even with a fairly large number of acceptor. However, this relies on the fact that all the acceptable routines are declared with the monitor type. For OO languages this doesn't compromise much since monitors already have an exhaustive list of member routines. However, for \CFA this isn't the case, routines can be added to a type anywhere after its declaration. Its important to note that the bitmask approach does not actually require an exhaustive list of routines, but it requires a dense unique ordering of routines with an upper-bound and that ordering must be consistent across translation units.
     531The alternative would be to have a picture more like this one:
     532
     533\begin{center}
     534{\resizebox{0.4\textwidth}{!}{\input{ext_monitor}}}
     535\end{center}
     536
     537Not storing the queues inside the monitor means that the storage can vary between routines, allowing for more flexibility and extensions. Storing an array of function-pointers would solve the issue of uniquely identifying acceptable routines. However, the single instruction bitmask compare has been replaced by dereferencing a pointer followed by a linear search. Furthermore, supporting nested external scheduling may now require additionnal searches on calls to accept to check if a routine is already queued in.
     538
     539At this point we must make a decision between flexibility and performance. Many design decisions in \CFA achieve both flexibility and performance, for example polymorphic routines add significant flexibility but inlining them means the optimizer can easily remove any runtime cost. Here however, the cost of flexibility cannot be trivially removed.
     540
     541In either cases here are a few alternatives for the different syntaxes this syntax : \\
     542\begin{center}
     543{\renewcommand{\arraystretch}{1.5}
     544\begin{tabular}[t]{l @{\hskip 0.35in} l}
     545\hline
     546\multicolumn{2}{ c }{\code{accept} on type}\\
     547\hline
    545548Alternative 1 & Alternative 2 \\
    546549\begin{lstlisting}
    547550mutex struct A
    548 accept( void g(A & mutex a) )
     551accept( void f(A & mutex a) )
    549552{};
    550553\end{lstlisting} &\begin{lstlisting}
    551554mutex struct A {}
    552 accept( void g(A & mutex a) );
     555accept( void f(A & mutex a) );
    553556
    554557\end{lstlisting} \\
     
    556559\begin{lstlisting}
    557560mutex struct A {
    558         accept( void g(A & mutex a) )
     561        accept( void f(A & mutex a) )
    559562};
    560563
     
    562565mutex struct A {
    563566        accept :
    564                 void g(A & mutex a) );
     567                void f(A & mutex a) );
    565568};
    566 \end{lstlisting}
     569\end{lstlisting}\\
     570\hline
     571\multicolumn{2}{ c }{\code{accept} on routine}\\
     572\hline
     573\begin{lstlisting}
     574mutex struct A {};
     575
     576void f(A & mutex a)
     577
     578accept( void f(A & mutex a) )
     579void g(A & mutex a) {
     580        /*...*/
     581}
     582\end{lstlisting}&\\
    567583\end{tabular}
    568 
     584}
     585\end{center}
    569586
    570587An other aspect to consider is what happens if multiple overloads of the same routine are used. For the time being it is assumed that multiple overloads of the same routine should be scheduled regardless of the overload used. However, this could easily be extended in the future.
     
    597614\end{lstlisting}
    598615
    599 This is unambiguous. The both locks will be acquired and kept, when routine \code{f} is called the lock for monitor \code{a} will be temporarily transferred from \code{g} to \code{f} (while \code{g} still holds lock \code{b}). This behavior can be extended to multi-monitor accept statment as follows.
     616This is unambiguous. Both locks will be acquired and kept, when routine \code{f} is called the lock for monitor \code{a} will be temporarily transferred from \code{g} to \code{f} (while \code{g} still holds lock \code{b}). This behavior can be extended to multi-monitor accept statment as follows.
    600617
    601618\begin{lstlisting}
     
    613630
    614631\subsubsection{Implementation Details: External scheduling queues}
    615 To support multi-monitor external scheduling means that some kind of entry-queues must be used that is aware of both monitors. However, acceptable routines must be aware of the entry queues which means they most be stored inside at least one of the monitors that will be acquired. This in turn adds the requirement a systematic algorithm of disambiguating which queue is relavant regardless of user ordering. The proposed algorithm is to fall back on monitors lock ordering and specify that the monitor that is acquired first is the lock with the relevant entry queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects gut that is a reasonnable contraint. This algorithm choice has two consequences, the ofthe highest priority monitor is no longer a true FIFO queue and the queue of the lowest priority monitor is both required and probably unused. The queue can no longer be a FIFO queue because instead of simply containing the waiting threads in order arrival, they also contain the second mutex. Therefore, another thread with the same highest priority monitor but a different lowest priority monitor may arrive first but enter the critical section after a thread with the correct pairing. Secondly, since it may not be known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is probably that half the multi-monitor queues will go unused for the entire duration of the program.
     632To support multi-monitor external scheduling means that some kind of entry-queues must be used that is aware of both monitors. However, acceptable routines must be aware of the entry queues which means they must be stored inside at least one of the monitors that will be acquired. This in turn adds the requirement a systematic algorithm of disambiguating which queue is relavant regardless of user ordering. The proposed algorithm is to fall back on monitors lock ordering and specify that the monitor that is acquired first is the lock with the relevant entry queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonnable constraint. This algorithm choice has two consequences, the entry queue of the highest priority monitor is no longer a true FIFO queue and the queue of the lowest priority monitor is both required and probably unused. The queue can no longer be a FIFO queue because instead of simply containing the waiting threads in order arrival, they also contain the second mutex. Therefore, another thread with the same highest priority monitor but a different lowest priority monitor may arrive first but enter the critical section after a thread with the correct pairing. Secondly, since it may not be known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is probable that half the multi-monitor queues will go unused for the entire duration of the program.
    616633
    617634\subsection{Other concurrency tools}
    618 
     635TO BE CONTINUED...
     636
     637\newpage
    619638\section{Parallelism}
    620 Historically, computer performance was about processor speeds and instructions count. However, with heat dissipaction being an ever growing challenge, parallelism has become the new source of greatest performance \cite{Sutter05, Sutter05b}. In this decade, it is not longer reasonnable create high-performance application without caring about parallelism. Indeed, parallelism an important aspect of performance and more specifically throughput and hardware utilization. The lowest level approach parallelism is to use \glspl{kthread}. However since these have significant costs and limitations, \glspl{kthread} are now mostly used as an implementation tool rather than a user oriented one. There are several alternatives to solve these issues which all have strengths and weaknesses.
     639Historically, computer performance was about processor speeds and instructions count. However, with heat dissipation being an ever growing challenge, parallelism has become the new source of greatest performance \cite{Sutter05, Sutter05b}. In this decade, it is not longer reasonnable to create high-performance application without caring about parallelism. Indeed, parallelism is an important aspect of performance and more specifically throughput and hardware utilization. The lowest level approach of parallelism is to use \glspl{kthread}. However since these have significant costs and limitations \glspl{kthread} are now mostly used as an implementation tool rather than a user oriented one. There are several alternatives to solve these issues which all have strengths and weaknesses.
    621640
    622641\subsection{User-level threads}
     
    624643
    625644Examples of languages that support are Java\cite{Java}, Haskell\cite{Haskell} and \uC\cite{uC++book}.
     645
    626646\subsection{Jobs and thread pools}
    627 The opposite approach is to base parallelism on \glspl{job}. Indeed, \glspl{job} offer limited flexibility but at the benefit of a simpler user interface. In \gls{job} based systems users express parallelism as units of work and the dependency graph (either explicit or implicit) that tie them together. This means users need not to worry about concurrency but significantly limits the interaction that can occur between different jobs. Indeed, any \gls{job} that blocks also blocks the underlying \gls{kthread}, this effectively mean the CPU utilization, and therefore throughput, will suffer noticeably. The golden standard of this implementation is Intel's TBB library\cite{TBB}.
     647The approach on the opposite end of the spectrum is to base parallelism on \glspl{job}. Indeed, \glspl{job} offer limited flexibility but at the benefit of a simpler user interface. In \gls{job} based systems users express parallelism as units of work and the dependency graph (either explicit or implicit) that tie them together. This means users need not to worry about concurrency but significantly limits the interaction that can occur between different jobs. Indeed, any \gls{job} that blocks also blocks the underlying \gls{kthread}, this effectively mean the CPU utilization, and therefore throughput, will suffer noticeably.
     648The golden standard of this implementation is Intel's TBB library\cite{TBB}.
    628649
    629650\subsection{Fibers : user-level threads without preemption}
    630651Finally, in the middle of the flexibility versus complexity spectrum lay \glspl{fiber} which offer \glspl{uthread} without the complexity of preemption. This means users don't have to worry about other \glspl{fiber} suddenly executing between two instructions which signficantly reduces complexity. However, any call to IO or other concurrency primitives can lead to context switches. Furthermore, users can also block \glspl{fiber} in the middle of their execution without blocking a full processor core. This means users still have to worry about mutual exclusion, deadlocks and race conditions in their code, raising the complexity significantly.
    631 \cite{Go}
     652An example of a language that uses fibers is Go\cite{Go}
    632653
    633654\subsection{Paradigm performance}
    634 While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all strongly depends on the usage. Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. Furthermore, if the units of uninterrupted work are large enough the paradigm choice will be fully armoticised by the actual work done.
     655While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations one of these paradigms will show better performance but it all strongly depends on the usage. Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. Furthermore, if the units of uninterrupted work are large enough the paradigm choice will be largely amorticised by the actual work done.
    635656
    636657\section{\CFA 's Thread Building Blocks}
     
    687708\end{lstlisting}
    688709
    689 In this example \code{func} is a function pointer stored in \acrfull{tls}, which is \CFA is both easy to use and completly typesafe.
    690 
    691 Of course for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution. While using \acrshort{api} such as \code{fork} and \code{join} is relatively common in the literature, such an interface is not needed. Indeed, the simplest approach is to use \acrshort{raii} principles and have threads \code{fork} once the constructor has completed and \code{join} before the destructor runs.
     710% In this example \code{func} is a function pointer stored in \acrfull{tls}, which is \CFA is both easy to use and completly typesafe.
     711
     712Of course for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution. While using an \acrshort{api} such as \code{fork} and \code{join} is relatively common in the literature, such an interface is not needed. Indeed, the simplest approach is to use \acrshort{raii} principles and have threads \code{fork} once the constructor has completed and \code{join} before the destructor runs.
    692713\begin{lstlisting}
    693714thread struct FuncRunner; //FuncRunner declared above
     
    733754\end{lstlisting}
    734755
     756\newpage
     757\large{\textbf{WORK IN PROGRESS}}
    735758\subsection{The \CFA Kernel : Processors, Clusters and Threads}\label{kernel}
    736759
Note: See TracChangeset for help on using the changeset viewer.