Changeset a9aab60 for doc/proposals


Ignore:
Timestamp:
Oct 7, 2016, 4:41:17 PM (8 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
Children:
9b4343e
Parents:
d95f565
Message:

some progress on parallelism

Location:
doc/proposals/concurrency
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • doc/proposals/concurrency/concurrency.tex

    rd95f565 ra9aab60  
    2424\usepackage{graphicx}
    2525\usepackage{tabularx}
    26 \usepackage{glossaries}
     26\usepackage[acronym]{glossaries}
    2727\usepackage{varioref}                                                           % extended references
    2828\usepackage{inconsolata}
     
    3838\usepackage{breakurl}
    3939
     40\usepackage{tikz}
     41\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
     42
    4043\renewcommand{\UrlFont}{\small\sf}
    4144
     
    597600
    598601\subsection{Paradigm performance}
    599 While the choice between the three paradigms listed above can have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all depends on the usage.
    600 Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum.
    601 
    602 \section{Parallelism in \CFA}
    603 As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency.
    604 
    605 \subsection{Kernel core}\label{kernel}
    606 At the ro
    607 \subsubsection{Threads}
    608 \CFA threads have all the caracteristiques of
    609 
    610 \subsection{High-level options}\label{tasks}
    611 
    612 \subsubsection{Thread interface}
    613 constructors destructors
    614         initializer lists
    615 monitors
    616 
    617 \subsubsection{Futures}
    618 
    619 \subsubsection{Implicit threading}
    620 Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level.
    621 
     602While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all strongly depends on the usage. Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. Furthermore, if the units of uninterrupted work are large enough the paradigm choice will be fully armoticised by the actual work done.
     603
     604\section{\CFA 's Thread Building Blocks}
     605As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency. With this said, it is possible to deconstruct the three paradigms details aboved in order to get simple building blocks. Here is a table showing the core caracteristics of the mentionned paradigms :
    622606\begin{center}
    623 \begin{tabular}[t]{|c|c|c|}
    624 Sequential & System Parallel & Language Parallel \\
    625 \begin{lstlisting}
    626 void big_sum(int* a, int* b,
    627                  int* out,
    628                  size_t length)
    629 {
    630         for(int i = 0; i < length; ++i ) {
    631                 out[i] = a[i] + b[i];
    632         }
    633 }
    634 
    635 
    636 
    637 
    638 
    639 int* a[10000];
    640 int* b[10000];
    641 int* c[10000];
    642 //... fill in a and b ...
    643 big_sum(a, b, c, 10000);
    644 \end{lstlisting} &\begin{lstlisting}
    645 void big_sum(int* a, int* b,
    646                  int* out,
    647                  size_t length)
    648 {
    649         range ar(a, a + length);
    650         range br(b, b + length);
    651         range or(out, out + length);
    652         parfor( ai, bi, oi,
    653         [](int* ai, int* bi, int* oi) {
    654                 oi = ai + bi;
    655         });
    656 }
    657 
    658 int* a[10000];
    659 int* b[10000];
    660 int* c[10000];
    661 //... fill in a and b ...
    662 big_sum(a, b, c, 10000);
    663 \end{lstlisting}&\begin{lstlisting}
    664 void big_sum(int* a, int* b,
    665                  int* out,
    666                  size_t length)
    667 {
    668         for (ai, bi, oi) in (a, b, out) {
    669                 oi = ai + bi;
    670         }
    671 }
    672 
    673 
    674 
    675 
    676 
    677 int* a[10000];
    678 int* b[10000];
    679 int* c[10000];
    680 //... fill in a and b ...
    681 big_sum(a, b, c, 10000);
    682 \end{lstlisting}
     607\begin{tabular}[t]{| r | c | c |}
     608\cline{2-3}
     609\multicolumn{1}{ c| }{} & Has a stack & Preemptive \\
     610\hline
     611\Glspl{job} & X & X \\
     612\hline
     613\Glspl{fiber} & \checkmark & X \\
     614\hline
     615\Glspl{uthread} & \checkmark & \checkmark \\
     616\hline
    683617\end{tabular}
    684618\end{center}
    685619
    686 \subsection{Machine setup}\label{machine}
    687 Threads are all good and well but wee still some OS support to fully utilize available hardware.
    688 
    689 \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads?
     620As shown in section \ref{cfaparadigms} these different blocks being available in \CFA it is trivial to reproduce any of these paradigm.
     621
     622\subsection{Thread Interface}
     623The basic building blocks of \CFA are \glspl{cfathread}. By default these are implemented as \glspl{uthread} and as such offer a flexible and lightweight threading interface (lightweight comparatievely to \glspl{kthread}). A thread can be declared using a struct declaration prefix with the \code{thread} as follows :
     624
     625\begin{lstlisting}
     626        thread struct foo {};
     627\end{lstlisting}
     628
     629Obviously, for this thread implementation to be usefull it must run some user code. Several other threading interfaces use some function pointer representation as the interface of threads (for example : \Csharp \cite{Csharp} and Scala \cite{Scala}). However, we consider that statically tying a \code{main} routine to a thread superseeds this approach. Since the \code{main} routine is definetely a special function in \CFA, we can reuse the existing syntax for declaring routines with unordinary name, i.e. operator overloading. As such the \code{main} routine of a thread can be defined as such :
     630\begin{lstlisting}
     631        thread struct foo {};
     632
     633        void ?main(thread foo* this) {
     634                /*... Some useful code ...*/
     635        }
     636\end{lstlisting}
     637
     638With these semantics it is trivial to write a thread type that takes a function pointer as parameter and executes it on its stack asynchronously :
     639\begin{lstlisting}
     640        typedef void (*voidFunc)(void);
     641
     642        thread struct FuncRunner {
     643                voidFunc func;
     644        };
     645
     646        //ctor
     647        void ?{}(thread FuncRunner* this, voidFunc inFunc) {
     648                func = inFunc;
     649        }
     650
     651        //main
     652        void ?main(thread FuncRunner* this) {
     653                this->func();
     654        }
     655\end{lstlisting}
     656
     657In this example \code{func} is a function pointer stored in \acrfull{tls}, which is \CFA is both easy to use and completly typesafe.
     658
     659Of course for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution. While using \acrshort{api} such as \code{fork} and \code{join} is relatively common in the literature, such an interface is not needed. Indeed, the simplest approach is to use \acrshort{raii} principles and have threads \code{fork} once the constructor has completed and \code{join} before the destructor runs.
     660\begin{lstlisting}
     661thread struct FuncRunner; //FuncRunner declared above
     662
     663void world() {
     664        sout | "World!" | endl;
     665}
     666
     667void main() {
     668        FuncRunner run = {world};
     669        //Thread run forks here
     670
     671        //Print to "Hello " and "World!" will be run concurrently
     672        sout | "Hello " | endl;
     673
     674        //Implicit join at end of scope
     675}
     676\end{lstlisting}
     677This semantic has several advantages over explicit semantics : typesafety is guaranteed, any thread will always be started and stopped exaclty once and users can't make any progamming errors. Furthermore it naturally follows the memory allocation semantics which means users don't need to learn multiple semantics.
     678
     679These semantics also naturally scale to multiple threads meaning basic synchronisation is very simple :
     680\begin{lstlisting}
     681        thread struct MyThread {
     682                //...
     683        };
     684
     685        //ctor
     686        void ?{}(thread MyThread* this) {}
     687
     688        //main
     689        void ?main(thread MyThread* this) {
     690                //...
     691        }
     692
     693        void foo() {
     694                MyThread thrds[10];
     695                //Start 10 threads at the beginning of the scope
     696
     697                DoStuff();
     698
     699                //Wait for the 10 threads to finish
     700        }
     701\end{lstlisting}
     702
     703\subsection{The \CFA Kernel : Processors, Clusters and Threads}\label{kernel}
     704
     705
     706\subsection{Paradigms}\label{cfaparadigms}
     707Given these building blocks we can then reproduce the all three of the popular paradigms. Indeed, we get \glspl{uthread} as the default paradigm in \CFA. However, disabling \glspl{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application.
     708
     709% \subsection{High-level options}\label{tasks}
     710%
     711% \subsubsection{Thread interface}
     712% constructors destructors
     713%       initializer lists
     714% monitors
     715%
     716% \subsubsection{Futures}
     717%
     718% \subsubsection{Implicit threading}
     719% Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level.
     720%
     721% \begin{center}
     722% \begin{tabular}[t]{|c|c|c|}
     723% Sequential & System Parallel & Language Parallel \\
     724% \begin{lstlisting}
     725% void big_sum(int* a, int* b,
     726%                int* out,
     727%                size_t length)
     728% {
     729%       for(int i = 0; i < length; ++i ) {
     730%               out[i] = a[i] + b[i];
     731%       }
     732% }
     733%
     734%
     735%
     736%
     737%
     738% int* a[10000];
     739% int* b[10000];
     740% int* c[10000];
     741% //... fill in a and b ...
     742% big_sum(a, b, c, 10000);
     743% \end{lstlisting} &\begin{lstlisting}
     744% void big_sum(int* a, int* b,
     745%                int* out,
     746%                size_t length)
     747% {
     748%       range ar(a, a + length);
     749%       range br(b, b + length);
     750%       range or(out, out + length);
     751%       parfor( ai, bi, oi,
     752%       [](int* ai, int* bi, int* oi) {
     753%               oi = ai + bi;
     754%       });
     755% }
     756%
     757% int* a[10000];
     758% int* b[10000];
     759% int* c[10000];
     760% //... fill in a and b ...
     761% big_sum(a, b, c, 10000);
     762% \end{lstlisting}&\begin{lstlisting}
     763% void big_sum(int* a, int* b,
     764%                int* out,
     765%                size_t length)
     766% {
     767%       for (ai, bi, oi) in (a, b, out) {
     768%               oi = ai + bi;
     769%       }
     770% }
     771%
     772%
     773%
     774%
     775%
     776% int* a[10000];
     777% int* b[10000];
     778% int* c[10000];
     779% //... fill in a and b ...
     780% big_sum(a, b, c, 10000);
     781% \end{lstlisting}
     782% \end{tabular}
     783% \end{center}
     784%
     785% \subsection{Machine setup}\label{machine}
     786% Threads are all good and well but wee still some OS support to fully utilize available hardware.
     787%
     788% \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads?
     789
     790\section{Putting it all together}
    690791
    691792\section{Future work}
     
    696797
    697798\clearpage
     799\printglossary[type=\acronymtype]
    698800\printglossary
    699801
  • doc/proposals/concurrency/glossary.tex

    rd95f565 ra9aab60  
    3131\textit{Synonyms : Tasks.}
    3232}
     33
     34\longnewglossaryentry{cfacluster}
     35{name={cluster}}
     36{
     37TBD...
     38
     39\textit{Synonyms : None.}
     40}
     41
     42\longnewglossaryentry{cfacpu}
     43{name={processor}}
     44{
     45TBD...
     46
     47\textit{Synonyms : None.}
     48}
     49
     50\longnewglossaryentry{cfathread}
     51{name={thread}}
     52{
     53TBD...
     54
     55\textit{Synonyms : None.}
     56}
     57
     58\longnewglossaryentry{preemption}
     59{name={preemption}}
     60{
     61TBD...
     62
     63\textit{Synonyms : None.}
     64}
     65
     66\newacronym{tls}{TLS}{Thread Local Storage}
     67\newacronym{api}{API}{Application Program Interface}
     68\newacronym{raii}{RAII}{Ressource Acquisition Is Initialization}
Note: See TracChangeset for help on using the changeset viewer.