Context Navigation

← Previous Change
Next Change →

Changeset a9aab60 for doc/proposals

Timestamp:

Oct 7, 2016, 4:41:17 PM (8 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc

Children:

Parents:

Message:

some progress on parallelism

Location:

doc/proposals/concurrency

Files:

: 2 edited

concurrency.tex (modified) (4 diffs)
glossary.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

doc/proposals/concurrency/concurrency.tex

-                      rd95f565
+                      ra9aab60
 \usepackage{graphicx}
 \usepackage{tabularx}
 \usepackage{glossaries}
+\usepackage[acronym]{glossaries}
 \usepackage{varioref}                                                           % extended references
 \usepackage{inconsolata}
 …
 \usepackage{breakurl}
+\usepackage{tikz}
+\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
 \renewcommand{\UrlFont}{\small\sf}
 …
 \subsection{Paradigm performance}
+While the choice between the three paradigms listed above can have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all depends on the usage.
+Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum.
+\section{Parallelism in \CFA}
+As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency.
+\subsection{Kernel core}\label{kernel}
+At the ro
+\subsubsection{Threads}
+\CFA threads have all the caracteristiques of
+\subsection{High-level options}\label{tasks}
+\subsubsection{Thread interface}
+constructors destructors
+        initializer lists
+monitors
+\subsubsection{Futures}
+\subsubsection{Implicit threading}
+Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level.
+While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all strongly depends on the usage. Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. Furthermore, if the units of uninterrupted work are large enough the paradigm choice will be fully armoticised by the actual work done.
+\section{\CFA 's Thread Building Blocks}
+As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency. With this said, it is possible to deconstruct the three paradigms details aboved in order to get simple building blocks. Here is a table showing the core caracteristics of the mentionned paradigms :
 \begin{center}
+\begin{tabular}[t]{|c|c|c|}
+Sequential & System Parallel & Language Parallel \\
+\begin{lstlisting}
+void big_sum(int* a, int* b,
+                 int* out,
+                 size_t length)
+{
+        for(int i = 0; i < length; ++i ) {
+                out[i] = a[i] + b[i];
+        }
+}
+int* a[10000];
+int* b[10000];
+int* c[10000];
+//... fill in a and b ...
+big_sum(a, b, c, 10000);
+\end{lstlisting} &\begin{lstlisting}
+void big_sum(int* a, int* b,
+                 int* out,
+                 size_t length)
+{
+        range ar(a, a + length);
+        range br(b, b + length);
+        range or(out, out + length);
+        parfor( ai, bi, oi,
+        [](int* ai, int* bi, int* oi) {
+                oi = ai + bi;
+        });
+}
+int* a[10000];
+int* b[10000];
+int* c[10000];
+//... fill in a and b ...
+big_sum(a, b, c, 10000);
+\end{lstlisting}&\begin{lstlisting}
+void big_sum(int* a, int* b,
+                 int* out,
+                 size_t length)
+{
+        for (ai, bi, oi) in (a, b, out) {
+                oi = ai + bi;
+        }
+}
+int* a[10000];
+int* b[10000];
+int* c[10000];
+//... fill in a and b ...
+big_sum(a, b, c, 10000);
+\end{lstlisting}
+\begin{tabular}[t]{| r | c | c |}
+\cline{2-3}
+\multicolumn{1}{ c| }{} & Has a stack & Preemptive \\
+\hline
+\Glspl{job} & X & X \\
+\hline
+\Glspl{fiber} & \checkmark & X \\
+\hline
+\Glspl{uthread} & \checkmark & \checkmark \\
+\hline
 \end{tabular}
 \end{center}
+\subsection{Machine setup}\label{machine}
+Threads are all good and well but wee still some OS support to fully utilize available hardware.
+\textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads?
+As shown in section \ref{cfaparadigms} these different blocks being available in \CFA it is trivial to reproduce any of these paradigm.
+\subsection{Thread Interface}
+The basic building blocks of \CFA are \glspl{cfathread}. By default these are implemented as \glspl{uthread} and as such offer a flexible and lightweight threading interface (lightweight comparatievely to \glspl{kthread}). A thread can be declared using a struct declaration prefix with the \code{thread} as follows :
+\begin{lstlisting}
+        thread struct foo {};
+\end{lstlisting}
+Obviously, for this thread implementation to be usefull it must run some user code. Several other threading interfaces use some function pointer representation as the interface of threads (for example : \Csharp \cite{Csharp} and Scala \cite{Scala}). However, we consider that statically tying a \code{main} routine to a thread superseeds this approach. Since the \code{main} routine is definetely a special function in \CFA, we can reuse the existing syntax for declaring routines with unordinary name, i.e. operator overloading. As such the \code{main} routine of a thread can be defined as such :
+\begin{lstlisting}
+        thread struct foo {};
+        void ?main(thread foo* this) {
+                /*... Some useful code ...*/
+        }
+\end{lstlisting}
+With these semantics it is trivial to write a thread type that takes a function pointer as parameter and executes it on its stack asynchronously :
+\begin{lstlisting}
+        typedef void (*voidFunc)(void);
+        thread struct FuncRunner {
+                voidFunc func;
+        };
+        //ctor
+        void ?{}(thread FuncRunner* this, voidFunc inFunc) {
+                func = inFunc;
+        }
+        //main
+        void ?main(thread FuncRunner* this) {
+                this->func();
+        }
+\end{lstlisting}
+In this example \code{func} is a function pointer stored in \acrfull{tls}, which is \CFA is both easy to use and completly typesafe.
+Of course for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution. While using \acrshort{api} such as \code{fork} and \code{join} is relatively common in the literature, such an interface is not needed. Indeed, the simplest approach is to use \acrshort{raii} principles and have threads \code{fork} once the constructor has completed and \code{join} before the destructor runs.
+\begin{lstlisting}
+thread struct FuncRunner; //FuncRunner declared above
+void world() {
+        sout | "World!" | endl;
+}
+void main() {
+        FuncRunner run = {world};
+        //Thread run forks here
+        //Print to "Hello " and "World!" will be run concurrently
+        sout | "Hello " | endl;
+        //Implicit join at end of scope
+}
+\end{lstlisting}
+This semantic has several advantages over explicit semantics : typesafety is guaranteed, any thread will always be started and stopped exaclty once and users can't make any progamming errors. Furthermore it naturally follows the memory allocation semantics which means users don't need to learn multiple semantics.
+These semantics also naturally scale to multiple threads meaning basic synchronisation is very simple :
+\begin{lstlisting}
+        thread struct MyThread {
+                //...
+        };
+        //ctor
+        void ?{}(thread MyThread* this) {}
+        //main
+        void ?main(thread MyThread* this) {
+                //...
+        }
+        void foo() {
+                MyThread thrds[10];
+                //Start 10 threads at the beginning of the scope
+                DoStuff();
+                //Wait for the 10 threads to finish
+        }
+\end{lstlisting}
+\subsection{The \CFA Kernel : Processors, Clusters and Threads}\label{kernel}
+\subsection{Paradigms}\label{cfaparadigms}
+Given these building blocks we can then reproduce the all three of the popular paradigms. Indeed, we get \glspl{uthread} as the default paradigm in \CFA. However, disabling \glspl{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application.
+% \subsection{High-level options}\label{tasks}
+%
+% \subsubsection{Thread interface}
+% constructors destructors
+%       initializer lists
+% monitors
+%
+% \subsubsection{Futures}
+%
+% \subsubsection{Implicit threading}
+% Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level.
+%
+% \begin{center}
+% \begin{tabular}[t]{|c|c|c|}
+% Sequential & System Parallel & Language Parallel \\
+% \begin{lstlisting}
+% void big_sum(int* a, int* b,
+%                int* out,
+%                size_t length)
+% {
+%       for(int i = 0; i < length; ++i ) {
+%               out[i] = a[i] + b[i];
+%       }
+% }
+%
+%
+%
+%
+%
+% int* a[10000];
+% int* b[10000];
+% int* c[10000];
+% //... fill in a and b ...
+% big_sum(a, b, c, 10000);
+% \end{lstlisting} &\begin{lstlisting}
+% void big_sum(int* a, int* b,
+%                int* out,
+%                size_t length)
+% {
+%       range ar(a, a + length);
+%       range br(b, b + length);
+%       range or(out, out + length);
+%       parfor( ai, bi, oi,
+%       [](int* ai, int* bi, int* oi) {
+%               oi = ai + bi;
+%       });
+% }
+%
+% int* a[10000];
+% int* b[10000];
+% int* c[10000];
+% //... fill in a and b ...
+% big_sum(a, b, c, 10000);
+% \end{lstlisting}&\begin{lstlisting}
+% void big_sum(int* a, int* b,
+%                int* out,
+%                size_t length)
+% {
+%       for (ai, bi, oi) in (a, b, out) {
+%               oi = ai + bi;
+%       }
+% }
+%
+%
+%
+%
+%
+% int* a[10000];
+% int* b[10000];
+% int* c[10000];
+% //... fill in a and b ...
+% big_sum(a, b, c, 10000);
+% \end{lstlisting}
+% \end{tabular}
+% \end{center}
+%
+% \subsection{Machine setup}\label{machine}
+% Threads are all good and well but wee still some OS support to fully utilize available hardware.
+%
+% \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads?
+\section{Putting it all together}
 \section{Future work}
 …
 \clearpage
+\printglossary[type=\acronymtype]
 \printglossary

doc/proposals/concurrency/glossary.tex

-                      rd95f565
+                      ra9aab60
 \textit{Synonyms : Tasks.}
+}
+\longnewglossaryentry{cfacluster}
+{name={cluster}}
+{
+TBD...
+\textit{Synonyms : None.}
+}
+\longnewglossaryentry{cfacpu}
+{name={processor}}
+{
+TBD...
+\textit{Synonyms : None.}
+}
+\longnewglossaryentry{cfathread}
+{name={thread}}
+{
+TBD...
+\textit{Synonyms : None.}
+}
+\longnewglossaryentry{preemption}
+{name={preemption}}
+{
+TBD...
+\textit{Synonyms : None.}
+}
+\newacronym{tls}{TLS}{Thread Local Storage}
+\newacronym{api}{API}{Application Program Interface}
+\newacronym{raii}{RAII}{Ressource Acquisition Is Initialization}

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: