Ignore:
Timestamp:
Sep 26, 2017, 11:27:38 PM (7 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
Children:
5dc26f5
Parents:
201aeb9
Message:

merge

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/proposals/concurrency/text/parallelism.tex

    r201aeb9 rd67cdb7  
    1111\section{Paradigm}
    1212\subsection{User-level threads}
    13 A direct improvement on the \gls{kthread} approach is to use \glspl{uthread}. These threads offer most of the same features that the operating system already provide but can be used on a much larger scale. This approach is the most powerfull solution as it allows all the features of multi-threading, while removing several of the more expensives costs of using kernel threads. The down side is that almost none of the low-level threading problems are hidden, users still have to think about data races, deadlocks and synchronization issues. These issues can be somewhat alleviated by a concurrency toolkit with strong garantees but the parallelism toolkit offers very little to reduce complexity in itself.
     13A direct improvement on the \gls{kthread} approach is to use \glspl{uthread}. These threads offer most of the same features that the operating system already provide but can be used on a much larger scale. This approach is the most powerfull solution as it allows all the features of multi-threading, while removing several of the more expensive costs of kernel threads. The down side is that almost none of the low-level threading problems are hidden; users still have to think about data races, deadlocks and synchronization issues. These issues can be somewhat alleviated by a concurrency toolkit with strong garantees but the parallelism toolkit offers very little to reduce complexity in itself.
    1414
    1515Examples of languages that support \glspl{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}.
    1616
    1717\subsection{Fibers : user-level threads without preemption}
    18 A popular varient of \glspl{uthread} is what is often reffered to as \glspl{fiber}. However, \glspl{fiber} do not present meaningful semantical differences with \glspl{uthread}. Advocates of \glspl{fiber} list their high performance and ease of implementation as majors strenghts of \glspl{fiber} but the performance difference between \glspl{uthread} and \glspl{fiber} is controversial and the ease of implementation, while true, is a weak argument in the context of language design. Therefore this proposal largely ignore fibers.
     18A popular varient of \glspl{uthread} is what is often refered to as \glspl{fiber}. However, \glspl{fiber} do not present meaningful semantical differences with \glspl{uthread}. Advocates of \glspl{fiber} list their high performance and ease of implementation as majors strenghts of \glspl{fiber} but the performance difference between \glspl{uthread} and \glspl{fiber} is controversial, and the ease of implementation, while true, is a weak argument in the context of language design. Therefore this proposal largely ignore fibers.
    1919
    2020An example of a language that uses fibers is Go~\cite{Go}
    2121
    2222\subsection{Jobs and thread pools}
    23 The approach on the opposite end of the spectrum is to base parallelism on \glspl{pool}. Indeed, \glspl{pool} offer limited flexibility but at the benefit of a simpler user interface. In \gls{pool} based systems, users express parallelism as units of work and a dependency graph (either explicit or implicit) that tie them together. This approach means users need not worry about concurrency but significantly limits the interaction that can occur among jobs. Indeed, any \gls{job} that blocks also blocks the underlying worker, which effectively means the CPU utilization, and therefore throughput, suffers noticeably. It can be argued that a solution to this problem is to use more workers than available cores. However, unless the number of jobs and the number of workers are comparable, having a significant amount of blocked jobs always results in idles cores.
     23The approach on the opposite end of the spectrum is to base parallelism on \glspl{pool}. Indeed, \glspl{pool} offer limited flexibility but at the benefit of a simpler user interface. In \gls{pool} based systems, users express parallelism as units of work, called jobs, and a dependency graph (either explicit or implicit) that tie them together. This approach means users need not worry about concurrency but significantly limits the interaction that can occur among jobs. Indeed, any \gls{job} that blocks also blocks the underlying worker, which effectively means the CPU utilization, and therefore throughput, suffers noticeably. It can be argued that a solution to this problem is to use more workers than available cores. However, unless the number of jobs and the number of workers are comparable, having a significant amount of blocked jobs always results in idles cores.
    2424
    2525The gold standard of this implementation is Intel's TBB library~\cite{TBB}.
    2626
    2727\subsection{Paradigm performance}
    28 While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pindown the performance implications of chosing a model at the language level. Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload. Having a large amount of mostly independent units of work to execute almost guarantess that the \gls{pool} based system has the best performance thanks to the lower memory overhead. However, interactions between jobs can easily exacerbate contention. User-level threads allow fine-grain context switching, which results in better resource utilisation, but context switches will be more expansive and the extra control means users need to tweak more variables to get the desired performance. Furthermore, if the units of uninterrupted work are large enough the paradigm choice is largely amorticised by the actual work done.
     28While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pindown the performance implications of chosing a model at the language level. Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload. Having a large amount of mostly independent units of work to execute almost guarantess that the \gls{pool} based system has the best performance thanks to the lower memory overhead (i.e., not thread stack per job). However, interactions among jobs can easily exacerbate contention. User-level threads allow fine-grain context switching, which results in better resource utilisation, but context switches is more expansive and the extra control means users need to tweak more variables to get the desired performance. Finally, if the units of uninterrupted work are large enough the paradigm choice is largely amortised by the actual work done.
    2929
    30 \newpage
    3130\TODO
    32 \subsection{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel}
     31
     32\section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel}
    3333
    3434
     35\subsubsection{Future Work: Machine setup}\label{machine}
     36While this was not done in the context of this proposal, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heteregenous setups. For example, system using \acrshort{numa} configurations may benefit from users being able to tie clusters and/or kernel threads to certains CPU cores. OS support for CPU affinity is now common \cit, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
     37
    3538\subsection{Paradigms}\label{cfaparadigms}
    36 Given these building blocks we can then reproduce the all three of the popular paradigms. Indeed, we get \glspl{uthread} as the default paradigm in \CFA. However, disabling \glspl{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application.
    37 
    38 % \subsection{High-level options}\label{tasks}
    39 %
    40 % \subsubsection{Thread interface}
    41 % constructors destructors
    42 %       initializer lists
    43 % monitors
    44 %
    45 % \subsubsection{Futures}
    46 %
    47 % \subsubsection{Implicit threading}
    48 % Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level.
    49 %
    50 % \begin{center}
    51 % \begin{tabular}[t]{|c|c|c|}
    52 % Sequential & System Parallel & Language Parallel \\
    53 % \begin{lstlisting}
    54 % void big_sum(int* a, int* b,
    55 %                int* out,
    56 %                size_t length)
    57 % {
    58 %       for(int i = 0; i < length; ++i ) {
    59 %               out[i] = a[i] + b[i];
    60 %       }
    61 % }
    62 %
    63 %
    64 %
    65 %
    66 %
    67 % int* a[10000];
    68 % int* b[10000];
    69 % int* c[10000];
    70 % //... fill in a and b ...
    71 % big_sum(a, b, c, 10000);
    72 % \end{lstlisting} &\begin{lstlisting}
    73 % void big_sum(int* a, int* b,
    74 %                int* out,
    75 %                size_t length)
    76 % {
    77 %       range ar(a, a + length);
    78 %       range br(b, b + length);
    79 %       range or(out, out + length);
    80 %       parfor( ai, bi, oi,
    81 %       [](int* ai, int* bi, int* oi) {
    82 %               oi = ai + bi;
    83 %       });
    84 % }
    85 %
    86 % int* a[10000];
    87 % int* b[10000];
    88 % int* c[10000];
    89 % //... fill in a and b ...
    90 % big_sum(a, b, c, 10000);
    91 % \end{lstlisting}&\begin{lstlisting}
    92 % void big_sum(int* a, int* b,
    93 %                int* out,
    94 %                size_t length)
    95 % {
    96 %       for (ai, bi, oi) in (a, b, out) {
    97 %               oi = ai + bi;
    98 %       }
    99 % }
    100 %
    101 %
    102 %
    103 %
    104 %
    105 % int* a[10000];
    106 % int* b[10000];
    107 % int* c[10000];
    108 % //... fill in a and b ...
    109 % big_sum(a, b, c, 10000);
    110 % \end{lstlisting}
    111 % \end{tabular}
    112 % \end{center}
    113 %
    114 % \subsection{Machine setup}\label{machine}
    115 % Threads are all good and well but wee still some OS support to fully utilize available hardware.
    116 %
    117 % \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads?
     39Given these building blocks, it is possible to reproduce all three of the popular paradigms. Indeed, \glspl{uthread} is the default paradigm in \CFA. However, disabling \glspl{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application. Finally, it is possible to build executors for thread pools from \glspl{uthread} or \glspl{fiber}.
Note: See TracChangeset for help on using the changeset viewer.