Changeset a9aab60 for doc/proposals
- Timestamp:
- Oct 7, 2016, 4:41:17 PM (8 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
- Children:
- 9b4343e
- Parents:
- d95f565
- Location:
- doc/proposals/concurrency
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/proposals/concurrency/concurrency.tex
rd95f565 ra9aab60 24 24 \usepackage{graphicx} 25 25 \usepackage{tabularx} 26 \usepackage {glossaries}26 \usepackage[acronym]{glossaries} 27 27 \usepackage{varioref} % extended references 28 28 \usepackage{inconsolata} … … 38 38 \usepackage{breakurl} 39 39 40 \usepackage{tikz} 41 \def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;} 42 40 43 \renewcommand{\UrlFont}{\small\sf} 41 44 … … 597 600 598 601 \subsection{Paradigm performance} 599 While the choice between the three paradigms listed above can have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all depends on the usage. 600 Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. 601 602 \section{Parallelism in \CFA} 603 As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency. 604 605 \subsection{Kernel core}\label{kernel} 606 At the ro 607 \subsubsection{Threads} 608 \CFA threads have all the caracteristiques of 609 610 \subsection{High-level options}\label{tasks} 611 612 \subsubsection{Thread interface} 613 constructors destructors 614 initializer lists 615 monitors 616 617 \subsubsection{Futures} 618 619 \subsubsection{Implicit threading} 620 Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level. 621 602 While the choice between the three paradigms listed above may have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all strongly depends on the usage. Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. Furthermore, if the units of uninterrupted work are large enough the paradigm choice will be fully armoticised by the actual work done. 603 604 \section{\CFA 's Thread Building Blocks} 605 As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency. With this said, it is possible to deconstruct the three paradigms details aboved in order to get simple building blocks. Here is a table showing the core caracteristics of the mentionned paradigms : 622 606 \begin{center} 623 \begin{tabular}[t]{|c|c|c|} 624 Sequential & System Parallel & Language Parallel \\ 625 \begin{lstlisting} 626 void big_sum(int* a, int* b, 627 int* out, 628 size_t length) 629 { 630 for(int i = 0; i < length; ++i ) { 631 out[i] = a[i] + b[i]; 632 } 633 } 634 635 636 637 638 639 int* a[10000]; 640 int* b[10000]; 641 int* c[10000]; 642 //... fill in a and b ... 643 big_sum(a, b, c, 10000); 644 \end{lstlisting} &\begin{lstlisting} 645 void big_sum(int* a, int* b, 646 int* out, 647 size_t length) 648 { 649 range ar(a, a + length); 650 range br(b, b + length); 651 range or(out, out + length); 652 parfor( ai, bi, oi, 653 [](int* ai, int* bi, int* oi) { 654 oi = ai + bi; 655 }); 656 } 657 658 int* a[10000]; 659 int* b[10000]; 660 int* c[10000]; 661 //... fill in a and b ... 662 big_sum(a, b, c, 10000); 663 \end{lstlisting}&\begin{lstlisting} 664 void big_sum(int* a, int* b, 665 int* out, 666 size_t length) 667 { 668 for (ai, bi, oi) in (a, b, out) { 669 oi = ai + bi; 670 } 671 } 672 673 674 675 676 677 int* a[10000]; 678 int* b[10000]; 679 int* c[10000]; 680 //... fill in a and b ... 681 big_sum(a, b, c, 10000); 682 \end{lstlisting} 607 \begin{tabular}[t]{| r | c | c |} 608 \cline{2-3} 609 \multicolumn{1}{ c| }{} & Has a stack & Preemptive \\ 610 \hline 611 \Glspl{job} & X & X \\ 612 \hline 613 \Glspl{fiber} & \checkmark & X \\ 614 \hline 615 \Glspl{uthread} & \checkmark & \checkmark \\ 616 \hline 683 617 \end{tabular} 684 618 \end{center} 685 619 686 \subsection{Machine setup}\label{machine} 687 Threads are all good and well but wee still some OS support to fully utilize available hardware. 688 689 \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads? 620 As shown in section \ref{cfaparadigms} these different blocks being available in \CFA it is trivial to reproduce any of these paradigm. 621 622 \subsection{Thread Interface} 623 The basic building blocks of \CFA are \glspl{cfathread}. By default these are implemented as \glspl{uthread} and as such offer a flexible and lightweight threading interface (lightweight comparatievely to \glspl{kthread}). A thread can be declared using a struct declaration prefix with the \code{thread} as follows : 624 625 \begin{lstlisting} 626 thread struct foo {}; 627 \end{lstlisting} 628 629 Obviously, for this thread implementation to be usefull it must run some user code. Several other threading interfaces use some function pointer representation as the interface of threads (for example : \Csharp \cite{Csharp} and Scala \cite{Scala}). However, we consider that statically tying a \code{main} routine to a thread superseeds this approach. Since the \code{main} routine is definetely a special function in \CFA, we can reuse the existing syntax for declaring routines with unordinary name, i.e. operator overloading. As such the \code{main} routine of a thread can be defined as such : 630 \begin{lstlisting} 631 thread struct foo {}; 632 633 void ?main(thread foo* this) { 634 /*... Some useful code ...*/ 635 } 636 \end{lstlisting} 637 638 With these semantics it is trivial to write a thread type that takes a function pointer as parameter and executes it on its stack asynchronously : 639 \begin{lstlisting} 640 typedef void (*voidFunc)(void); 641 642 thread struct FuncRunner { 643 voidFunc func; 644 }; 645 646 //ctor 647 void ?{}(thread FuncRunner* this, voidFunc inFunc) { 648 func = inFunc; 649 } 650 651 //main 652 void ?main(thread FuncRunner* this) { 653 this->func(); 654 } 655 \end{lstlisting} 656 657 In this example \code{func} is a function pointer stored in \acrfull{tls}, which is \CFA is both easy to use and completly typesafe. 658 659 Of course for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution. While using \acrshort{api} such as \code{fork} and \code{join} is relatively common in the literature, such an interface is not needed. Indeed, the simplest approach is to use \acrshort{raii} principles and have threads \code{fork} once the constructor has completed and \code{join} before the destructor runs. 660 \begin{lstlisting} 661 thread struct FuncRunner; //FuncRunner declared above 662 663 void world() { 664 sout | "World!" | endl; 665 } 666 667 void main() { 668 FuncRunner run = {world}; 669 //Thread run forks here 670 671 //Print to "Hello " and "World!" will be run concurrently 672 sout | "Hello " | endl; 673 674 //Implicit join at end of scope 675 } 676 \end{lstlisting} 677 This semantic has several advantages over explicit semantics : typesafety is guaranteed, any thread will always be started and stopped exaclty once and users can't make any progamming errors. Furthermore it naturally follows the memory allocation semantics which means users don't need to learn multiple semantics. 678 679 These semantics also naturally scale to multiple threads meaning basic synchronisation is very simple : 680 \begin{lstlisting} 681 thread struct MyThread { 682 //... 683 }; 684 685 //ctor 686 void ?{}(thread MyThread* this) {} 687 688 //main 689 void ?main(thread MyThread* this) { 690 //... 691 } 692 693 void foo() { 694 MyThread thrds[10]; 695 //Start 10 threads at the beginning of the scope 696 697 DoStuff(); 698 699 //Wait for the 10 threads to finish 700 } 701 \end{lstlisting} 702 703 \subsection{The \CFA Kernel : Processors, Clusters and Threads}\label{kernel} 704 705 706 \subsection{Paradigms}\label{cfaparadigms} 707 Given these building blocks we can then reproduce the all three of the popular paradigms. Indeed, we get \glspl{uthread} as the default paradigm in \CFA. However, disabling \glspl{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application. 708 709 % \subsection{High-level options}\label{tasks} 710 % 711 % \subsubsection{Thread interface} 712 % constructors destructors 713 % initializer lists 714 % monitors 715 % 716 % \subsubsection{Futures} 717 % 718 % \subsubsection{Implicit threading} 719 % Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level. 720 % 721 % \begin{center} 722 % \begin{tabular}[t]{|c|c|c|} 723 % Sequential & System Parallel & Language Parallel \\ 724 % \begin{lstlisting} 725 % void big_sum(int* a, int* b, 726 % int* out, 727 % size_t length) 728 % { 729 % for(int i = 0; i < length; ++i ) { 730 % out[i] = a[i] + b[i]; 731 % } 732 % } 733 % 734 % 735 % 736 % 737 % 738 % int* a[10000]; 739 % int* b[10000]; 740 % int* c[10000]; 741 % //... fill in a and b ... 742 % big_sum(a, b, c, 10000); 743 % \end{lstlisting} &\begin{lstlisting} 744 % void big_sum(int* a, int* b, 745 % int* out, 746 % size_t length) 747 % { 748 % range ar(a, a + length); 749 % range br(b, b + length); 750 % range or(out, out + length); 751 % parfor( ai, bi, oi, 752 % [](int* ai, int* bi, int* oi) { 753 % oi = ai + bi; 754 % }); 755 % } 756 % 757 % int* a[10000]; 758 % int* b[10000]; 759 % int* c[10000]; 760 % //... fill in a and b ... 761 % big_sum(a, b, c, 10000); 762 % \end{lstlisting}&\begin{lstlisting} 763 % void big_sum(int* a, int* b, 764 % int* out, 765 % size_t length) 766 % { 767 % for (ai, bi, oi) in (a, b, out) { 768 % oi = ai + bi; 769 % } 770 % } 771 % 772 % 773 % 774 % 775 % 776 % int* a[10000]; 777 % int* b[10000]; 778 % int* c[10000]; 779 % //... fill in a and b ... 780 % big_sum(a, b, c, 10000); 781 % \end{lstlisting} 782 % \end{tabular} 783 % \end{center} 784 % 785 % \subsection{Machine setup}\label{machine} 786 % Threads are all good and well but wee still some OS support to fully utilize available hardware. 787 % 788 % \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads? 789 790 \section{Putting it all together} 690 791 691 792 \section{Future work} … … 696 797 697 798 \clearpage 799 \printglossary[type=\acronymtype] 698 800 \printglossary 699 801 -
doc/proposals/concurrency/glossary.tex
rd95f565 ra9aab60 31 31 \textit{Synonyms : Tasks.} 32 32 } 33 34 \longnewglossaryentry{cfacluster} 35 {name={cluster}} 36 { 37 TBD... 38 39 \textit{Synonyms : None.} 40 } 41 42 \longnewglossaryentry{cfacpu} 43 {name={processor}} 44 { 45 TBD... 46 47 \textit{Synonyms : None.} 48 } 49 50 \longnewglossaryentry{cfathread} 51 {name={thread}} 52 { 53 TBD... 54 55 \textit{Synonyms : None.} 56 } 57 58 \longnewglossaryentry{preemption} 59 {name={preemption}} 60 { 61 TBD... 62 63 \textit{Synonyms : None.} 64 } 65 66 \newacronym{tls}{TLS}{Thread Local Storage} 67 \newacronym{api}{API}{Application Program Interface} 68 \newacronym{raii}{RAII}{Ressource Acquisition Is Initialization}
Note: See TracChangeset
for help on using the changeset viewer.