Changeset c69adb7 for doc/proposals/concurrency
- Timestamp:
- Oct 5, 2016, 11:47:40 AM (8 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
- Children:
- a7976d79
- Parents:
- 7e10773
- Location:
- doc/proposals/concurrency
- Files:
-
- 1 added
- 2 edited
- 1 moved
Legend:
- Unmodified
- Added
- Removed
-
doc/proposals/concurrency/Makefile
r7e10773 rc69adb7 1 1 ## Define the appropriate configuration variables. 2 2 3 TeXLIB = .:../../LaTeXmacros:../../LaTeXmacros/listings:../../LaTeXmacros/enumitem: 3 TeXLIB = .:../../LaTeXmacros:../../LaTeXmacros/listings:../../LaTeXmacros/enumitem:~/bibliographies: 4 4 LaTeX = TEXINPUTS=${TeXLIB} && export TEXINPUTS && latex -halt-on-error 5 5 BibTeX = BIBINPUTS=${TeXLIB} && export BIBINPUTS && bibtex … … 32 32 33 33 clean : 34 rm -f *.bbl *.aux *.dvi *.idx *.ilg *.ind *.brf *.out *.log *.toc *.blg *.pstex_t *.cf \34 rm -f *.bbl *.aux *.dvi *.idx *.ilg *.ind *.brf *.out *.log *.toc *.blg *.pstex_t *.cf *.glg *.glo *.gls *.ist \ 35 35 ${FIGURES} ${PICTURES} ${PROGRAMS} ${GRAPHS} ${basename ${DOCUMENT}}.ps ${DOCUMENT} 36 36 … … 54 54 -${BibTeX} ${basename $@} 55 55 # Make index from *.aux entries and input index at end of document 56 #makeindex -s ../../LaTeXmacros/indexstyle ${basename $@}.idx56 makeglossaries ${basename $@} 57 57 #${LaTeX} ${basename $@}.tex 58 58 # Run again to get index title into table of contents 59 59 ${LaTeX} ${basename $@}.tex 60 ${LaTeX} ${basename $@}.tex 61 60 62 61 63 predefined : -
doc/proposals/concurrency/concurrency.tex
r7e10773 rc69adb7 1 1 % requires tex packages: texlive-base texlive-latex-base tex-common texlive-humanities texlive-latex-extra texlive-fonts-recommended 2 2 3 % inline code ©...©(copyright symbol) emacs: C-q M-)4 % red highlighting ®...®(registered trademark symbol) emacs: C-q M-.5 % blue highlighting ß...ß(sharp s symbol) emacs: C-q M-_6 % green highlighting ¢...¢(cent symbol) emacs: C-q M-"7 % LaTex escape §...§(section symbol) emacs: C-q M-'8 % keyword escape ¶...¶(pilcrow symbol) emacs: C-q M-^3 % inline code �...� (copyright symbol) emacs: C-q M-) 4 % red highlighting �...� (registered trademark symbol) emacs: C-q M-. 5 % blue highlighting �...� (sharp s symbol) emacs: C-q M-_ 6 % green highlighting �...� (cent symbol) emacs: C-q M-" 7 % LaTex escape �...� (section symbol) emacs: C-q M-' 8 % keyword escape �...� (pilcrow symbol) emacs: C-q M-^ 9 9 % math escape $...$ (dollar symbol) 10 10 … … 24 24 \usepackage{graphicx} 25 25 \usepackage{tabularx} 26 \usepackage{glossaries} 26 27 \usepackage{varioref} % extended references 27 28 \usepackage{inconsolata} … … 36 37 \usepackage[dvips,plainpages=false,pdfpagelabels,pdfpagemode=UseNone,colorlinks=true,pagebackref=true,linkcolor=blue,citecolor=blue,urlcolor=blue,pagebackref=true,breaklinks=true]{hyperref} 37 38 \usepackage{breakurl} 39 38 40 \renewcommand{\UrlFont}{\small\sf} 39 41 … … 57 59 \newcommand{\code}[1]{\lstinline{#1}} 58 60 61 \input{glossary} 59 62 60 63 \newsavebox{\LstBox} … … 81 84 Indeed, for higly productive parallel programming high-level approaches are much more popular\cite{HPP:Study}. Examples are task based parallelism, message passing, implicit threading. 82 85 83 There are actually two problems that need to be solved in the design of the concurrency for a language. Which concurrency tools are available to the users and which parallelism tools are available. While these two concepts are often seen together, they are in fact distinct concepts that require different sorts of tools\cite{ Myths}. Concurrency tools need to handle mutual exclusion and synchronization while parallelism tools are more about performance, cost and ressource utilisation.86 There are actually two problems that need to be solved in the design of the concurrency for a language. Which concurrency tools are available to the users and which parallelism tools are available. While these two concepts are often seen together, they are in fact distinct concepts that require different sorts of tools\cite{Buhr05a}. Concurrency tools need to handle mutual exclusion and synchronization while parallelism tools are more about performance, cost and ressource utilisation. 84 87 85 88 \section{Concurrency} 86 Several tool can be used to solve concurrency challenges. Since these challenges always appear with the use of mutable shared state, some languages and libraries simply disallow mutable shared state completely (Erlang , Haskel, Akka (Scala))\cit. In the paradigms, interaction between concurrent objects rely on message passing or other paradigms that often closely relate to networking concepts. However, in imperative or OO languages these approaches entail a clear distinction between concurrent and non concurrent paradigms. Which in turns mean that programmers need to learn two sets of designs patterns in order to be effective at their jobs. Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on non-concurrent constructs like routine calls and objects. At a lower level these can be implemented as locks and atomic operations. However for productivity reasons it is desireable to have a higher-level construct to be the core concurrency paradigm\cite{HPP:Study}. This paper proposes Monitors\cit as the core concurrency construct.87 88 Finally, an approach that is worth mentionning because it is gaining in popularity is transactionnal memory\cit . However, the performance and feature set is currently too restrictive to be possible to add such a paradigm to a language like C or \CC\cit, which is why it was rejected as the core paradigm for concurrency in \CFA.89 Several tool can be used to solve concurrency challenges. Since these challenges always appear with the use of mutable shared state, some languages and libraries simply disallow mutable shared state completely (Erlang\cite{Erlang}, Haskell\cite{Haskell}, Akka (Scala)\cit). In the paradigms, interaction between concurrent objects rely on message passing or other paradigms that often closely relate to networking concepts. However, in imperative or OO languages these approaches entail a clear distinction between concurrent and non concurrent paradigms. Which in turns mean that programmers need to learn two sets of designs patterns in order to be effective at their jobs. Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on non-concurrent constructs like routine calls and objects. At a lower level these can be implemented as locks and atomic operations. However for productivity reasons it is desireable to have a higher-level construct to be the core concurrency paradigm\cite{HPP:Study}. This paper proposes Monitors\cit as the core concurrency construct. 90 91 Finally, an approach that is worth mentionning because it is gaining in popularity is transactionnal memory\cite{Dice10}. However, the performance and feature set is currently too restrictive to be possible to add such a paradigm to a language like C or \CC\cit, which is why it was rejected as the core paradigm for concurrency in \CFA. 89 92 90 93 \section{Monitors} 91 A monitor is a set of routines that ensure mutual exclusion when accessing shared state. This concept is generally associated with Object-Oriented Languages like Java\cit or \uC\cite{uCPP:Book} but does not strictly require OOP semantics. The only requirements is to be able to declare a handle to a shared object and a set of routines that act on it :94 A monitor is a set of routines that ensure mutual exclusion when accessing shared state. This concept is generally associated with Object-Oriented Languages like Java\cite{Java} or \uC\cite{uC++book} but does not strictly require OOP semantics. The only requirements is to be able to declare a handle to a shared object and a set of routines that act on it : 92 95 \begin{lstlisting} 93 96 typedef /*some monitor type*/ monitor; … … 442 445 Regardless of the option chosen for wait semantics, signal must be symmetrical. In all cases, signal only needs a single parameter, the condition variable that needs to be signalled. But \code{signal} needs to be called from the same monitor(s) than the call to \code{wait}. Otherwise, mutual exclusion cannot be properly transferred back to the waiting monitor. 443 446 447 Finally, an additionnal semantic which can be very usefull is the \code{signalBlock} routine. This routine behaves like signal for all of the semantics discussed above, but with the subtelty that mutual exclusion is transferred to the waiting task immediately rather than wating for the end of the critical section. 448 444 449 \subsection{External scheduling} \label{extsched} 445 \textbf{\large{Work in progress...}} 446 As one might expect, the alternative to Internal scheduling is to use external scheduling instead. The goal of external scheduling is to be able to have the same scheduling power as internal scheduling without the requirement that any thread can acquire the monitor lock. This method is somewhat more robust to deadlocks since one of the threads keeps a relatively tight control on scheduling. External scheduling can generally be done either in terms of control flow (see \uC) or in terms of data (see Go). Of course, both of these paradigms have their own strenghts and weaknesses but for this project control flow semantics where chosen to stay consistent with the reset of the languages semantics. Two challenges specific to \CFA arise when trying to add external scheduling which is loose object definitions and multi-monitor routines. 450 As one might expect, the alternative to Internal scheduling is to use External scheduling instead. This method is somewhat more robust to deadlocks since one of the threads keeps a relatively tight control on scheduling. Indeed, as the following examples will demontrate, external scheduling allows users to wait for events from other threads without the concern of unrelated events occuring. External scheduling can generally be done either in terms of control flow (see \uC) or in terms of data (see Go). Of course, both of these paradigms have their own strenghts and weaknesses but for this project control flow semantics where chosen to stay consistent with the reset of the languages semantics. Two challenges specific to \CFA arise when trying to add external scheduling which is loose object definitions and multi-monitor routines. The following example shows what a simple use \code{accept} versus \code{wait}/\code{signal} and its advantages. 451 452 \begin{center} 453 \begin{tabular}{|c|c|} 454 Internal Scheduling & External Scheduling \\ 455 \hline 456 \begin{lstlisting} 457 _Monitor blarg { 458 condition c; 459 public: 460 void f(); 461 void g() { signal} 462 void h() { wait(c); } 463 private: 464 } 465 \end{lstlisting}&\begin{lstlisting} 466 _Monitor blarg { 467 468 public: 469 void f(); 470 void g(); 471 void h() { _Accept(g); } 472 private: 473 } 474 \end{lstlisting} 475 \end{tabular} 476 \end{center} 477 478 In the case of internal scheduling, the call to \code{wait} only guarantees that \code{g} was the last routine to access the monitor. This intails that the routine \code{f} may have acquired mutual exclusion several times while routine \code{h} was waiting. On the other hand, external scheduling guarantees that while routine \code{h} was waiting, no routine other than \code{g} could acquire the monitor. 447 479 448 480 \subsubsection{Loose object definitions} 449 In \uC monitor definitions include an exhaustive list of monitor operations : 450 \begin{lstlisting} 451 _Monitor blarg { 452 public: 453 void f() { _Accept(g); } 454 void g(); 455 private: 456 } 457 \end{lstlisting} 458 459 Since \CFA is not an object oriented it becomes much more difficult to implement but also much less clear for the user : 481 In \uC monitor definitions include an exhaustive list of monitor operations. Since \CFA is not an object oriented it becomes much more difficult to implement but also much less clear for the user : 460 482 461 483 \begin{lstlisting} 462 484 mutex struct A {}; 463 485 464 void f(A & mutex a) { accept(g); }486 void f(A & mutex a); 465 487 void g(A & mutex a); 488 void h(A & mutex a) { accept(g); } 466 489 \end{lstlisting} 467 490 … … 556 579 To support multi-monitor external scheduling means that some kind of entry-queues must be used that is aware of both monitors. However, acceptable routines must be aware of the entry queues which means they most be stored inside at least one of the monitors that will be acquired. This in turn adds the requirement a systematic algorithm of disambiguating which queue is relavant regardless of user ordering. The proposed algorithm is to fall back on monitors lock ordering and specify that the monitor that is acquired first is the lock with the relevant entry queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects gut that is a reasonnable contraint. This algorithm choice has two consequences, the ofthe highest priority monitor is no longer a true FIFO queue and the queue of the lowest priority monitor is both required and probably unused. The queue can no longer be a FIFO queue because instead of simply containing the waiting threads in order arrival, they also contain the second mutex. Therefore, another thread with the same highest priority monitor but a different lowest priority monitor may arrive first but enter the critical section after a thread with the correct pairing. Secondly, since it may not be known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is probably that half the multi-monitor queues will go unused for the entire duration of the program. 557 580 558 \section{Parrallelism} 559 560 \section{Tasks} 561 562 \section{Naming} 581 \subsection{Other concurrency tools} 582 583 \section{Parallelism} 584 Historically, computer performance was about processor speeds and instructions count. However, with heat dissipaction being an ever growing challenge, parallelism has become the new source of greatest performance \cite{Sutter05, Sutter05b}. In this decade, it is not longer reasonnable create high-performance application without caring about parallelism. Indeed, parallelism an important aspect of performance and more specifically throughput and hardware utilization. The lowest level approach parallelism is to use \glspl{kthread}. However since these have significant costs and limitations, \glspl{kthread} are now mostly used as an implementation tool rather than a user oriented one. There are several alternatives to solve these issues which all have strengths and weaknesses. 585 586 \subsection{User-level threads} 587 A direct improvement on the \gls{kthread} approach is to use \glspl{uthread}. These threads offer most of the same features that the operating system already provide but can be used on a much larger scale. This is the most powerfull solution as it allows all the features of multi-threading while removing several of the more expensives costs of using kernel threads. The down side is that almost none of the low-level threading complexities are hidden, users still have to think about data races, deadlocks and synchronization issues. This can be somewhat alleviated by a concurrency toolkit with strong garantees but the parallelism toolkit offers very little to reduce complexity in itself. 588 589 Examples of languages that support are Java\cite{Java}, Haskell\cite{Haskell} and \uC\cite{uC++book}. 590 591 \subsection{Jobs and thread pools} 592 The opposite approach is to base parallelism on \glspl{job}. Indeed, \glspl{job} offer limited flexibility but at the benefit of a simpler user interface. In \gls{job} based systems users express parallelism as units of work and the dependency graph (either explicit or implicit) that tie them together. This means users need not to worry about concurrency but significantly limits the interaction that can occur between different jobs. Indeed, any \gls{job} that blocks also blocks the underlying \gls{kthread}, this effectively mean the CPU utilization, and therefore throughput, will suffer noticeably. The golden standard of this implementation is Intel's TBB library\cite{TBB}. 593 594 \subsection{Fibers : user-level threads without preemption} 595 Finally, in the middle of the flexibility versus complexity spectrum lay \glspl{fiber} which offer \glspl{uthread} without the complexity of preemption. This means users don't have to worry about other \glspl{fiber} suddenly executing between two instructions which signficantly reduces complexity. However, any call to IO or other concurrency primitives can lead to context switches. Furthermore, users can also block \glspl{fiber} in the middle of their execution without blocking a full processor core. This means users still have to worry about mutual exclusion, deadlocks and race conditions in their code, raising the complexity significantly. 596 \cite{Go} 597 598 \subsection{Paradigm performance} 599 While the choice between the three paradigms listed above can have significant performance implication, it is difficult to pin the performance implications of chosing a model at the language level. Indeed, in many situations own of these paradigms will show better performance but it all depends on the usage. 600 Having mostly indepent units of work to execute almost guarantess that the \gls{job} based system will have the best performance. However, add interactions between jobs and the processor utilisation might suffer. User-level threads may allow maximum ressource utilisation but context switches will be more expansive and it is also harder for users to get perfect tunning. As with every example, fibers sit somewhat in the middle of the spectrum. 601 602 \section{Parallelism in \CFA} 603 As a system level language, \CFA should offer both performance and flexibilty as its primary goals, simplicity and user-friendliness being a secondary concern. Therefore, the core of parallelism in \CFA should prioritize power and efficiency. 604 605 \subsection{Kernel core}\label{kernel} 606 At the ro 607 \subsubsection{Threads} 608 \CFA threads have all the caracteristiques of 609 610 \subsection{High-level options}\label{tasks} 611 612 \subsubsection{Thread interface} 613 constructors destructors 614 initializer lists 615 monitors 616 617 \subsubsection{Futures} 618 619 \subsubsection{Implicit threading} 620 Finally, simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the system level. 621 622 \begin{center} 623 \begin{tabular}[t]{|c|c|c|} 624 Sequential & System Parallel & Language Parallel \\ 625 \begin{lstlisting} 626 void big_sum(int* a, int* b, 627 int* out, 628 size_t length) 629 { 630 for(int i = 0; i < length; ++i ) { 631 out[i] = a[i] + b[i]; 632 } 633 } 634 635 636 637 638 639 int* a[10000]; 640 int* b[10000]; 641 int* c[10000]; 642 //... fill in a and b ... 643 big_sum(a, b, c, 10000); 644 \end{lstlisting} &\begin{lstlisting} 645 void big_sum(int* a, int* b, 646 int* out, 647 size_t length) 648 { 649 range ar(a, a + length); 650 range br(b, b + length); 651 range or(out, out + length); 652 parfor( ai, bi, oi, 653 [](int* ai, int* bi, int* oi) { 654 oi = ai + bi; 655 }); 656 } 657 658 int* a[10000]; 659 int* b[10000]; 660 int* c[10000]; 661 //... fill in a and b ... 662 big_sum(a, b, c, 10000); 663 \end{lstlisting}&\begin{lstlisting} 664 void big_sum(int* a, int* b, 665 int* out, 666 size_t length) 667 { 668 for (ai, bi, oi) in (a, b, out) { 669 oi = ai + bi; 670 } 671 } 672 673 674 675 676 677 int* a[10000]; 678 int* b[10000]; 679 int* c[10000]; 680 //... fill in a and b ... 681 big_sum(a, b, c, 10000); 682 \end{lstlisting} 683 \end{tabular} 684 \end{center} 685 686 \subsection{Machine setup}\label{machine} 687 Threads are all good and well but wee still some OS support to fully utilize available hardware. 688 689 \textbf{\large{Work in progress...}} Do wee need something beyond specifying the number of kernel threads? 563 690 564 691 \section{Future work} 692 Concurrency and parallelism is still a very active field that strongly benefits from hardware advances. As such certain features that aren't necessarily mature enough in their current state could become relevant in the lifetime of \CFA. 693 \subsection{Transactions} 565 694 566 695 \section*{Acknowledgements} 567 696 568 569 697 \clearpage 698 \printglossary 699 700 \clearpage 570 701 \bibliographystyle{plain} 571 \bibliography{ citations}702 \bibliography{pl,local} 572 703 573 704 -
doc/proposals/concurrency/local.bib
r7e10773 rc69adb7 22 22 23 23 24 @mastersthesis{Bilson:CFA,25 keywords = {Cforall, Overloading, Polymorphism},26 author = {Richard C. Bilson},27 title = {Implementing Overloading and Polymorphism in Cforall},28 school = "University of Waterloo",29 year = "2003"30 }31 32 24 @article{HPP:Study, 33 25 keywords = {Parallel, Productivity}, 34 26 author = {Lorin Hochstein and Jeff Carver and Forrest Shull and Sima Asgari and Victor Basili and Jeffrey K. Hollingsworth and Marvin V. Zelkowitz }, 35 27 title = {Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers}, 36 }37 @article{CFA:Refrat,38 keywords = {Cforall, refrat},39 author = {Glen Ditchfield},40 title = {Cforall Reference Manual and Rationale},41 month = jan,42 year = 200343 28 } 44 29 … … 50 35 } 51 36 52 @article{Myths, 53 author = {Peter A. Buhr and Ashif S. Harji}, 54 title = {Concurrent Urban Legends}, 55 year = 2005 37 @article{TBB, 38 keywords = {Intel, TBB}, 39 title = {Intel Thread Building Blocks}, 56 40 } 57 58 @article{uCPP:Book,59 keywords = {uC++, manual, book},60 author = {Peter A. Buhr},61 title = {Understanding Control Flow with Concurrent Programming using $\mu${C}{\kern-.1em\hbox{\large\texttt{+\kern-.25em+}}}},62 month = aug,63 year = 201464 }65 66 @techreport{ISO:Ada,67 type = {International Standard},68 key = {ISO/IEC 8652:1995},69 year = {1995},70 title = {Ada},71 volume = {1995},72 institution = {International Organization for Standardization}73 }
Note: See TracChangeset
for help on using the changeset viewer.