Changeset 5ff188f for doc/papers/concurrency/Paper.tex
- Timestamp:
- Jan 31, 2018, 5:49:36 PM (7 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
- Children:
- 281806b
- Parents:
- 633a642
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r633a642 r5ff188f 1 % requires tex packages: texlive-base texlive-latex-base tex-common texlive-humanities texlive-latex-extra texlive-fonts-recommended 2 3 % inline code �...� (copyright symbol) emacs: C-q M-) 4 % red highlighting �...� (registered trademark symbol) emacs: C-q M-. 5 % blue highlighting �...� (sharp s symbol) emacs: C-q M-_ 6 % green highlighting �...� (cent symbol) emacs: C-q M-" 7 % LaTex escape �...� (section symbol) emacs: C-q M-' 8 % keyword escape �...� (pilcrow symbol) emacs: C-q M-^ 1 % inline code ©...© (copyright symbol) emacs: C-q M-) 2 % red highlighting ®...® (registered trademark symbol) emacs: C-q M-. 3 % blue highlighting ß...ß (sharp s symbol) emacs: C-q M-_ 4 % green highlighting ¢...¢ (cent symbol) emacs: C-q M-" 5 % LaTex escape §...§ (section symbol) emacs: C-q M-' 6 % keyword escape ¶...¶ (pilcrow symbol) emacs: C-q M-^ 9 7 % math escape $...$ (dollar symbol) 10 8 … … 20 18 \usepackage{epic,eepic} 21 19 \usepackage{upquote} % switch curled `'" to straight 22 \usepackage{dirtytalk}23 20 \usepackage{calc} 24 21 \usepackage{xspace} … … 46 43 \urlstyle{rm} 47 44 48 \usepackage{tikz}49 \def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}50 51 45 \setlength{\topmargin}{-0.45in} % move running title into header 52 46 \setlength{\headsep}{0.25in} … … 78 72 79 73 \title{Concurrency in \CFA} 80 \author{Thierry Delisle , Waterloo, Ontario, Canada, 2018}74 \author{Thierry Delisle and Peter A. Buhr, Waterloo, Ontario, Canada} 81 75 82 76 … … 85 79 86 80 \begin{abstract} 87 \CFA is a modern, non-object-oriented extension of the C programming language. This thesis serves as a definition and an implementation for the concurrency and parallelism \CFA offers. These features are created from scratch due to the lack of concurrency in ISO C. Lightweight threads are introduced into the language. In addition, monitors are introduced as a high-level tool for control-flow based synchronization and mutual-exclusion. The main contributions of this thesis are two-fold: it extends the existing semantics of monitors introduce by~\cite{Hoare74} to handle monitors in groups and also details the engineering effort needed to introduce these features as core language features. Indeed, these features are added with respect to expectations of C programmers, and integrate with the \CFA type-system and other language features. 81 \CFA is a modern, \emph{non-object-oriented} extension of the C programming language. 82 This paper serves as a definition and an implementation for the concurrency and parallelism \CFA offers. These features are created from scratch due to the lack of concurrency in ISO C. Lightweight threads are introduced into the language. In addition, monitors are introduced as a high-level tool for control-flow based synchronization and mutual-exclusion. The main contributions of this paper are two-fold: it extends the existing semantics of monitors introduce by~\cite{Hoare74} to handle monitors in groups and also details the engineering effort needed to introduce these features as core language features. Indeed, these features are added with respect to expectations of C programmers, and integrate with the \CFA type-system and other language features. 88 83 \end{abstract} 89 84 … … 95 90 \section{Introduction} 96 91 % ====================================================================== 97 This thesis provides a minimal concurrency \textbf{api} that is simple, efficient and can be reused to build higher-level features. The simplest possible concurrency system is a thread and a lock but this low-level approach is hard to master. An easier approach for users is to support higher-level constructs as the basis of concurrency. Indeed, for highly productive concurrent programming, high-level approaches are much more popular~\cite{HPP:Study}. Examples are task based, message passing and implicit threading. The high-level approach and its minimal \textbf{api} are tested in a dialect of C, called \CFA. Furthermore, the proposed \textbf{api} doubles as an early definition of the \CFA language and library. This thesis also provides an implementation of the concurrency library for \CFA as well as all the required language features added to the source-to-source translator. 92 93 This paper provides a minimal concurrency \textbf{api} that is simple, efficient and can be reused to build higher-level features. The simplest possible concurrency system is a thread and a lock but this low-level approach is hard to master. An easier approach for users is to support higher-level constructs as the basis of concurrency. Indeed, for highly productive concurrent programming, high-level approaches are much more popular~\cite{HPP:Study}. Examples are task based, message passing and implicit threading. The high-level approach and its minimal \textbf{api} are tested in a dialect of C, called \CFA. Furthermore, the proposed \textbf{api} doubles as an early definition of the \CFA language and library. This paper also provides an implementation of the concurrency library for \CFA as well as all the required language features added to the source-to-source translator. 98 94 99 95 There are actually two problems that need to be solved in the design of concurrency for a programming language: which concurrency and which parallelism tools are available to the programmer. While these two concepts are often combined, they are in fact distinct, requiring different tools~\cite{Buhr05a}. Concurrency tools need to handle mutual exclusion and synchronization, while parallelism tools are about performance, cost and resource utilization. 100 96 101 In the context of this thesis, a \textbf{thread} is a fundamental unit of execution that runs a sequence of code, generally on a program stack. Having multiple simultaneous threads gives rise to concurrency and generally requires some kind of locking mechanism to ensure proper execution. Correspondingly, \textbf{concurrency} is defined as the concepts and challenges that occur when multiple independent (sharing memory, timing dependencies, etc.) concurrent threads are introduced. Accordingly, \textbf{locking} (and by extension locks) are defined as a mechanism that prevents the progress of certain threads in order to avoid problems due to concurrency. Finally, in this thesis\textbf{parallelism} is distinct from concurrency and is defined as running multiple threads simultaneously. More precisely, parallelism implies \emph{actual} simultaneous execution as opposed to concurrency which only requires \emph{apparent} simultaneous execution. As such, parallelism is only observable in the differences in performance or, more generally, differences in timing.97 In the context of this paper, a \textbf{thread} is a fundamental unit of execution that runs a sequence of code, generally on a program stack. Having multiple simultaneous threads gives rise to concurrency and generally requires some kind of locking mechanism to ensure proper execution. Correspondingly, \textbf{concurrency} is defined as the concepts and challenges that occur when multiple independent (sharing memory, timing dependencies, etc.) concurrent threads are introduced. Accordingly, \textbf{locking} (and by extension locks) are defined as a mechanism that prevents the progress of certain threads in order to avoid problems due to concurrency. Finally, in this paper \textbf{parallelism} is distinct from concurrency and is defined as running multiple threads simultaneously. More precisely, parallelism implies \emph{actual} simultaneous execution as opposed to concurrency which only requires \emph{apparent} simultaneous execution. As such, parallelism is only observable in the differences in performance or, more generally, differences in timing. 102 98 103 99 % ====================================================================== … … 113 109 114 110 % ====================================================================== 115 \s ection{References}111 \subsection{References} 116 112 117 113 Like \CC, \CFA introduces rebind-able references providing multiple dereferencing as an alternative to pointers. In regards to concurrency, the semantic difference between pointers and references are not particularly relevant, but since this document uses mostly references, here is a quick overview of the semantics: … … 132 128 133 129 % ====================================================================== 134 \s ection{Overloading}130 \subsection{Overloading} 135 131 136 132 Another important feature of \CFA is function overloading as in Java and \CC, where routines with the same name are selected based on the number and type of the arguments. As well, \CFA uses the return type as part of the selection criteria, as in Ada~\cite{Ada}. For routines with multiple parameters and returns, the selection is complex. … … 153 149 154 150 % ====================================================================== 155 \s ection{Operators}151 \subsection{Operators} 156 152 Overloading also extends to operators. The syntax for denoting operator-overloading is to name a routine with the symbol of the operator and question marks where the arguments of the operation appear, e.g.: 157 153 \begin{cfacode} … … 173 169 174 170 % ====================================================================== 175 \s ection{Constructors/Destructors}171 \subsection{Constructors/Destructors} 176 172 Object lifetime is often a challenge in concurrency. \CFA uses the approach of giving concurrent meaning to object lifetime as a means of synchronization and/or mutual exclusion. Since \CFA relies heavily on the lifetime of objects, constructors and destructors is a core feature required for concurrency and parallelism. \CFA uses the following syntax for constructors and destructors: 177 173 \begin{cfacode} … … 208 204 209 205 % ====================================================================== 210 \s ection{Parametric Polymorphism}206 \subsection{Parametric Polymorphism} 211 207 \label{s:ParametricPolymorphism} 212 208 Routines in \CFA can also be reused for multiple types. This capability is done using the \code{forall} clauses, which allow separately compiled routines to support generic usage over multiple types. For example, the following sum function works for any type that supports construction from 0 and addition: … … 241 237 242 238 % ====================================================================== 243 \s ection{with Clause/Statement}239 \subsection{with Clause/Statement} 244 240 Since \CFA lacks the concept of a receiver, certain functions end up needing to repeat variable names often. To remove this inconvenience, \CFA provides the \code{with} statement, which opens an aggregate scope making its fields directly accessible (like Pascal). 245 241 \begin{cfacode} … … 286 282 287 283 \section{\protect\CFA's Thread Building Blocks} 288 One of the important features that are missing in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional. As such, library support for threading is far from widespread. At the time of writing the thesis, neither \texttt{gcc} nor \texttt{clang} support ``threads.h'' in their respective standard libraries.}. On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore modern programming languages must have the proper tools to allow users to write efficient concurrent programs to take advantage of parallelism. As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. And being a system-level language means programmers expect to choose precisely which features they need and which cost they are willing to pay.284 One of the important features that are missing in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional. As such, library support for threading is far from widespread. At the time of writing the paper, neither \texttt{gcc} nor \texttt{clang} support ``threads.h'' in their respective standard libraries.}. On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore modern programming languages must have the proper tools to allow users to write efficient concurrent programs to take advantage of parallelism. As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. And being a system-level language means programmers expect to choose precisely which features they need and which cost they are willing to pay. 289 285 290 286 \section{Coroutines: A Stepping Stone}\label{coroutine} … … 664 660 \end{cfacode} 665 661 666 In this example, threads of type \code{foo} start execution in the \code{void main(foo &)} routine, which prints \code{"Hello World!".} While this thesisencourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously.662 In this example, threads of type \code{foo} start execution in the \code{void main(foo &)} routine, which prints \code{"Hello World!".} While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously. 667 663 \begin{cfacode} 668 664 typedef void (*voidFunc)(int); … … 999 995 % ====================================================================== 1000 996 % ====================================================================== 1001 In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization. With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}. With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (i.e., with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (i.e., without access to the shared state). Since internal scheduling within a single monitor is mostly a solved problem, this thesisconcentrates on extending internal scheduling to multiple monitors. Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.997 In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization. With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}. With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (i.e., with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (i.e., without access to the shared state). Since internal scheduling within a single monitor is mostly a solved problem, this paper concentrates on extending internal scheduling to multiple monitors. Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side. 1002 998 1003 999 First, here is a simple example of internal scheduling: … … 1800 1796 A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}. It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues. A \textbf{cfacluster} also offers a pluggable scheduler that can optimize the workload generated by the \textbf{uthread}. 1801 1797 1802 \textbf{cfacluster} have not been fully implemented in the context of this thesis. Currently \CFA only supports one \textbf{cfacluster}, the initial one.1798 \textbf{cfacluster} have not been fully implemented in the context of this paper. Currently \CFA only supports one \textbf{cfacluster}, the initial one. 1803 1799 1804 1800 \subsection{Future Work: Machine Setup}\label{machine} 1805 While this was not done in the context of this thesis, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. For example, a system using \textbf{numa} configurations may benefit from users being able to tie clusters and/or kernel threads to certain CPU cores. OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.1801 While this was not done in the context of this paper, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. For example, a system using \textbf{numa} configurations may benefit from users being able to tie clusters and/or kernel threads to certain CPU cores. OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity. 1806 1802 1807 1803 \subsection{Paradigms}\label{cfaparadigms} … … 1815 1811 The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation. Since several concurrency operations can use an unbound amount of memory (depending on \textbf{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers. Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays. Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array. The threads and the condition both have a fixed amount of memory, while \code{mutex} routines and blocking calls allow for an unbound amount, within the stack size. 1816 1812 1817 Note that since the major contributions of this thesisare extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.1813 Note that since the major contributions of this paper are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed. 1818 1814 1819 1815 % ====================================================================== … … 2615 2611 2616 2612 \section{Conclusion} 2617 This thesis has achieved a minimal concurrency \textbf{api} that is simple, efficient and usable as the basis for higher-level features. The approach presented is based on a lightweight thread-system for parallelism, which sits on top of clusters of processors. This M:N model is judged to be both more efficient and allow more flexibility for users. Furthermore, this document introduces monitors as the main concurrency tool for users. This thesisalso offers a novel approach allowing multiple monitors to be accessed simultaneously without running into the Nested Monitor Problem~\cite{Lister77}. It also offers a full implementation of the concurrency runtime written entirely in \CFA, effectively the largest \CFA code base to date.2613 This paper has achieved a minimal concurrency \textbf{api} that is simple, efficient and usable as the basis for higher-level features. The approach presented is based on a lightweight thread-system for parallelism, which sits on top of clusters of processors. This M:N model is judged to be both more efficient and allow more flexibility for users. Furthermore, this document introduces monitors as the main concurrency tool for users. This paper also offers a novel approach allowing multiple monitors to be accessed simultaneously without running into the Nested Monitor Problem~\cite{Lister77}. It also offers a full implementation of the concurrency runtime written entirely in \CFA, effectively the largest \CFA code base to date. 2618 2614 2619 2615 … … 2625 2621 2626 2622 \subsection{Performance} \label{futur:perf} 2627 This thesispresents a first implementation of the \CFA concurrency runtime. Therefore, there is still significant work to improve performance. Many of the data structures and algorithms may change in the future to more efficient versions. For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous. It may be possible that limiting the number helps increase performance. However, it is not obvious that the benefit would be significant.2623 This paper presents a first implementation of the \CFA concurrency runtime. Therefore, there is still significant work to improve performance. Many of the data structures and algorithms may change in the future to more efficient versions. For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous. It may be possible that limiting the number helps increase performance. However, it is not obvious that the benefit would be significant. 2628 2624 2629 2625 \subsection{Flexible Scheduling} \label{futur:sched} … … 2729 2725 \section{Acknowledgements} 2730 2726 2731 I would like to thank my supervisor, Professor Peter Buhr, for his guidance through my degree as well as the editing of this document. 2732 2733 I would like to thank Professors Martin Karsten and Gregor Richards, for reading my thesis and providing helpful feedback. 2734 2735 Thanks to Aaron Moss, Rob Schluntz and Andrew Beach for their work on the \CFA project as well as all the discussions which have helped me concretize the ideas in this thesis. 2736 2737 Finally, I acknowledge that this has been possible thanks to the financial help offered by the David R. Cheriton School of Computer Science and the corporate partnership with Huawei Ltd. 2727 Thanks to Aaron Moss, Rob Schluntz and Andrew Beach for their work on the \CFA project as well as all the discussions which helped concretize the ideas in this paper. 2728 Partial funding was supplied by the Natural Sciences and Engineering Research Council of Canada and a corporate partnership with Huawei Ltd. 2738 2729 2739 2730 … … 2744 2735 2745 2736 \end{document} 2737 2738 % Local Variables: % 2739 % tab-width: 4 % 2740 % fill-column: 120 % 2741 % compile-command: "make" % 2742 % End: %
Note: See TracChangeset
for help on using the changeset viewer.