Changeset b3ffb61 for doc/proposals/concurrency
- Timestamp:
- Nov 14, 2017, 1:35:12 PM (7 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
- Children:
- 20632a2, a5b7905
- Parents:
- 20ffcf3
- Location:
- doc/proposals/concurrency
- Files:
-
- 8 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/proposals/concurrency/Makefile
r20ffcf3 rb3ffb61 84 84 dvips $< -o $@ 85 85 86 build/${basename ${DOCUMENT}}.dvi : Makefile ${GRAPHS} ${PROGRAMS} ${PICTURES} ${FIGURES} ${SOURCES} ${basename ${DOCUMENT}}.tex ../../LaTeXmacros/common.tex ../../LaTeXmacros/indexstyle 86 build/${basename ${DOCUMENT}}.dvi : Makefile ${GRAPHS} ${PROGRAMS} ${PICTURES} ${FIGURES} ${SOURCES} ${basename ${DOCUMENT}}.tex ../../LaTeXmacros/common.tex ../../LaTeXmacros/indexstyle annex/local.bib 87 87 88 88 @ if [ ! -r ${basename $@}.ind ] ; then touch ${basename $@}.ind ; fi # Conditionally create an empty *.ind (index) file for inclusion until makeindex is run. … … 95 95 @ -${BibTeX} ${basename $@} 96 96 @ echo "Glossary" 97 makeglossaries -q -s ${basename $@}.ist ${basename $@} # Make index from *.aux entries and input index at end of document97 @ makeglossaries -q -s ${basename $@}.ist ${basename $@} # Make index from *.aux entries and input index at end of document 98 98 @ echo ".dvi generation" 99 99 @ -build/bump_ver.sh -
doc/proposals/concurrency/annex/local.bib
r20ffcf3 rb3ffb61 52 52 year = 2017 53 53 } 54 55 @manual{Cpp-Transactions, 56 keywords = {C++, Transactional Memory}, 57 title = {Technical Specification for C++ Extensions for Transactional Memory}, 58 organization= {International Standard ISO/IEC TS 19841:2015 }, 59 publisher = {American National Standards Institute}, 60 address = {http://www.iso.org}, 61 year = 2015, 62 } 63 64 @article{BankTransfer, 65 keywords = {Bank Transfer}, 66 title = {Bank Account Transfer Problem}, 67 publisher = {Wiki Wiki Web}, 68 address = {http://wiki.c2.com}, 69 year = 2010 70 } 71 72 @misc{2FTwoHardThings, 73 keywords = {Hard Problem}, 74 title = {TwoHardThings}, 75 author = {Martin Fowler}, 76 address = {https://martinfowler.com/bliki/TwoHardThings.html}, 77 year = 2009 78 } 79 80 @article{IntrusiveData, 81 title = {Intrusive Data Structures}, 82 author = {Jiri Soukup}, 83 journal = {CppReport}, 84 year = 1998, 85 month = May, 86 volume = {10/No5.}, 87 page = 22 88 } 89 90 @misc{affinityLinux, 91 title = "{Linux man page - sched\_setaffinity(2)}" 92 } 93 94 @misc{affinityWindows, 95 title = "{Windows (vs.85) - SetThreadAffinityMask function}" 96 } 97 98 @misc{affinityFreebsd, 99 title = "{FreeBSD General Commands Manual - CPUSET(1)}" 100 } 101 102 @misc{affinityNetbsd, 103 title = "{NetBSD Library Functions Manual - AFFINITY(3)}" 104 } 105 106 @misc{affinityMacosx, 107 title = "{Affinity API Release Notes for OS X v10.5}" 108 } -
doc/proposals/concurrency/text/basics.tex
r20ffcf3 rb3ffb61 11 11 Execution with a single thread and multiple stacks where the thread is self-scheduling deterministically across the stacks is called coroutining. Execution with a single and multiple stacks but where the thread is scheduled by an oracle (non-deterministic from the thread perspective) across the stacks is called concurrency. 12 12 13 Therefore, a minimal concurrency system can be achieved by creating coroutines, which instead of context switching among each other, always ask an oracle where to context switch next. While coroutines can execute on the caller's stack-frame, stackfull coroutines allow full generality and are sufficient as the basis for concurrency. The aforementioned oracle is a scheduler and the whole system now follows a cooperative threading-model \cit. The oracle/scheduler can either be a stackless or stackfull entity and correspondingly require one or two context switches to run a different coroutine. In any case, a subset of concurrency related challenges start to appear. For the complete set of concurrency challenges to occur, the only feature missing is preemption.14 15 A scheduler introduces order of execution uncertainty, while preemption introduces uncertainty about where context-switches occur. Mutual-exclusion and synchronisation are ways of limiting non-determinism in a concurrent system. Now it is important to understand that uncertainty is desireable; uncertainty can be used by runtime systems to significantly increase performance and is often the basis of giving a user the illusion that tasks are running in parallel. Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows \cit.13 Therefore, a minimal concurrency system can be achieved by creating coroutines, which instead of context switching among each other, always ask an oracle where to context switch next. While coroutines can execute on the caller's stack-frame, stackfull coroutines allow full generality and are sufficient as the basis for concurrency. The aforementioned oracle is a scheduler and the whole system now follows a cooperative threading-model (a.k.a non-preemptive scheduling). The oracle/scheduler can either be a stackless or stackfull entity and correspondingly require one or two context switches to run a different coroutine. In any case, a subset of concurrency related challenges start to appear. For the complete set of concurrency challenges to occur, the only feature missing is preemption. 14 15 A scheduler introduces order of execution uncertainty, while preemption introduces uncertainty about where context-switches occur. Mutual-exclusion and synchronisation are ways of limiting non-determinism in a concurrent system. Now it is important to understand that uncertainty is desireable; uncertainty can be used by runtime systems to significantly increase performance and is often the basis of giving a user the illusion that tasks are running in parallel. Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows. 16 16 17 17 \section{\protect\CFA 's Thread Building Blocks} … … 307 307 \subsection{Alternative: Lamda Objects} 308 308 309 For coroutines as for threads, many implementations are based on routine pointers or function objects\cit . For example, Boost implements coroutines in terms of four functor object types:309 For coroutines as for threads, many implementations are based on routine pointers or function objects\cite{Butenhof97, ANSI14:C++, MS:VisualC++, BoostCoroutines15}. For example, Boost implements coroutines in terms of four functor object types: 310 310 \begin{cfacode} 311 311 asymmetric_coroutine<>::pull_type -
doc/proposals/concurrency/text/concurrency.tex
r20ffcf3 rb3ffb61 8 8 Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on basic constructs like routine calls and shared objects. At the lowest level, concurrent paradigms are implemented as atomic operations and locks. Many such mechanisms have been proposed, including semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}. However, for productivity reasons it is desireable to have a higher-level construct be the core concurrency paradigm~\cite{HPP:Study}. 9 9 10 An approach that is worth mentioning because it is gaining in popularity is transactionnal memory~\cite{Dice10}[Check citation]. While this approach is even pursued by system languages like \CC\cit , the performance and feature set is currently too restrictive to be the main concurrency paradigm for systems language, which is why it was rejected as the core paradigm for concurrency in \CFA.10 An approach that is worth mentioning because it is gaining in popularity is transactionnal memory~\cite{Dice10}[Check citation]. While this approach is even pursued by system languages like \CC\cite{Cpp-Transactions}, the performance and feature set is currently too restrictive to be the main concurrency paradigm for systems language, which is why it was rejected as the core paradigm for concurrency in \CFA. 11 11 12 12 One of the most natural, elegant, and efficient mechanisms for synchronization and communication, especially for shared-memory systems, is the \emph{monitor}. Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}. Many programming languages---e.g., Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}---provide monitors as explicit language constructs. In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors. For these reasons, this project proposes monitors as the core concurrency-construct. … … 139 139 The \gls{multi-acq} monitor lock allows a monitor lock to be acquired by both \code{bar} or \code{baz} and acquired again in \code{foo}. In the calls to \code{bar} and \code{baz} the monitors are acquired in opposite order. 140 140 141 However, such use leads to the lock acquiring order problem. In the example above, the user uses implicit ordering in the case of function \code{foo} but explicit ordering in the case of \code{bar} and \code{baz}. This subtle mistake means that calling these routines concurrently may lead to deadlock and is therefore undefined behavior. As shown\cit , solving this problem requires:141 However, such use leads to the lock acquiring order problem. In the example above, the user uses implicit ordering in the case of function \code{foo} but explicit ordering in the case of \code{bar} and \code{baz}. This subtle mistake means that calling these routines concurrently may lead to deadlock and is therefore undefined behavior. As shown\cite{Lister77}, solving this problem requires: 142 142 \begin{enumerate} 143 143 \item Dynamically tracking of the monitor-call order. 144 144 \item Implement rollback semantics. 145 145 \end{enumerate} 146 While the first requirement is already a significant constraint on the system, implementing a general rollback semantics in a C-like language is prohibitively complex \cit. In \CFA, users simply need to be carefull when acquiring multiple monitors at the same time or only use \gls{bulk-acq} of all the monitors. While \CFA provides only a partial solution, many system provide no solution and the \CFA partial solution handles many useful cases.146 While the first requirement is already a significant constraint on the system, implementing a general rollback semantics in a C-like language is still prohibitively complex \cite{Dice10}. In \CFA, users simply need to be carefull when acquiring multiple monitors at the same time or only use \gls{bulk-acq} of all the monitors. While \CFA provides only a partial solution, many system provide no solution and the \CFA partial solution handles many useful cases. 147 147 148 148 For example, \gls{multi-acq} and \gls{bulk-acq} can be used together in interesting ways: … … 157 157 } 158 158 \end{cfacode} 159 This example shows a trivial solution to the bank-account transfer-problem\cit . Without \gls{multi-acq} and \gls{bulk-acq}, the solution to this problem is much more involved and requires carefull engineering.159 This example shows a trivial solution to the bank-account transfer-problem\cite{BankTransfer}. Without \gls{multi-acq} and \gls{bulk-acq}, the solution to this problem is much more involved and requires carefull engineering. 160 160 161 161 \subsection{\code{mutex} statement} \label{mutex-stmt} 162 162 163 The call semantics discussed aboved have one software engineering issue, only a named routine can acquire the mutual-exclusion of a set of monitor. \CFA offers the \code{mutex} statement to workaround the need for unnecessary names, avoiding a major software engineering problem\cit . Listing \ref{lst:mutex-stmt} shows an example of the \code{mutex} statement, which introduces a new scope in which the mutual-exclusion of a set of monitor is acquired. Beyond naming, the \code{mutex} statement has no semantic difference from a routine call with \code{mutex} parameters.163 The call semantics discussed aboved have one software engineering issue, only a named routine can acquire the mutual-exclusion of a set of monitor. \CFA offers the \code{mutex} statement to workaround the need for unnecessary names, avoiding a major software engineering problem\cite{2FTwoHardThings}. Listing \ref{lst:mutex-stmt} shows an example of the \code{mutex} statement, which introduces a new scope in which the mutual-exclusion of a set of monitor is acquired. Beyond naming, the \code{mutex} statement has no semantic difference from a routine call with \code{mutex} parameters. 164 164 165 165 \begin{figure} … … 232 232 % ====================================================================== 233 233 % ====================================================================== 234 In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronisation. With monitors, this capability is generally achieved with internal or external scheduling as in \cit. Since internal scheduling within a single monitor is mostly a solved problem, this thesis concentrates on extending internal scheduling to multiple monitors. Indeed, like the \gls{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.234 In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronisation. With monitors, this capability is generally achieved with internal or external scheduling as in \cite{Hoare74}. Since internal scheduling within a single monitor is mostly a solved problem, this thesis concentrates on extending internal scheduling to multiple monitors. Indeed, like the \gls{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side. 235 235 236 236 First, here is a simple example of such a technique: … … 305 305 This version uses \gls{bulk-acq} (denoted using the {\sf\&} symbol), but the presence of multiple monitors does not add a particularly new meaning. Synchronization happens between the two threads in exactly the same way and order. The only difference is that mutual exclusion covers more monitors. On the implementation side, handling multiple monitors does add a degree of complexity as the next few examples demonstrate. 306 306 307 While deadlock issues can occur when nesting monitors, these issues are only a symptom of the fact that locks, and by extension monitors, are not perfectly composable. For monitors, a well known deadlock problem is the Nested Monitor Problem \cit, which occurs when a \code{wait} is made by a thread that holds more than one monitor. For example, the following pseudo-code runs into the nested-monitor problem :307 While deadlock issues can occur when nesting monitors, these issues are only a symptom of the fact that locks, and by extension monitors, are not perfectly composable. For monitors, a well known deadlock problem is the Nested Monitor Problem \cite{Lister77}, which occurs when a \code{wait} is made by a thread that holds more than one monitor. For example, the following pseudo-code runs into the nested-monitor problem : 308 308 \begin{multicols}{2} 309 309 \begin{pseudo} -
doc/proposals/concurrency/text/future.tex
r20ffcf3 rb3ffb61 10 10 \section{Non-Blocking IO} \label{futur:nbio} 11 11 While most of the parallelism tools 12 However, many modern workloads are not bound on computation but on IO operations, an common case being webservers and XaaS (anything as a service). These type of workloads often require significant engineering around amortising costs of blocking IO operations. While improving throughtput of these operations is outside what \CFA can do as a language, it can help users to make better use of the CPU time otherwise spent waiting on IO operations. The current trend is to use asynchronous programming using tools like callbacks and/or futurs and promises\cit . However, while these are valid solutions, they lead to code that is harder to read and maintain because it is much less linear12 However, many modern workloads are not bound on computation but on IO operations, an common case being webservers and XaaS (anything as a service). These type of workloads often require significant engineering around amortising costs of blocking IO operations. While improving throughtput of these operations is outside what \CFA can do as a language, it can help users to make better use of the CPU time otherwise spent waiting on IO operations. The current trend is to use asynchronous programming using tools like callbacks and/or futurs and promises\cite. However, while these are valid solutions, they lead to code that is harder to read and maintain because it is much less linear 13 13 14 14 \section{Other concurrency tools} \label{futur:tools} 15 While monitors offer a flexible and powerful concurent core for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. Example of such tools can include simple locks and condition variables, futures and promises , and executors. These additional features are useful when monitors offer a level of abstraction which is indaquate for certain tasks.15 While monitors offer a flexible and powerful concurent core for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. Example of such tools can include simple locks and condition variables, futures and promises\cite{promises}, and executors. These additional features are useful when monitors offer a level of abstraction which is indaquate for certain tasks. 16 16 17 17 \section{Implicit threading} \label{futur:implcit} 18 Simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the library level. The cannonical example of implcit parallelism is parallel for loops, which are the simplest example of a divide and conquer algorithm\cit . Listing \ref{lst:parfor} shows three different code examples that accomplish pointwise sums of large arrays. Note that none of these example explicitly declare any concurrency or parallelism objects.18 Simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the library level. The cannonical example of implcit parallelism is parallel for loops, which are the simplest example of a divide and conquer algorithm\cite{uC++book}. Listing \ref{lst:parfor} shows three different code examples that accomplish pointwise sums of large arrays. Note that none of these example explicitly declare any concurrency or parallelism objects. 19 19 20 20 \begin{figure} -
doc/proposals/concurrency/text/internals.tex
r20ffcf3 rb3ffb61 3 3 There are several challenges specific to \CFA when implementing concurrency. These challenges are a direct result of \gls{bulk-acq} and loose object-definitions. These two constraints are the root cause of most design decisions in the implementation. Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs. This is to avoid the chicken and egg problem \cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal, means that memory management is a constant concern in the design of the system. 4 4 5 The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues. The queue design needs to be intr insic\citto avoid the need for memory allocation, which entails that all the nodes need specific fields to keep track of all needed information. Since many concurrency operations can use an unbound amount of memory (depending on \gls{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient. The only variable sized container that does not require memory allocation is the callstack, which is heavily used in the implementation of internal scheduling. Particularly variable length arrays, which are used extensively.5 The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues. The queue design needs to be intrusive\cite{IntrusiveData} to avoid the need for memory allocation, which entails that all the nodes need specific fields to keep track of all needed information. Since many concurrency operations can use an unbound amount of memory (depending on \gls{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient. The only variable sized container that does not require memory allocation is the callstack, which is heavily used in the implementation of internal scheduling. Particularly variable length arrays, which are used extensively. 6 6 7 7 Since stack allocation is based around scope, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable length. The threads and the condition both allow a fixed amount of memory to be stored, while mutex-routines and the actual blocking call allow for an unbound amount (though the later is preferable in terms of performance). … … 15 15 % ====================================================================== 16 16 17 The first step towards the monitor implementation is simple mutex-routines using monitors. In the single monitor case, this is done using the entry/exit procedure highlighted in listing \ref{lst:entry1}. This entry/exit procedure does not actually have to be extended to support multiple monitors, indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlocks\cit . In \CFA, ordering of monitor relies on memory ordering, this is sufficient because all objects are guaranteed to have distinct non-overlaping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behavior. When a mutex call is made, the concerned monitors are agregated into a variable-length pointer array and sorted based on pointer values. This array presists for the entire duration of the mutual-exclusion and its ordering reused extensively.17 The first step towards the monitor implementation is simple mutex-routines using monitors. In the single monitor case, this is done using the entry/exit procedure highlighted in listing \ref{lst:entry1}. This entry/exit procedure does not actually have to be extended to support multiple monitors, indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlocks\cite{Havender68}. In \CFA, ordering of monitor relies on memory ordering, this is sufficient because all objects are guaranteed to have distinct non-overlaping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behavior. When a mutex call is made, the concerned monitors are agregated into a variable-length pointer array and sorted based on pointer values. This array presists for the entire duration of the mutual-exclusion and its ordering reused extensively. 18 18 \begin{figure} 19 19 \begin{multicols}{2} -
doc/proposals/concurrency/text/parallelism.tex
r20ffcf3 rb3ffb61 33 33 34 34 \subsection{Future Work: Machine setup}\label{machine} 35 While this was not done in the context of this thesis, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heteregenous setups. For example, system using \acrshort{numa} configurations may benefit from users being able to tie clusters and /or kernel threads to certains CPU cores. OS support for CPU affinity is now common \cit,which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.35 While this was not done in the context of this thesis, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heteregenous setups. For example, system using \acrshort{numa} configurations may benefit from users being able to tie clusters and\/or kernel threads to certains CPU cores. OS support for CPU affinity is now common \cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx} which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity. 36 36 37 \subsection{Paradigms}\label{cfaparadigms}38 Given these building blocks, it is possible to reproduce all three of the popular paradigms. Indeed, \glspl{uthread} is the default paradigm in \CFA. However, disabling \gls{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application. Finally, it is possible to build executors for thread pools from \glspl{uthread} or \glspl{fiber}.37 % \subsection{Paradigms}\label{cfaparadigms} 38 % Given these building blocks, it is possible to reproduce all three of the popular paradigms. Indeed, \glspl{uthread} is the default paradigm in \CFA. However, disabling \gls{preemption} on the \gls{cfacluster} means \glspl{cfathread} effectively become \glspl{fiber}. Since several \glspl{cfacluster} with different scheduling policy can coexist in the same application, this allows \glspl{fiber} and \glspl{uthread} to coexist in the runtime of an application. Finally, it is possible to build executors for thread pools from \glspl{uthread} or \glspl{fiber}. -
doc/proposals/concurrency/version
r20ffcf3 rb3ffb61 1 0.11. 951 0.11.129
Note: See TracChangeset
for help on using the changeset viewer.