Changeset 27125d0 for doc/papers/concurrency/Paper.tex
- Timestamp:
- Jun 6, 2020, 4:59:19 PM (3 years ago)
- Branches:
- arm-eh, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- 393d59a
- Parents:
- 9246ec6
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r9246ec6 r27125d0 386 386 Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs. 387 387 Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure threads and virtual processors (kernel threads). 388 Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.388 Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js v12.18.0, and \uC 7.0.0. 389 389 390 390 … … 893 893 With respect to safety, we believe static analysis can discriminate persistent generator state from temporary generator-main state and raise a compile-time error for temporary usage spanning suspend points. 894 894 Our experience using generators is that the problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small. 895 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small ,fast generators.895 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small and fast generators. 896 896 897 897 Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored. … … 915 915 The destructor provides a newline, if formatted text ends with a full line. 916 916 Figure~\ref{f:CFormatGenImpl} shows the C implementation of the \CFA input generator with one additional field and the computed @goto@. 917 For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the format generator.917 For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the \CFA format generator. 918 918 919 919 % https://dl-acm-org.proxy.lib.uwaterloo.ca/ 920 920 921 Figure~\ref{f:DeviceDriverGen} shows an important application for an asymmetric generator,a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11}921 An important application for the asymmetric generator is a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11} 922 922 Swift \etal~\cite[p.~86]{Swift05} restructure device drivers using the Extension Procedure Call (XPC) within the kernel via functions @nooks_driver_call@ and @nooks_kernel_call@, which have coroutine properties context switching to separate stacks with explicit hand-off calls; 923 923 however, the calls do not retain execution state, and hence always start from the top. … … 925 925 However, Adya \etal~\cite{Adya02} argue against stack ripping in Section 3.2 and suggest a hybrid approach in Section 4 using cooperatively scheduled \emph{fibers}, which is coroutining. 926 926 927 As an example,the following protocol:927 Figure~\ref{f:DeviceDriverGen} shows the generator advantages in implementing a simple network device-driver with the following protocol: 928 928 \begin{center} 929 929 \ldots\, STX \ldots\, message \ldots\, ESC ETX \ldots\, message \ldots\, ETX 2-byte crc \ldots 930 930 \end{center} 931 is for a simple network message beginning with the control character STX, ending with an ETX, andfollowed by a 2-byte cyclic-redundancy check.931 where the network message begins with the control character STX, ends with an ETX, and is followed by a 2-byte cyclic-redundancy check. 932 932 Control characters may appear in a message if preceded by an ESC. 933 933 When a message byte arrives, it triggers an interrupt, and the operating system services the interrupt by calling the device driver with the byte read from a hardware register. 934 The device driver returns a status code of its current state, and when a complete message is obtained, the operating system read the message accumulated in the supplied buffer.934 The device driver returns a status code of its current state, and when a complete message is obtained, the operating system reads the message accumulated in the supplied buffer. 935 935 Hence, the device driver is an input/output generator, where the cost of resuming the device-driver generator is the same as call and return, so performance in an operating-system kernel is excellent. 936 936 The key benefits of using a generator are correctness, safety, and maintenance because the execution states are transcribed directly into the programming language rather than table lookup or stack ripping. 937 The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language.937 % The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language. 938 938 939 939 \begin{figure} … … 992 992 \end{figure} 993 993 994 Figure~\ref{f:CFAPingPongGen} shows a symmetric generator, where the generator resumes another generator, forming a resume/resume cycle.994 Generators can also have symmetric activation using resume/resume to create control-flow cycles among generators. 995 995 (The trivial cycle is a generator resuming itself.) 996 996 This control flow is similar to recursion for functions but without stack growth. 997 Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow are creating, executing, and terminating the cycle. 997 Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow using for the ping/pong program in Figure~\ref{f:CFAPingPongGen}. 998 The program starts by creating the generators, @ping@ and @pong@, and then assigns the partners that form the cycle. 998 999 Constructing the cycle must deal with definition-before-use to close the cycle, \ie, the first generator must know about the last generator, which is not within scope. 999 1000 (This issue occurs for any cyclic data structure.) 1000 The example creates the generators, @ping@ and @pong@, and then assigns the partners that form the cycle.1001 1001 % (Alternatively, the constructor can assign the partners as they are declared, except the first, and the first-generator partner is set after the last generator declaration to close the cycle.) 1002 Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary cycleusing @resume@ to activate partner generator(s).1002 Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary number of cycles using @resume@ to activate partner generator(s). 1003 1003 Terminating the cycle is accomplished by @suspend@ or @return@, both of which go back to the stack frame that started the cycle (program main in the example). 1004 1004 Note, the creator and starter may be different, \eg if the creator calls another function that starts the cycle. … … 1006 1006 Also, since local variables are not retained in the generator function, there are no objects with destructors to be called, so the cost is the same as a function return. 1007 1007 Destructor cost occurs when the generator instance is deallocated by the creator. 1008 1009 \begin{figure} 1010 \centering 1011 \input{FullCoroutinePhases.pstex_t} 1012 \vspace*{-10pt} 1013 \caption{Symmetric coroutine steps: Ping / Pong} 1014 \label{f:PingPongFullCoroutineSteps} 1015 \end{figure} 1008 1016 1009 1017 \begin{figure} … … 1069 1077 \end{figure} 1070 1078 1071 \begin{figure}1072 \centering1073 \input{FullCoroutinePhases.pstex_t}1074 \vspace*{-10pt}1075 \caption{Symmetric coroutine steps: Ping / Pong}1076 \label{f:PingPongFullCoroutineSteps}1077 \end{figure}1078 1079 1079 Figure~\ref{f:CPingPongSim} shows the C implementation of the \CFA symmetric generator, where there is still only one additional field, @restart@, but @resume@ is more complex because it does a forward rather than backward jump. 1080 1080 Before the jump, the parameter for the next call @partner@ is placed into the register used for the first parameter, @rdi@, and the remaining registers are reset for a return. … … 1100 1100 \label{s:Coroutine} 1101 1101 1102 Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics, \ie there is an implicit closure and @suspend@ may appear in a helper function called from the coroutine main. 1103 A coroutine is specified by replacing @generator@ with @coroutine@ for the type. 1102 Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics with an implicit closure and @suspend@ may appear in a helper function called from the coroutine main because of the separate stack. 1103 Note, simulating coroutines with stacks of generators, \eg Python with @yield from@ cannot handle symmetric control-flow. 1104 Furthermore, all stack components must be of generators, so it is impossible to call a library function passing a generator that yields. 1105 Creating a generator copy of the library function maybe impossible because the library function is opaque. 1106 1107 A \CFA coroutine is specified by replacing @generator@ with @coroutine@ for the type. 1104 1108 Coroutine generality results in higher cost for creation, due to dynamic stack allocation, for execution, due to context switching among stacks, and for terminating, due to possible stack unwinding and dynamic stack deallocation. 1105 1109 A series of different kinds of coroutines and their implementations demonstrate how coroutines extend generators. 1106 1110 1107 1111 First, the previous generator examples are converted to their coroutine counterparts, allowing local-state variables to be moved from the generator type into the coroutine main. 1112 Now the coroutine type only contains communication variables between interface functions and the coroutine main. 1108 1113 \begin{center} 1109 1114 \begin{tabular}{@{}l|l|l|l@{}} … … 1147 1152 } 1148 1153 \end{cfa} 1149 A call to this function is placed at the end of the d river's coroutine-main.1154 A call to this function is placed at the end of the device driver's coroutine-main. 1150 1155 For complex finite-state machines, refactoring is part of normal program abstraction, especially when code is used in multiple places. 1151 1156 Again, this complexity is usually associated with execution state rather than data state. … … 1478 1483 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer. 1479 1484 The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle. 1480 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with correspondingarbitrary typed input and output values versus fixed ones.1485 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with arbitrary typed input and output values versus fixed ones. 1481 1486 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@. 1482 1487 … … 1523 1528 The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate. 1524 1529 The coroutine stack can appear in a number of locations and be fixed or variable sized. 1525 Hence, the coroutine's stack could be a variable-length structure (VLS)\footnote{ 1526 We are examining VLSs, where fields can be variable-sized structures or arrays. 1527 Once allocated, a VLS is fixed sized.} 1530 Hence, the coroutine's stack could be a variable-length structure (VLS) 1531 % \footnote{ 1532 % We are examining VLSs, where fields can be variable-sized structures or arrays. 1533 % Once allocated, a VLS is fixed sized.} 1528 1534 on the allocating stack, provided the allocating stack is large enough. 1529 1535 For a VLS stack allocation and deallocation is an inexpensive adjustment of the stack pointer, modulo any stack constructor costs to initial frame setup. … … 1570 1576 \label{s:threads} 1571 1577 1572 Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion. 1573 A common API for this ability is @fork@ and @join@. 1578 Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion, where a common API is @fork@ and @join@. 1574 1579 \vspace{4pt} 1575 1580 \par\noindent … … 1725 1730 For these reasons, \CFA selected monitors as the core high-level concurrency construct, upon which higher-level approaches can be easily constructed. 1726 1731 1727 Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter. \footnote{1728 Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.} 1732 Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter. 1733 (Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.) 1729 1734 A \newterm{monitor} is a set of functions that ensure mutual exclusion when accessing shared state. 1730 1735 (Note, in \CFA, @monitor@ is short-hand for @mutex struct@.) … … 2986 2991 The total time is divided by @N@ to obtain the average time for a benchmark. 2987 2992 Each benchmark experiment is run 13 times and the average appears in the table. 2988 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{Cforall BenchMarks}.2993 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}. 2989 2994 % tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark 2995 % cp -p benchmark.tar /u/cforall/public_html/doc/concurrent_benchmark.tar 2990 2996 2991 2997 \paragraph{Creation} … … 3028 3034 \uC thread & 523.4 & 523.9 & 7.7 \\ 3029 3035 Python generator & 123.2 & 124.3 & 4.1 \\ 3030 Node.js generator & 3 2.3 & 32.2& 0.3 \\3036 Node.js generator & 33.4 & 33.5 & 0.3 \\ 3031 3037 Goroutine thread & 751.0 & 750.5 & 3.1 \\ 3032 3038 Rust tokio thread & 1860.0 & 1881.1 & 37.6 \\ … … 3233 3239 Python generator & 40.9 & 41.3 & 1.5 \\ 3234 3240 Node.js await & 1852.2 & 1854.7 & 16.4 \\ 3235 Node.js generator & 3 2.6 & 32.2 & 1.0\\3241 Node.js generator & 33.3 & 33.4 & 0.3 \\ 3236 3242 Goroutine thread & 143.0 & 143.3 & 1.1 \\ 3237 3243 Rust async await & 32.0 & 32.0 & 0.0 \\ … … 3271 3277 While control flow in \CFA has a strong start, development is still underway to complete a number of missing features. 3272 3278 3273 \vspace{-5pt} 3274 \paragraph{Flexible Scheduling} 3275 \label{futur:sched} 3276 3279 \medskip 3280 \textbf{Flexible Scheduling:} 3277 3281 An important part of concurrency is scheduling. 3278 3282 Different scheduling algorithms can affect performance, both in terms of average and variation. … … 3282 3286 Currently, the \CFA pluggable scheduler is too simple to handle complex scheduling, \eg quality of service and real-time, where the scheduler must interact with mutex objects to deal with issues like priority inversion~\cite{Buhr00b}. 3283 3287 3284 \vspace{-5pt} 3285 \paragraph{Non-Blocking I/O} 3286 \label{futur:nbio} 3288 \smallskip 3289 \textbf{Non-Blocking I/O:} 3287 3290 Many modern workloads are not bound by computation but IO operations, common cases being web servers and XaaS~\cite{XaaS} (anything as a service). 3288 3291 These types of workloads require significant engineering to amortizing costs of blocking IO-operations. … … 3293 3296 A non-blocking I/O library is currently under development for \CFA. 3294 3297 3295 \vspace{-5pt} 3296 \paragraph{Other Concurrency Tools} 3297 \label{futur:tools} 3298 3298 \smallskip 3299 \textbf{Other Concurrency Tools:} 3299 3300 While monitors offer flexible and powerful concurrency for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. 3300 3301 Examples of such tools can include futures and promises~\cite{promises}, executors and actors. … … 3302 3303 As well, new \CFA extensions should make it possible to create a uniform interface for virtually all mutual exclusion, including monitors and low-level locks. 3303 3304 3304 \vspace{-5pt} 3305 \paragraph{Implicit Threading} 3306 \label{futur:implcit} 3307 3308 Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, possibly with some help from pragmas to guide the conversion. 3305 \smallskip 3306 \textbf{Implicit Threading:} 3307 Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, with some help from pragmas to guide the conversion. 3309 3308 This type of concurrency can be achieved both at the language level and at the library level. 3310 3309 The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}.
Note: See TracChangeset
for help on using the changeset viewer.