Changeset 27125d0 for doc/papers/concurrency
- Timestamp:
- Jun 6, 2020, 4:59:19 PM (5 years ago)
- Branches:
- ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- 393d59a
- Parents:
- 9246ec6
- Location:
- doc/papers/concurrency
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r9246ec6 r27125d0 386 386 Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs. 387 387 Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure threads and virtual processors (kernel threads). 388 Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.388 Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js v12.18.0, and \uC 7.0.0. 389 389 390 390 … … 893 893 With respect to safety, we believe static analysis can discriminate persistent generator state from temporary generator-main state and raise a compile-time error for temporary usage spanning suspend points. 894 894 Our experience using generators is that the problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small. 895 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small ,fast generators.895 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small and fast generators. 896 896 897 897 Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored. … … 915 915 The destructor provides a newline, if formatted text ends with a full line. 916 916 Figure~\ref{f:CFormatGenImpl} shows the C implementation of the \CFA input generator with one additional field and the computed @goto@. 917 For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the format generator.917 For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the \CFA format generator. 918 918 919 919 % https://dl-acm-org.proxy.lib.uwaterloo.ca/ 920 920 921 Figure~\ref{f:DeviceDriverGen} shows an important application for an asymmetric generator,a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11}921 An important application for the asymmetric generator is a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11} 922 922 Swift \etal~\cite[p.~86]{Swift05} restructure device drivers using the Extension Procedure Call (XPC) within the kernel via functions @nooks_driver_call@ and @nooks_kernel_call@, which have coroutine properties context switching to separate stacks with explicit hand-off calls; 923 923 however, the calls do not retain execution state, and hence always start from the top. … … 925 925 However, Adya \etal~\cite{Adya02} argue against stack ripping in Section 3.2 and suggest a hybrid approach in Section 4 using cooperatively scheduled \emph{fibers}, which is coroutining. 926 926 927 As an example,the following protocol:927 Figure~\ref{f:DeviceDriverGen} shows the generator advantages in implementing a simple network device-driver with the following protocol: 928 928 \begin{center} 929 929 \ldots\, STX \ldots\, message \ldots\, ESC ETX \ldots\, message \ldots\, ETX 2-byte crc \ldots 930 930 \end{center} 931 is for a simple network message beginning with the control character STX, ending with an ETX, andfollowed by a 2-byte cyclic-redundancy check.931 where the network message begins with the control character STX, ends with an ETX, and is followed by a 2-byte cyclic-redundancy check. 932 932 Control characters may appear in a message if preceded by an ESC. 933 933 When a message byte arrives, it triggers an interrupt, and the operating system services the interrupt by calling the device driver with the byte read from a hardware register. 934 The device driver returns a status code of its current state, and when a complete message is obtained, the operating system read the message accumulated in the supplied buffer.934 The device driver returns a status code of its current state, and when a complete message is obtained, the operating system reads the message accumulated in the supplied buffer. 935 935 Hence, the device driver is an input/output generator, where the cost of resuming the device-driver generator is the same as call and return, so performance in an operating-system kernel is excellent. 936 936 The key benefits of using a generator are correctness, safety, and maintenance because the execution states are transcribed directly into the programming language rather than table lookup or stack ripping. 937 The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language.937 % The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language. 938 938 939 939 \begin{figure} … … 992 992 \end{figure} 993 993 994 Figure~\ref{f:CFAPingPongGen} shows a symmetric generator, where the generator resumes another generator, forming a resume/resume cycle.994 Generators can also have symmetric activation using resume/resume to create control-flow cycles among generators. 995 995 (The trivial cycle is a generator resuming itself.) 996 996 This control flow is similar to recursion for functions but without stack growth. 997 Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow are creating, executing, and terminating the cycle. 997 Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow using for the ping/pong program in Figure~\ref{f:CFAPingPongGen}. 998 The program starts by creating the generators, @ping@ and @pong@, and then assigns the partners that form the cycle. 998 999 Constructing the cycle must deal with definition-before-use to close the cycle, \ie, the first generator must know about the last generator, which is not within scope. 999 1000 (This issue occurs for any cyclic data structure.) 1000 The example creates the generators, @ping@ and @pong@, and then assigns the partners that form the cycle.1001 1001 % (Alternatively, the constructor can assign the partners as they are declared, except the first, and the first-generator partner is set after the last generator declaration to close the cycle.) 1002 Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary cycleusing @resume@ to activate partner generator(s).1002 Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary number of cycles using @resume@ to activate partner generator(s). 1003 1003 Terminating the cycle is accomplished by @suspend@ or @return@, both of which go back to the stack frame that started the cycle (program main in the example). 1004 1004 Note, the creator and starter may be different, \eg if the creator calls another function that starts the cycle. … … 1006 1006 Also, since local variables are not retained in the generator function, there are no objects with destructors to be called, so the cost is the same as a function return. 1007 1007 Destructor cost occurs when the generator instance is deallocated by the creator. 1008 1009 \begin{figure} 1010 \centering 1011 \input{FullCoroutinePhases.pstex_t} 1012 \vspace*{-10pt} 1013 \caption{Symmetric coroutine steps: Ping / Pong} 1014 \label{f:PingPongFullCoroutineSteps} 1015 \end{figure} 1008 1016 1009 1017 \begin{figure} … … 1069 1077 \end{figure} 1070 1078 1071 \begin{figure}1072 \centering1073 \input{FullCoroutinePhases.pstex_t}1074 \vspace*{-10pt}1075 \caption{Symmetric coroutine steps: Ping / Pong}1076 \label{f:PingPongFullCoroutineSteps}1077 \end{figure}1078 1079 1079 Figure~\ref{f:CPingPongSim} shows the C implementation of the \CFA symmetric generator, where there is still only one additional field, @restart@, but @resume@ is more complex because it does a forward rather than backward jump. 1080 1080 Before the jump, the parameter for the next call @partner@ is placed into the register used for the first parameter, @rdi@, and the remaining registers are reset for a return. … … 1100 1100 \label{s:Coroutine} 1101 1101 1102 Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics, \ie there is an implicit closure and @suspend@ may appear in a helper function called from the coroutine main. 1103 A coroutine is specified by replacing @generator@ with @coroutine@ for the type. 1102 Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics with an implicit closure and @suspend@ may appear in a helper function called from the coroutine main because of the separate stack. 1103 Note, simulating coroutines with stacks of generators, \eg Python with @yield from@ cannot handle symmetric control-flow. 1104 Furthermore, all stack components must be of generators, so it is impossible to call a library function passing a generator that yields. 1105 Creating a generator copy of the library function maybe impossible because the library function is opaque. 1106 1107 A \CFA coroutine is specified by replacing @generator@ with @coroutine@ for the type. 1104 1108 Coroutine generality results in higher cost for creation, due to dynamic stack allocation, for execution, due to context switching among stacks, and for terminating, due to possible stack unwinding and dynamic stack deallocation. 1105 1109 A series of different kinds of coroutines and their implementations demonstrate how coroutines extend generators. 1106 1110 1107 1111 First, the previous generator examples are converted to their coroutine counterparts, allowing local-state variables to be moved from the generator type into the coroutine main. 1112 Now the coroutine type only contains communication variables between interface functions and the coroutine main. 1108 1113 \begin{center} 1109 1114 \begin{tabular}{@{}l|l|l|l@{}} … … 1147 1152 } 1148 1153 \end{cfa} 1149 A call to this function is placed at the end of the d river's coroutine-main.1154 A call to this function is placed at the end of the device driver's coroutine-main. 1150 1155 For complex finite-state machines, refactoring is part of normal program abstraction, especially when code is used in multiple places. 1151 1156 Again, this complexity is usually associated with execution state rather than data state. … … 1478 1483 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer. 1479 1484 The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle. 1480 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with correspondingarbitrary typed input and output values versus fixed ones.1485 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with arbitrary typed input and output values versus fixed ones. 1481 1486 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@. 1482 1487 … … 1523 1528 The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate. 1524 1529 The coroutine stack can appear in a number of locations and be fixed or variable sized. 1525 Hence, the coroutine's stack could be a variable-length structure (VLS)\footnote{ 1526 We are examining VLSs, where fields can be variable-sized structures or arrays. 1527 Once allocated, a VLS is fixed sized.} 1530 Hence, the coroutine's stack could be a variable-length structure (VLS) 1531 % \footnote{ 1532 % We are examining VLSs, where fields can be variable-sized structures or arrays. 1533 % Once allocated, a VLS is fixed sized.} 1528 1534 on the allocating stack, provided the allocating stack is large enough. 1529 1535 For a VLS stack allocation and deallocation is an inexpensive adjustment of the stack pointer, modulo any stack constructor costs to initial frame setup. … … 1570 1576 \label{s:threads} 1571 1577 1572 Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion. 1573 A common API for this ability is @fork@ and @join@. 1578 Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion, where a common API is @fork@ and @join@. 1574 1579 \vspace{4pt} 1575 1580 \par\noindent … … 1725 1730 For these reasons, \CFA selected monitors as the core high-level concurrency construct, upon which higher-level approaches can be easily constructed. 1726 1731 1727 Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter. \footnote{1728 Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.} 1732 Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter. 1733 (Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.) 1729 1734 A \newterm{monitor} is a set of functions that ensure mutual exclusion when accessing shared state. 1730 1735 (Note, in \CFA, @monitor@ is short-hand for @mutex struct@.) … … 2986 2991 The total time is divided by @N@ to obtain the average time for a benchmark. 2987 2992 Each benchmark experiment is run 13 times and the average appears in the table. 2988 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{Cforall BenchMarks}.2993 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}. 2989 2994 % tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark 2995 % cp -p benchmark.tar /u/cforall/public_html/doc/concurrent_benchmark.tar 2990 2996 2991 2997 \paragraph{Creation} … … 3028 3034 \uC thread & 523.4 & 523.9 & 7.7 \\ 3029 3035 Python generator & 123.2 & 124.3 & 4.1 \\ 3030 Node.js generator & 3 2.3 & 32.2& 0.3 \\3036 Node.js generator & 33.4 & 33.5 & 0.3 \\ 3031 3037 Goroutine thread & 751.0 & 750.5 & 3.1 \\ 3032 3038 Rust tokio thread & 1860.0 & 1881.1 & 37.6 \\ … … 3233 3239 Python generator & 40.9 & 41.3 & 1.5 \\ 3234 3240 Node.js await & 1852.2 & 1854.7 & 16.4 \\ 3235 Node.js generator & 3 2.6 & 32.2 & 1.0\\3241 Node.js generator & 33.3 & 33.4 & 0.3 \\ 3236 3242 Goroutine thread & 143.0 & 143.3 & 1.1 \\ 3237 3243 Rust async await & 32.0 & 32.0 & 0.0 \\ … … 3271 3277 While control flow in \CFA has a strong start, development is still underway to complete a number of missing features. 3272 3278 3273 \vspace{-5pt} 3274 \paragraph{Flexible Scheduling} 3275 \label{futur:sched} 3276 3279 \medskip 3280 \textbf{Flexible Scheduling:} 3277 3281 An important part of concurrency is scheduling. 3278 3282 Different scheduling algorithms can affect performance, both in terms of average and variation. … … 3282 3286 Currently, the \CFA pluggable scheduler is too simple to handle complex scheduling, \eg quality of service and real-time, where the scheduler must interact with mutex objects to deal with issues like priority inversion~\cite{Buhr00b}. 3283 3287 3284 \vspace{-5pt} 3285 \paragraph{Non-Blocking I/O} 3286 \label{futur:nbio} 3288 \smallskip 3289 \textbf{Non-Blocking I/O:} 3287 3290 Many modern workloads are not bound by computation but IO operations, common cases being web servers and XaaS~\cite{XaaS} (anything as a service). 3288 3291 These types of workloads require significant engineering to amortizing costs of blocking IO-operations. … … 3293 3296 A non-blocking I/O library is currently under development for \CFA. 3294 3297 3295 \vspace{-5pt} 3296 \paragraph{Other Concurrency Tools} 3297 \label{futur:tools} 3298 3298 \smallskip 3299 \textbf{Other Concurrency Tools:} 3299 3300 While monitors offer flexible and powerful concurrency for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. 3300 3301 Examples of such tools can include futures and promises~\cite{promises}, executors and actors. … … 3302 3303 As well, new \CFA extensions should make it possible to create a uniform interface for virtually all mutual exclusion, including monitors and low-level locks. 3303 3304 3304 \vspace{-5pt} 3305 \paragraph{Implicit Threading} 3306 \label{futur:implcit} 3307 3308 Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, possibly with some help from pragmas to guide the conversion. 3305 \smallskip 3306 \textbf{Implicit Threading:} 3307 Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, with some help from pragmas to guide the conversion. 3309 3308 This type of concurrency can be achieved both at the language level and at the library level. 3310 3309 The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}. -
doc/papers/concurrency/response2
r9246ec6 r27125d0 113 113 OOPSLA 2017. 114 114 115 Of our testing languages, only Java is JITTED. To ensure the Java test-programs 116 correctly measured the specific feature, we consulted with Dave Dice at Oracle 117 who works directly on the development of the Oracle JVM Just-in-Time 118 Compiler. We modified our test programs based on his advise, and he validated 119 our programs as correctly measuring the specified language feature. Hence, we 120 have taken into account all issues related to performing benchmarks in JITTED 121 languages. Dave's help is recognized in the Acknowledgment section. Also, all 122 the benchmark programs are publicly available for independent verification. 115 Of our testing languages, only Java and Node.js are JITTED. To ensure the Java 116 test-programs correctly measured the specific feature, we consulted with Dave 117 Dice at Oracle who works directly on the development of the Oracle JVM 118 Just-in-Time Compiler. We modified our test programs based on his advise, and 119 he validated our programs as correctly measuring the specified language 120 feature. Hence, we have taken into account all issues related to performing 121 benchmarks in JITTED languages. Dave's help is recognized in the 122 Acknowledgment section. Also, all the benchmark programs are publicly available 123 for independent verification. 124 125 Similarly, we verified our Node.js programs with Gregor Richards, an expert in 126 just-in-time compilation for dynamic typing. 123 127 124 128 … … 286 290 critical section); threads can arrive at any time in any order, where the 287 291 mutual exclusion for the critical section ensures one thread executes at a 288 time. Interestingly, Reed & Kanodia's mutual exclusion is Habermann's289 communication not mutual exclusion. These papers only buttress our contention292 time. Interestingly, Reed & Kanodia's critical region is Habermann's 293 communication not critical section. These papers only buttress our contention 290 294 about the confusion of these terms in the literature. 291 295 … … 561 565 562 566 I think we may be differing on the meaning of stack. You may be imagining a 563 modern stack that grows and shrink dynamically. Whereas early Fortran567 modern stack that grows and shrinks dynamically. Whereas early Fortran 564 568 preallocated a stack frame for each function, like Python allocates a frame for 565 569 a generator. Within each preallocated Fortran function is a frame for local 566 570 variables and a pointer to store the return value for a call. The Fortran 567 call/return mechanism thanuses these frames to build a traditional call stack571 call/return mechanism uses these frames to build a traditional call stack 568 572 linked by the return pointer. The only restriction is that a function stack 569 573 frame can only be used once, implying no direct or indirect recursion. Hence, … … 631 635 build coroutines by stacks of generators invoking one another. 632 636 633 As we point out, coroutines built from stacks of generators have problems, such 634 as no symmetric control-flow. Furthermore, stacks of generators have a problem 635 with the following programming pattern. logger is a library function called 636 f rom a function or a coroutine, where the doit for the coroutine suspends. With637 stacks of generators, there has to be a function and generator version of 638 logger to support these two scenarios. If logger is a library function, it may 639 beimpossible to create the generator logger because the logger function is637 Coroutines built from stacks of generators have problems, such as no symmetric 638 control-flow. Furthermore, stacks of generators have a problem with the 639 following programming pattern. logger is a library function called from a 640 function or a coroutine, where the doit for the coroutine suspends. With stacks 641 of generators, there has to be a function and generator version of logger to 642 support these two scenarios. If logger is a library function, it may be 643 impossible to create the generator logger because the logger function is 640 644 opaque. 641 645 … … 668 672 } 669 673 674 Additonal text has been added to the start of Section 3.2 address this comment. 670 675 671 676 … … 681 686 682 687 * prefer generators for simple computations that yield up many values, 683 684 This description does not cover output or matching generators that do not yield685 many or any values. For example, the output generator Fmt yields no value; the686 device driver yields a value occasionally once a message is found. Furthermore,687 real device drivers are not simple; there can have hundreds of states and688 transitions. Imagine the complex event-engine for a web-server written as a689 generator.690 691 688 * prefer coroutines for more complex processes that have significant 692 689 internal structure, 693 690 694 As for generators, complexity is not the criterion for selection. A coroutine 695 brings generality to the implementation because of the addition stack, whereas 696 generators have restrictions on standard software-engining practises: variable 697 placement, no helper functions without creating an explicit generator stack, 698 and no symmetric control-flow. Whereas, the producer/consumer example in Figure 699 7 uses stack variable placement, helpers, and simple ping/pong-style symmetric 691 We do not believe the number of values yielded is an important factor is 692 choosing between a generator or coroutine; either form can receive (input) or 693 return (output) millions of values. As well, simple versus complex computations 694 is also not a criterion for selection as both forms can be very 695 sophisticated. As stated in the paper, a coroutine brings generality to the 696 implementation because of the addition stack, whereas generators have 697 restrictions on standard software-engining practices: variable placement, no 698 helper functions without creating an explicit generator stack, and no symmetric 700 699 control-flow. 701 700 … … 721 720 in purpose. 722 721 723 Given that you asked about this it before, I believe other readers might also724 ask the same question because async-await is very popular. So I think this 725 section does help to position the work in the paper among other work, and 726 hence, it isappropriate to keep it in the paper.722 Given that you asked about this before, I believe other readers might also ask 723 the same question because async-await is very popular. So I think this section 724 does help to position the work in the paper among other work, and hence, it is 725 appropriate to keep it in the paper. 727 726 728 727 … … 967 966 968 967 To handle the other 5% of the cases, there is a trivial Cforall pattern 969 providing Java-style start/join. The additional cost for this pattern is 2970 light-weight thread context-switches.968 providing Java-style start/join. The additional cost for this pattern is small 969 in comparison to the work performed between the start and join. 971 970 972 971 thread T {};
Note: See TracChangeset
for help on using the changeset viewer.