- File:
-
- 1 edited
-
doc/papers/concurrency/Paper.tex (modified) (77 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r6d43cc57 r251454a0 56 56 \newcommand{\Textbf}[2][red]{{\color{#1}{\textbf{#2}}}} 57 57 \newcommand{\Emph}[2][red]{{\color{#1}\textbf{\emph{#2}}}} 58 \newcommand{\R}[1]{\Textbf{#1}} 59 \newcommand{\B}[1]{{\Textbf[blue]{#1}}} 60 \newcommand{\G}[1]{{\Textbf[OliveGreen]{#1}}} 58 61 \newcommand{\uC}{$\mu$\CC} 59 \newcommand{\TODO}[1]{{\Textbf{#1}}} 62 \newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace} 63 \newcommand{\TODO}{{\Textbf{TODO}}} 60 64 61 65 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% … … 254 258 \section{Introduction} 255 259 256 This paper provides a minimal concurrency \newterm{A pplicationProgram Interface} (API) that is simple, efficient and can be used to build other concurrency features.260 This paper provides a minimal concurrency \newterm{Abstract Program Interface} (API) that is simple, efficient and can be used to build other concurrency features. 257 261 While the simplest concurrency system is a thread and a lock, this low-level approach is hard to master. 258 262 An easier approach for programmers is to support higher-level constructs as the basis of concurrency. … … 271 275 Hence, there are two problems to be solved: concurrency and parallelism. 272 276 While these two concepts are often combined, they are distinct, requiring different tools~\cite[\S~2]{Buhr05a}. 273 Concurrency tools handle mutual exclusion and synchronization, while parallelism tools handle performance, cost,and resource utilization.277 Concurrency tools handle synchronization and mutual exclusion, while parallelism tools handle performance, cost and resource utilization. 274 278 275 279 The proposed concurrency API is implemented in a dialect of C, called \CFA. … … 282 286 Extended versions and explanation of the following code examples are available at the \CFA website~\cite{Cforall} or in Moss~\etal~\cite{Moss18}. 283 287 284 \CFA is a non-object-orientedextension of ISO-C, and hence, supports all C paradigms.288 \CFA is an extension of ISO-C, and hence, supports all C paradigms. 285 289 %It is a non-object-oriented system-language, meaning most of the major abstractions have either no runtime overhead or can be opted out easily. 286 Like C, the b uilding blocks of \CFA are structures and routines.290 Like C, the basics of \CFA revolve around structures and functions. 287 291 Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions. 288 292 While \CFA is not an object-oriented language, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does have a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}. … … 296 300 int x = 1, y = 2, z = 3; 297 301 int * p1 = &x, ** p2 = &p1, *** p3 = &p2, $\C{// pointers to x}$ 298 `&` r1 = x, `&&` r2 = r1,`&&&` r3 = r2; $\C{// references to x}$302 `&` r1 = x, `&&` r2 = r1, `&&&` r3 = r2; $\C{// references to x}$ 299 303 int * p4 = &z, `&` r4 = z; 300 304 … … 345 349 'with' '(' $\emph{expression-list}$ ')' $\emph{compound-statement}$ 346 350 \end{cfa} 347 and may appear as the body of a routine or nested within a routinebody.351 and may appear as the body of a function or nested within a function body. 348 352 Each expression in the expression-list provides a type and object. 349 353 The type must be an aggregate type. … … 356 360 357 361 \CFA maximizes the ability to reuse names via overloading to aggressively address the naming problem. 358 Both variables and routines may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments.362 Both variables and functions may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments. 359 363 \begin{cquote} 360 364 \vspace*{-\baselineskip}%??? … … 411 415 \end{cquote} 412 416 Overloading is important for \CFA concurrency since the runtime system relies on creating different types to represent concurrency objects. 413 Therefore, overloading eliminates long prefixes and other naming conventions to prevent name clashes. 414 As seen in Section~\ref{basics}, routine @main@ is heavily overloaded. 415 For example, variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name: 417 Therefore, overloading is necessary to prevent the need for long prefixes and other naming conventions to prevent name clashes. 418 As seen in Section~\ref{basics}, function @main@ is heavily overloaded. 419 420 Variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name: 416 421 \begin{cfa} 417 422 struct S { int `i`; int j; double m; } s; … … 427 432 } 428 433 \end{cfa} 429 For parallel semantics, both @s.i@ and @t.i@ are visible withthe same type, so only @i@ is ambiguous without qualification.434 For parallel semantics, both @s.i@ and @t.i@ are visible the same type, so only @i@ is ambiguous without qualification. 430 435 431 436 … … 433 438 434 439 Overloading also extends to operators. 435 Operator-overloading syntax creates a routine name with anoperator symbol and question marks for the operands:440 Operator-overloading syntax names a routine with the operator symbol and question marks for the operands: 436 441 \begin{cquote} 437 442 \lstDeleteShortInline@% … … 467 472 \end{cquote} 468 473 While concurrency does not use operator overloading directly, it provides an introduction for the syntax of constructors. 474 475 476 \subsection{Parametric Polymorphism} 477 \label{s:ParametricPolymorphism} 478 479 The signature feature of \CFA is parametric-polymorphic functions~\cite{} with functions generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types. 480 For example, the following sum function works for any type that supports construction from 0 and addition: 481 \begin{cfa} 482 forall( otype T | { void `?{}`( T *, zero_t ); T `?+?`( T, T ); } ) // constraint type, 0 and + 483 T sum( T a[$\,$], size_t size ) { 484 `T` total = { `0` }; $\C{// initialize by 0 constructor}$ 485 for ( size_t i = 0; i < size; i += 1 ) 486 total = total `+` a[i]; $\C{// select appropriate +}$ 487 return total; 488 } 489 S sa[5]; 490 int i = sum( sa, 5 ); $\C{// use S's 0 construction and +}$ 491 \end{cfa} 492 493 \CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each function declaration: 494 \begin{cfa} 495 trait `sumable`( otype T ) { 496 void `?{}`( T &, zero_t ); $\C{// 0 literal constructor}$ 497 T `?+?`( T, T ); $\C{// assortment of additions}$ 498 T ?+=?( T &, T ); 499 T ++?( T & ); 500 T ?++( T & ); 501 }; 502 forall( otype T `| sumable( T )` ) $\C{// use trait}$ 503 T sum( T a[$\,$], size_t size ); 504 \end{cfa} 505 506 Assertions can be @otype@ or @dtype@. 507 @otype@ refers to a ``complete'' object, \ie an object has a size, default constructor, copy constructor, destructor and an assignment operator. 508 @dtype@ only guarantees an object has a size and alignment. 509 510 Using the return type for discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@: 511 \begin{cfa} 512 forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); } 513 int * ip = alloc(); $\C{// select type and size from left-hand side}$ 514 double * dp = alloc(); 515 struct S {...} * sp = alloc(); 516 \end{cfa} 517 where the return type supplies the type/size of the allocation, which is impossible in most type systems. 469 518 470 519 … … 495 544 \CFA also provides @new@ and @delete@, which behave like @malloc@ and @free@, in addition to constructing and destructing objects: 496 545 \begin{cfa} 497 { 498 ... struct S s = {10}; ... $\C{// allocation, call constructor}$546 { struct S s = {10}; $\C{// allocation, call constructor}$ 547 ... 499 548 } $\C{// deallocation, call destructor}$ 500 549 struct S * s = new(); $\C{// allocation, call constructor}$ … … 502 551 delete( s ); $\C{// deallocation, call destructor}$ 503 552 \end{cfa} 504 \CFA concurrency uses object lifetime as a means of mutual exclusion and/or synchronization. 505 506 507 \subsection{Parametric Polymorphism} 508 \label{s:ParametricPolymorphism} 509 510 The signature feature of \CFA is parametric-polymorphic routines~\cite{} with routines generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types. 511 For example, the following sum routine works for any type that supports construction from 0 and addition: 512 \begin{cfa} 513 forall( otype T | { void `?{}`( T *, zero_t ); T `?+?`( T, T ); } ) // constraint type, 0 and + 514 T sum( T a[$\,$], size_t size ) { 515 `T` total = { `0` }; $\C{// initialize by 0 constructor}$ 516 for ( size_t i = 0; i < size; i += 1 ) 517 total = total `+` a[i]; $\C{// select appropriate +}$ 518 return total; 519 } 520 S sa[5]; 521 int i = sum( sa, 5 ); $\C{// use S's 0 construction and +}$ 522 \end{cfa} 523 The builtin type @zero_t@ (and @one_t@) overload constant 0 (and 1) for a new types, where both 0 and 1 have special meaning in C. 524 525 \CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each routine declaration: 526 \begin{cfa} 527 trait `sumable`( otype T ) { 528 void `?{}`( T &, zero_t ); $\C{// 0 literal constructor}$ 529 T `?+?`( T, T ); $\C{// assortment of additions}$ 530 T ?+=?( T &, T ); 531 T ++?( T & ); 532 T ?++( T & ); 533 }; 534 forall( otype T `| sumable( T )` ) $\C{// use trait}$ 535 T sum( T a[$\,$], size_t size ); 536 \end{cfa} 537 538 Assertions can be @otype@ or @dtype@. 539 @otype@ refers to a ``complete'' object, \ie an object has a size, default constructor, copy constructor, destructor and an assignment operator. 540 @dtype@ only guarantees an object has a size and alignment. 541 542 Using the return type for discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@: 543 \begin{cfa} 544 forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); } 545 int * ip = alloc(); $\C{// select type and size from left-hand side}$ 546 double * dp = alloc(); 547 struct S {...} * sp = alloc(); 548 \end{cfa} 549 where the return type supplies the type/size of the allocation, which is impossible in most type systems. 553 \CFA concurrency uses object lifetime as a means of synchronization and/or mutual exclusion. 550 554 551 555 … … 580 584 \subsection{\protect\CFA's Thread Building Blocks} 581 585 582 An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@header, it is minimal and defined as optional.586 An important missing feature in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional. 583 587 As such, library support for threading is far from widespread. 584 At the time of writing the paper, neither \protect\lstinline @gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@in their standard libraries.}.585 In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.588 At the time of writing the paper, neither \protect\lstinline|gcc| nor \protect\lstinline|clang| support ``threads.h'' in their standard libraries.}. 589 On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism. 586 590 As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. 587 591 Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay. … … 621 625 \newbox\myboxA 622 626 \begin{lrbox}{\myboxA} 623 \begin{ cfa}[aboveskip=0pt,belowskip=0pt]627 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt] 624 628 `int f1, f2, state = 1;` // single global variables 625 629 int fib() { … … 638 642 } 639 643 } 640 \end{ cfa}644 \end{lstlisting} 641 645 \end{lrbox} 642 646 643 647 \newbox\myboxB 644 648 \begin{lrbox}{\myboxB} 645 \begin{ cfa}[aboveskip=0pt,belowskip=0pt]649 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt] 646 650 #define FIB_INIT `{ 0, 1 }` 647 651 typedef struct { int f2, f1; } Fib; … … 660 664 } 661 665 } 662 \end{ cfa}666 \end{lstlisting} 663 667 \end{lrbox} 664 668 … … 673 677 \newbox\myboxA 674 678 \begin{lrbox}{\myboxA} 675 \begin{ cfa}[aboveskip=0pt,belowskip=0pt]679 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt] 676 680 `coroutine` Fib { int fn; }; 677 681 void main( Fib & fib ) with( fib ) { … … 693 697 } 694 698 } 695 \end{ cfa}699 \end{lstlisting} 696 700 \end{lrbox} 697 701 \newbox\myboxB 698 702 \begin{lrbox}{\myboxB} 699 \begin{ cfa}[aboveskip=0pt,belowskip=0pt]703 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt] 700 704 `coroutine` Fib { int ret; }; 701 705 void main( Fib & f ) with( fib ) { … … 717 721 718 722 719 \end{ cfa}723 \end{lstlisting} 720 724 \end{lrbox} 721 725 \subfloat[3 States, internal variables]{\label{f:Coroutine3States}\usebox\myboxA} … … 727 731 728 732 Using a coroutine, it is possible to express the Fibonacci formula directly without any of the C problems. 729 Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @`coroutine` Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines, \eg @next@. 733 Figure~\ref{f:Coroutine3States} creates a @coroutine@ type: 734 \begin{cfa} 735 `coroutine` Fib { int fn; }; 736 \end{cfa} 737 which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface functions, @next@. 730 738 Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main. 731 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's @resume@.732 The interface routine@next@, takes a Fibonacci instance and context switches to it using @resume@;739 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's resume. 740 The interface function, @next@, takes a Fibonacci instance and context switches to it using @resume@; 733 741 on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned. 734 742 The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates the stack; … … 761 769 \newbox\myboxA 762 770 \begin{lrbox}{\myboxA} 763 \begin{ cfa}[aboveskip=0pt,belowskip=0pt]771 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt] 764 772 `coroutine` Format { 765 773 char ch; // used for communication … … 793 801 } 794 802 } 795 \end{ cfa}803 \end{lstlisting} 796 804 \end{lrbox} 797 805 798 806 \newbox\myboxB 799 807 \begin{lrbox}{\myboxB} 800 \begin{ cfa}[aboveskip=0pt,belowskip=0pt]808 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt] 801 809 struct Format { 802 810 char ch; … … 830 838 format( &fmt ); 831 839 } 832 \end{ cfa}840 \end{lstlisting} 833 841 \end{lrbox} 834 842 \subfloat[\CFA Coroutine]{\label{f:CFAFmt}\usebox\myboxA} … … 839 847 \end{figure} 840 848 841 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming routine for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal routines.842 However, @resume@/@suspend@ context switch to existing stack-frames rather than create new ones so there is no stack growth.843 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming routinefor another coroutine, which eventually forms a resuming-call cycle.849 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming function for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal functions. 850 However, there is no stack growth because @resume@/@suspend@ context switch to existing stack-frames rather than create new ones. 851 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a resuming-call cycle. 844 852 (The trivial cycle is a coroutine resuming itself.) 845 853 This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch. … … 923 931 Figure~\ref{f:ProdCons} shows a producer/consumer symmetric-coroutine performing bi-directional communication. 924 932 Since the solution involves a full-coroutining cycle, the program main creates one coroutine in isolation, passes this coroutine to its partner, and closes the cycle at the call to @start@. 925 The @start@ routinecommunicates both the number of elements to be produced and the consumer into the producer's coroutine structure.933 The @start@ function communicates both the number of elements to be produced and the consumer into the producer's coroutine structure. 926 934 Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it. 927 935 @prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer. … … 929 937 The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status. 930 938 For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine. 931 The consumer iterates until the @done@ flag is set, prints the values delivered by the producer, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation).939 The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation). 932 940 The call from the consumer to the @payment@ introduces the cycle between producer and consumer. 933 941 When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed. … … 959 967 \end{cfa} 960 968 and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack. 961 Furthermore, the execution of constructs/destructors is in the wrong order for certain operations .962 For example, for threadsif the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.969 Furthermore, the execution of constructs/destructors is in the wrong order for certain operations, \eg for threads; 970 \eg, if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. 963 971 964 972 An alternatively is composition: … … 972 980 However, there is nothing preventing wrong placement or multiple declarations. 973 981 974 For coroutines as for threads, many implementations are based on routine pointers or routineobjects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.982 For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}. 975 983 For example, Boost implements coroutines in terms of four functor object-types: 976 984 \begin{cfa} … … 980 988 symmetric_coroutine<>::yield_type 981 989 \end{cfa} 982 Similarly, the canonical threading paradigm is often based on routine pointers, \eg @pthreads@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.990 Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}. 983 991 However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type. 984 992 \begin{cfa} … … 994 1002 \end{cfa} 995 1003 Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little. 996 997 Note, the type @coroutine_t@ must be an abstract handle to the coroutine, because the coroutine descriptor and its stack are non-copyable.998 Copying the coroutine descriptor results in copies being out of date with the current state of the stack.999 Correspondingly, copying the stack results is copies being out of date with the coroutine descriptor, and pointers in the stack being out of date to data on the stack.1000 (There is no mechanism in C to find all stack-specific pointers and update them as part of a copy.)1001 1004 1002 1005 The selected approach is to use language support by introducing a new kind of aggregate (structure): … … 1011 1014 Furthermore, implementing coroutines without language supports also displays the power of a programming language. 1012 1015 While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without using the language support. 1013 The reserved keyword simplyeases use for the common cases.1014 1015 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation routines:1016 \begin{cfa} 1017 trait is_coroutine( `dtype`T ) {1018 void main( T &);1019 coroutine_desc * get_coroutine( T &);1016 The reserved keyword eases use for the common cases. 1017 1018 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation functions: 1019 \begin{cfa} 1020 trait is_coroutine( dtype T ) { 1021 void main( T & this ); 1022 coroutine_desc * get_coroutine( T & this ); 1020 1023 }; 1021 forall( `dtype` T | is_coroutine(T) ) void suspend( T & );1022 forall( `dtype` T | is_coroutine(T) ) void resume( T & );1023 \end{cfa} 1024 The @dtype@ property of the trait ensures the coroutine descriptor is non-copyable, so all coroutines must be passed by reference (pointer). 1025 Th e routine definitions ensures there is a statically-typed @main@ routine that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle.1026 The @main@ routine has no return value or additional parameters because the coroutine type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values versus fixed ones.1027 The generic routines @suspend@ and @resume@ can be redefined, but any object passed to themis a coroutine since it must satisfy the @is_coroutine@ trait to compile.1028 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine, and possibly redefining @suspend@ and @resume@.1024 forall( dtype T | is_coroutine(T) ) void get_coroutine( T & ); 1025 forall( dtype T | is_coroutine(T) ) void suspend( T & ); 1026 forall( dtype T | is_coroutine(T) ) void resume( T & ); 1027 \end{cfa} 1028 This definition ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine. 1029 No return value or additional parameters are necessary for this function, because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values. 1030 As well, any object passed to @suspend@ and @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile. 1031 The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine. 1029 1032 The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main: 1030 1033 \begin{cquote} … … 1036 1039 }; 1037 1040 \end{cfa} 1038 & 1039 {\Large $\Rightarrow$} 1040 & 1041 & {\Large $\Rightarrow$} & 1041 1042 \begin{tabular}{@{}ccc@{}} 1042 1043 \begin{cfa} … … 1072 1073 Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait: 1073 1074 \begin{cquote} 1074 \begin{tabular}{@{}c@{\hspace{ 3\parindentlnth}}c@{}}1075 \begin{tabular}{@{}c@{\hspace{2\parindentlnth}}c@{}} 1075 1076 \begin{cfa} 1076 1077 thread myThread { … … 1082 1083 & 1083 1084 \begin{cfa} 1084 trait is_thread( `dtype`T ) {1085 void main( T & );1086 thread_desc * get_thread( T & );1087 void ^?{}( T & `mutex` );1085 trait is_thread( dtype T ) { 1086 void main( T & this ); 1087 thread_desc * get_thread( T & this ); 1088 void ^?{}( T & `mutex` this ); 1088 1089 }; 1089 1090 \end{cfa} … … 1091 1092 \end{cquote} 1092 1093 (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitors}.) 1093 Like a coroutine, the statically-typed @main@ routineis the starting point (first stack frame) of a user thread.1094 Like a coroutine, the statically-typed @main@ function is the starting point (first stack frame) of a user thread. 1094 1095 The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@; 1095 1096 whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{ 1096 The \lstinline@main@ routine is already a special routine in C, \ie where the program's initial thread begins, so it is a natural extension of this semantics to use overloading to declare \lstinline@main@s for user coroutines and threads.}1097 No return value or additional parameters are necessary for this routine because the task type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values.1097 The \lstinline@main@ function is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).} 1098 No return value or additional parameters are necessary for this function, because the task type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values. 1098 1099 1099 1100 \begin{comment} % put in appendix with coroutine version ??? … … 1108 1109 1109 1110 In this example, threads of type @foo@ start execution in the @void main(foo &)@ routine, which prints @"Hello World!".@ While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. 1110 With the static semantics it is trivial to write a thread type that takes a routinepointer as a parameter and executes it on its stack asynchronously.1111 \begin{cfa} 1112 typedef void (*void Rtn)(int);1113 1114 thread RtnRunner {1115 void Rtnfunc;1111 With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously. 1112 \begin{cfa} 1113 typedef void (*voidFunc)(int); 1114 1115 thread FuncRunner { 1116 voidFunc func; 1116 1117 int arg; 1117 1118 }; 1118 1119 1119 void ?{}( RtnRunner & this, voidRtn inRtn, int arg) {1120 this.func = in Rtn;1120 void ?{}(FuncRunner & this, voidFunc inFunc, int arg) { 1121 this.func = inFunc; 1121 1122 this.arg = arg; 1122 1123 } 1123 1124 1124 void main( RtnRunner & this) {1125 // thread starts here and runs the routine1125 void main(FuncRunner & this) { 1126 // thread starts here and runs the function 1126 1127 this.func( this.arg ); 1127 1128 } … … 1132 1133 1133 1134 int main() { 1134 RtnRunner f = {hello, 42};1135 FuncRunner f = {hello, 42}; 1135 1136 return 0? 1136 1137 } 1137 1138 \end{cfa} 1138 A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{ API}.1139 A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{api}. 1139 1140 \end{comment} 1140 1141 … … 1185 1186 void main( Adder & adder ) with( adder ) { 1186 1187 subtotal = 0; 1187 for ( int c = 0; c < cols; c += 1 ) { subtotal += row[c]; } 1188 for ( int c = 0; c < cols; c += 1 ) { 1189 subtotal += row[c]; 1190 } 1188 1191 } 1189 1192 int main() { … … 1207 1210 1208 1211 1209 \section{ Mutual Exclusion / Synchronization}1212 \section{Synchronization / Mutual Exclusion} 1210 1213 1211 1214 Uncontrolled non-deterministic execution is meaningless. 1212 To reestablish meaningful execution requires mechanisms to reintroduce determinism , \ie restrict non-determinism, called mutual exclusion and synchronization, where mutual exclusion is an access-control mechanism on data shared by threads, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}.1213 Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it , \eg Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala).1214 In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts , \eg channels~\cite{CSP,Go}.1215 However, in call/return-based languages, these approaches force a clear distinction , \ie introduce a new programming paradigm, between regular and concurrent computation, \eg routine call versus message passing.1216 Hence, a programmer must learn and manipulatetwo sets of design patterns.1215 To reestablish meaningful execution requires mechanisms to reintroduce determinism (control non-determinism), called synchronization and mutual exclusion, where synchronization is a timing relationship among threads and mutual exclusion is an access-control mechanism on data shared by threads. 1216 Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala)). 1217 In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (\eg channels~\cite{CSP,Go}). 1218 However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between non-concurrent and concurrent computation (\ie function call versus message passing). 1219 This distinction means a programmers needs to learn two sets of design patterns. 1217 1220 While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account. 1218 1221 In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm. 1219 1222 1220 At the lowest level, concurrent control is implemented by atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b}, barriers,and path expressions~\cite{Campbell74}.1223 At the lowest level, concurrent control is implemented as atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}. 1221 1224 However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}. 1222 A newer approach for restricting non-determinismis transactional memory~\cite{Herlihy93}.1225 A newer approach is transactional memory~\cite{Herlihy93}. 1223 1226 While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA. 1224 1227 1225 One of the most natural, elegant, and efficient mechanisms for mutual exclusion and synchronization for shared-memory systems is the \emph{monitor}. 1226 First proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}, many concurrent programming-languages provide monitors as an explicit language construct: \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}. 1227 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as mutex locks or semaphores to simulate monitors. 1228 For these reasons, \CFA selected monitors as the core high-level concurrency-construct, upon which higher-level approaches can be easily constructed. 1228 One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}. 1229 Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}. 1230 Many programming languages -- \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java} -- provide monitors as explicit language constructs. 1231 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors. 1232 For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed.. 1229 1233 1230 1234 … … 1232 1236 1233 1237 A group of instructions manipulating a specific instance of shared data that must be performed atomically is called an (individual) \newterm{critical-section}~\cite{Dijkstra65}. 1234 The generalization is calleda \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.1238 A generalization is a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously. 1235 1239 The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers have the same session and all writers have a unique session. 1236 \newterm{Mutual exclusion} enforces th at the correct kind and number of threads are using a critical section.1240 \newterm{Mutual exclusion} enforces the correction number of threads are using a critical section at the same time. 1237 1241 1238 1242 However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use. 1239 1243 Methods range from low-level locks, which are fast and flexible but require significant attention for correctness, to higher-level concurrency techniques, which sacrifice some performance to improve ease of use. 1240 Ease of use comes by either guaranteeing some problems cannot occur , \eg deadlock free, or by offering a more explicit coupling between shared data and critical section.1241 For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations , \eg reading/writing, for numerical types.1242 However, a significant challenge with locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.1243 Easing composability is another feature higher-level mutual-exclusion mechanisms canoffer.1244 Ease of use comes by either guaranteeing some problems cannot occur (\eg deadlock free), or by offering a more explicit coupling between shared data and critical section. 1245 For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations (\eg reading/writing large types atomically). 1246 However, a significant challenge with (low-level) locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock. 1247 Easing composability is another feature higher-level mutual-exclusion mechanisms offer. 1244 1248 1245 1249 … … 1247 1251 1248 1252 Synchronization enforces relative ordering of execution, and synchronization tools provide numerous mechanisms to establish these timing relationships. 1249 Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use; 1250 higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg message passing, or offering a simpler solution to otherwise involved challenges, \eg barrier lock. 1251 Often synchronization is used to order access to a critical section, \eg ensuring a reader thread is the next kind of thread to enter a critical section. 1252 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, that reader has \newterm{barged}. 1253 Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation). 1253 Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use. 1254 Higher-level mechanisms often simplify usage by adding better coupling between synchronization and data (\eg message passing), or offering a simpler solution to otherwise involved challenges, \eg barrier lock. 1255 As mentioned above, synchronization can be expressed as guaranteeing that event \textit{X} always happens before \textit{Y}. 1256 Often synchronization is used to order access to a critical section, \eg ensuring the next kind of thread to enter a critical section is a reader thread 1257 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, the reader has \newterm{barged}. 1258 Barging can result in staleness/freshness problems, where a reader barges ahead of a write and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from having an opportunity to be read. 1254 1259 Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs. 1255 This challenge is often split into two different approaches: barging avoidance and barging prevention. 1256 Algorithms that allow a barger, but divert it until later using current synchronization state (flags), are avoiding the barger; 1257 algorithms that preclude a barger from entering during synchronization in the critical section prevent barging completely. 1258 Techniques like baton-pass locks~\cite{Andrews89} between threads instead of unconditionally releasing locks is an example of barging prevention. 1260 This challenge is often split into two different approaches, barging avoidance and barging prevention. 1261 Algorithms that allow a barger but divert it until later are avoiding the barger, while algorithms that preclude a barger from entering during synchronization in the critical section prevent the barger completely. 1262 baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention. 1259 1263 1260 1264 … … 1262 1266 \label{s:Monitors} 1263 1267 1264 A \textbf{monitor} is a set of routines that ensure mutual exclusion when accessing shared state. 1265 More precisely, a monitor is a programming technique that binds mutual exclusion to routine scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation). 1266 The strong association with the call/return paradigm eases programmability, readability and maintainability, at a slight cost in flexibility and efficiency. 1267 1268 Note, like coroutines/threads, both locks and monitors require an abstract handle to reference them, because at their core, both mechanisms are manipulating non-copyable shared-state. 1269 Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data. 1270 Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity. 1271 As for coroutines/tasks, a non-copyable (@dtype@) trait is used to capture this requirement, so all locks/monitors must be passed by reference (pointer). 1272 \begin{cfa} 1273 trait is_monitor( `dtype` T ) { 1268 A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state. 1269 More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine. 1270 This strong association eases readability and maintainability, at the cost of flexibility. 1271 Note that both monitors and mutex locks, require an abstract handle to identify them. 1272 This concept is generally associated with object-oriented languages like Java~\cite{Java} or \uC~\cite{uC++book} but does not strictly require OO semantics. 1273 The only requirement is the ability to declare a handle to a shared object and a set of routines that act on it: 1274 \begin{cfa} 1275 typedef /*some monitor type*/ monitor; 1276 int f(monitor & m); 1277 1278 int main() { 1279 monitor m; // Handle m 1280 f(m); // Routine using handle 1281 } 1282 \end{cfa} 1283 1284 % ====================================================================== 1285 % ====================================================================== 1286 \subsection{Call Semantics} \label{call} 1287 % ====================================================================== 1288 % ====================================================================== 1289 The above monitor example displays some of the intrinsic characteristics. 1290 First, it is necessary to use pass-by-reference over pass-by-value for monitor routines. 1291 This semantics is important, because at their core, monitors are implicit mutual-exclusion objects (locks), and these objects cannot be copied. 1292 Therefore, monitors are non-copy-able objects (@dtype@). 1293 1294 Another aspect to consider is when a monitor acquires its mutual exclusion. 1295 For example, a monitor may need to be passed through multiple helper routines that do not acquire the monitor mutual-exclusion on entry. 1296 Passthrough can occur for generic helper routines (@swap@, @sort@, \etc) or specific helper routines like the following to implement an atomic counter: 1297 1298 \begin{cfa} 1299 monitor counter_t { /*...see section $\ref{data}$...*/ }; 1300 1301 void ?{}(counter_t & nomutex this); // constructor 1302 size_t ++?(counter_t & mutex this); // increment 1303 1304 // need for mutex is platform dependent 1305 void ?{}(size_t * this, counter_t & mutex cnt); // conversion 1306 \end{cfa} 1307 This counter is used as follows: 1308 \begin{center} 1309 \begin{tabular}{c @{\hskip 0.35in} c @{\hskip 0.35in} c} 1310 \begin{cfa} 1311 // shared counter 1312 counter_t cnt1, cnt2; 1313 1314 // multiple threads access counter 1315 thread 1 : cnt1++; cnt2++; 1316 thread 2 : cnt1++; cnt2++; 1317 thread 3 : cnt1++; cnt2++; 1318 ... 1319 thread N : cnt1++; cnt2++; 1320 \end{cfa} 1321 \end{tabular} 1322 \end{center} 1323 Notice how the counter is used without any explicit synchronization and yet supports thread-safe semantics for both reading and writing, which is similar in usage to the \CC template @std::atomic@. 1324 1325 Here, the constructor (@?{}@) uses the @nomutex@ keyword to signify that it does not acquire the monitor mutual-exclusion when constructing. 1326 This semantics is because an object not yet constructed should never be shared and therefore does not require mutual exclusion. 1327 Furthermore, it allows the implementation greater freedom when it initializes the monitor locking. 1328 The prefix increment operator uses @mutex@ to protect the incrementing process from race conditions. 1329 Finally, there is a conversion operator from @counter_t@ to @size_t@. 1330 This conversion may or may not require the @mutex@ keyword depending on whether or not reading a @size_t@ is an atomic operation. 1331 1332 For maximum usability, monitors use \textbf{multi-acq} semantics, which means a single thread can acquire the same monitor multiple times without deadlock. 1333 For example, listing \ref{fig:search} uses recursion and \textbf{multi-acq} to print values inside a binary tree. 1334 \begin{figure} 1335 \begin{cfa}[caption={Recursive printing algorithm using \textbf{multi-acq}.},label={fig:search}] 1336 monitor printer { ... }; 1337 struct tree { 1338 tree * left, right; 1339 char * value; 1340 }; 1341 void print(printer & mutex p, char * v); 1342 1343 void print(printer & mutex p, tree * t) { 1344 print(p, t->value); 1345 print(p, t->left ); 1346 print(p, t->right); 1347 } 1348 \end{cfa} 1349 \end{figure} 1350 1351 Having both @mutex@ and @nomutex@ keywords can be redundant, depending on the meaning of a routine having neither of these keywords. 1352 For example, it is reasonable that it should default to the safest option (@mutex@) when given a routine without qualifiers @void foo(counter_t & this)@, whereas assuming @nomutex@ is unsafe and may cause subtle errors. 1353 On the other hand, @nomutex@ is the ``normal'' parameter behaviour, it effectively states explicitly that ``this routine is not special''. 1354 Another alternative is making exactly one of these keywords mandatory, which provides the same semantics but without the ambiguity of supporting routines with neither keyword. 1355 Mandatory keywords would also have the added benefit of being self-documented but at the cost of extra typing. 1356 While there are several benefits to mandatory keywords, they do bring a few challenges. 1357 Mandatory keywords in \CFA would imply that the compiler must know without doubt whether or not a parameter is a monitor or not. 1358 Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred. 1359 For this reason, \CFA only has the @mutex@ keyword and uses no keyword to mean @nomutex@. 1360 1361 The next semantic decision is to establish when @mutex@ may be used as a type qualifier. 1362 Consider the following declarations: 1363 \begin{cfa} 1364 int f1(monitor & mutex m); 1365 int f2(const monitor & mutex m); 1366 int f3(monitor ** mutex m); 1367 int f4(monitor * mutex m []); 1368 int f5(graph(monitor *) & mutex m); 1369 \end{cfa} 1370 The problem is to identify which object(s) should be acquired. 1371 Furthermore, each object needs to be acquired only once. 1372 In the case of simple routines like @f1@ and @f2@ it is easy to identify an exhaustive list of objects to acquire on entry. 1373 Adding indirections (@f3@) still allows the compiler and programmer to identify which object is acquired. 1374 However, adding in arrays (@f4@) makes it much harder. 1375 Array lengths are not necessarily known in C, and even then, making sure objects are only acquired once becomes none-trivial. 1376 This problem can be extended to absurd limits like @f5@, which uses a graph of monitors. 1377 To make the issue tractable, this project imposes the requirement that a routine may only acquire one monitor per parameter and it must be the type of the parameter with at most one level of indirection (ignoring potential qualifiers). 1378 Also note that while routine @f3@ can be supported, meaning that monitor @**m@ is acquired, passing an array to this routine would be type-safe and yet result in undefined behaviour because only the first element of the array is acquired. 1379 However, this ambiguity is part of the C type-system with respects to arrays. 1380 For this reason, @mutex@ is disallowed in the context where arrays may be passed: 1381 \begin{cfa} 1382 int f1(monitor & mutex m); // Okay : recommended case 1383 int f2(monitor * mutex m); // Not Okay : Could be an array 1384 int f3(monitor mutex m []); // Not Okay : Array of unknown length 1385 int f4(monitor ** mutex m); // Not Okay : Could be an array 1386 int f5(monitor * mutex m []); // Not Okay : Array of unknown length 1387 \end{cfa} 1388 Note that not all array functions are actually distinct in the type system. 1389 However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic. 1390 1391 Unlike object-oriented monitors, where calling a mutex member \emph{implicitly} acquires mutual-exclusion of the receiver object, \CFA uses an explicit mechanism to specify the object that acquires mutual-exclusion. 1392 A consequence of this approach is that it extends naturally to multi-monitor calls. 1393 \begin{cfa} 1394 int f(MonitorA & mutex a, MonitorB & mutex b); 1395 1396 MonitorA a; 1397 MonitorB b; 1398 f(a,b); 1399 \end{cfa} 1400 While OO monitors could be extended with a mutex qualifier for multiple-monitor calls, no example of this feature could be found. 1401 The capability to acquire multiple locks before entering a critical section is called \emph{\textbf{bulk-acq}}. 1402 In practice, writing multi-locking routines that do not lead to deadlocks is tricky. 1403 Having language support for such a feature is therefore a significant asset for \CFA. 1404 In the case presented above, \CFA guarantees that the order of acquisition is consistent across calls to different routines using the same monitors as arguments. 1405 This consistent ordering means acquiring multiple monitors is safe from deadlock when using \textbf{bulk-acq}. 1406 However, users can still force the acquiring order. 1407 For example, notice which routines use @mutex@/@nomutex@ and how this affects acquiring order: 1408 \begin{cfa} 1409 void foo(A& mutex a, B& mutex b) { // acquire a & b 1410 ... 1411 } 1412 1413 void bar(A& mutex a, B& /*nomutex*/ b) { // acquire a 1414 ... foo(a, b); ... // acquire b 1415 } 1416 1417 void baz(A& /*nomutex*/ a, B& mutex b) { // acquire b 1418 ... foo(a, b); ... // acquire a 1419 } 1420 \end{cfa} 1421 The \textbf{multi-acq} monitor lock allows a monitor lock to be acquired by both @bar@ or @baz@ and acquired again in @foo@. 1422 In the calls to @bar@ and @baz@ the monitors are acquired in opposite order. 1423 1424 However, such use leads to lock acquiring order problems. 1425 In the example above, the user uses implicit ordering in the case of function @foo@ but explicit ordering in the case of @bar@ and @baz@. 1426 This subtle difference means that calling these routines concurrently may lead to deadlock and is therefore undefined behaviour. 1427 As shown~\cite{Lister77}, solving this problem requires: 1428 \begin{enumerate} 1429 \item Dynamically tracking the monitor-call order. 1430 \item Implement rollback semantics. 1431 \end{enumerate} 1432 While the first requirement is already a significant constraint on the system, implementing a general rollback semantics in a C-like language is still prohibitively complex~\cite{Dice10}. 1433 In \CFA, users simply need to be careful when acquiring multiple monitors at the same time or only use \textbf{bulk-acq} of all the monitors. 1434 While \CFA provides only a partial solution, most systems provide no solution and the \CFA partial solution handles many useful cases. 1435 1436 For example, \textbf{multi-acq} and \textbf{bulk-acq} can be used together in interesting ways: 1437 \begin{cfa} 1438 monitor bank { ... }; 1439 1440 void deposit( bank & mutex b, int deposit ); 1441 1442 void transfer( bank & mutex mybank, bank & mutex yourbank, int me2you) { 1443 deposit( mybank, -me2you ); 1444 deposit( yourbank, me2you ); 1445 } 1446 \end{cfa} 1447 This example shows a trivial solution to the bank-account transfer problem~\cite{BankTransfer}. 1448 Without \textbf{multi-acq} and \textbf{bulk-acq}, the solution to this problem is much more involved and requires careful engineering. 1449 1450 1451 \subsection{\protect\lstinline|mutex| statement} \label{mutex-stmt} 1452 1453 The call semantics discussed above have one software engineering issue: only a routine can acquire the mutual-exclusion of a set of monitor. \CFA offers the @mutex@ statement to work around the need for unnecessary names, avoiding a major software engineering problem~\cite{2FTwoHardThings}. 1454 Table \ref{f:mutex-stmt} shows an example of the @mutex@ statement, which introduces a new scope in which the mutual-exclusion of a set of monitor is acquired. 1455 Beyond naming, the @mutex@ statement has no semantic difference from a routine call with @mutex@ parameters. 1456 1457 \begin{table} 1458 \begin{center} 1459 \begin{tabular}{|c|c|} 1460 function call & @mutex@ statement \\ 1461 \hline 1462 \begin{cfa}[tabsize=3] 1463 monitor M {}; 1464 void foo( M & mutex m1, M & mutex m2 ) { 1465 // critical section 1466 } 1467 1468 void bar( M & m1, M & m2 ) { 1469 foo( m1, m2 ); 1470 } 1471 \end{cfa}&\begin{cfa}[tabsize=3] 1472 monitor M {}; 1473 void bar( M & m1, M & m2 ) { 1474 mutex(m1, m2) { 1475 // critical section 1476 } 1477 } 1478 1479 1480 \end{cfa} 1481 \end{tabular} 1482 \end{center} 1483 \caption{Regular call semantics vs. \protect\lstinline|mutex| statement} 1484 \label{f:mutex-stmt} 1485 \end{table} 1486 1487 % ====================================================================== 1488 % ====================================================================== 1489 \subsection{Data semantics} \label{data} 1490 % ====================================================================== 1491 % ====================================================================== 1492 Once the call semantics are established, the next step is to establish data semantics. 1493 Indeed, until now a monitor is used simply as a generic handle but in most cases monitors contain shared data. 1494 This data should be intrinsic to the monitor declaration to prevent any accidental use of data without its appropriate protection. 1495 For example, here is a complete version of the counter shown in section \ref{call}: 1496 \begin{cfa} 1497 monitor counter_t { 1498 int value; 1499 }; 1500 1501 void ?{}(counter_t & this) { 1502 this.cnt = 0; 1503 } 1504 1505 int ?++(counter_t & mutex this) { 1506 return ++this.value; 1507 } 1508 1509 // need for mutex is platform dependent here 1510 void ?{}(int * this, counter_t & mutex cnt) { 1511 *this = (int)cnt; 1512 } 1513 \end{cfa} 1514 1515 Like threads and coroutines, monitors are defined in terms of traits with some additional language support in the form of the @monitor@ keyword. 1516 The monitor trait is: 1517 \begin{cfa} 1518 trait is_monitor(dtype T) { 1274 1519 monitor_desc * get_monitor( T & ); 1275 1520 void ^?{}( T & mutex ); 1276 1521 }; 1277 1522 \end{cfa} 1278 1279 1280 \subsection{Mutex Acquisition} 1281 \label{s:MutexAcquisition} 1282 1283 While correctness implicitly implies a monitor's mutual exclusion is acquired and released, there are implementation options about when and where the locking/unlocking occurs. 1284 (Much of this discussion also applies to basic locks.) 1285 For example, a monitor may need to be passed through multiple helper routines before it becomes necessary to acquire the monitor mutual-exclusion. 1286 \begin{cfa}[morekeywords=nomutex] 1287 monitor Aint { int cnt; }; $\C{// atomic integer counter}$ 1288 void ?{}( Aint & `nomutex` this ) with( this ) { cnt = 0; } $\C{// constructor}$ 1289 int ?=?( Aint & `mutex`$\(_{opt}\)$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions}$ 1290 void ?{}( int & this, Aint & `mutex`$\(_{opt}\)$ v ) { this = v.cnt; } 1291 int ?=?( int & lhs, Aint & `mutex`$\(_{opt}\)$ rhs ) with( rhs ) { lhs = cnt; } 1292 int ++?( Aint & `mutex`$\(_{opt}\)$ this ) with( this ) { return ++cnt; } $\C{// increment}$ 1293 \end{cfa} 1294 The @Aint@ constructor, @?{}@, uses the \lstinline[morekeywords=nomutex]@nomutex@ qualifier indicating mutual exclusion is unnecessary during construction because an object is inaccessible (private) until after it is initialized. 1295 (While a constructor may publish its address into a global variable, doing so generates a race-condition.) 1296 The conversion operators for initializing and assigning with a normal integer only need @mutex@, if reading/writing the implementation type is not atomic. 1297 Finally, the prefix increment operato, @++?@, is normally @mutex@ to protect the incrementing from race conditions, unless there is an atomic increment instruction for the implementation type. 1298 1299 The atomic counter is used without any explicit mutual-exclusion and provides thread-safe semantics, which is similar to the \CC template @std::atomic@. 1300 \begin{cfa} 1301 Aint x, y, z; 1302 ++x; ++y; ++z; $\C{// safe increment by multiple threads}$ 1303 x = 2; y = 2; z = 2; $\C{// conversions}$ 1304 int i = x, j = y, k = z; 1305 i = x; j = y; k = z; 1306 \end{cfa} 1307 1308 For maximum usability, monitors have \newterm{multi-acquire} semantics allowing a thread to acquire it multiple times without deadlock. 1309 For example, atomically printing the contents of a binary tree: 1310 \begin{cfa} 1311 monitor Tree { 1312 Tree * left, right; 1313 // value 1314 }; 1315 void print( Tree & mutex tree ) { $\C{// prefix traversal}$ 1316 // write value 1317 print( tree->left ); $\C{// multiply acquire monitor lock on each recursion}$ 1318 print( tree->right ); 1319 } 1320 \end{cfa} 1321 1322 Mandatory monitor qualifiers have the benefit of being self-documented, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant. 1323 Instead, one of qualifier semantics can be the default, and the other required. 1324 For example, assume the safe @mutex@ option for a monitor parameter because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors. 1325 On the other hand, assuming \lstinline[morekeywords=nomutex]@nomutex@ is the \emph{normal} parameter behaviour, stating explicitly ``this parameter is not special''. 1326 Providing a default qualifier implies knowing whether a parameter is a monitor. 1327 Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred. 1328 For this reason, \CFA requires programmers to identify the kind of parameter with the @mutex@ keyword and uses no keyword to mean \lstinline[morekeywords=nomutex]@nomutex@. 1329 1330 The next semantic decision is establishing which parameter \emph{types} may be qualified with @mutex@. 1331 Given: 1332 \begin{cfa} 1333 monitor M { ... } 1334 int f1( M & mutex m ); 1335 int f2( M * mutex m ); 1336 int f3( M * mutex m[] ); 1337 int f4( stack( M * ) & mutex m ); 1338 \end{cfa} 1339 the issue is that some of these parameter types are composed of multiple objects. 1340 For @f1@, there is only a single parameter object. 1341 Adding indirection in @f2@ still identifies a single object. 1342 However, the matrix in @f3@ introduces multiple objects. 1343 While shown shortly, multiple acquisition is possible; 1344 however array lengths are often unknown in C. 1345 This issue is exacerbated in @f4@, where the data structure must be safely traversed to acquire all of its elements. 1346 1347 To make the issue tractable, \CFA only acquires one monitor per parameter with at most one level of indirection. 1348 However, the C type-system has an ambiguity with respects to arrays. 1349 Is the argument for @f2@ a single object or an array of objects? 1350 If it is an array, only the first element of the array is acquired, which seems unsafe; 1351 hence, @mutex@ is disallowed for array parameters. 1352 \begin{cfa} 1353 int f1( M & mutex m ); $\C{// allowed: recommended case}$ 1354 int f2( M * mutex m ); $\C{// disallowed: could be an array}$ 1355 int f3( M mutex m[$\,$] ); $\C{// disallowed: array length unknown}$ 1356 int f4( M ** mutex m ); $\C{// disallowed: could be an array}$ 1357 int f5( M * mutex m[$\,$] ); $\C{// disallowed: array length unknown}$ 1358 \end{cfa} 1359 % Note, not all array routines have distinct types: @f2@ and @f3@ have the same type, as do @f4@ and @f5@. 1360 % However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic. 1361 1362 For object-oriented monitors, calling a mutex member \emph{implicitly} acquires mutual exclusion of the receiver object, @`rec`.foo(...)@. 1363 \CFA has no receiver, and hence, must use an explicit mechanism to specify which object has mutual exclusion acquired. 1364 A positive consequence of this design decision is the ability to support multi-monitor routines. 1365 \begin{cfa} 1366 int f( M & mutex x, M & mutex y ); $\C{// multiple monitor parameter of any type}$ 1367 M m1, m2; 1368 f( m1, m2 ); 1369 \end{cfa} 1370 (While object-oriented monitors can be extended with a mutex qualifier for multiple-monitor members, no prior example of this feature could be found.) 1371 In practice, writing multi-locking routines that do not deadlock is tricky. 1372 Having language support for such a feature is therefore a significant asset for \CFA. 1373 1374 The capability to acquire multiple locks before entering a critical section is called \newterm{bulk acquire}. 1375 In the previous example, \CFA guarantees the order of acquisition is consistent across calls to different routines using the same monitors as arguments. 1376 This consistent ordering means acquiring multiple monitors is safe from deadlock. 1377 However, users can force the acquiring order. 1378 For example, notice the use of @mutex@/\lstinline[morekeywords=nomutex]@nomutex@ and how this affects the acquiring order: 1379 \begin{cfa} 1380 void foo( M & mutex m1, M & mutex m2 ); $\C{// acquire m1 and m2}$ 1381 void bar( M & mutex m1, M & /* nomutex */ m2 ) { $\C{// acquire m1}$ 1382 ... foo( m1, m2 ); ... $\C{// acquire m2}$ 1383 } 1384 void baz( M & /* nomutex */ m1, M & mutex m2 ) { $\C{// acquire m2}$ 1385 ... foo( m1, m2 ); ... $\C{// acquire m1}$ 1386 } 1387 \end{cfa} 1388 The multi-acquire semantics allows @bar@ or @baz@ to acquire a monitor lock and reacquire it in @foo@. 1389 In the calls to @bar@ and @baz@, the monitors are acquired in opposite order. 1390 1391 However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamically tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}. 1392 In \CFA, safety is guaranteed by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid. 1393 While \CFA provides only a partial solution, the \CFA partial solution handles many useful cases. 1394 \begin{cfa} 1395 monitor Bank { ... }; 1396 void deposit( Bank & `mutex` b, int deposit ); 1397 void transfer( Bank & `mutex` mybank, Bank & `mutex` yourbank, int me2you) { 1398 deposit( mybank, `-`me2you ); $\C{// debit}$ 1399 deposit( yourbank, me2you ); $\C{// credit}$ 1400 } 1401 \end{cfa} 1402 This example shows a trivial solution to the bank-account transfer problem~\cite{BankTransfer}. 1403 Without multi- and bulk acquire, the solution to this problem requires careful engineering. 1404 1405 1406 \subsection{\protect\lstinline|mutex| statement} \label{mutex-stmt} 1407 1408 The monitor call-semantics associate all locking semantics to routines. 1409 Like Java, \CFA offers an alternative @mutex@ statement to reduce refactoring and naming. 1410 \begin{cquote} 1411 \begin{tabular}{@{}c|@{\hspace{\parindentlnth}}c@{}} 1412 routine call & @mutex@ statement \\ 1413 \begin{cfa} 1414 monitor M {}; 1415 void foo( M & mutex m1, M & mutex m2 ) { 1416 // critical section 1417 } 1418 void bar( M & m1, M & m2 ) { 1419 foo( m1, m2 ); 1420 } 1421 \end{cfa} 1422 & 1423 \begin{cfa} 1424 1425 void bar( M & m1, M & m2 ) { 1426 mutex( m1, m2 ) { // remove refactoring and naming 1427 // critical section 1523 Note that the destructor of a monitor must be a @mutex@ routine to prevent deallocation while a thread is accessing the monitor. 1524 As with any object, calls to a monitor, using @mutex@ or otherwise, is undefined behaviour after the destructor has run. 1525 1526 % ====================================================================== 1527 % ====================================================================== 1528 \section{Internal Scheduling} \label{intsched} 1529 % ====================================================================== 1530 % ====================================================================== 1531 In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization. 1532 With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}. 1533 With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (\ie with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (\ie without access to the shared state). 1534 Since internal scheduling within a single monitor is mostly a solved problem, this paper concentrates on extending internal scheduling to multiple monitors. 1535 Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side. 1536 1537 First, here is a simple example of internal scheduling: 1538 1539 \begin{cfa} 1540 monitor A { 1541 condition e; 1542 } 1543 1544 void foo(A& mutex a1, A& mutex a2) { 1545 ... 1546 // Wait for cooperation from bar() 1547 wait(a1.e); 1548 ... 1549 } 1550 1551 void bar(A& mutex a1, A& mutex a2) { 1552 // Provide cooperation for foo() 1553 ... 1554 // Unblock foo 1555 signal(a1.e); 1556 } 1557 \end{cfa} 1558 There are two details to note here. 1559 First, @signal@ is a delayed operation; it only unblocks the waiting thread when it reaches the end of the critical section. 1560 This semantics is needed to respect mutual-exclusion, \ie the signaller and signalled thread cannot be in the monitor simultaneously. 1561 The alternative is to return immediately after the call to @signal@, which is significantly more restrictive. 1562 Second, in \CFA, while it is common to store a @condition@ as a field of the monitor, a @condition@ variable can be stored/created independently of a monitor. 1563 Here routine @foo@ waits for the @signal@ from @bar@ before making further progress, ensuring a basic ordering. 1564 1565 An important aspect of the implementation is that \CFA does not allow barging, which means that once function @bar@ releases the monitor, @foo@ is guaranteed to be the next thread to acquire the monitor (unless some other thread waited on the same condition). 1566 This guarantee offers the benefit of not having to loop around waits to recheck that a condition is met. 1567 The main reason \CFA offers this guarantee is that users can easily introduce barging if it becomes a necessity but adding barging prevention or barging avoidance is more involved without language support. 1568 Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. 1569 1570 % ====================================================================== 1571 % ====================================================================== 1572 \subsection{Internal Scheduling - Multi-Monitor} 1573 % ====================================================================== 1574 % ====================================================================== 1575 It is easy to understand the problem of multi-monitor scheduling using a series of pseudo-code examples. 1576 Note that for simplicity in the following snippets of pseudo-code, waiting and signalling is done using an implicit condition variable, like Java built-in monitors. 1577 Indeed, @wait@ statements always use the implicit condition variable as parameters and explicitly name the monitors (A and B) associated with the condition. 1578 Note that in \CFA, condition variables are tied to a \emph{group} of monitors on first use (called branding), which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors. 1579 The example below shows the simple case of having two threads (one for each column) and a single monitor A. 1580 1581 \begin{multicols}{2} 1582 thread 1 1583 \begin{cfa} 1584 acquire A 1585 wait A 1586 release A 1587 \end{cfa} 1588 1589 \columnbreak 1590 1591 thread 2 1592 \begin{cfa} 1593 acquire A 1594 signal A 1595 release A 1596 \end{cfa} 1597 \end{multicols} 1598 One thread acquires before waiting (atomically blocking and releasing A) and the other acquires before signalling. 1599 It is important to note here that both @wait@ and @signal@ must be called with the proper monitor(s) already acquired. 1600 This semantic is a logical requirement for barging prevention. 1601 1602 A direct extension of the previous example is a \textbf{bulk-acq} version: 1603 \begin{multicols}{2} 1604 \begin{cfa} 1605 acquire A & B 1606 wait A & B 1607 release A & B 1608 \end{cfa} 1609 \columnbreak 1610 \begin{cfa} 1611 acquire A & B 1612 signal A & B 1613 release A & B 1614 \end{cfa} 1615 \end{multicols} 1616 \noindent This version uses \textbf{bulk-acq} (denoted using the {\sf\&} symbol), but the presence of multiple monitors does not add a particularly new meaning. 1617 Synchronization happens between the two threads in exactly the same way and order. 1618 The only difference is that mutual exclusion covers a group of monitors. 1619 On the implementation side, handling multiple monitors does add a degree of complexity as the next few examples demonstrate. 1620 1621 While deadlock issues can occur when nesting monitors, these issues are only a symptom of the fact that locks, and by extension monitors, are not perfectly composable. 1622 For monitors, a well-known deadlock problem is the Nested Monitor Problem~\cite{Lister77}, which occurs when a @wait@ is made by a thread that holds more than one monitor. 1623 For example, the following cfa-code runs into the nested-monitor problem: 1624 \begin{multicols}{2} 1625 \begin{cfa} 1626 acquire A 1627 acquire B 1628 wait B 1629 release B 1630 release A 1631 \end{cfa} 1632 1633 \columnbreak 1634 1635 \begin{cfa} 1636 acquire A 1637 acquire B 1638 signal B 1639 release B 1640 release A 1641 \end{cfa} 1642 \end{multicols} 1643 \noindent The @wait@ only releases monitor @B@ so the signalling thread cannot acquire monitor @A@ to get to the @signal@. 1644 Attempting release of all acquired monitors at the @wait@ introduces a different set of problems, such as releasing monitor @C@, which has nothing to do with the @signal@. 1645 1646 However, for monitors as for locks, it is possible to write a program using nesting without encountering any problems if nesting is done correctly. 1647 For example, the next cfa-code snippet acquires monitors {\sf A} then {\sf B} before waiting, while only acquiring {\sf B} when signalling, effectively avoiding the Nested Monitor Problem~\cite{Lister77}. 1648 1649 \begin{multicols}{2} 1650 \begin{cfa} 1651 acquire A 1652 acquire B 1653 wait B 1654 release B 1655 release A 1656 \end{cfa} 1657 1658 \columnbreak 1659 1660 \begin{cfa} 1661 1662 acquire B 1663 signal B 1664 release B 1665 1666 \end{cfa} 1667 \end{multicols} 1668 1669 \noindent However, this simple refactoring may not be possible, forcing more complex restructuring. 1670 1671 % ====================================================================== 1672 % ====================================================================== 1673 \subsection{Internal Scheduling - In Depth} 1674 % ====================================================================== 1675 % ====================================================================== 1676 1677 A larger example is presented to show complex issues for \textbf{bulk-acq} and its implementation options are analyzed. 1678 Figure~\ref{f:int-bulk-cfa} shows an example where \textbf{bulk-acq} adds a significant layer of complexity to the internal signalling semantics, and listing \ref{f:int-bulk-cfa} shows the corresponding \CFA code to implement the cfa-code in listing \ref{f:int-bulk-cfa}. 1679 For the purpose of translating the given cfa-code into \CFA-code, any method of introducing a monitor is acceptable, \eg @mutex@ parameters, global variables, pointer parameters, or using locals with the @mutex@ statement. 1680 1681 \begin{figure} 1682 \begin{multicols}{2} 1683 Waiting thread 1684 \begin{cfa}[numbers=left] 1685 acquire A 1686 // Code Section 1 1687 acquire A & B 1688 // Code Section 2 1689 wait A & B 1690 // Code Section 3 1691 release A & B 1692 // Code Section 4 1693 release A 1694 \end{cfa} 1695 \columnbreak 1696 Signalling thread 1697 \begin{cfa}[numbers=left, firstnumber=10,escapechar=|] 1698 acquire A 1699 // Code Section 5 1700 acquire A & B 1701 // Code Section 6 1702 |\label{line:signal1}|signal A & B 1703 // Code Section 7 1704 |\label{line:releaseFirst}|release A & B 1705 // Code Section 8 1706 |\label{line:lastRelease}|release A 1707 \end{cfa} 1708 \end{multicols} 1709 \begin{cfa}[caption={Internal scheduling with \textbf{bulk-acq}},label={f:int-bulk-cfa}] 1710 \end{cfa} 1711 \begin{center} 1712 \begin{cfa}[xleftmargin=.4\textwidth] 1713 monitor A a; 1714 monitor B b; 1715 condition c; 1716 \end{cfa} 1717 \end{center} 1718 \begin{multicols}{2} 1719 Waiting thread 1720 \begin{cfa} 1721 mutex(a) { 1722 // Code Section 1 1723 mutex(a, b) { 1724 // Code Section 2 1725 wait(c); 1726 // Code Section 3 1428 1727 } 1429 } 1430 1431 \end{cfa} 1432 \end{tabular} 1433 \end{cquote} 1434 1435 1436 \section{Scheduling} 1437 \label{s:Scheduling} 1438 1439 While monitor mutual-exclusion provides safe access to shared data, the monitor data may indicate that a thread accessing it cannot proceed. 1440 For example, Figure~\ref{f:GenericBoundedBuffer} shows a bounded buffer that may be full/empty so produce/consumer threads must block. 1441 Leaving the monitor and trying again (busy waiting) is impractical for high-level programming. 1442 Monitors eliminate busy waiting by providing internal synchronization to schedule threads needing access to the shared data, where the synchronization is blocking (threads are parked) versus spinning. 1443 Synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling, where \newterm{scheduling} defines which thread acquires the critical section next. 1444 \newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and on behalf of other threads attempting entry. 1445 1446 Figure~\ref{f:BBInt} shows a \CFA bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@. 1447 The @wait@ routine atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the routine's parameter list. 1448 The appropriate condition lock is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer. 1449 Signalling is unconditional, because signalling an empty condition lock does nothing. 1450 1451 Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means: 1452 \begin{enumerate} 1453 \item 1454 The signalling thread returns immediately, and the signalled thread continues. 1455 \item 1456 The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait). 1457 \item 1458 The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues. 1459 \end{enumerate} 1460 The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service. 1461 \CFA supports the next two semantics as both are useful. 1462 Finally, while it is common to store a @condition@ as a field of the monitor, in \CFA, a @condition@ variable can be created/stored independently. 1463 Furthermore, a condition variable is tied to a \emph{group} of monitors on first use (called \newterm{branding}), which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors. 1464 1465 \begin{figure} 1466 \centering 1467 \newbox\myboxA 1468 \begin{lrbox}{\myboxA} 1469 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1470 forall( otype T ) { // distribute forall 1471 monitor Buffer { 1472 `condition` full, empty; 1473 int front, back, count; 1474 T elements[10]; 1475 }; 1476 void ?{}( Buffer(T) & buffer ) with(buffer) { 1477 [front, back, count] = 0; 1728 // Code Section 4 1729 } 1730 \end{cfa} 1731 \columnbreak 1732 Signalling thread 1733 \begin{cfa} 1734 mutex(a) { 1735 // Code Section 5 1736 mutex(a, b) { 1737 // Code Section 6 1738 signal(c); 1739 // Code Section 7 1478 1740 } 1479 1480 void insert( Buffer(T) & mutex buffer, T elem ) 1481 with(buffer) { 1482 if ( count == 10 ) `wait( empty )`; 1483 // insert elem into buffer 1484 `signal( full )`; 1485 } 1486 T remove( Buffer(T) & mutex buffer ) with(buffer) { 1487 if ( count == 0 ) `wait( full )`; 1488 // remove elem from buffer 1489 `signal( empty )`; 1490 return elem; 1491 } 1492 } 1493 \end{cfa} 1494 \end{lrbox} 1495 1496 \newbox\myboxB 1497 \begin{lrbox}{\myboxB} 1498 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1499 forall( otype T ) { // distribute forall 1500 monitor Buffer { 1501 1502 int front, back, count; 1503 T elements[10]; 1504 }; 1505 void ?{}( Buffer(T) & buffer ) with(buffer) { 1506 [front, back, count] = 0; 1507 } 1508 T remove( Buffer(T) & mutex buffer ); // forward 1509 void insert( Buffer(T) & mutex buffer, T elem ) 1510 with(buffer) { 1511 if ( count == 10 ) `waitfor( remove, buffer )`; 1512 // insert elem into buffer 1513 1514 } 1515 T remove( Buffer(T) & mutex buffer ) with(buffer) { 1516 if ( count == 0 ) `waitfor( insert, buffer )`; 1517 // remove elem from buffer 1518 1519 return elem; 1520 } 1521 } 1522 \end{cfa} 1523 \end{lrbox} 1524 1525 \subfloat[Internal Scheduling]{\label{f:BBInt}\usebox\myboxA} 1526 %\qquad 1527 \subfloat[External Scheduling]{\label{f:BBExt}\usebox\myboxB} 1528 \caption{Generic Bounded-Buffer} 1529 \label{f:GenericBoundedBuffer} 1741 // Code Section 8 1742 } 1743 \end{cfa} 1744 \end{multicols} 1745 \begin{cfa}[caption={Equivalent \CFA code for listing \ref{f:int-bulk-cfa}},label={f:int-bulk-cfa}] 1746 \end{cfa} 1747 \begin{multicols}{2} 1748 Waiter 1749 \begin{cfa}[numbers=left] 1750 acquire A 1751 acquire A & B 1752 wait A & B 1753 release A & B 1754 release A 1755 \end{cfa} 1756 1757 \columnbreak 1758 1759 Signaller 1760 \begin{cfa}[numbers=left, firstnumber=6,escapechar=|] 1761 acquire A 1762 acquire A & B 1763 signal A & B 1764 release A & B 1765 |\label{line:secret}|// Secretly keep B here 1766 release A 1767 // Wakeup waiter and transfer A & B 1768 \end{cfa} 1769 \end{multicols} 1770 \begin{cfa}[caption={Figure~\ref{f:int-bulk-cfa}, with delayed signalling comments},label={f:int-secret}] 1771 \end{cfa} 1530 1772 \end{figure} 1531 1773 1532 Figure~\ref{f:BBExt} shows a \CFA bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until the buffer has a free/empty slot. 1533 External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion. 1534 If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. 1535 Threads making calls to routines that are currently excluded block outside (external) of the monitor on a calling queue, versus blocking on condition queues inside (internal) of the monitor. 1536 % External scheduling is more constrained and explicit, which helps programmers reduce the non-deterministic nature of concurrency. 1537 External scheduling allows users to wait for events from other threads without concern of unrelated events occurring. 1538 The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels. 1539 While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics. 1540 Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor routines (see Section~\ref{s:Multi-MonitorScheduling}). 1541 1542 For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread; 1543 the signaller enters the monitor and changes state, detects a waiting threads that can use the state, performs a non-blocking signal on the condition queue for the waiting thread, and exits the monitor to run concurrently. 1544 The waiter unblocks next, uses/takes the state, and exits the monitor. 1545 Blocking signalling is the reverse, where the waiter is providing the cooperation for the signalling thread; 1546 the signaller enters the monitor, detects a waiting thread providing the necessary state, performs a blocking signal to place it on the urgent queue and unblock the waiter. 1547 The waiter changes state and exits the monitor, and the signaller unblocks next from the urgent queue to use/take the state. 1548 1549 Figure~\ref{f:DatingService} shows a dating service demonstrating the two forms of signalling: non-blocking and blocking. 1550 The dating service matches girl and boy threads with matching compatibility codes so they can exchange phone numbers. 1551 A thread blocks until an appropriate partner arrives. 1552 The complexity is exchanging phone number in the monitor because the monitor mutual-exclusion property prevents exchanging numbers. 1553 For internal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the oppose number, post its phone number, and unblock the partner. 1554 For external scheduling, the implicit urgent-condition replaces the explict @exchange@-condition and @signal_block@ puts the finding thread on the urgent condition and unblocks the matcher.. 1555 1556 The dating service is an example of a monitor that cannot be written using external scheduling because it requires knowledge of calling parameters to make scheduling decisions, and parameters of waiting threads are unavailable; 1557 as well, an arriving thread may not find a partner and must wait, which requires a condition variable, and condition variables imply internal scheduling. 1558 1559 \begin{figure} 1560 \centering 1561 \newbox\myboxA 1562 \begin{lrbox}{\myboxA} 1563 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1564 enum { CCodes = 20 }; 1565 monitor DS { 1566 int GirlPhNo, BoyPhNo; 1567 condition Girls[CCodes], Boys[CCodes]; 1568 condition exchange; 1569 }; 1570 int girl( DS & mutex ds, int phNo, int ccode ) { 1571 if ( is_empty( Boys[ccode] ) ) { 1572 wait( Girls[ccode] ); 1573 GirlPhNo = phNo; 1574 exchange.signal(); 1575 } else { 1576 GirlPhNo = phNo; 1577 signal( Boys[ccode] ); 1578 exchange.wait(); 1579 } // if 1580 return BoyPhNo; 1581 } 1582 int boy( DS & mutex ds, int phNo, int ccode ) { 1583 // as above with boy/girl interchanged 1584 } 1585 \end{cfa} 1586 \end{lrbox} 1587 1588 \newbox\myboxB 1589 \begin{lrbox}{\myboxB} 1590 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1591 1592 monitor DS { 1593 int GirlPhNo, BoyPhNo; 1594 condition Girls[CCodes], Boys[CCodes]; 1595 1596 }; 1597 int girl( DS & mutex ds, int phNo, int ccode ) { 1598 if ( is_empty( Boys[ccode] ) ) { // no compatible 1599 wait( Girls[ccode] ); // wait for boy 1600 GirlPhNo = phNo; // make phone number available 1601 1602 } else { 1603 GirlPhNo = phNo; // make phone number available 1604 signal_block( Boys[ccode] ); // restart boy 1605 1606 } // if 1607 return BoyPhNo; 1608 } 1609 int boy( DS & mutex ds, int phNo, int ccode ) { 1610 // as above with boy/girl interchanged 1611 } 1612 \end{cfa} 1613 \end{lrbox} 1614 1615 \subfloat[\lstinline@signal@]{\label{f:DatingSignal}\usebox\myboxA} 1616 \qquad 1617 \subfloat[\lstinline@signal_block@]{\label{f:DatingSignalBlock}\usebox\myboxB} 1618 \caption{Dating service. } 1619 \label{f:DatingService} 1620 \end{figure} 1621 1622 Both internal and external scheduling extend to multiple monitors in a natural way. 1623 \begin{cquote} 1624 \begin{tabular}{@{}l@{\hspace{3\parindentlnth}}l@{}} 1625 \begin{cfa} 1626 monitor M { `condition e`; ... }; 1627 void foo( M & mutex m1, M & mutex m2 ) { 1628 ... wait( `e` ); ... // wait( e, m1, m2 ) 1629 ... wait( `e, m1` ); ... 1630 ... wait( `e, m2` ); ... 1631 } 1632 \end{cfa} 1633 & 1634 \begin{cfa} 1635 void rtn$\(_1\)$( M & mutex m1, M & mutex m2 ); 1636 void rtn$\(_2\)$( M & mutex m1 ); 1637 void bar( M & mutex m1, M & mutex m2 ) { 1638 ... waitfor( `rtn` ); ... // $\LstCommentStyle{waitfor( rtn\(_1\), m1, m2 )}$ 1639 ... waitfor( `rtn, m1` ); ... // $\LstCommentStyle{waitfor( rtn\(_2\), m1 )}$ 1640 } 1641 \end{cfa} 1642 \end{tabular} 1643 \end{cquote} 1644 For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex types in the parameter list, \ie @wait( e, m1, m2 )@. 1645 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@. 1646 Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe. 1647 Finally, a signaller, 1648 \begin{cfa} 1649 void baz( M & mutex m1, M & mutex m2 ) { 1650 ... signal( e ); ... 1651 } 1652 \end{cfa} 1653 must have acquired monitor locks that are greater than or equal to the number of locks for the waiting thread signalled from the condition queue. 1654 1655 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. 1656 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@. 1657 Waitfor statically verifies the released monitors are the same as the acquired mutex-parameters of the given routine or routine pointer. 1658 To statically verify the released monitors match with the accepted routine's mutex parameters, the routine (pointer) prototype must be accessible. 1659 1660 Given the ability to release a subset of acquired monitors can result in a \newterm{nested monitor}~\cite{Lister77} deadlock. 1661 \begin{cfa} 1662 void foo( M & mutex m1, M & mutex m2 ) { 1663 ... wait( `e, m1` ); ... $\C{// release m1, keeping m2 acquired )}$ 1664 void bar( M & mutex m1, M & mutex m2 ) { $\C{// must acquire m1 and m2 )}$ 1665 ... signal( `e` ); ... 1666 \end{cfa} 1667 The @wait@ only releases @m1@ so the signalling thread cannot acquire both @m1@ and @m2@ to enter @bar@ to get to the @signal@. 1668 While deadlock issues can occur with multiple/nesting acquisition, this issue results from the fact that locks, and by extension monitors, are not perfectly composable. 1669 1670 Finally, an important aspect of monitor implementation is barging, \ie can calling threads barge ahead of signalled threads? 1671 If barging is allowed, synchronization between a singller and signallee is difficult, often requiring multiple unblock/block cycles (looping around a wait rechecking if a condition is met). 1672 \begin{quote} 1673 However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program. 1674 It is only in this way that a waiting program has an absolute guarantee that it can acquire the resource just released by the signalling program without any danger that a third program will interpose a monitor entry and seize the resource instead.~\cite[p.~550]{Hoare74} 1675 \end{quote} 1676 \CFA scheduling \emph{precludes} barging, which simplifies synchronization among threads in the monitor and increases correctness. 1677 For example, there are no loops in either bounded buffer solution in Figure~\ref{f:GenericBoundedBuffer}. 1678 Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. 1679 1680 1681 \subsection{Barging Prevention} 1682 1683 Figure~\ref{f:BargingPrevention} shows \CFA code where bulk acquire adds complexity to the internal-signalling semantics. 1684 The complexity begins at the end of the inner @mutex@ statement, where the semantics of internal scheduling need to be extended for multiple monitors. 1685 The problem is that bulk acquire is used in the inner @mutex@ statement where one of the monitors is already acquired. 1686 When the signalling thread reaches the end of the inner @mutex@ statement, it should transfer ownership of @m1@ and @m2@ to the waiting threads to prevent barging into the outer @mutex@ statement by another thread. 1687 However, both the signalling and waiting thread W1 still need monitor @m1@. 1688 1689 \begin{figure} 1690 \newbox\myboxA 1691 \begin{lrbox}{\myboxA} 1692 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1693 monitor M m1, m2; 1694 condition c; 1695 mutex( m1 ) { // $\LstCommentStyle{\color{red}outer}$ 1696 ... 1697 mutex( m1, m2 ) { // $\LstCommentStyle{\color{red}inner}$ 1698 ... `signal( c )`; ... 1699 // m1, m2 acquired 1700 } // $\LstCommentStyle{\color{red}release m2}$ 1701 // m1 acquired 1702 } // release m1 1703 \end{cfa} 1704 \end{lrbox} 1705 1706 \newbox\myboxB 1707 \begin{lrbox}{\myboxB} 1708 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1709 1710 1711 mutex( m1 ) { 1712 ... 1713 mutex( m1, m2 ) { 1714 ... `wait( c )`; // block and release m1, m2 1715 // m1, m2 acquired 1716 } // $\LstCommentStyle{\color{red}release m2}$ 1717 // m1 acquired 1718 } // release m1 1719 \end{cfa} 1720 \end{lrbox} 1721 1722 \newbox\myboxC 1723 \begin{lrbox}{\myboxC} 1724 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 1725 1726 1727 mutex( m2 ) { 1728 ... `wait( c )`; ... 1729 // m2 acquired 1730 } // $\LstCommentStyle{\color{red}release m2}$ 1731 1732 1733 1734 1735 \end{cfa} 1736 \end{lrbox} 1737 1738 \begin{cquote} 1739 \subfloat[Signalling Thread]{\label{f:SignallingThread}\usebox\myboxA} 1740 \hspace{2\parindentlnth} 1741 \subfloat[Waiting Thread (W1)]{\label{f:WaitingThread}\usebox\myboxB} 1742 \hspace{2\parindentlnth} 1743 \subfloat[Waiting Thread (W2)]{\label{f:OtherWaitingThread}\usebox\myboxC} 1744 \end{cquote} 1745 \caption{Barging Prevention} 1746 \label{f:BargingPrevention} 1747 \end{figure} 1748 1749 One scheduling solution is for the signaller to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling. 1750 However, Figure~\ref{f:OtherWaitingThread} shows this solution is complex depending on other waiters, resulting is choices when the signaller finishes the inner mutex-statement. 1751 The singaller can retain @m2@ until completion of the outer mutex statement and pass the locks to waiter W1, or it can pass @m2@ to waiter W2 after completing the inner mutex-statement, while continuing to hold @m1@. 1752 In the latter case, waiter W2 must eventually pass @m2@ to waiter W1, which is complex because W1 may have waited before W2, so W2 is unaware of it. 1753 Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves. 1754 1755 While a number of approaches were examined~\cite[\S~4.3]{Delisle18}, the solution chosen for \CFA is a novel techique called \newterm{partial signalling}. 1756 Signalled threads are moved to an urgent queue and the waiter at the front defines the set of monitors necessary for it to unblock. 1757 Partial signalling transfers ownership of monitors to the front waiter. 1758 When the signaller thread exits or waits in the monitor the front waiter is unblocked if all its monitors are released. 1759 This solution has the benefit that complexity is encapsulated into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. 1760 1761 \begin{comment} 1774 The complexity begins at code sections 4 and 8 in listing \ref{f:int-bulk-cfa}, which are where the existing semantics of internal scheduling needs to be extended for multiple monitors. 1775 The root of the problem is that \textbf{bulk-acq} is used in a context where one of the monitors is already acquired, which is why it is important to define the behaviour of the previous cfa-code. 1776 When the signaller thread reaches the location where it should ``release @A & B@'' (listing \ref{f:int-bulk-cfa} line \ref{line:releaseFirst}), it must actually transfer ownership of monitor @B@ to the waiting thread. 1777 This ownership transfer is required in order to prevent barging into @B@ by another thread, since both the signalling and signalled threads still need monitor @A@. 1778 There are three options: 1779 1780 \subsubsection{Delaying Signals} 1781 The obvious solution to the problem of multi-monitor scheduling is to keep ownership of all locks until the last lock is ready to be transferred. 1782 It can be argued that that moment is when the last lock is no longer needed, because this semantics fits most closely to the behaviour of single-monitor scheduling. 1783 This solution has the main benefit of transferring ownership of groups of monitors, which simplifies the semantics from multiple objects to a single group of objects, effectively making the existing single-monitor semantic viable by simply changing monitors to monitor groups. 1784 This solution releases the monitors once every monitor in a group can be released. 1785 However, since some monitors are never released (\eg the monitor of a thread), this interpretation means a group might never be released. 1786 A more interesting interpretation is to transfer the group until all its monitors are released, which means the group is not passed further and a thread can retain its locks. 1787 1788 However, listing \ref{f:int-secret} shows this solution can become much more complicated depending on what is executed while secretly holding B at line \ref{line:secret}, while avoiding the need to transfer ownership of a subset of the condition monitors. 1762 1789 Figure~\ref{f:dependency} shows a slightly different example where a third thread is waiting on monitor @A@, using a different condition variable. 1763 1790 Because the third thread is signalled when secretly holding @B@, the goal becomes unreachable. … … 1773 1800 In both cases, the threads need to be able to distinguish, on a per monitor basis, which ones need to be released and which ones need to be transferred, which means knowing when to release a group becomes complex and inefficient (see next section) and therefore effectively precludes this approach. 1774 1801 1775 1776 1802 \subsubsection{Dependency graphs} 1803 1777 1804 1778 1805 \begin{figure} … … 1851 1878 The extra challenge is that this dependency graph is effectively post-mortem, but the runtime system needs to be able to build and solve these graphs as the dependencies unfold. 1852 1879 Resolving dependency graphs being a complex and expensive endeavour, this solution is not the preferred one. 1853 \end{comment} 1854 1855 1856 \begin{comment} 1880 1881 \subsubsection{Partial Signalling} \label{partial-sig} 1882 Finally, the solution that is chosen for \CFA is to use partial signalling. 1883 Again using listing \ref{f:int-bulk-cfa}, the partial signalling solution transfers ownership of monitor @B@ at lines \ref{line:signal1} to the waiter but does not wake the waiting thread since it is still using monitor @A@. 1884 Only when it reaches line \ref{line:lastRelease} does it actually wake up the waiting thread. 1885 This solution has the benefit that complexity is encapsulated into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. 1886 This solution has a much simpler implementation than a dependency graph solving algorithms, which is why it was chosen. 1887 Furthermore, after being fully implemented, this solution does not appear to have any significant downsides. 1888 1889 Using partial signalling, listing \ref{f:dependency} can be solved easily: 1890 \begin{itemize} 1891 \item When thread $\gamma$ reaches line \ref{line:release-ab} it transfers monitor @B@ to thread $\alpha$ and continues to hold monitor @A@. 1892 \item When thread $\gamma$ reaches line \ref{line:release-a} it transfers monitor @A@ to thread $\beta$ and wakes it up. 1893 \item When thread $\beta$ reaches line \ref{line:release-aa} it transfers monitor @A@ to thread $\alpha$ and wakes it up. 1894 \end{itemize} 1895 1896 % ====================================================================== 1897 % ====================================================================== 1898 \subsection{Signalling: Now or Later} 1899 % ====================================================================== 1900 % ====================================================================== 1901 \begin{table} 1902 \begin{tabular}{|c|c|} 1903 @signal@ & @signal_block@ \\ 1904 \hline 1905 \begin{cfa}[tabsize=3] 1906 monitor DatingService { 1907 // compatibility codes 1908 enum{ CCodes = 20 }; 1909 1910 int girlPhoneNo 1911 int boyPhoneNo; 1912 }; 1913 1914 condition girls[CCodes]; 1915 condition boys [CCodes]; 1916 condition exchange; 1917 1918 int girl(int phoneNo, int cfa) { 1919 // no compatible boy ? 1920 if(empty(boys[cfa])) { 1921 wait(girls[cfa]); // wait for boy 1922 girlPhoneNo = phoneNo; // make phone number available 1923 signal(exchange); // wake boy from chair 1924 } else { 1925 girlPhoneNo = phoneNo; // make phone number available 1926 signal(boys[cfa]); // wake boy 1927 wait(exchange); // sit in chair 1928 } 1929 return boyPhoneNo; 1930 } 1931 int boy(int phoneNo, int cfa) { 1932 // same as above 1933 // with boy/girl interchanged 1934 } 1935 \end{cfa}&\begin{cfa}[tabsize=3] 1936 monitor DatingService { 1937 1938 enum{ CCodes = 20 }; // compatibility codes 1939 1940 int girlPhoneNo; 1941 int boyPhoneNo; 1942 }; 1943 1944 condition girls[CCodes]; 1945 condition boys [CCodes]; 1946 // exchange is not needed 1947 1948 int girl(int phoneNo, int cfa) { 1949 // no compatible boy ? 1950 if(empty(boys[cfa])) { 1951 wait(girls[cfa]); // wait for boy 1952 girlPhoneNo = phoneNo; // make phone number available 1953 signal(exchange); // wake boy from chair 1954 } else { 1955 girlPhoneNo = phoneNo; // make phone number available 1956 signal_block(boys[cfa]); // wake boy 1957 1958 // second handshake unnecessary 1959 1960 } 1961 return boyPhoneNo; 1962 } 1963 1964 int boy(int phoneNo, int cfa) { 1965 // same as above 1966 // with boy/girl interchanged 1967 } 1968 \end{cfa} 1969 \end{tabular} 1970 \caption{Dating service example using \protect\lstinline|signal| and \protect\lstinline|signal_block|. } 1971 \label{tbl:datingservice} 1972 \end{table} 1973 An important note is that, until now, signalling a monitor was a delayed operation. 1974 The ownership of the monitor is transferred only when the monitor would have otherwise been released, not at the point of the @signal@ statement. 1975 However, in some cases, it may be more convenient for users to immediately transfer ownership to the thread that is waiting for cooperation, which is achieved using the @signal_block@ routine. 1976 1977 The example in table \ref{tbl:datingservice} highlights the difference in behaviour. 1978 As mentioned, @signal@ only transfers ownership once the current critical section exits; this behaviour requires additional synchronization when a two-way handshake is needed. 1979 To avoid this explicit synchronization, the @condition@ type offers the @signal_block@ routine, which handles the two-way handshake as shown in the example. 1980 This feature removes the need for a second condition variables and simplifies programming. 1981 Like every other monitor semantic, @signal_block@ uses barging prevention, which means mutual-exclusion is baton-passed both on the front end and the back end of the call to @signal_block@, meaning no other thread can acquire the monitor either before or after the call. 1982 1983 % ====================================================================== 1984 % ====================================================================== 1857 1985 \section{External scheduling} \label{extsched} 1858 1986 % ====================================================================== 1987 % ====================================================================== 1988 An alternative to internal scheduling is external scheduling (see Table~\ref{tbl:sched}). 1859 1989 \begin{table} 1860 1990 \begin{tabular}{|c|c|c|} … … 1920 2050 \label{tbl:sched} 1921 2051 \end{table} 2052 This method is more constrained and explicit, which helps users reduce the non-deterministic nature of concurrency. 2053 Indeed, as the following examples demonstrate, external scheduling allows users to wait for events from other threads without the concern of unrelated events occurring. 2054 External scheduling can generally be done either in terms of control flow (\eg Ada with @accept@, \uC with @_Accept@) or in terms of data (\eg Go with channels). 2055 Of course, both of these paradigms have their own strengths and weaknesses, but for this project, control-flow semantics was chosen to stay consistent with the rest of the languages semantics. 2056 Two challenges specific to \CFA arise when trying to add external scheduling with loose object definitions and multiple-monitor routines. 2057 The previous example shows a simple use @_Accept@ versus @wait@/@signal@ and its advantages. 2058 Note that while other languages often use @accept@/@select@ as the core external scheduling keyword, \CFA uses @waitfor@ to prevent name collisions with existing socket \textbf{api}s. 1922 2059 1923 2060 For the @P@ member above using internal scheduling, the call to @wait@ only guarantees that @V@ is the last routine to access the monitor, allowing a third routine, say @isInUse()@, acquire mutual exclusion several times while routine @P@ is waiting. 1924 2061 On the other hand, external scheduling guarantees that while routine @P@ is waiting, no other routine than @V@ can acquire the monitor. 1925 \end{comment} 1926 1927 2062 2063 % ====================================================================== 2064 % ====================================================================== 1928 2065 \subsection{Loose Object Definitions} 1929 \label{s:LooseObjectDefinitions} 1930 1931 In an object-oriented programming-language, a class includes an exhaustive list of operations. 1932 However, new members can be added via static inheritance or dynaic members, \eg JavaScript~\cite{JavaScript}. 1933 Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement. 1934 \begin{cfa} 1935 monitor M {}; 1936 void `f`( M & mutex m ); 1937 void g( M & mutex m ) { waitfor( `f` ); } $\C{// clear which f}$ 1938 void `f`( M & mutex m, int ); $\C{// different f}$ 1939 void h( M & mutex m ) { waitfor( `f` ); } $\C{// unclear which f}$ 1940 \end{cfa} 1941 Hence, the cfa-code for the entering a monitor looks like: 1942 \begin{cfa} 1943 if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$ 1944 else if ( $\textrm{\textit{already own monitor}}$ ) $\LstCommentStyle{// \color{red}continue}$ 1945 else if ( $\textrm{\textit{monitor accepts me}}$ ) $\LstCommentStyle{// \color{red}enter}$ 1946 else $\LstCommentStyle{// \color{red}block}$ 1947 \end{cfa} 2066 % ====================================================================== 2067 % ====================================================================== 2068 In \uC, a monitor class declaration includes an exhaustive list of monitor operations. 2069 Since \CFA is not object oriented, monitors become both more difficult to implement and less clear for a user: 2070 2071 \begin{cfa} 2072 monitor A {}; 2073 2074 void f(A & mutex a); 2075 void g(A & mutex a) { 2076 waitfor(f); // Obvious which f() to wait for 2077 } 2078 2079 void f(A & mutex a, int); // New different F added in scope 2080 void h(A & mutex a) { 2081 waitfor(f); // Less obvious which f() to wait for 2082 } 2083 \end{cfa} 2084 2085 Furthermore, external scheduling is an example where implementation constraints become visible from the interface. 2086 Here is the cfa-code for the entering phase of a monitor: 2087 \begin{center} 2088 \begin{tabular}{l} 2089 \begin{cfa} 2090 if monitor is free 2091 enter 2092 elif already own the monitor 2093 continue 2094 elif monitor accepts me 2095 enter 2096 else 2097 block 2098 \end{cfa} 2099 \end{tabular} 2100 \end{center} 1948 2101 For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions. 1949 However, a fast check for \emph{monitor accepts me}is much harder to implement depending on the constraints put on the monitors.1950 Figure~\ref{fig:ClassicalMonitor} shows monitors are often expressed as an entry (calling) queue, some acceptor queues, and an urgent stack/queue.2102 However, a fast check for @monitor accepts me@ is much harder to implement depending on the constraints put on the monitors. 2103 Indeed, monitors are often expressed as an entry queue and some acceptor queue as in Figure~\ref{fig:ClassicalMonitor}. 1951 2104 1952 2105 \begin{figure} 1953 2106 \centering 1954 \subfloat[Classical monitor] {2107 \subfloat[Classical Monitor] { 1955 2108 \label{fig:ClassicalMonitor} 1956 {\resizebox{0.45\textwidth}{!}{\input{monitor .pstex_t}}}2109 {\resizebox{0.45\textwidth}{!}{\input{monitor}}} 1957 2110 }% subfloat 1958 \q uad1959 \subfloat[ Bulk acquire monitor] {2111 \qquad 2112 \subfloat[\textbf{bulk-acq} Monitor] { 1960 2113 \label{fig:BulkMonitor} 1961 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor .pstex_t}}}2114 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor}}} 1962 2115 }% subfloat 1963 \caption{Monitor Implementation} 1964 \label{f:MonitorImplementation} 2116 \caption{External Scheduling Monitor} 1965 2117 \end{figure} 1966 2118 1967 For a fixed (small) number of mutex routines (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction. 1968 This approach requires a unique dense ordering of routines with a small upper-bound and the ordering must be consistent across translation units. 1969 For object-oriented languages these constraints are common, but \CFA mutex routines can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex routines. 1970 1971 Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation. 1972 The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately. 1973 The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a linear search. 1974 1975 \begin{comment} 2119 There are other alternatives to these pictures, but in the case of the left picture, implementing a fast accept check is relatively easy. 2120 Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (\eg 128) of mutex members. 2121 This approach requires a unique dense ordering of routines with an upper-bound and that ordering must be consistent across translation units. 2122 For OO languages these constraints are common, since objects only offer adding member routines consistently across translation units via inheritance. 2123 However, in \CFA users can extend objects with mutex routines that are only visible in certain translation unit. 2124 This means that establishing a program-wide dense-ordering among mutex routines can only be done in the program linking phase, and still could have issues when using dynamically shared objects. 2125 2126 The alternative is to alter the implementation as in Figure~\ref{fig:BulkMonitor}. 2127 Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate. 2128 Generating a mask dynamically means that the storage for the mask information can vary between calls to @waitfor@, allowing for more flexibility and extensions. 2129 Storing an array of accepted function pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search. 2130 Furthermore, supporting nested external scheduling (\eg listing \ref{f:nest-ext}) may now require additional searches for the @waitfor@ statement to check if a routine is already queued. 2131 1976 2132 \begin{figure} 1977 2133 \begin{cfa}[caption={Example of nested external scheduling},label={f:nest-ext}] … … 1989 2145 \end{figure} 1990 2146 1991 Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a routinepointer and a set of monitors, as is discussed in the next section.2147 Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a function pointer and a set of monitors, as is discussed in the next section. 1992 2148 These details are omitted from the picture for the sake of simplicity. 1993 2149 … … 1997 2153 In the end, the most flexible approach has been chosen since it allows users to write programs that would otherwise be hard to write. 1998 2154 This decision is based on the assumption that writing fast but inflexible locks is closer to a solved problem than writing locks that are as flexible as external scheduling in \CFA. 1999 \end{comment} 2000 2001 2155 2156 % ====================================================================== 2157 % ====================================================================== 2002 2158 \subsection{Multi-Monitor Scheduling} 2003 \label{s:Multi-MonitorScheduling} 2159 % ====================================================================== 2160 % ====================================================================== 2004 2161 2005 2162 External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax. 2006 Even in the simplest possible case, new semantics needs to be established:2163 Even in the simplest possible case, some new semantics needs to be established: 2007 2164 \begin{cfa} 2008 2165 monitor M {}; 2009 void f( M & mutex m1 ); 2010 void g( M & mutex m1, M & mutex m2 ) { 2011 waitfor( f ); $\C{// pass m1 or m2 to f?}$ 2012 } 2013 \end{cfa} 2014 The solution is for the programmer to disambiguate: 2015 \begin{cfa} 2016 waitfor( f, m2 ); $\C{// wait for call to f with argument m2}$ 2017 \end{cfa} 2018 Routine @g@ has acquired both locks, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@ (while @g@ still holds lock @m1@). 2019 This behaviour can be extended to the multi-monitor @waitfor@ statement. 2166 2167 void f(M & mutex a); 2168 2169 void g(M & mutex b, M & mutex c) { 2170 waitfor(f); // two monitors M => unknown which to pass to f(M & mutex) 2171 } 2172 \end{cfa} 2173 The obvious solution is to specify the correct monitor as follows: 2174 2020 2175 \begin{cfa} 2021 2176 monitor M {}; 2022 void f( M & mutex m1, M & mutex m2 ); 2023 void g( M & mutex m1, M & mutex m2 ) { 2024 waitfor( f, m1, m2 ); $\C{// wait for call to f with arguments m1 and m2}$ 2025 } 2026 \end{cfa} 2027 Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by accepting routine. 2177 2178 void f(M & mutex a); 2179 2180 void g(M & mutex a, M & mutex b) { 2181 // wait for call to f with argument b 2182 waitfor(f, b); 2183 } 2184 \end{cfa} 2185 This syntax is unambiguous. 2186 Both locks are acquired and kept by @g@. 2187 When routine @f@ is called, the lock for monitor @b@ is temporarily transferred from @g@ to @f@ (while @g@ still holds lock @a@). 2188 This behaviour can be extended to the multi-monitor @waitfor@ statement as follows. 2189 2190 \begin{cfa} 2191 monitor M {}; 2192 2193 void f(M & mutex a, M & mutex b); 2194 2195 void g(M & mutex a, M & mutex b) { 2196 // wait for call to f with arguments a and b 2197 waitfor(f, a, b); 2198 } 2199 \end{cfa} 2200 2201 Note that the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired in the routine. @waitfor@ used in any other context is undefined behaviour. 2028 2202 2029 2203 An important behaviour to note is when a set of monitors only match partially: 2204 2030 2205 \begin{cfa} 2031 2206 mutex struct A {}; 2207 2032 2208 mutex struct B {}; 2033 void g( A & mutex m1, B & mutex m2 ) { 2034 waitfor( f, m1, m2 ); 2035 } 2209 2210 void g(A & mutex a, B & mutex b) { 2211 waitfor(f, a, b); 2212 } 2213 2036 2214 A a1, a2; 2037 2215 B b; 2216 2038 2217 void foo() { 2039 g( a1, b ); // block on accept 2040 } 2218 g(a1, b); // block on accept 2219 } 2220 2041 2221 void bar() { 2042 f( a2, b); // fulfill cooperation2222 f(a2, b); // fulfill cooperation 2043 2223 } 2044 2224 \end{cfa} … … 2047 2227 It is also important to note that in the case of external scheduling the order of parameters is irrelevant; @waitfor(f,a,b)@ and @waitfor(f,b,a)@ are indistinguishable waiting condition. 2048 2228 2049 2229 % ====================================================================== 2230 % ====================================================================== 2050 2231 \subsection{\protect\lstinline|waitfor| Semantics} 2051 2052 Syntactically, the @waitfor@ statement takes a routine identifier and a set of monitors. 2053 While the set of monitors can be any list of expressions, the routine name is more restricted because the compiler validates at compile time the validity of the routine type and the parameters used with the @waitfor@ statement. 2054 It checks that the set of monitors passed in matches the requirements for a routine call. 2232 % ====================================================================== 2233 % ====================================================================== 2234 2235 Syntactically, the @waitfor@ statement takes a function identifier and a set of monitors. 2236 While the set of monitors can be any list of expressions, the function name is more restricted because the compiler validates at compile time the validity of the function type and the parameters used with the @waitfor@ statement. 2237 It checks that the set of monitors passed in matches the requirements for a function call. 2055 2238 Figure~\ref{f:waitfor} shows various usages of the waitfor statement and which are acceptable. 2056 The choice of the routinetype is made ignoring any non-@mutex@ parameter.2239 The choice of the function type is made ignoring any non-@mutex@ parameter. 2057 2240 One limitation of the current implementation is that it does not handle overloading, but overloading is possible. 2058 2241 \begin{figure} … … 2080 2263 waitfor(f2, a1, a2); // Incorrect : Mutex arguments don't match 2081 2264 waitfor(f1, 1); // Incorrect : 1 not a mutex argument 2082 waitfor(f9, a1); // Incorrect : f9 routinedoes not exist2265 waitfor(f9, a1); // Incorrect : f9 function does not exist 2083 2266 waitfor(*fp, a1 ); // Incorrect : fp not an identifier 2084 2267 waitfor(f4, a1); // Incorrect : f4 ambiguous … … 2090 2273 2091 2274 Finally, for added flexibility, \CFA supports constructing a complex @waitfor@ statement using the @or@, @timeout@ and @else@. 2092 Indeed, multiple @waitfor@ clauses can be chained together using @or@; this chain forms a single statement that uses baton pass to any routine that fits one of the routine+monitor set passed in.2093 To enable users to tell which accepted routineexecuted, @waitfor@s are followed by a statement (including the null statement @;@) or a compound statement, which is executed after the clause is triggered.2094 A @waitfor@ chain can also be followed by a @timeout@, to signify an upper bound on the wait, or an @else@, to signify that the call should be non-blocking, which checks for a matching routinecall already arrived and otherwise continues.2275 Indeed, multiple @waitfor@ clauses can be chained together using @or@; this chain forms a single statement that uses baton pass to any function that fits one of the function+monitor set passed in. 2276 To enable users to tell which accepted function executed, @waitfor@s are followed by a statement (including the null statement @;@) or a compound statement, which is executed after the clause is triggered. 2277 A @waitfor@ chain can also be followed by a @timeout@, to signify an upper bound on the wait, or an @else@, to signify that the call should be non-blocking, which checks for a matching function call already arrived and otherwise continues. 2095 2278 Any and all of these clauses can be preceded by a @when@ condition to dynamically toggle the accept clauses on or off based on some current state. 2096 2279 Figure~\ref{f:waitfor2} demonstrates several complex masks and some incorrect ones. … … 2147 2330 \end{figure} 2148 2331 2149 2332 % ====================================================================== 2333 % ====================================================================== 2150 2334 \subsection{Waiting For The Destructor} 2151 2335 % ====================================================================== 2336 % ====================================================================== 2152 2337 An interesting use for the @waitfor@ statement is destructor semantics. 2153 2338 Indeed, the @waitfor@ statement can accept any @mutex@ routine, which includes the destructor (see section \ref{data}). … … 2176 2361 2177 2362 2363 % ###### # ###### # # # ####### # ### ##### # # 2364 % # # # # # # # # # # # # # # # ## ## 2365 % # # # # # # # # # # # # # # # # # # 2366 % ###### # # ###### # # # # ##### # # ##### # # # 2367 % # ####### # # ####### # # # # # # # # 2368 % # # # # # # # # # # # # # # # # 2369 % # # # # # # # ####### ####### ####### ####### ### ##### # # 2178 2370 \section{Parallelism} 2179 2180 2371 Historically, computer performance was about processor speeds and instruction counts. 2181 2372 However, with heat dissipation being a direct consequence of speed increase, parallelism has become the new source for increased performance~\cite{Sutter05, Sutter05b}. … … 2187 2378 While there are many variations of the presented paradigms, most of these variations do not actually change the guarantees or the semantics, they simply move costs in order to achieve better performance for certain workloads. 2188 2379 2189 2190 2380 \section{Paradigms} 2191 2192 2193 2381 \subsection{User-Level Threads} 2194 2195 2382 A direct improvement on the \textbf{kthread} approach is to use \textbf{uthread}. 2196 2383 These threads offer most of the same features that the operating system already provides but can be used on a much larger scale. … … 2201 2388 Examples of languages that support \textbf{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}. 2202 2389 2203 2204 2390 \subsection{Fibers : User-Level Threads Without Preemption} \label{fibers} 2205 2206 2391 A popular variant of \textbf{uthread} is what is often referred to as \textbf{fiber}. 2207 2392 However, \textbf{fiber} do not present meaningful semantic differences with \textbf{uthread}. … … 2212 2397 An example of a language that uses fibers is Go~\cite{Go} 2213 2398 2214 2215 2399 \subsection{Jobs and Thread Pools} 2216 2217 2400 An approach on the opposite end of the spectrum is to base parallelism on \textbf{pool}. 2218 2401 Indeed, \textbf{pool} offer limited flexibility but at the benefit of a simpler user interface. … … 2225 2408 The gold standard of this implementation is Intel's TBB library~\cite{TBB}. 2226 2409 2227 2228 2410 \subsection{Paradigm Performance} 2229 2230 2411 While the choice between the three paradigms listed above may have significant performance implications, it is difficult to pin down the performance implications of choosing a model at the language level. 2231 2412 Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload. … … 2235 2416 Finally, if the units of uninterrupted work are large, enough the paradigm choice is largely amortized by the actual work done. 2236 2417 2237 2238 2418 \section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel} 2239 2240 2419 A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}. 2241 2420 It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues. … … 2245 2424 Currently \CFA only supports one \textbf{cfacluster}, the initial one. 2246 2425 2247 2248 2426 \subsection{Future Work: Machine Setup}\label{machine} 2249 2250 2427 While this was not done in the context of this paper, another important aspect of clusters is affinity. 2251 2428 While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. … … 2253 2430 OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity. 2254 2431 2255 2256 2432 \subsection{Paradigms}\label{cfaparadigms} 2257 2258 2433 Given these building blocks, it is possible to reproduce all three of the popular paradigms. 2259 2434 Indeed, \textbf{uthread} is the default paradigm in \CFA. … … 2263 2438 2264 2439 2440 2265 2441 \section{Behind the Scenes} 2266 2267 2442 There are several challenges specific to \CFA when implementing concurrency. 2268 These challenges are a direct result of bulk acquireand loose object definitions.2443 These challenges are a direct result of \textbf{bulk-acq} and loose object definitions. 2269 2444 These two constraints are the root cause of most design decisions in the implementation. 2270 2445 Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs. … … 2274 2449 The main memory concern for concurrency is queues. 2275 2450 All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation. 2276 Since several concurrency operations can use an unbound amount of memory (depending on bulk acquire), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers.2451 Since several concurrency operations can use an unbound amount of memory (depending on \textbf{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers. 2277 2452 Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays. 2278 2453 Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array. 2279 2454 The threads and the condition both have a fixed amount of memory, while @mutex@ routines and blocking calls allow for an unbound amount, within the stack size. 2280 2455 2281 Note that since the major contributions of this paper are extending monitor semantics to bulk acquire and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed. 2282 2283 2456 Note that since the major contributions of this paper are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed. 2457 2458 % ====================================================================== 2459 % ====================================================================== 2284 2460 \section{Mutex Routines} 2461 % ====================================================================== 2462 % ====================================================================== 2285 2463 2286 2464 The first step towards the monitor implementation is simple @mutex@ routines. … … 2317 2495 \end{figure} 2318 2496 2319 2320 2497 \subsection{Details: Interaction with polymorphism} 2321 2322 2498 Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support. 2323 2499 However, it is shown that entry-point locking solves most of the issues. … … 2388 2564 void foo(T * mutex t); 2389 2565 2390 // Correct: this routineonly works on monitors (any monitor)2566 // Correct: this function only works on monitors (any monitor) 2391 2567 forall(dtype T | is_monitor(T)) 2392 2568 void bar(T * mutex t)); … … 2395 2571 Both entry point and \textbf{callsite-locking} are feasible implementations. 2396 2572 The current \CFA implementation uses entry-point locking because it requires less work when using \textbf{raii}, effectively transferring the burden of implementation to object construction/destruction. 2397 It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, \ie the routinebody.2573 It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, \ie the function body. 2398 2574 For example, the monitor call can appear in the middle of an expression. 2399 2575 Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites. 2400 2576 2401 2577 % ====================================================================== 2578 % ====================================================================== 2402 2579 \section{Threading} \label{impl:thread} 2580 % ====================================================================== 2581 % ====================================================================== 2403 2582 2404 2583 Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency. … … 2413 2592 \end{figure} 2414 2593 2415 2416 2594 \subsection{Processors} 2417 2418 2595 Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically @pthread@s in the current implementation of \CFA. 2419 2596 Indeed, any parallelism must go through operating-system libraries. … … 2423 2600 Processors internally use coroutines to take advantage of the existing context-switching semantics. 2424 2601 2425 2426 2602 \subsection{Stack Management} 2427 2428 2603 One of the challenges of this system is to reduce the footprint as much as possible. 2429 2604 Specifically, all @pthread@s created also have a stack created with them, which should be used as much as possible. … … 2432 2607 In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large. 2433 2608 2434 2435 2609 \subsection{Context Switching} 2436 2437 2610 As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks. 2438 To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific routinecall.2611 To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call. 2439 2612 This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread. 2440 Note that the instruction pointer can be left untouched since the context-switch is always inside the same routine2613 Note that the instruction pointer can be left untouched since the context-switch is always inside the same function. 2441 2614 Threads, however, do not context-switch between each other directly. 2442 2615 They context-switch to the scheduler. … … 2448 2621 This option is not currently present in \CFA, but the changes required to add it are strictly additive. 2449 2622 2450 2451 2623 \subsection{Preemption} \label{preemption} 2452 2453 2624 Finally, an important aspect for any complete threading system is preemption. 2454 2625 As mentioned in section \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution. … … 2477 2648 As a result, a signal handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread). 2478 2649 It is important to note that signal handlers save and restore signal masks because user-thread migration can cause a signal mask to migrate from one kernel thread to another. 2479 This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' routines from other routines}.2650 This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' functions from other functions.}. 2480 2651 However, since the kernel thread handling preemption requires a different signal mask, executing user threads on the kernel-alarm thread can cause deadlocks. 2481 2652 For this reason, the alarm thread is in a tight loop around a system call to @sigwaitinfo@, requiring very little CPU time for preemption. … … 2484 2655 Indeed, @sigwait@ can differentiate signals sent from @pthread_sigqueue@ from signals sent from alarms or the kernel. 2485 2656 2486 2487 2657 \subsection{Scheduler} 2488 2658 Finally, an aspect that was not mentioned yet is the scheduling algorithm. … … 2490 2660 Further discussion on scheduling is present in section \ref{futur:sched}. 2491 2661 2492 2662 % ====================================================================== 2663 % ====================================================================== 2493 2664 \section{Internal Scheduling} \label{impl:intsched} 2494 2665 % ====================================================================== 2666 % ====================================================================== 2495 2667 The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience): 2496 2668 2497 2669 \begin{figure} 2498 2670 \begin{center} 2499 {\resizebox{0.4\textwidth}{!}{\input{monitor .pstex_t}}}2671 {\resizebox{0.4\textwidth}{!}{\input{monitor}}} 2500 2672 \end{center} 2501 2673 \caption{Traditional illustration of a monitor} … … 2506 2678 2507 2679 For \CFA, this picture does not have support for blocking multiple monitors on a single condition. 2508 To support bulk acquiretwo changes to this picture are required.2680 To support \textbf{bulk-acq} two changes to this picture are required. 2509 2681 First, it is no longer helpful to attach the condition to \emph{a single} monitor. 2510 2682 Secondly, the thread waiting on the condition has to be separated across multiple monitors, seen in figure \ref{fig:monitor_cfa}. … … 2555 2727 \end{figure} 2556 2728 2557 The solution discussed in \ref{ s:InternalScheduling} can be seen in the exit routine of listing \ref{f:entry2}.2729 The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{f:entry2}. 2558 2730 Basically, the solution boils down to having a separate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership. 2559 2731 This solution is deadlock safe as well as preventing any potential barging. … … 2791 2963 } 2792 2964 \end{cfa} 2793 This routineis called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, \ie no preemption.2965 This function is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, \ie no preemption. 2794 2966 However, once clusters are fully implemented, it will be possible to create fibers and \textbf{uthread} in the same system, as in listing \ref{f:fiber-uthread} 2795 2967 \begin{figure} … … 2976 3148 For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine. 2977 3149 Figure~\ref{f:mutex} shows the code for \CFA. 2978 To put the results in context, the cost of entering a non-inline routineand the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.3150 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. 2979 3151 The results can be shown in table \ref{tab:mutex}. 2980 3152 … … 3227 3399 Therefore, there is still significant work to improve performance. 3228 3400 Many of the data structures and algorithms may change in the future to more efficient versions. 3229 For example, the number of monitors in a single bulk acquireis only bound by the stack size, this is probably unnecessarily generous.3401 For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous. 3230 3402 It may be possible that limiting the number helps increase performance. 3231 3403 However, it is not obvious that the benefit would be significant.
Note:
See TracChangeset
for help on using the changeset viewer.