Changeset aae9c17
- Timestamp:
- Sep 5, 2023, 5:48:31 PM (18 months ago)
- Branches:
- master
- Children:
- 1f10959, efe39bb
- Parents:
- f54e6ec
- Location:
- doc/theses/colby_parsons_MMAth/text
- Files:
-
- 9 edited
Legend:
- Unmodified
- Added
- Removed
-
TabularUnified doc/theses/colby_parsons_MMAth/text/CFA_concurrency.tex ¶
rf54e6ec raae9c17 34 34 A thread terminates by returning from the main routine where it starts. 35 35 36 \section{Existing Concurrency Features} 37 \CFA currently provides a suite of concurrency features including futures, locks, condition variables, generators, coroutines, monitors. 38 Examples of these features are omitted as most of them are the same as their counterparts in other languages. 39 It is worthwhile to note that all concurrency features added to \CFA are made to be compatible each other. 40 The laundry list of features above and the ones introduced in this thesis can be used in the same program without issue. 41 42 Solving concurrent problems requires a diverse toolkit. 43 This work aims to flesh out \CFA's concurrent toolkit to fill in gaps. 44 Futures are used when a one-shot result needs to be safely delivered concurrently, and are especially useful when the receiver needs to block until the result is ready. 45 When multiple values have to be sent, or more synchronization is needed, futures are not powerful enough, which introduces the need for channels. 46 A close cousin of channels is actor systems, which take message passing a step further and go beyond channels to provide a level of abstraction that allows for easy scalability and separation of concerns. 47 The @waituntil@ and @mutex@ statements provide utilities allowing for easier use of the existing features. 48 All the contributions of this thesis provide the ability to solve concurrent problems that formerly would require a user to either implement a similar feature themselves or construct an ad-hoc solution. 49 36 50 % Local Variables: % 37 51 % tab-width: 4 % -
TabularUnified doc/theses/colby_parsons_MMAth/text/CFA_intro.tex ¶
rf54e6ec raae9c17 11 11 However, \CFA stays true to the C programming style, with most code revolving around @struct@'s and routines, and respects the same rules as C. 12 12 \CFA is not object oriented as it has no notion of @this@ (receiver) and no structures with methods, but supports some object oriented ideas including constructors, destructors, and limited nominal inheritance. 13 While \CFA is rich with interesting features, only the subset pertinent to this work is discussed .13 While \CFA is rich with interesting features, only the subset pertinent to this work is discussed here. 14 14 15 15 \section{References} 16 16 References in \CFA are similar to references in \CC; however \CFA references are \emph{rebindable}, and support multi-level referencing. 17 References in \CFA are a layer of syntactic sugar over pointers to reduce the number of ref/deref operations needed with pointer usage.18 Another difference is theuse of @0p@ instead of C's @NULL@ or \CC's @nullptr@.17 References in \CFA are a layer of syntactic sugar over pointers to reduce the number of syntactic ref/deref operations needed with pointer usage. 18 Pointers in \CFA differ from C and \CC in their use of @0p@ instead of C's @NULL@ or \CC's @nullptr@. 19 19 Examples of references are shown in \VRef[Listing]{l:cfa_ref}. 20 20 … … 100 100 \end{cfa} 101 101 102 The operator '@|@' is used for \CFA stream I/O, similar to how \CC, uses operators '@<<@' and '@>>@' \VRef[Listing]{l:cfa_stream}. 103 104 \begin{cfa}[caption={Example of \CFA stream I/O},label={l:cfa_stream}] 105 char c; int i; double d; 106 sin | c | i | d; $\C{ // read into c, i, and d }$ 107 sout | c | i | d; $\C{ // output c, i, and d }$ 108 x 27 2.3 $\C{ // implicit separation between values and auto newline }$ 109 \end{cfa} 110 102 111 103 112 \section{Constructors and Destructors} … … 127 136 128 137 \section{Polymorphism}\label{s:poly} 129 C supports limited polymorphism, often requiring users to implement polymorphism using a @void *@ ( type erasure) approach.130 \CFA extends C with generalized overloading polymorphism (see \VRef{s:Overloading}), as well as , parametric polymorphism and limited nominal inheritance.138 C supports limited polymorphism, often requiring users to implement polymorphism using a @void *@ (explicit type erasure) approach. 139 \CFA extends C with generalized overloading polymorphism (see \VRef{s:Overloading}), as well as parametric polymorphism and limited inclusion polymorphism (nominal inheritance). 131 140 132 141 \subsection{Parametric Polymorphism} -
TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶
rf54e6ec raae9c17 8 8 Hence, actors are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction. 9 9 Actor message-passing is similar to channels, but with more abstraction, so there is no shared data to protect, making actors amenable in a distributed environment. 10 Actors are often used for high-performance computing and other data-centric problems, where the ease of use and scalability of an actor system provides an advantage over channels. 11 10 12 The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation, and the \gls{actor_system}, which refers to the implementation of the model. 11 13 Before discussing \CFA's actor system in detail, it is important to first describe the actor model, and the classic approach to implementing an actor system. … … 15 17 An actor is composed of a \Newterm{mailbox} (message queue) and a set of \Newterm{behaviours} that receive from the mailbox to perform work. 16 18 Actors execute asynchronously upon receiving a message and can modify their own state, make decisions, spawn more actors, and send messages to other actors. 19 Conceptually, actor systems can be thought of in terms of channels, where each actor's mailbox is a channel. 20 However, a mailbox behaves like an unbounded channel, which differs from the fixed size channels discussed in the previous chapter. 17 21 Because the actor model is implicit concurrency, its strength is that it abstracts away many details and concerns needed in other concurrent paradigms. 18 22 For example, mutual exclusion and locking are rarely relevant concepts in an actor model, as actors typically only operate on local state. … … 67 71 % The arrows from the message queues to the actors in the diagram indicate interleaved messages addressed to each actor. 68 72 69 The actor system in \CFA uses a message-centric design, adopts several features from my prior actor work in \uC~\cite{Buhr22} but i mplemented in \CFA, andadds the following new \CFA contributions:73 The actor system in \CFA uses a message-centric design, adopts several features from my prior actor work in \uC~\cite{Buhr22} but is implemented in \CFA. My contributions to the prior actor work include introducing queue gulping, developing an actor benchmark suite, and extending promise support for actors. Furthermore, I improved the design and implementation of the \uC actor system to greatly increase its performance. As such, the actor system in \CFA started as a copy of the \uC implementation, which was then refined. This work adds the following new \CFA contributions: 70 74 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt] 71 75 \item … … 237 241 tells the executor to mark the respective actor as finished executing, but not call the object's destructor nor deallocate the object. 238 242 This status is used when actors or messages are global or stack allocated, or a programmer wants to manage deallocation themselves. 239 Note , for messages,there is no difference between allocations @Nodelete@ and @Finished@ because both tell the executor to do nothing to the message.243 Note that for messages there is no difference between allocations @Nodelete@ and @Finished@ because both tell the executor to do nothing to the message. 240 244 Hence, @Finished@ is implicitly changed to @Nodelete@ in a message constructor, and @Nodelete@ is used internally for message error-checking \see{Section~\ref{s:SafetyProductivity}}. 241 245 Therefore, reading a message's allocation status after setting to @Finished@ may be either @Nodelete@ (after construction) or @Finished@ (after explicitly setting using @set_allocation@). … … 244 248 After an actor is terminated, it is erroneous to send messages to it. 245 249 Similarly, after a message is terminated, it cannot be sent to an actor. 246 Note ,it is safe to construct an actor or message with a status other than @Nodelete@, since the executor only examines the allocation action \emph{after} a behaviour returns.250 Note that it is safe to construct an actor or message with a status other than @Nodelete@, since the executor only examines the allocation action \emph{after} a behaviour returns. 247 251 248 252 \subsection{Actor Envelopes}\label{s:envelope} … … 377 381 Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to suggest or force actor termination. 378 382 For example, in Figure~\ref{f:CFAActor}, the builtin @finished_msg@ message and receive are used to terminate the actor because the actor is allocated on the stack, so no deallocation actions are performed by the executor. 379 Note ,assignment is used to initialize these messages rather than constructors because the constructor changes the allocation to @Nodelete@ for error checking383 Note that assignment is used to initialize these messages rather than constructors because the constructor changes the allocation to @Nodelete@ for error checking 380 384 381 385 \begin{figure} … … 399 403 After the receive routine is done, the executor must clean up the actor and message according to their allocation status. 400 404 If the allocation status is @Delete@ or @Destroy@, the appropriate destructor must be called by the executor. 401 This requirement poses a problem; 402 the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor. 405 This requirement poses a problem: the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor. 403 406 This requires downcasting from the base type to the derived type, which requires a virtual system. 404 407 To accomplish the downcast, a rudimentary destructor-only virtual system was implemented in \CFA as part of this work. … … 477 480 The only extra overhead is each executor cycling (usually round-robin) through its $M$/$N$ queues. 478 481 The goal is to achieve better performance and scalability for certain kinds of actor applications by reducing executor locking. 479 Note ,lock-free queues do not help because busy waiting on any atomic instruction is the source of the slowdown whether it is a lock or lock-free.482 Note that lock-free queues do not help because busy waiting on any atomic instruction is the source of the slowdown whether it is a lock or lock-free. 480 483 481 484 \begin{figure} … … 489 492 Each executor thread iterates over its own message queues until it finds one with messages. 490 493 At this point, the executor thread atomically \gls{gulp}s the queue, meaning it moves the contents of message queue to a local queue of the executor thread. 491 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor thread gulps queue 0 and begins to process it locally.494 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where an executor thread gulps queue 0 and begins to process it locally. 492 495 This step allows the executor thread to process the local queue without any atomics until the next gulp. 493 496 Other executor threads can continue adding to the ends of the executor thread's message queues. … … 515 518 Instead, an index is stored in the copy-queue data-structure that keeps track of which element to pop next allowing pop to be $O(1)$. 516 519 Push operations are amortized $O(1)$ since pushes may cause doubling reallocations of the underlying dynamic-sized array (like \CC @vector@). 520 Note that the copy queue is similar to a circular buffer, but has a key difference. 521 After all elements are popped, the end of the queue is set back to index zero, whereas a standard circular buffer would leave the end in the same location. 522 Using a standard circular buffer would incur an additional $O(n)$ cost of fixing the buffer after a doubling reallocation. 517 523 518 524 Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed. … … 547 553 To steal, a thief takes work from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue. 548 554 This contention can reduce the victim's throughput. 549 Note ,the data structure used for the ready queue is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations.555 Note that the data structure used for the ready queue is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations. 550 556 551 557 The stealing mechanism in this work differs from most work-stealing actor-systems because of the message-centric (inverted) actor-system. … … 561 567 Achieving this goal requires work stealing to have minimal (practically no) effect on the performance of the victim. 562 568 The implication is that thieves cannot contend with a victim, and that a victim should perform no stealing related work unless it becomes a thief. 563 In theory, this goal is not achievable, but practical results show the goal is virtually achieved.569 In theory, this goal is not achievable, but practical results show the goal is nearly achieved. 564 570 565 571 One important lesson learned while working on \uC actors~\cite{Buhr22} and through discussions with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D.~\cite{Delisle22}, is \emph{not} to aggressively steal. 566 572 With reasonable workloads, being a thief should be a temporary state, \ie eventually work appears on the thief's ready-queues and it returns to normal operation. 567 Furthermore, the act of \emph{looking} to find work is invasive (Heisenberg uncertainty principle), possibly disrupting multiple victims.573 Furthermore, the act of \emph{looking} to find work is invasive, possibly disrupting multiple victims. 568 574 Therefore, stealing should be done lazily in case work appears for the thief and to minimize disruption of victims. 569 Note ,the cost of stealing is not crucial for the thief because it does not have anything else to do except poll or block.575 Note that the cost of stealing is not crucial for the thief because it does not have anything else to do except poll or block. 570 576 571 577 The outline for lazy-stealing by a thief is: select a victim, scan its queues once, and return immediately if a queue is stolen. … … 628 634 % The swap races to interchange the appropriate victim's mail-queue pointer with an empty mail-queue pointer in the thief's @worker_queues@ range. 629 635 This swap can fail if another thief steals the queue, which is discussed further in Section~\ref{s:swap}. 630 % Note ,a thief never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.636 % Note that a thief never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers. 631 637 If no appropriate victim mailbox is found, no swap is attempted. 632 638 … … 764 770 Figure~\ref{f:qpcasImpl} shows the \CFA pseudocode for the \gls{qpcas}. 765 771 In detail, a thief performs the following steps to swap two pointers: 766 \begin{enumerate} [start=0]767 \item 768 stores local copies of the two pointers to be swapped.769 \item 770 verifies the stored copy of the victim queue pointer, @vic_queue@, is valid.772 \begin{enumerate} 773 \item 774 Stores local copies of the two pointers to be swapped. 775 \item 776 Verifies the stored copy of the victim queue pointer, @vic_queue@, is valid. 771 777 If @vic_queue@ is null, then the victim queue is part of another swap so the operation fails. 772 778 No state has changed at this point so the thief just returns. 773 Note ,thieves only set their own queues pointers to null when stealing, and queue pointers are not set to null anywhere else.779 Note that thieves only set their own queues pointers to null when stealing, and queue pointers are not set to null anywhere else. 774 780 As such, it is impossible for @my_queue@ to be null since each worker owns a disjoint range of the queue array. 775 781 Hence, only @vic_queue@ is checked for null. 776 782 \item 777 attempts to atomically set the thief's queue pointer to null.783 Attempts to atomically set the thief's queue pointer to null. 778 784 The @CAS@ only fails if the thief's queue pointer is no longer equal to @my_queue@, which implies this thief has become a victim and its queue has been stolen. 779 785 At this point, the thief-turned-victim fails, and since it has not changed any state, it just returns false. … … 784 790 785 791 \item 786 attempts to atomically set the victim's queue pointer to @my_queue@.792 Attempts to atomically set the victim's queue pointer to @my_queue@. 787 793 If the @CAS@ succeeds, the victim's queue pointer has been set and the swap can no longer fail. 788 794 If the @CAS@ fails, the thief's queue pointer must be restored to its previous value before returning. 789 795 \item 790 sets the thief's queue pointer to @vic_queue@ completing the swap.796 Sets the thief's queue pointer to @vic_queue@ completing the swap. 791 797 \end{enumerate} 792 798 … … 794 800 \begin{cfa} 795 801 bool try_swap_queues( worker & this, uint victim_idx, uint my_idx ) with(this) { 796 // Step 0: mailboxes is the shared array of all sharded queues802 // Step 1: mailboxes is the shared array of all sharded queues 797 803 work_queue * my_queue = mailboxes[my_idx]; 798 804 work_queue * vic_queue = mailboxes[victim_idx]; 799 805 800 // Step 1If the victim queue is 0p then they are in the process of stealing so return false806 // Step 2 If the victim queue is 0p then they are in the process of stealing so return false 801 807 // 0p is Cforall's equivalent of C++'s nullptr 802 808 if ( vic_queue == 0p ) return false; 803 809 804 // Step 2Try to set our own (thief's) queue ptr to be 0p.810 // Step 3 Try to set our own (thief's) queue ptr to be 0p. 805 811 // If this CAS fails someone stole our (thief's) queue so return false 806 812 if ( !CAS( &mailboxes[my_idx], &my_queue, 0p ) ) 807 813 return false; 808 814 809 // Step 3: Try to set victim queue ptr to be our (thief's) queue ptr.815 // Step 4: Try to set victim queue ptr to be our (thief's) queue ptr. 810 816 // If it fails someone stole the other queue, so fix up then return false 811 817 if ( !CAS( &mailboxes[victim_idx], &vic_queue, my_queue ) ) { … … 813 819 return false; 814 820 } 815 // Step 4: Successfully swapped.821 // Step 5: Successfully swapped. 816 822 // Thief's ptr is 0p so no one touches it 817 823 // Write back without CAS is safe … … 828 834 \end{theorem} 829 835 To verify sequential correctness, Figure~\ref{f:seqSwap} shows a simplified \gls{qpcas}. 830 Step 1is missing in the sequential example since it only matters in the concurrent context.836 Step 2 is missing in the sequential example since it only matters in the concurrent context. 831 837 By inspection, the sequential swap copies each pointer being swapped, and then the original values of each pointer are reset using the copy of the other pointer. 832 838 … … 834 840 \begin{cfa} 835 841 void swap( uint victim_idx, uint my_idx ) { 836 // Step 0:842 // Step 1: 837 843 work_queue * my_queue = mailboxes[my_idx]; 838 844 work_queue * vic_queue = mailboxes[victim_idx]; 839 // Step 2:845 // Step 3: 840 846 mailboxes[my_idx] = 0p; 841 // Step 3:847 // Step 4: 842 848 mailboxes[victim_idx] = my_queue; 843 // Step 4:849 // Step 5: 844 850 mailboxes[my_idx] = vic_queue; 845 851 } … … 859 865 \begin{itemize} 860 866 \item 861 Step 0 and 1do not write, and as such, they cannot invalidate the invariant of any other thieves.862 \item 863 In step 2, a thief attempts to write @0p@ to one of their queue pointers.867 Step 1 and 2 do not write, and as such, they cannot invalidate the invariant of any other thieves. 868 \item 869 In step 3, a thief attempts to write @0p@ to one of their queue pointers. 864 870 This queue pointer cannot be @0p@. 865 871 As stated above, @my_queue@ is never equal to @0p@ since thieves only write @0p@ to queue pointers from their own queue range and all worker's queue ranges are disjoint. 866 As such, step 2upholds the invariant, since in a failure case no write occurs, and in the success case, the value of the queue pointer is guaranteed to not be 0p.867 \item 868 In step 3, the thief attempts to write @my_queue@ to the victim's queue pointer.869 If the current value of the victim's queue pointer is @0p@, then the @CAS@ fails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1.872 As such, step 3 upholds the invariant, since in a failure case no write occurs, and in the success case, the value of the queue pointer is guaranteed to not be 0p. 873 \item 874 In step 4, the thief attempts to write @my_queue@ to the victim's queue pointer. 875 If the current value of the victim's queue pointer is @0p@, then the @CAS@ fails since @vic_queue@ cannot be equal to @0p@ because of the check in step 2. 870 876 Therefore, when the @CAS@ succeeds, the value of the victim's queue pointer must not be @0p@. 871 As such, the write never overwrites a value of @0p@, hence the invariant is held in the @CAS@ of step 3.872 \item 873 The write back to the thief's queue pointer that happens in the failure case of step 3 and in step 4hold the invariant since they are the subsequent write to a @0p@ queue pointer and are being set by the same thief that set the pointer to @0p@.877 As such, the write never overwrites a value of @0p@, hence the invariant is held in the @CAS@ of step 4. 878 \item 879 The write back to the thief's queue pointer that happens in the failure case of step 4 and in step 5 hold the invariant since they are the subsequent write to a @0p@ queue pointer and are being set by the same thief that set the pointer to @0p@. 874 880 \end{itemize} 875 881 876 882 Given this informal proof of invariance it can be shown that the successful swap is correct. 877 Once a thief atomically sets their queue pointer to be @0p@ in step 2, the invariant guarantees that that pointer does not change.878 In the success case of step 3, it is known the value of the victim's queue-pointer, which is not overwritten, must be @vic_queue@ due to the use of @CAS@.883 Once a thief atomically sets their queue pointer to be @0p@ in step 3, the invariant guarantees that that pointer does not change. 884 In the success case of step 4, it is known the value of the victim's queue-pointer, which is not overwritten, must be @vic_queue@ due to the use of @CAS@. 879 885 Given that the pointers all have unique memory locations (a pointer is never swapped with itself), this first write of the successful swap is correct since it can only occur when the pointer has not changed. 880 886 By the invariant, the write back in the successful case is correct since no other worker can write to the @0p@ pointer. 881 In the failed case of step 3, the outcome is correct in steps 1 and 2since no writes have occurred so the program state is unchanged.882 Therefore, the program state is safely restored to the state it had prior to the @0p@ write in step 2, because the invariant makes the write back to the @0p@ pointer safe.887 In the failed case of step 4, the outcome is correct in steps 2 and 3 since no writes have occurred so the program state is unchanged. 888 Therefore, the program state is safely restored to the state it had prior to the @0p@ write in step 3, because the invariant makes the write back to the @0p@ pointer safe. 883 889 Note that the pointers having unique memory locations prevents the ABA problem. 884 890 … … 906 912 As such, the victim here is not also a thief. 907 913 Stepping through the code in \ref{f:qpcasImpl}, for all thieves, steps 0-1 succeed since the victim is not stealing and has no queue pointers set to be @0p@. 908 Similarly, for all thieves, step 2 succeedsince no one is stealing from any of the thieves.909 In step 3, the first thief to @CAS@ wins the race and successfully swaps the queue pointer.914 Similarly, for all thieves, step 3 succeeds since no one is stealing from any of the thieves. 915 In step 4, the first thief to @CAS@ wins the race and successfully swaps the queue pointer. 910 916 Since it is the first one to @CAS@ and @CAS@ is atomic, there is no way for the @CAS@ to fail since no other thief could have written to the victim's queue pointer and the victim did not write to the pointer since they aren't stealing. 911 917 Hence at least one swap is guaranteed to succeed in this case. … … 926 932 This is a proof by contradiction. 927 933 Assume no swaps occur. 928 Then all thieves must have failed at step 1, step 2 or step 3.929 For a given thief $b$ to fail at step 1, thief $b + 1$ must have succeeded at step 2 before $b$ executes step 0.930 Hence, not all thieves can fail at step 1.931 Furthermore if a thief $b$ fails at step 1it logically splits the chain into two subchains $0 <- b$ and $b + 1 <- M - 1$, where $b$ has become solely a victim since its swap has failed and it did not modify any state.932 There must exist at least one chain containing two or more queues after since it is impossible for a split to occur both before and after a thief, since that requires failing at step 1 and succeeding at step 2.933 Hence, without loss of generality, whether thieves succeed or fail at step 1, this proof can proceed inductively.934 935 For a given thief $i$ to fail at step 2, it means that another thief $j$ had to have written to $i$'s queue pointer between $i$'s step 0 and step 2.936 The only way for $j$ to write to $i$'s queue pointer would be if $j$ was stealing from $i$ and had successfully finished step 3.937 If $j$ finished step 3 then theat least one swap was successful.938 Therefore all thieves did not fail at step 2.939 Hence all thieves must successfully complete step 2 and fail at step 3.934 Then all thieves must have failed at step 2, step 3 or step 4. 935 For a given thief $b$ to fail at step 2, thief $b + 1$ must have succeeded at step 3 before $b$ executes step 1. 936 Hence, not all thieves can fail at step 2. 937 Furthermore if a thief $b$ fails at step 2 it logically splits the chain into two subchains $0 <- b$ and $b + 1 <- M - 1$, where $b$ has become solely a victim since its swap has failed and it did not modify any state. 938 There must exist at least one chain containing two or more queues after since it is impossible for a split to occur both before and after a thief, since that requires failing at step 2 and succeeding at step 3. 939 Hence, without loss of generality, whether thieves succeed or fail at step 2, this proof can proceed inductively. 940 941 For a given thief $i$ to fail at step 3, it means that another thief $j$ had to have written to $i$'s queue pointer between $i$'s step 1 and step 3. 942 The only way for $j$ to write to $i$'s queue pointer would be if $j$ was stealing from $i$ and had successfully finished 4. 943 If $j$ finished step 4, then at least one swap was successful. 944 Therefore all thieves did not fail at step 3. 945 Hence all thieves must successfully complete step 3 and fail at step 4. 940 946 However, since the first worker, thief $0$, is solely a victim and not a thief, it does not change the state of any of its queue pointers. 941 Hence, in this case thief $1$ always succeeds in step 3 if all thieves succeed in step 2.947 Hence, in this case thief $1$ always succeeds in step 4 if all thieves succeed in step 3. 942 948 Thus, by contradiction with the earlier assumption that no swaps occur, at least one swap must succeed. 943 949 … … 964 970 The forward direction is proven by contradiction in a similar fashion to \ref{t:vic_chain}. 965 971 Assume no swaps occur. 966 Similar to \ref{t:vic_chain}, this graph can be inductively split into subgraphs of the same type by failure at step 1, so the proof proceeds without loss of generality.967 Similar to \ref{t:vic_chain} the conclusion is drawn that all thieves must successfully complete step 2 for no swaps to occur, since for step 2 to fail, a different thief has to successfully complete step 3, which would imply a successful swap.968 Hence, the only way forward is to assume all thieves successfully complete step 2.969 Hence for there to be no swaps all thieves must fail step 3.972 Similar to \ref{t:vic_chain}, this graph can be inductively split into subgraphs of the same type by failure at step 2, so the proof proceeds without loss of generality. 973 Similar to \ref{t:vic_chain} the conclusion is drawn that all thieves must successfully complete step 3 for no swaps to occur, since for step 3 to fail, a different thief has to successfully complete step 4, which would imply a successful swap. 974 Hence, the only way forward is to assume all thieves successfully complete step 3. 975 Hence for there to be no swaps all thieves must fail step 4. 970 976 However, since $A$ has no outgoing edges, since the graph is connected there must be some $K$ such that $K < M - 1$ thieves are attempting to swap with $A$. 971 Since all $K$ thieves have passed step 2, similar to \ref{t:one_vic} the first one of the $K$ thieves to attempt step 3is guaranteed to succeed.977 Since all $K$ thieves have passed step 3, similar to \ref{t:one_vic} the first one of the $K$ thieves to attempt step 4 is guaranteed to succeed. 972 978 Thus, by contradiction with the earlier assumption that no swaps occur, if the graph does not contain a cycle, at least one swap must succeed. 973 979 … … 983 989 Thus, by induction all vertices in the graph must have exactly one outgoing edge. 984 990 Hence all vertices are thief queues. 985 Now consider the case where all thieves successfully complete step 0-1, and then they all complete step 2.991 Now consider the case where all thieves successfully complete step 1-2, and then they all complete step 3. 986 992 At this point all thieves are attempting to swap with a queue pointer whose value has changed to @0p@. 987 993 If all thieves attempt the @CAS@ before any write backs, then they all fail. … … 997 1003 There is no one selection heuristic known to be best for all workloads. 998 1004 Recent work focuses on locality aware scheduling in actor systems~\cite{barghi18,wolke17}. 999 However, while locality-aware scheduling provides good performance on some workloads, sometimerandomized selection performs better on other workloads~\cite{barghi18}.1005 However, while locality-aware scheduling provides good performance on some workloads, randomized selection performs better on other workloads~\cite{barghi18}. 1000 1006 Since locality aware scheduling has been explored recently, this work introduces a heuristic called \Newterm{longest victim} and compares it to randomized work stealing. 1001 1007 … … 1021 1027 1022 1028 \CFA's actor system comes with a suite of safety and productivity features. 1023 Most of these features are only present in \CFA's debug mode, and hence, have zero-cost in no debug mode.1029 Most of these features are only present in \CFA's debug mode, and hence, have zero-cost in no-debug mode. 1024 1030 The suit of features include the following. 1025 1031 \begin{itemize} … … 1194 1200 The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator. 1195 1201 Additionally, due to the static typing in \CFA's actor system, there is no expensive dynamic (RTTI) casts that occur in \uC to discriminate messages types. 1196 Note ,while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small;1202 Note that while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small; 1197 1203 hence, the relative cost for the RTTI in \uC is significant. 1198 1204 … … 1264 1270 Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix-multiply results. 1265 1271 There are two groupings with Akka and ProtoActor being slightly slower than \uC, \CFA, and CAF. 1266 On the Intel, there is an un knowndivergence between \uC and \CFA/CAF at 24 cores.1272 On the Intel, there is an unexplained divergence between \uC and \CFA/CAF at 24 cores. 1267 1273 Given that the bottleneck of this benchmark is the computation of the result matrix, all executors perform well on this embarrassingly parallel application. 1268 1274 Hence, the results are tightly clustered across all actor systems. … … 1288 1294 1289 1295 Two pathological unbalanced cases are created, and compared using vanilla and randomized work stealing in \CFA. 1290 These benchmarks adversarially take sadvantage of the round-robin assignment of actors to workers by loading actors only on specific cores (there is one worker per core).1296 These benchmarks adversarially take advantage of the round-robin assignment of actors to workers by loading actors only on specific cores (there is one worker per core). 1291 1297 The workload on the loaded cores is the same as the executor benchmark described in \ref{s:executorPerf}, but with fewer rounds. 1292 1298 … … 1294 1300 Given this layout, the ideal speedup of work stealing in the balance-one case should be $N / N - 1$ where $N$ is the number of threads; 1295 1301 in the balance-multi case, the ideal speedup is 0.5. 1296 Note ,in the balance-one benchmark, the workload is fixed so decreasing runtime is expected;1302 Note that in the balance-one benchmark, the workload is fixed so decreasing runtime is expected; 1297 1303 in the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected. 1298 1304 … … 1328 1334 On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1329 1335 Here, the longest-victim and random heuristic are the same. 1330 Note ,the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few \emph{dummy} actors are created for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.1336 Note that the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few \emph{dummy} actors are created for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core. 1331 1337 1332 1338 For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim. -
TabularUnified doc/theses/colby_parsons_MMAth/text/channels.tex ¶
rf54e6ec raae9c17 39 39 Zero sized implies the communication is synchronous, \ie the producer must wait for the consumer to arrive or vice versa for a value to be communicated. 40 40 \item 41 Fixed sized (bounded) implies the communication is asynchronous, \ie the producer can proceed up to the buffer size and vice versa for the consumer with respect to removal.41 Fixed sized (bounded) implies the communication is mostly asynchronous, \ie the producer can proceed up to the buffer size and vice versa for the consumer with respect to removal, at which point the producer/consumer would wait. 42 42 \item 43 43 Infinite sized (unbounded) implies the communication is asynchronous, \ie the producer never waits but the consumer waits when the buffer is empty. … … 66 66 67 67 \section{Channel Implementation}\label{s:chan_impl} 68 Currently, only the Go programming language provides user-level threading where the primary communication mechanism is channels. 69 Experiments were conducted that varied the producer-consumer algorithm and lock type used inside the channel. 68 Currently, only the Go and Erlang programming languages provide user-level threading where the primary communication mechanism is channels. 69 Both Go and Erlang have user-level threading and preemptive scheduling, and both use channels for communication. 70 Go provides multiple homogeneous channels; each have a single associated type. 71 Erlang, which is closely related to actor systems, provides one heterogeneous channel per thread (mailbox) with a typed receive pattern. 72 Go encourages users to communicate via channels, but provides them as an optional language feature. 73 On the other hand, Erlang's single heterogeneous channel is a fundamental part of the threading system design; using it is unavoidable. 74 Similar to Go, \CFA's channels are offered as an optional language feature. 75 76 While iterating on channel implementation, experiments were conducted that varied the producer-consumer algorithm and lock type used inside the channel. 70 77 With the exception of non-\gls{fcfs} or non-\gls{fifo} algorithms, no algorithm or lock usage in the channel implementation was found to be consistently more performant that Go's choice of algorithm and lock implementation. 71 78 Performance of channels can be improved by sharding the underlying buffer \cite{Dice11}. … … 82 89 As a result, contention on the internal lock is decreased; only entering threads compete for the lock since unblocking threads do not reacquire the lock. 83 90 The other advantage of Go's wait-morphing approach is that it eliminates the bottleneck of waiting for signalled threads to run. 84 Note ,the property of acquiring/releasing the lock only once can also be achieved with a different form of cooperation, called \Newterm{baton passing}.91 Note that the property of acquiring/releasing the lock only once can also be achieved with a different form of cooperation, called \Newterm{baton passing}. 85 92 Baton passing occurs when one thread acquires a lock but does not release it, and instead signals a thread inside the critical section, conceptually ``passing'' the mutual exclusion from the signalling thread to the signalled thread. 86 93 The baton-passing approach has threads cooperate to pass mutual exclusion without additional lock acquires or releases; … … 147 154 Thus, improperly handled \gls{toctou} issues with channels often result in deadlocks as threads performing the termination may end up unexpectedly blocking in their attempt to help other threads exit the system. 148 155 149 \paragraph{Go channels}provide a set of tools to help with concurrent shutdown~\cite{go:chan} using a @close@ operation in conjunction with the \Go{select} statement.156 Go channels provide a set of tools to help with concurrent shutdown~\cite{go:chan} using a @close@ operation in conjunction with the \Go{select} statement. 150 157 The \Go{select} statement is discussed in \ref{s:waituntil}, where \CFA's @waituntil@ statement is compared with the Go \Go{select} statement. 151 158 … … 159 166 These design choices fit Go's paradigm of error management, where users are expected to explicitly check for errors, rather than letting errors occur and catching them. 160 167 If errors need to occur in Go, return codes are used to pass error information up call levels. 161 Note ,panics in Go can be caught, but it is not the idiomatic way to write Go programs.168 Note that panics in Go can be caught, but it is not the idiomatic way to write Go programs. 162 169 163 170 While Go's channel-closing semantics are powerful enough to perform any concurrent termination needed by a program, their lack of ease of use leaves much to be desired. … … 451 458 The results of the benchmark are shown in Figure~\ref{f:chanPerf}. 452 459 The performance of Go and \CFA channels on this microbenchmark is comparable. 453 Note, the performance should decline as the number of cores increases as the channel operations occur in a critical section, so increasing cores results in higher contention with no increase in parallelism. 460 Note that the performance should decline as the number of cores increases as the channel operations occur in a critical section, so increasing cores results in higher contention with no increase in parallelism. 461 462 The performance of \CFA and Go's shutdown mechanisms is not measured, as shutdown is an exceptional case that does not occur frequently in most programs. Additionally, it is difficult to measure channel shutdown performance; threads need to be synchronized between each subsequent shutdown, which is likely more expensive than the shutdown mechanism itself. 454 463 455 464 \begin{figure} 456 465 \centering 457 \subfloat[AMD \CFAChannel Benchmark]{466 \subfloat[AMD Channel Benchmark]{ 458 467 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Channel_Contention.pgf}} 459 468 \label{f:chanAMD} 460 469 } 461 \subfloat[Intel \CFAChannel Benchmark]{470 \subfloat[Intel Channel Benchmark]{ 462 471 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Channel_Contention.pgf}} 463 472 \label{f:chanIntel} -
TabularUnified doc/theses/colby_parsons_MMAth/text/conclusion.tex ¶
rf54e6ec raae9c17 10 10 The @waituntil@ statement aids in writing concurrent programs in both the message passing and shared memory paradigms of concurrency. 11 11 Furthermore, no other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's @waituntil@. 12 13 These features are commonly used in conjunction to solve concurrent problems. 14 The @waituntil@ statement, the @mutex@ statement, and channels will all likely see use in a program where a thread operates as an administrator or server which accepts and distributes work among channels based on some shared state. 15 The @mutex@ statement sees use across almost all concurrent code in \CFA, since it is used with the stream operator @sout@ to provide thread-safe output. 16 While not yet implemented, the polymorphic support of the @waituntil@ statement could see use in conjunction with the actor system to enable user threads outside the actor system to wait for work to be done, or for actors to finish. 17 A user of \CFA does not have to solely subscribe to the message passing or shared memory concurrent paradigm. 18 As such, channels in \CFA are often used to pass pointers to shared memory that may still need mutual exclusion, requiring the @mutex@ statement to also be used. 12 19 13 20 On overview of the contributions in this thesis include the following: -
TabularUnified doc/theses/colby_parsons_MMAth/text/frontpgs.tex ¶
rf54e6ec raae9c17 78 78 \begin{center}\textbf{Abstract}\end{center} 79 79 80 Concurrent programs are notoriously hard to programand even harder to debug. Furthermore concurrent programs must be performant, as the introduction of concurrency into a program is often done to achieve some form of speedup.80 Concurrent programs are notoriously hard to write and even harder to debug. Furthermore concurrent programs must be performant, as the introduction of concurrency into a program is often done to achieve some form of speedup. 81 81 82 82 This thesis presents a suite of high-level concurrent-language features in the new programming language \CFA, all of which are implemented with the aim of improving the performance, productivity, and safety of concurrent programs. -
TabularUnified doc/theses/colby_parsons_MMAth/text/intro.tex ¶
rf54e6ec raae9c17 15 15 The features include a @mutex@ statement, channels, a @waituntil@ statement, and an actor system. 16 16 All of these features exist in other programming languages in some shape or form, however this thesis extends the original ideas by improving performance, productivity, and safety. 17 18 19 The first chapter of this thesis aims to familiarize the reader with the language \CFA. 20 In this chapter, syntax and features of the \CFA language that appear in this work are discussed The next chapter briefly discusses prior concurrency work in \CFA and how this work builds on top of existing features. 21 The remaining chapters each introduce a concurrent language feature, discuss prior related work, and present contributions which are then benchmarked against other languages and systems. 22 The first of these chapters discusses the @mutex@ statement, a language feature that improves ease of use and safety of lock usage. 23 The @mutex@ statement is compared both in terms of safety and performance with similar tools in \CC and Java. 24 The following chapter discusses channels, a message passing concurrency primitive that provides an avenue for safe synchronous and asynchronous communication across threads. 25 Channels in \CFA are compared to Go, which popularized the use of channels in modern concurrent programs. 26 The following chapter discusses the \CFA actor system. 27 The \CFA actor system is a close cousin of channels, as it also belongs to the message passing paradigm of concurrency. 28 However, the actor system provides a great degree of abstraction and ease of scalability, making it useful for a different range of problems than channels. 29 The actor system in \CFA is compared with a variety of other systems on a suite of benchmarks, where it achieves significant performance gains over other systems due to its design. 30 The final chapter discusses the \CFA @waituntil@ statement which provides the ability to synchronize while waiting for a resource, such as acquiring a lock, accessing a future, or writing to a channel. 31 The @waituntil@ statement presented provides greater flexibility and expressibility than similar features in other languages. 32 All in all, the features presented aim to fill in gaps in the current \CFA concurrent language support, and enable users to write a wider range of complex concurrent programs with ease. -
TabularUnified doc/theses/colby_parsons_MMAth/text/mutex_stmt.tex ¶
rf54e6ec raae9c17 33 33 Figure~\ref{f:AtomicCounter} shows a \CFA and Java monitor implementing an atomic counter. 34 34 A \Newterm{monitor} is a programming technique that implicitly binds mutual exclusion to static function scope by call and return. 35 In contrast, lock mutual exclusion, defined by acquire/release calls, is independent of lexical context (analogous to block versus heap storage allocation).35 In contrast, lock mutual exclusion, defined by acquire/release calls, is independent of lexical context (analogous to stack versus heap storage allocation). 36 36 Restricting acquire and release points in a monitor eases programming, comprehension, and maintenance, at a slight cost in flexibility and efficiency. 37 37 Ultimately, a monitor is implemented using a combination of basic locks and atomic instructions. … … 104 104 } 105 105 \end{cfa} 106 The \CFA monitor implementation ensures multi-lock acquisition is done in a deadlock-free manner regardless of the number of MX parameters and monitor arguments. 107 106 The \CFA monitor implementation ensures multi-lock acquisition is done in a deadlock-free manner regardless of the number of MX parameters and monitor arguments. It it important to note that \CFA monitors do not attempt to solve the nested monitor problem~\cite{Lister77}. 108 107 109 108 \section{\lstinline{mutex} statement} … … 112 111 \VRef[Figure]{f:ReadersWriter} shows the outline of a reader/writer lock written as a \CFA monitor and mutex statements. 113 112 (The exact lock implementation is irrelevant.) 114 The @read@ and @write@ functions are called with a reader/write lock and any arguments to perform reading or writing.113 The @read@ and @write@ functions are called with a reader/writer lock and any arguments to perform reading or writing. 115 114 The @read@ function is not MX because multiple readers can read simultaneously. 116 115 MX is acquired within @read@ by calling the (nested) helper functions @StartRead@ and @EndRead@ or executing the mutex statements. … … 209 208 \section{\CFA implementation} 210 209 The \CFA mutex statement takes some ideas from both the Java and \CC features. 211 Like Java, \CFA introduces a new statement rather than building from existing language features. 212 (\CFA has sufficient language features to mimic \CC RAII locking.) 210 Like Java, \CFA introduces a new statement rather than building from existing language features, although \CFA has sufficient language features to mimic \CC RAII locking. 213 211 This syntactic choice makes MX explicit rather than implicit via object declarations. 214 212 Hence, it is easier for programmers and language tools to identify MX points in a program, \eg scan for all @mutex@ parameters and statements in a body of code. … … 340 338 \end{cquote} 341 339 Comparatively, if the @scoped_lock@ is used and the same locks are acquired elsewhere, there is no concern of the @scoped_lock@ deadlocking, due to its avoidance scheme, but it may livelock. 342 The convenience and safety of the @mutex@ statement, \eg guaranteed lock release with exceptions, should encourage programmers to always use it for locking, mitigating any deadlock scenario versus combining manual locking with the mutex statement. 340 The convenience and safety of the @mutex@ statement, \ie guaranteed lock release with exceptions, should encourage programmers to always use it for locking, mitigating any deadlock scenario versus combining manual locking with the mutex statement. 341 Both \CC and the \CFA do not provide any deadlock guarantees for nested @scoped_lock@s or @mutex@ statements. 342 To do so would require solving the nested monitor problem~\cite{Lister77}, which currently does not have any practical solutions. 343 343 344 344 \section{Performance} … … 413 413 In Figures~\ref{f:mutex_bench8_AMD} and \ref{f:mutex_bench8_Intel} there is the counter-intuitive result of the mutex statement performing better than the baseline. 414 414 At 7 locks and above the mutex statement switches from a hard coded sort to insertion sort, which should decrease performance. 415 The hard coded sort is branch-free and constant-time and was verified to be faster than insertion sort for 6 or fewer locks. 415 416 It is likely the increase in throughput compared to baseline is due to the delay spent in the insertion sort, which decreases contention on the locks. 417 416 418 417 419 \begin{figure} -
TabularUnified doc/theses/colby_parsons_MMAth/text/waituntil.tex ¶
rf54e6ec raae9c17 103 103 104 104 When discussing \gls{synch_multiplex} implementations, the resource being multiplexed is important. 105 While CSP waits on channels, the earliest known implementation of synch\-ronous multiplexing is Unix's @select@~\cite{linux:select}, multiplexing over file descriptors .105 While CSP waits on channels, the earliest known implementation of synch\-ronous multiplexing is Unix's @select@~\cite{linux:select}, multiplexing over file descriptors, which conceptually differ from channels in name only. 106 106 The @select@ system-call is passed three sets of file descriptors (read, write, exceptional) to wait on and an optional timeout. 107 107 @select@ blocks until either some subset of file descriptors are available or the timeout expires. … … 155 155 156 156 Figure~\ref{f:BB_Ada} shows the outline of a bounded buffer implemented with an Ada task. 157 Note ,a task method is associated with the \lstinline[language=ada]{accept} clause of the \lstinline[language=ada]{select} statement, rather than being a separate routine.157 Note that a task method is associated with the \lstinline[language=ada]{accept} clause of the \lstinline[language=ada]{select} statement, rather than being a separate routine. 158 158 The thread executing the loop in the task body blocks at the \lstinline[language=ada]{select} until a call occurs to @insert@ or @remove@. 159 159 Then the appropriate \lstinline[language=ada]{accept} method is run with the called arguments. … … 319 319 The @waituntil@ statement supports guard clauses, both @or@ and @and@ semantics, and timeout and @else@ for asynchronous multiplexing. 320 320 Figure~\ref{f:wu_example} shows a \CFA @waituntil@ usage, which is waiting for either @Lock@ to be available \emph{or} for a value to be read from @Channel@ into @i@ \emph{and} for @Future@ to be fulfilled \emph{or} a timeout of one second. 321 Note ,the expression inside a @waituntil@ clause is evaluated once at the start of the @waituntil@ algorithm.321 Note that the expression inside a @waituntil@ clause is evaluated once at the start of the @waituntil@ algorithm. 322 322 323 323 \section{Waituntil Semantics} … … 366 366 If the @and@ operator waited for all clauses to be available before running, it is the same as just acquiring those resources consecutively by a sequence of @waituntil@ statements. 367 367 368 As fornormal C expressions, the @and@ operator binds more tightly than the @or@.368 As with normal C expressions, the @and@ operator binds more tightly than the @or@. 369 369 To give an @or@ operator higher precedence, parenthesis are used. 370 370 For example, the following @waituntil@ unconditionally waits for @C@ and one of either @A@ or @B@, since the @or@ is given higher precedence via parenthesis. … … 523 523 \end{cfa} 524 524 If only @C2@ is satisfied, \emph{both} timeout code-blocks trigger because 1 second occurs before 3 seconds. 525 Note ,the \CFA @waitfor@ statement only provides a single @timeout@ clause because it only supports @or@ semantics.525 Note that the \CFA @waitfor@ statement only provides a single @timeout@ clause because it only supports @or@ semantics. 526 526 527 527 The \lstinline[deletekeywords={timeout}]{timeout} routine is different from UNIX @sleep@, which blocks for the specified duration and returns the amount of time elapsed since the call started. … … 624 624 Correspondingly, if thread 2 wins the race and sees that @A@ has a waiting removal, then thread 1 must execute the first clause of its @waituntil@ and thread 2 execute its second. 625 625 Any other execution scenario is incorrect for exclusive-or semantics. 626 Note ,that priority execution of multiple satisfied @waituntil@ causes (\ie top to bottom) is not violated because, in this scenario, there is only one satisfied clause for either thread.626 Note that priority execution of multiple satisfied @waituntil@ causes (\ie top to bottom) is not violated because, in this scenario, there is only one satisfied clause for either thread. 627 627 628 628 The Go @select@ solves this problem by acquiring all the internal locks of the channels before registering the @select@ on the channels. … … 715 715 The waiting condition of the @waituntil@ statement is implemented as the complete predicate over the resources, joined by the @waituntil@ operators, where a resource is @true@ if it is available, and @false@ otherwise. 716 716 This complete predicate is used as the mechanism to check if a thread is done waiting on a @waituntil@. 717 Leveraging the compiler, a predicate routine is generated per @waituntil@ that when passes the statuses of the resources, returns @true@ when the @waituntil@ is done, and false otherwise. 717 Leveraging the compiler, a predicate routine is generated per @waituntil@. 718 This predicate routine accepts the statuses of the resources being waited on as arguments. 719 A resource status is an integer that indicates whether a resource is either not available, available, or has already run its associated code block. 720 The predicate routine returns @true@ when the @waituntil@ is done, \ie enough resources have run their associated code blocks to satisfy the @waituntil@'s predicate, and false otherwise. 718 721 To support guards on the \CFA @waituntil@ statement, the status of a resource disabled by a guard is set to a boolean value that ensures that the predicate function behaves as if that resource is no longer part of the predicate. 719 722 The generated code allows the predicate that is checked with each iteration to be simplified to not check guard values. … … 760 763 It would require some way to ensure channels used with internal operators are modified, if and only if, the corresponding code block is run. 761 764 However, that is not feasible due to reasons described in the exclusive-or portion of Section~\ref{s:wu_chans}. 762 Ultimately, I did not deemed this feature to be crucial enough whentaking into account the complexity and runtime cost.765 Ultimately, this feature was not considered crucial after taking into account the complexity and runtime cost. 763 766 764 767 \subsection{The full \lstinline{waituntil} picture} … … 828 831 829 832 The channel multiplexing benchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels. 833 Although Unix's select, poll and epoll multiplex over file descriptors, which are effectively channels, these utilities are not included in the benchmarks. 834 Reading or writing to a file descriptor requires a system call which is much more expensive than operating on a user-space channel. 835 As such, it is infeasible to compare Unix's select, poll and epoll with \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}. 830 836 The basic structure of the benchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads. 831 837 The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has a different channel that it waits on. … … 901 907 Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing \lstinline[language=Go]{select} and @waituntil@. 902 908 In the AMD benchmarks (left column), the performance is very similar as the number of cores scale. 903 The AMD machine has a high-caching contention cost because of its \emph{chi cklet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go.909 The AMD machine has a high-caching contention cost because of its \emph{chiplet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go. 904 910 Hence, it is difficult to detect differences in the \gls{synch_multiplex}, except at low cores, where Go has significantly better performance, due to an optimization in its scheduler. 905 911 Go heavily optimizes thread handoffs on the local run-queue, which can result in very good performance for low numbers of threads parking/unparking each other~\cite{go:sched}. … … 939 945 \end{cquote} 940 946 In Go, this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired. 941 Interesting , this case may not be as pathological as it seems.947 Interestingly, this case may not be as pathological as it seems. 942 948 If the set of channels belonging to a \lstinline[language=Go]{select} have channels that overlap with the set of another \lstinline[language=Go]{select}, these statements lose the ability to operate in parallel. 943 949 The implementation in \CFA only holds a single lock at a time, resulting in better locking granularity, and hence, more parallelism.
Note: See TracChangeset
for help on using the changeset viewer.