Changeset f519bd8 for doc/theses
- Timestamp:
- Jul 3, 2023, 4:13:34 PM (2 years ago)
- Branches:
- master
- Children:
- 3397eed, adb67cf3
- Parents:
- 96ea77a
- Location:
- doc/theses/colby_parsons_MMAth
- Files:
- 
      - 4 edited
 
 - 
          
  Makefile (modified) (1 diff)
- 
          
  glossary.tex (modified) (1 diff)
- 
          
  local.bib (modified) (1 diff)
- 
          
  text/actors.tex (modified) (7 diffs)
 
Legend:
- Unmodified
- Added
- Removed
- 
      doc/theses/colby_parsons_MMAth/Makefiler96ea77a rf519bd8 83 83 diagrams/acyclic_swap \ 84 84 diagrams/cyclic_swap \ 85 diagrams/steal \ 85 86 } 86 87 
- 
      doc/theses/colby_parsons_MMAth/glossary.texr96ea77a rf519bd8 71 71 name=synchronous multiplexing, 72 72 first={\Newterm{synchronous multiplexing}}, 73 description={synchronization onsome subset of a set of resources.}73 description={synchronization waiting for some subset of a set of resources.} 74 74 } 
- 
      doc/theses/colby_parsons_MMAth/local.bibr96ea77a rf519bd8 196 196 } 197 197 198 @inproceedings{Doherty04, 199 title={DCAS is not a silver bullet for nonblocking algorithm design}, 200 author={Doherty, Simon and Detlefs, David L and Groves, Lindsay and Flood, Christine H and Luchangco, Victor and Martin, Paul A and Moir, Mark and Shavit, Nir and Steele Jr, Guy L}, 201 booktitle={Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures}, 202 pages={216--224}, 203 year={2004} 204 } 
- 
      doc/theses/colby_parsons_MMAth/text/actors.texr96ea77a rf519bd8 529 529 530 530 \section{Work Stealing}\label{s:steal} 531 Work stealing is a scheduling strategy to provide \Newterm{load balanc e}.531 Work stealing is a scheduling strategy to provide \Newterm{load balancing}. 532 532 The goal is to increase resource utilization by having an idle thread steal work from a working thread. 533 While there are multiple parts in a work-stealing scheduler, t he two important components are the stealing mechanism and victim selection.533 While there are multiple parts in a work-stealing scheduler, two important components are the stealing mechanism and victim selection. 534 534 535 535 \subsection{Stealing Mechanism} 536 In work stealing, the stealing worker is called the \Newterm{thief} and the stolen workeris called the \Newterm{victim}.536 In work stealing, the stealing worker is called the \Newterm{thief} and the worker being stolen from is called the \Newterm{victim}. 537 537 % Workers consume actors from their ready queue and execute their behaviours. 538 538 % Through these behaviours, a worker inserts messages onto its own and other worker ready-queues. 539 To steal, a thie ve takes one or more actorsfrom a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue.539 To steal, a thief takes work from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue. 540 540 This contention can slows down the victim's throughput. 541 541 Note, the data structure used for the ready queue is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations. 542 543 % C_TODO: maybe insert stealing diagram544 542 545 543 The stealing mechanism in this work differs from most work-stealing actor-systems because of the message-centric (inverted) actor-system. … … 548 546 In \CFA, the actor work-stealing implementation is unique because of the message-centric system. 549 547 With this approach, it is impractical to steal actors because an actor's messages are distributed in temporal order along the message queue. 550 To ensure sequential actor execution and \gls{fifo} message delivery , actorstealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue.548 To ensure sequential actor execution and \gls{fifo} message delivery in a message-centric system, stealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue. 551 549 This operation is $O(N)$ with a non-trivial constant. 552 The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and steal queues to eliminate queue searching. 553 554 Given queue stealing, my goal is to have an essentially zero-contention-cost stealing mechanism. 555 This goal means work stealing has minimal affect on the performance of the victim. 550 The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and to steal queues to eliminate queue searching. 551 552 \begin{figure} 553 \begin{center} 554 \input{diagrams/steal.tikz} 555 \end{center} 556 \caption{Queue Stealing Mechanism} 557 \label{f:steal} 558 \end{figure} 559 560 Given queue stealing, the goal of the presented work stealing implementation is to have an essentially zero-contention-cost stealing mechanism. 561 Achieving this goal requires work stealing to have minimal (practically no) effect on the performance of the victim. 556 562 The implication is that thieves cannot contend with a victim, and that a victim should perform no stealing related work unless it becomes a thief. 557 563 In theory, this goal is not achievable, but practical results show the goal is virtually achieved. 558 564 559 One important lesson I learned working on \uC actors~\cite{} and talkingwith fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal.565 One important lesson learned while working on \uC actors~\cite{} and through discussions with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal. 560 566 With reasonable workloads, being a thief should be a temporary state, \ie eventually work appears on the thief's ready-queues and it returns to normal operation. 561 567 Furthermore, the act of \emph{looking} to find work is invasive (Heisenberg uncertainty principle), possibly disrupting multiple victims. … … 563 569 Note, the cost of stealing is not crucial for the thief because it does not have anything else to do except poll or block. 564 570 565 My lazy-stealing approach is to select a victim, scan its queues once, and return immediately if a queue is stolen. 566 Then perform a regular scan looking for work, where stolen work is placed at the end of the scan. 567 Hence, only one victim is effected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work. 568 If no work is found in the thief's queues because a steal is unsuccessful, it performs another steal from a different victim. 571 As such a lazy-stealing approach is used; a thief select a victim, scan its queues once, and return immediately if a queue is stolen. 572 The thief then returns to normal operation and conducts a regular scan over its own queues looking for work, where stolen work is placed at the end of the scan. 573 Hence, only one victim is affected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work. 569 574 This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system. 570 575 In all work-stealing algorithms, there is the pathological case where there is too little work and too many workers; … … 572 577 This case is not examined in this work. 573 578 574 In detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.579 In more detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker. 575 580 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers, not the victim. 576 581 The complexity in the implementation is that victim gulping does not take the mailbox queue; … … 580 585 otherwise there are two threads simultaneously running messages on actors in the two parts of the mailbox queue. 581 586 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided, making it ineligible for stealing. 582 Hence, a thief checks for ineligible and non-empty before attempting an atomic steal of a queue, 583 584 587 Hence, a thief checks if a queue is eligible and non-empty before attempting an atomic steal of a queue. 588 589 Internally, the mailbox queues are accessed through two arrays of size $N$, that are shared among all workers. 590 There is an array of mailbox queue objects @mailboxes@, and an array of pointers to the mailbox objects, @worker_queues@: 591 \begin{cfa} 592 work_queue * mailboxes; // master array of work request queues 593 work_queue ** worker_queues; // secondary array of work queues to allow for swapping 594 \end{cfa} 595 A diagram of the queue architecture and stealing mechanism is shown in Figure~\ref{f:steal}. 596 597 When a send happens, it inserts a request directly into one of the @mailboxes@, but when a steal happens, two pointers in @worker_queues@ are swapped. 598 Conceptually, @worker_queues@ represents the ownership relation between mailboxes and workers. 599 Given $M$ workers and $N$ mailboxes, each worker owns a contiguous $M$/$N$ block of pointers in @worker_queues@. 600 When a worker gulps, it dereferences one of the pointers in its section of @worker_queues@ and then gulps from the queue it points at. 601 To transfer ownership of a mailbox from one worker to another, a pointer from each of the workers' ranges are swapped. 602 This structure provides near-complete separation of stealing and gulping/send operations. 603 As such, operations can happen on the queues independent of stealing, which avoids almost all contention between thief threads and victim threads. 585 604 586 605 To steal a queue, a thief does the following: 587 606 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt] 588 607 \item 589 chooses a victim , which is trivial because there is an array of $N$ pointers to the mailboxes (copy queues) that is subdivided into $M$/$N$ worker ranges \see{Section~\ref{s:executor}}.590 591 \item 592 scan the victim's mailbox-range and test each mail queue in that range for a non-empty queue by performing a test-for-non-empty and swap.593 The conditional check reduces the number of atomic operations.594 The swap races to interchange the non-empty mail-queue pointer from the victim's range with an empty mail-queue pointer in the thief'srange.595 Regardless, either a thief swaps or the victim gulps the mail queue; 596 correctness is guaranteed, no matter which thread wins the race, because both threads are cycling through appropriate lists.597 Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.598 599 \item 600 stops searching after a non-empty queue steal or all queues in the random worker's range are examined.601 The thief then re turns to its scheduler and iterates over its messages queues, because new messages may have arrived during stealing, including astolen queue.602 Stealing is only repeated after t wo consecutive iterations over its owned queues without finding work.608 chooses a victim. Victim selection heuristics are independent of the stealing mechanism and will be discussed in Section~\ref{s:victimSelect}. 609 610 \item 611 scan the victim's $M$/$N$ range of @worker_queues@ and non-atomically checks each mail queue to see if it is eligible and non-empty. 612 If one such mail queue is found, the thief attempts to steal the queue by swapping two pointers in @worker_queues@. 613 The swap races to interchange the non-empty mail-queue pointer from the victim's @worker_queues@ range with an empty mail-queue pointer in the thief's @worker_queues@ range. 614 This swap can fail if another thief steals the queue, which will be discussed further in Section~\ref{s:swap}. 615 Note, a thief never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers. 616 If no such mail queue is found, no swap is attempted. 617 618 \item 619 stops searching after a successful queue steal, a failed queue steal or all queues in the victim's range are examined. 620 The thief then resumes normal execution and ceases being a thief; it returns to its own queues and iterates over them, because new messages may have arrived during stealing, including ones in the potentially stolen queue. 621 Stealing is only repeated after the worker completes two consecutive iterations over its owned queues without finding work. 603 622 \end{enumerate} 604 This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue. 605 As well, pushes to the queues by other workers can happen concurrently during the swap since pushes happen via the actor queue references. 606 607 608 609 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt] 610 \item 611 selects a random worker's mailbox-range and tests each mail queue in that range for a non-empty queue. 612 613 A queue can be stolen and gulped at the same time since they are independent operations. 614 However, two workers cannot gulp from the same queue at the same time since that would violate the ordering and mutual exclusion guarantees of actors, so an avail flag is used on queues to stop this from happening. 615 If a worker tries to gulp a queue and sees an avail flag up, the gulp fails, and it will move on to other queues and try to gulp it again later. 616 stops searching after a non-empty queue steal or all queues in the random worker's range are examined. 617 It stops searching after any queue steal (even an empty one) or all queues are examined. 618 619 Stealing is only repeated after two consecutive iterations over its owned queues without finding work. 620 Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers. 621 \end{enumerate} 622 This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue. 623 The first and last sentence here are correct. I'm not sure I understand the second sentence. 624 625 626 627 As an aside I think I need a diagram here since it is clear that the algorithm is not clear through my writing. 628 I made a very quick one that I'll use here to explain, but I will work on one to add to the chapter. 629 630 631 To give some accompanying code there are two arrays, the array of copy queues, and the array of pointers to the copy queues: 632 \begin{cfa} 633 work_queue * request_queues; // master array of work request queues 634 work_queue ** worker_req_queues; // secondary array of work queues to allow for swapping 635 \end{cfa} 636 \includegraphics[width=4in]{../figures/steal_diagram.eps} 637 When a send happens, it inserts a request directly into one of the queues in @request_queues@, when a steal happens two pointers in @worker_req_queues@ are swapped. 638 Each worker owns a section of @worker_req_queues@. 639 When a worker gulps, it dereferences one of the pointers in its section of @worker_req_queues@ and then gulps from the queue it points at. 640 As such, operations can happen on the queues directly independent of stealing, which avoids almost all contention between stealing threads and busy threads. 641 642 If you want to poke some of the code, in @actor.hfa@ @steal_work()@ selects the victim, @choose_queue()@ iterates over the queues looking for non-empty queues and @try_swap_queues()@ performs the thread safe pointer swap. 643 644 % The first key to this is that actors and workers maintain two distinct arrays of references to queues. 645 % Actors will always receive messages via the same queues. 646 % Workers, on the other hand will swap the pointers to queues in their shared array and operate on queues in the range of that array that they own. 647 % Swapping queues is a matter of atomically swapping two pointers in the worker array. 648 % As such pushes to the queues can happen concurrently during the swap since pushes happen via the actor queue references. 649 650 % Gulping can also occur during queue swapping, but the implementation requires more nuance than the pushes. 651 % When a worker is not stealing it iterates across its own range of queues and gulps them one by one. 652 653 Correctness of gulping is slightly more complex with stealing. 654 When a worker operates on a queue, it first copies the queue pointer from the mailbox array to a local variable. 623 624 This approach largely eliminates contention among thieves and victims except for the rare moment when a thief steals a queue and the victim and thief concurrently attempt to gulp the same queue. 625 When a victim operates on a queue, it first copies the queue pointer from @worker_queues@ to a local variable. 655 626 It then uses that local variable for all queue operations until it moves to the next index of its range of the queue array. 656 627 This ensures that any swaps do not interrupt gulping operations, however this introduces a correctness issue. 628 There is a window for a race condition between the victim and a thief. 629 Once a victim copies the queue pointer from @worker_queues@, a thief could steal that pointer and both may try to gulp from the same queue. 630 These two gulps cannot be allowed to happen concurrently. 657 631 If any behaviours from a queue are run by two workers at a time it violates both mutual exclusion and the actor ordering guarantees. 658 As such this must be avoided. 659 To avoid this each queue has a @being_processed@ flag that is atomically set to @true@ when a queue is gulped. 660 The flag indicates that a queue is being processed locally and is set back to @false@ once the local processing is finished. 661 If a worker attempts to gulp a queue and finds that the @being_processed@ flag is @true@, it does not gulp the queue and moves on to the next queue in its range. 632 To avoid concurrent gulps, each queue stores a @being_processed@ flag that is atomically set to @true@ when a queue is gulped. 633 The flag indicates that a queue is being processed by a worker and is set back to @false@ once the local processing is finished. 634 If a worker, either victim or thief, attempts to gulp a queue and finds that the @being_processed@ flag is @true@, it does not gulp the queue and moves on to the next queue in its range. 662 635 This is a source of contention between victims and thieves since a thief may steal a queue and set @being_processed@ to @true@ between a victim saving a pointer to a queue and gulping it. 663 636 However, the window for this race is very small, making this contention rare. 664 This is why th e claim is made that this mechanism is zero-victim-cost in practice but not in theory.665 By collecting statistics on failed gulps due to the @being_processed@ flag, it is found that this contention occurs ~0.05\% of the time when a gulp occurs .637 This is why this mechanism is zero-victim-cost in practice but not in theory. 638 By collecting statistics on failed gulps due to the @being_processed@ flag, it is found that this contention occurs ~0.05\% of the time when a gulp occurs (1 in every 2000 gulps). 666 639 Hence, the claim is made that this stealing mechanism has zero-victim-cost in practice. 667 640 668 669 \subsection{Queue Swap Correctness} 641 \subsection{Queue Pointer Swap}\label{s:swap} 642 Two atomically swap two pointers in @worker_queues@, a novel wait-free swap algorithm is used. 643 The novel wait-free swap is effectively a special case of a DCAS. 644 DCAS stands for Double Compare-And-Swap, which is a more general version of the Compare-And-Swap (CAS) atomic operation~\cite{Doherty04}. 645 CAS compares is a read-modify-write operation available on most modern architectures which atomically compares an expected value with a memory location. 646 If the expected value and value in memory are equal it then writes a new value into the memory location. 647 A sample software implemention of CAS follows. 648 \begin{cfa} 649 // assume this routine executes atomically 650 bool CAS( val * ptr, val expected, val new ) { 651 if ( *ptr != expected ) 652 return false; 653 *ptr = new; 654 return true; 655 } 656 \end{cfa} 657 As shown CAS only operates on one memory location. 658 Where CAS operates on a single memory location and some values, DCAS operates on two memory locations. 659 \begin{cfa} 660 // assume this routine executes atomically 661 bool DCAS( val * addr1, val * addr2, val old1, val old2, val new1, val new2 ) { 662 if ( ( *addr1 == old1 ) && ( *addr2 == old2 ) ) { 663 *addr1 = new1; 664 *addr2 = new2; 665 return true; 666 } 667 return false; 668 } 669 \end{cfa} 670 The DCAS implemented in this work is special cased in two ways. 671 First of all, a DCAS is more powerful than what is needed to swap two pointers. 672 A double atomic swap (DAS) is all that is needed for this work. 673 The atomic swap provided by most modern hardware requires that at least one operand is a register. 674 A DAS would relax that restriction so that both operands of the swap could be memory locations. 675 As such a DAS can be written in terms of the DCAS above as follows. 676 \begin{cfa} 677 bool DAS( val * addr1, val * addr2 ) { 678 return DCAS( addr1, addr2, *addr1, *addr2, *addr2, *addr1 ); 679 } 680 \end{cfa} 681 The other special case is that the values being swapped will never null pointers. 682 This allows the DAS implementation presented to use null pointers as intermediate values during the swap. 683 670 684 Given the wait-free swap used is novel, it is important to show that it is correct. 671 Firstly, it is clear to show that the swap is wait-free since all workers will fail or succeed in swapping the queues in a finite number of steps since there are no locks or looping.685 It is clear to show that the swap is wait-free since all thieves will fail or succeed in swapping the queues in a finite number of steps since there are no locks or looping. 672 686 There is no retry mechanism in the case of a failed swap, since a failed swap either means the work was already stolen, or that work was stolen from the thief. 673 In both cases it is apropos for a thief to give nup on stealing.687 In both cases it is apropos for a thief to give up on stealing. 674 688 \CFA-style pseudocode for the queue swap is presented below. 675 689 The swap uses compare-and-swap (@CAS@) which is just pseudocode for C's @__atomic_compare_exchange_n@. … … 681 695 void swap( uint victim_idx, uint my_idx ) { 682 696 // Step 0: 683 work_queue * my_queue = request_queues[my_idx];684 work_queue * vic_queue = request_queues[victim_idx];697 work_queue * my_queue = mailboxes[my_idx]; 698 work_queue * vic_queue = mailboxes[victim_idx]; 685 699 // Step 2: 686 request_queues[my_idx] = 0p;700 mailboxes[my_idx] = 0p; 687 701 // Step 3: 688 request_queues[victim_idx] = my_queue;702 mailboxes[victim_idx] = my_queue; 689 703 // Step 4: 690 request_queues[my_idx] = vic_queue;691 } 692 \end{cfa} 693 694 Step 1 is missing in the sequential example since i n only matterin the concurrent context presented later.704 mailboxes[my_idx] = vic_queue; 705 } 706 \end{cfa} 707 708 Step 1 is missing in the sequential example since it only matters in the concurrent context presented later. 695 709 By looking at the sequential swap it is easy to see that it is correct. 696 710 Temporary copies of each pointer being swapped are stored, and then the original values of each pointer are set using the copy of the other pointer. 697 711 698 712 \begin{cfa} 699 // This routine is atomic700 bool CAS( work_queue ** ptr, work_queue ** old, work_queue * new ) {701 if ( *ptr != *old )702 return false;703 *ptr = new;704 return true;705 }706 713 707 714 bool try_swap_queues( worker & this, uint victim_idx, uint my_idx ) with(this) { 708 715 // Step 0: 709 // request_queues is the shared array of all sharded queues710 work_queue * my_queue = request_queues[my_idx];711 work_queue * vic_queue = request_queues[victim_idx];716 // mailboxes is the shared array of all sharded queues 717 work_queue * my_queue = mailboxes[my_idx]; 718 work_queue * vic_queue = mailboxes[victim_idx]; 712 719 713 720 // Step 1: 714 // If either queue is 0p then they are in the process of being stolen721 // If the victim queue is 0p then they are in the process of stealing so return false 715 722 // 0p is Cforall's equivalent of C++'s nullptr 716 723 if ( vic_queue == 0p ) return false; 717 724 718 725 // Step 2: 719 // Try to set thief'squeue ptr to be 0p.720 // If this CAS fails someone stole thief'squeue so return false721 if ( !CAS( & request_queues[my_idx], &my_queue, 0p ) )726 // Try to set our own (thief's) queue ptr to be 0p. 727 // If this CAS fails someone stole our (thief's) queue so return false 728 if ( !CAS( &mailboxes[my_idx], &my_queue, 0p ) ) 722 729 return false; 723 730 724 731 // Step 3: 725 // Try to set victim queue ptr to be thief'squeue ptr.732 // Try to set victim queue ptr to be our (thief's) queue ptr. 726 733 // If it fails someone stole the other queue, so fix up then return false 727 if ( !CAS( & request_queues[victim_idx], &vic_queue, my_queue ) ) {728 request_queues[my_idx] = my_queue; // reset queue ptr back to prev val734 if ( !CAS( &mailboxes[victim_idx], &vic_queue, my_queue ) ) { 735 mailboxes[my_idx] = my_queue; // reset queue ptr back to prev val 729 736 return false; 730 737 } … … 734 741 // Thief's ptr is 0p so no one will touch it 735 742 // Write back without CAS is safe 736 request_queues[my_idx] = vic_queue;743 mailboxes[my_idx] = vic_queue; 737 744 return true; 738 745 } 
  Note:
 See   TracChangeset
 for help on using the changeset viewer.
  