Changeset f519bd8


Ignore:
Timestamp:
Jul 3, 2023, 4:13:34 PM (17 months ago)
Author:
caparsons <caparson@…>
Branches:
master
Children:
3397eed, adb67cf3
Parents:
96ea77a
Message:

reworked actor steal section

Location:
doc/theses/colby_parsons_MMAth
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/Makefile

    r96ea77a rf519bd8  
    8383        diagrams/acyclic_swap \
    8484        diagrams/cyclic_swap \
     85        diagrams/steal \
    8586}
    8687
  • doc/theses/colby_parsons_MMAth/glossary.tex

    r96ea77a rf519bd8  
    7171name=synchronous multiplexing,
    7272first={\Newterm{synchronous multiplexing}},
    73 description={synchronization on some subset of a set of resources.}
     73description={synchronization waiting for some subset of a set of resources.}
    7474}
  • doc/theses/colby_parsons_MMAth/local.bib

    r96ea77a rf519bd8  
    196196}
    197197
     198@inproceedings{Doherty04,
     199  title={DCAS is not a silver bullet for nonblocking algorithm design},
     200  author={Doherty, Simon and Detlefs, David L and Groves, Lindsay and Flood, Christine H and Luchangco, Victor and Martin, Paul A and Moir, Mark and Shavit, Nir and Steele Jr, Guy L},
     201  booktitle={Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures},
     202  pages={216--224},
     203  year={2004}
     204}
  • doc/theses/colby_parsons_MMAth/text/actors.tex

    r96ea77a rf519bd8  
    529529
    530530\section{Work Stealing}\label{s:steal}
    531 Work stealing is a scheduling strategy to provide \Newterm{load balance}.
     531Work stealing is a scheduling strategy to provide \Newterm{load balancing}.
    532532The goal is to increase resource utilization by having an idle thread steal work from a working thread.
    533 While there are multiple parts in a work-stealing scheduler, the two important components are the stealing mechanism and victim selection.
     533While there are multiple parts in a work-stealing scheduler, two important components are the stealing mechanism and victim selection.
    534534
    535535\subsection{Stealing Mechanism}
    536 In work stealing, the stealing worker is called the \Newterm{thief} and the stolen worker is called the \Newterm{victim}.
     536In work stealing, the stealing worker is called the \Newterm{thief} and the worker being stolen from is called the \Newterm{victim}.
    537537% Workers consume actors from their ready queue and execute their behaviours.
    538538% Through these behaviours, a worker inserts messages onto its own and other worker ready-queues.
    539 To steal, a thieve takes one or more actors from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue.
     539To steal, a thief takes work from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue.
    540540This contention can slows down the victim's throughput.
    541541Note, the data structure used for the ready queue is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations.
    542 
    543 % C_TODO: maybe insert stealing diagram
    544542
    545543The stealing mechanism in this work differs from most work-stealing actor-systems because of the message-centric (inverted) actor-system.
     
    548546In \CFA, the actor work-stealing implementation is unique because of the message-centric system.
    549547With this approach, it is impractical to steal actors because an actor's messages are distributed in temporal order along the message queue.
    550 To ensure sequential actor execution and \gls{fifo} message delivery, actor stealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue.
     548To ensure sequential actor execution and \gls{fifo} message delivery in a message-centric system, stealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue.
    551549This operation is $O(N)$ with a non-trivial constant.
    552 The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and steal queues to eliminate queue searching.
    553 
    554 Given queue stealing, my goal is to have an essentially zero-contention-cost stealing mechanism.
    555 This goal means work stealing has minimal affect on the performance of the victim.
     550The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and to steal queues to eliminate queue searching.
     551
     552\begin{figure}
     553\begin{center}
     554\input{diagrams/steal.tikz}
     555\end{center}
     556\caption{Queue Stealing Mechanism}
     557\label{f:steal}
     558\end{figure}
     559
     560Given queue stealing, the goal of the presented work stealing implementation is to have an essentially zero-contention-cost stealing mechanism.
     561Achieving this goal requires work stealing to have minimal (practically no) effect on the performance of the victim.
    556562The implication is that thieves cannot contend with a victim, and that a victim should perform no stealing related work unless it becomes a thief.
    557563In theory, this goal is not achievable, but practical results show the goal is virtually achieved.
    558564
    559 One important lesson I learned working on \uC actors~\cite{} and talking with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal.
     565One important lesson learned while working on \uC actors~\cite{} and through discussions with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal.
    560566With reasonable workloads, being a thief should be a temporary state, \ie eventually work appears on the thief's ready-queues and it returns to normal operation.
    561567Furthermore, the act of \emph{looking} to find work is invasive (Heisenberg uncertainty principle), possibly disrupting multiple victims.
     
    563569Note, the cost of stealing is not crucial for the thief because it does not have anything else to do except poll or block.
    564570
    565 My lazy-stealing approach is to select a victim, scan its queues once, and return immediately if a queue is stolen.
    566 Then perform a regular scan looking for work, where stolen work is placed at the end of the scan.
    567 Hence, only one victim is effected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work.
    568 If no work is found in the thief's queues because a steal is unsuccessful, it performs another steal from a different victim.
     571As such a lazy-stealing approach is used; a thief select a victim, scan its queues once, and return immediately if a queue is stolen.
     572The thief then returns to normal operation and conducts a regular scan over its own queues looking for work, where stolen work is placed at the end of the scan.
     573Hence, only one victim is affected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work.
    569574This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system.
    570575In all work-stealing algorithms, there is the pathological case where there is too little work and too many workers;
     
    572577This case is not examined in this work.
    573578
    574 In detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.
     579In more detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.
    575580Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers, not the victim.
    576581The complexity in the implementation is that victim gulping does not take the mailbox queue;
     
    580585otherwise there are two threads simultaneously running messages on actors in the two parts of the mailbox queue.
    581586To solve this problem, an atomic gulp also marks the mailbox queue as subdivided, making it ineligible for stealing.
    582 Hence, a thief checks for ineligible and non-empty before attempting an atomic steal of a queue,
    583 
    584 
     587Hence, a thief checks if a queue is eligible and non-empty before attempting an atomic steal of a queue.
     588
     589Internally, the mailbox queues are accessed through two arrays of size $N$, that are shared among all workers.
     590There is an array of mailbox queue objects @mailboxes@, and an array of pointers to the mailbox objects, @worker_queues@:
     591\begin{cfa}
     592work_queue * mailboxes;                   // master array of work request queues
     593work_queue ** worker_queues;                // secondary array of work queues to allow for swapping
     594\end{cfa}
     595A diagram of the queue architecture and stealing mechanism is shown in Figure~\ref{f:steal}.
     596
     597When a send happens, it inserts a request directly into one of the @mailboxes@, but when a steal happens, two pointers in @worker_queues@ are swapped.
     598Conceptually, @worker_queues@ represents the ownership relation between mailboxes and workers.
     599Given $M$ workers and $N$ mailboxes, each worker owns a contiguous $M$/$N$ block of pointers in @worker_queues@.
     600When a worker gulps, it dereferences one of the pointers in its section of @worker_queues@ and then gulps from the queue it points at.
     601To transfer ownership of a mailbox from one worker to another, a pointer from each of the workers' ranges are swapped.
     602This structure provides near-complete separation of stealing and gulping/send operations.
     603As such, operations can happen on the queues independent of stealing, which avoids almost all contention between thief threads and victim threads.
    585604
    586605To steal a queue, a thief does the following:
    587606\begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
    588607\item
    589 chooses a victim, which is trivial because there is an array of $N$ pointers to the mailboxes (copy queues) that is subdivided into $M$/$N$ worker ranges \see{Section~\ref{s:executor}}.
    590 
    591 \item
    592 scan the victim's mailbox-range and test each mail queue in that range for a non-empty queue by performing a test-for-non-empty and swap.
    593 The conditional check reduces the number of atomic operations.
    594 The swap races to interchange the non-empty mail-queue pointer from the victim's range with an empty mail-queue pointer in the thief's range.
    595 Regardless, either a thief swaps or the victim gulps the mail queue;
    596 correctness is guaranteed, no matter which thread wins the race, because both threads are cycling through appropriate lists.
    597 Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
    598 
    599 \item
    600 stops searching after a non-empty queue steal or all queues in the random worker's range are examined.
    601 The thief then returns to its scheduler and iterates over its messages queues, because new messages may have arrived during stealing, including a stolen queue.
    602 Stealing is only repeated after two consecutive iterations over its owned queues without finding work.
     608chooses a victim. Victim selection heuristics are independent of the stealing mechanism and will be discussed in Section~\ref{s:victimSelect}.
     609
     610\item
     611scan the victim's $M$/$N$ range of @worker_queues@ and non-atomically checks each mail queue to see if it is eligible and non-empty.
     612If one such mail queue is found, the thief attempts to steal the queue by swapping two pointers in @worker_queues@.
     613The swap races to interchange the non-empty mail-queue pointer from the victim's @worker_queues@ range with an empty mail-queue pointer in the thief's @worker_queues@ range.
     614This swap can fail if another thief steals the queue, which will be discussed further in Section~\ref{s:swap}.
     615Note, a thief never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
     616If no such mail queue is found, no swap is attempted.
     617
     618\item
     619stops searching after a successful queue steal, a failed queue steal or all queues in the victim's range are examined.
     620The thief then resumes normal execution and ceases being a thief; it returns to its own queues and iterates over them, because new messages may have arrived during stealing, including ones in the potentially stolen queue.
     621Stealing is only repeated after the worker completes two consecutive iterations over its owned queues without finding work.
    603622\end{enumerate}
    604 This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue.
    605 As well, pushes to the queues by other workers can happen concurrently during the swap since pushes happen via the actor queue references.
    606 
    607 
    608 
    609 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
    610 \item
    611 selects a random worker's mailbox-range and tests each mail queue in that range for a non-empty queue.
    612 
    613 A queue can be stolen and gulped at the same time since they are independent operations.
    614 However, two workers cannot gulp from the same queue at the same time since that would violate the ordering and mutual exclusion guarantees of actors, so an avail flag is used on queues to stop this from happening.
    615 If a worker tries to gulp a queue and sees an avail flag up, the gulp fails, and it will move on to other queues and try to gulp it again later.
    616 stops searching after a non-empty queue steal or all queues in the random worker's range are examined.
    617 It stops searching after any queue steal (even an empty one) or all queues are examined.
    618 
    619 Stealing is only repeated after two consecutive iterations over its owned queues without finding work.
    620 Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
    621 \end{enumerate}
    622 This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue.
    623 The first and last sentence here are correct. I'm not sure I understand the second sentence.
    624 
    625 
    626 
    627 As an aside I think I need a diagram here since it is clear that the algorithm is not clear through my writing.
    628 I made a very quick one that I'll use here to explain, but I will work on one to add to the chapter.
    629 
    630 
    631 To give some accompanying code there are two arrays, the array of copy queues, and the array of pointers to the copy queues:
    632 \begin{cfa}
    633 work_queue * request_queues;                   // master array of work request queues
    634 work_queue ** worker_req_queues;                // secondary array of work queues to allow for swapping
    635 \end{cfa}
    636 \includegraphics[width=4in]{../figures/steal_diagram.eps}
    637 When a send happens, it inserts a request directly into one of the queues in @request_queues@, when a steal happens two pointers in @worker_req_queues@ are swapped.
    638 Each worker owns a section of @worker_req_queues@.
    639 When a worker gulps, it dereferences one of the pointers in its section of @worker_req_queues@ and then gulps from the queue it points at.
    640 As such, operations can happen on the queues directly independent of stealing, which avoids almost all contention between stealing threads and busy threads.
    641 
    642 If you want to poke some of the code, in @actor.hfa@ @steal_work()@ selects the victim, @choose_queue()@ iterates over the queues looking for non-empty queues and @try_swap_queues()@ performs the thread safe pointer swap.
    643 
    644 % The first key to this is that actors and workers maintain two distinct arrays of references to queues.
    645 % Actors will always receive messages via the same queues.
    646 % Workers, on the other hand will swap the pointers to queues in their shared array and operate on queues in the range of that array that they own.
    647 % Swapping queues is a matter of atomically swapping two pointers in the worker array.
    648 % As such pushes to the queues can happen concurrently during the swap since pushes happen via the actor queue references.
    649 
    650 % Gulping can also occur during queue swapping, but the implementation requires more nuance than the pushes.
    651 % When a worker is not stealing it iterates across its own range of queues and gulps them one by one.
    652 
    653 Correctness of gulping is slightly more complex with stealing.
    654 When a worker operates on a queue, it first copies the queue pointer from the mailbox array to a local variable.
     623
     624This approach largely eliminates contention among thieves and victims except for the rare moment when a thief steals a queue and the victim and thief concurrently attempt to gulp the same queue.
     625When a victim operates on a queue, it first copies the queue pointer from @worker_queues@ to a local variable.
    655626It then uses that local variable for all queue operations until it moves to the next index of its range of the queue array.
    656627This ensures that any swaps do not interrupt gulping operations, however this introduces a correctness issue.
     628There is a window for a race condition between the victim and a thief.
     629Once a victim copies the queue pointer from @worker_queues@, a thief could steal that pointer and both may try to gulp from the same queue.
     630These two gulps cannot be allowed to happen concurrently.
    657631If any behaviours from a queue are run by two workers at a time it violates both mutual exclusion and the actor ordering guarantees.
    658 As such this must be avoided.
    659 To avoid this each queue has a @being_processed@ flag that is atomically set to @true@ when a queue is gulped.
    660 The flag indicates that a queue is being processed locally and is set back to @false@ once the local processing is finished.
    661 If a worker attempts to gulp a queue and finds that the @being_processed@ flag is @true@, it does not gulp the queue and moves on to the next queue in its range.
     632To avoid concurrent gulps, each queue stores a @being_processed@ flag that is atomically set to @true@ when a queue is gulped.
     633The flag indicates that a queue is being processed by a worker and is set back to @false@ once the local processing is finished.
     634If a worker, either victim or thief, attempts to gulp a queue and finds that the @being_processed@ flag is @true@, it does not gulp the queue and moves on to the next queue in its range.
    662635This is a source of contention between victims and thieves since a thief may steal a queue and set @being_processed@ to @true@ between a victim saving a pointer to a queue and gulping it.
    663636However, the window for this race is very small, making this contention rare.
    664 This is why the claim is made that this mechanism is zero-victim-cost in practice but not in theory.
    665 By collecting statistics on failed gulps due to the @being_processed@ flag, it is found that this contention occurs ~0.05\% of the time when a gulp occurs.
     637This is why this mechanism is zero-victim-cost in practice but not in theory.
     638By collecting statistics on failed gulps due to the @being_processed@ flag, it is found that this contention occurs ~0.05\% of the time when a gulp occurs (1 in every 2000 gulps).
    666639Hence, the claim is made that this stealing mechanism has zero-victim-cost in practice.
    667640
    668 
    669 \subsection{Queue Swap Correctness}
     641\subsection{Queue Pointer Swap}\label{s:swap}
     642Two atomically swap two pointers in @worker_queues@, a novel wait-free swap algorithm is used.
     643The novel wait-free swap is effectively a special case of a DCAS.
     644DCAS stands for Double Compare-And-Swap, which is a more general version of the Compare-And-Swap (CAS) atomic operation~\cite{Doherty04}.
     645CAS compares is a read-modify-write operation available on most modern architectures which atomically compares an expected value with a memory location.
     646If the expected value and value in memory are equal it then writes a new value into the memory location.
     647A sample software implemention of CAS follows.
     648\begin{cfa}
     649// assume this routine executes atomically
     650bool CAS( val * ptr, val expected, val new ) {
     651        if ( *ptr != expected )
     652                return false;
     653        *ptr = new;
     654        return true;
     655}
     656\end{cfa}
     657As shown CAS only operates on one memory location.
     658Where CAS operates on a single memory location and some values, DCAS operates on two memory locations.
     659\begin{cfa}
     660// assume this routine executes atomically
     661bool DCAS( val * addr1, val * addr2, val old1, val old2, val new1, val new2 ) {
     662        if ( ( *addr1 == old1 ) && ( *addr2 == old2 ) ) {
     663        *addr1 = new1;
     664        *addr2 = new2;
     665        return true;
     666    }
     667    return false;
     668}
     669\end{cfa}
     670The DCAS implemented in this work is special cased in two ways.
     671First of all, a DCAS is more powerful than what is needed to swap two pointers.
     672A double atomic swap (DAS) is all that is needed for this work.
     673The atomic swap provided by most modern hardware requires that at least one operand is a register.
     674A DAS would relax that restriction so that both operands of the swap could be memory locations.
     675As such a DAS can be written in terms of the DCAS above as follows.
     676\begin{cfa}
     677bool DAS( val * addr1, val * addr2 ) {
     678    return DCAS( addr1, addr2, *addr1, *addr2, *addr2, *addr1 );
     679}
     680\end{cfa}
     681The other special case is that the values being swapped will never null pointers.
     682This allows the DAS implementation presented to use null pointers as intermediate values during the swap.
     683
    670684Given the wait-free swap used is novel, it is important to show that it is correct.
    671 Firstly, it is clear to show that the swap is wait-free since all workers will fail or succeed in swapping the queues in a finite number of steps since there are no locks or looping.
     685It is clear to show that the swap is wait-free since all thieves will fail or succeed in swapping the queues in a finite number of steps since there are no locks or looping.
    672686There is no retry mechanism in the case of a failed swap, since a failed swap either means the work was already stolen, or that work was stolen from the thief.
    673 In both cases it is apropos for a thief to given up on stealing.
     687In both cases it is apropos for a thief to give up on stealing.
    674688\CFA-style pseudocode for the queue swap is presented below.
    675689The swap uses compare-and-swap (@CAS@) which is just pseudocode for C's @__atomic_compare_exchange_n@.
     
    681695void swap( uint victim_idx, uint my_idx ) {
    682696        // Step 0:
    683         work_queue * my_queue = request_queues[my_idx];
    684         work_queue * vic_queue = request_queues[victim_idx];
     697        work_queue * my_queue = mailboxes[my_idx];
     698        work_queue * vic_queue = mailboxes[victim_idx];
    685699        // Step 2:
    686         request_queues[my_idx] = 0p;
     700        mailboxes[my_idx] = 0p;
    687701        // Step 3:
    688         request_queues[victim_idx] = my_queue;
     702        mailboxes[victim_idx] = my_queue;
    689703        // Step 4:
    690         request_queues[my_idx] = vic_queue;
    691 }
    692 \end{cfa}
    693 
    694 Step 1 is missing in the sequential example since in only matter in the concurrent context presented later.
     704        mailboxes[my_idx] = vic_queue;
     705}
     706\end{cfa}
     707
     708Step 1 is missing in the sequential example since it only matters in the concurrent context presented later.
    695709By looking at the sequential swap it is easy to see that it is correct.
    696710Temporary copies of each pointer being swapped are stored, and then the original values of each pointer are set using the copy of the other pointer.
    697711
    698712\begin{cfa}
    699 // This routine is atomic
    700 bool CAS( work_queue ** ptr, work_queue ** old, work_queue * new ) {
    701         if ( *ptr != *old )
    702                 return false;
    703         *ptr = new;
    704         return true;
    705 }
    706713
    707714bool try_swap_queues( worker & this, uint victim_idx, uint my_idx ) with(this) {
    708715        // Step 0:
    709         // request_queues is the shared array of all sharded queues
    710         work_queue * my_queue = request_queues[my_idx];
    711         work_queue * vic_queue = request_queues[victim_idx];
     716        // mailboxes is the shared array of all sharded queues
     717        work_queue * my_queue = mailboxes[my_idx];
     718        work_queue * vic_queue = mailboxes[victim_idx];
    712719
    713720        // Step 1:
    714         // If either queue is 0p then they are in the process of being stolen
     721        // If the victim queue is 0p then they are in the process of stealing so return false
    715722        // 0p is Cforall's equivalent of C++'s nullptr
    716723        if ( vic_queue == 0p ) return false;
    717724
    718725        // Step 2:
    719         // Try to set thief's queue ptr to be 0p.
    720         // If this CAS fails someone stole thief's queue so return false
    721         if ( !CAS( &request_queues[my_idx], &my_queue, 0p ) )
     726        // Try to set our own (thief's) queue ptr to be 0p.
     727        // If this CAS fails someone stole our (thief's) queue so return false
     728        if ( !CAS( &mailboxes[my_idx], &my_queue, 0p ) )
    722729                return false;
    723730
    724731        // Step 3:
    725         // Try to set victim queue ptr to be thief's queue ptr.
     732        // Try to set victim queue ptr to be our (thief's) queue ptr.
    726733        // If it fails someone stole the other queue, so fix up then return false
    727         if ( !CAS( &request_queues[victim_idx], &vic_queue, my_queue ) ) {
    728                 request_queues[my_idx] = my_queue; // reset queue ptr back to prev val
     734        if ( !CAS( &mailboxes[victim_idx], &vic_queue, my_queue ) ) {
     735                mailboxes[my_idx] = my_queue; // reset queue ptr back to prev val
    729736                return false;
    730737        }
     
    734741        // Thief's ptr is 0p so no one will touch it
    735742        // Write back without CAS is safe
    736         request_queues[my_idx] = vic_queue;
     743        mailboxes[my_idx] = vic_queue;
    737744        return true;
    738745}
Note: See TracChangeset for help on using the changeset viewer.