Ignore:
Timestamp:
May 18, 2022, 3:59:14 PM (3 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
288927f
Parents:
fa2a3b1
Message:

A whole lot of results and some text section done

Location:
doc/theses/thierry_delisle_PhD/thesis/text
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    rfa2a3b1 r622a358  
    77Networked ZIPF
    88
     9Nginx : 5Gb still good, 4Gb starts to suffer
     10
     11Cforall : 10Gb too high, 4 Gb too low
     12
    913\section{Memcached}
    1014
    11 In Memory
     15\subsection{Benchmark Environment}
     16These experiments are run on a cluster of homogenous Supermicro SYS-6017R-TDF compute nodes with the following characteristics:
     17The server runs Ubuntu 20.04.3 LTS on top of Linux Kernel 5.11.0-34.
     18Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz.
     19These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
     20The cpus each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches respectively.
     21Each node is connected to the network through a Mellanox 10 Gigabit Ethernet port.
     22The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
    1223
    13 Networked
     24
     25
     26\begin{figure}
     27        \centering
     28        \input{result.memcd.updt.qps.pstex_t}
     29        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
     30        \label{fig:memcd:updt:qps}
     31\end{figure}
     32
     33\begin{figure}
     34        \centering
     35        \input{result.memcd.updt.lat.pstex_t}
     36        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
     37        \label{fig:memcd:updt:lat}
     38\end{figure}
     39
     40\begin{figure}
     41        \centering
     42        \input{result.memcd.rate.qps.pstex_t}
     43        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
     44        \label{fig:memcd:rate:qps}
     45\end{figure}
     46
     47\begin{figure}
     48        \centering
     49        \input{result.memcd.rate.99th.pstex_t}
     50        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
     51        \label{fig:memcd:rate:tail}
     52\end{figure}
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    rfa2a3b1 r622a358  
    66\section{Benchmark Environment}
    77All of these benchmarks are run on two distinct hardware environment, an AMD and an INTEL machine.
     8
     9For all benchmarks, \texttt{taskset} is used to limit the experiment to 1 NUMA Node with no hyper threading.
     10If more \glspl{hthrd} are needed, then 1 NUMA Node with hyperthreading is used.
     11If still more \glspl{hthrd} are needed then the experiment is limited to as few NUMA Nodes as needed.
     12
    813
    914\paragraph{AMD} The AMD machine is a server with two AMD EPYC 7662 CPUs and 256GB of DDR4 RAM.
     
    2328
    2429\section{Cycling latency}
     30\begin{figure}
     31        \centering
     32        \input{cycle.pstex_t}
     33        \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.}
     34        \label{fig:cycle}
     35\end{figure}
    2536The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue.
    2637Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark.
     
    4253Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system.
    4354
    44 \begin{figure}
    45         \centering
    46         \input{cycle.pstex_t}
    47         \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.}
    48         \label{fig:cycle}
    49 \end{figure}
    50 
    5155To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of \glspl{proc} available.
    5256Beyond this point, adding more rings serves to mitigate even more the idle sleep handling.
     
    5458
    5559The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition.
    56 
    57 \begin{lstlisting}
    58         Thread.main() {
    59                 count := 0
    60                 for {
    61                         wait()
    62                         this.next.wake()
    63                         count ++
    64                         if must_stop() { break }
    65                 }
    66                 global.count += count
    67         }
    68 \end{lstlisting}
    69 
    70 \begin{figure}
    71         \centering
    72         \input{result.cycle.jax.ops.pstex_t}
    73         \vspace*{-10pt}
    74         \label{fig:cycle:ns:jax}
    75 \end{figure}
     60Figure~\ref{fig:cycle:code} shows pseudo code for this benchmark.
     61
     62\begin{figure}
     63        \begin{lstlisting}
     64                Thread.main() {
     65                        count := 0
     66                        for {
     67                                wait()
     68                                this.next.wake()
     69                                count ++
     70                                if must_stop() { break }
     71                        }
     72                        global.count += count
     73                }
     74        \end{lstlisting}
     75        \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code}
     76        \label{fig:cycle:code}
     77\end{figure}
     78
     79
     80
     81\subsection{Results}
     82\begin{figure}
     83        \subfloat[][Throughput, 100 \ats per \proc]{
     84                \resizebox{0.5\linewidth}{!}{
     85                        \input{result.cycle.jax.ops.pstex_t}
     86                }
     87                \label{fig:cycle:jax:ops}
     88        }
     89        \subfloat[][Throughput, 1 \ats per \proc]{
     90                \resizebox{0.5\linewidth}{!}{
     91                        \input{result.cycle.low.jax.ops.pstex_t}
     92                }
     93                \label{fig:cycle:jax:low:ops}
     94        }
     95
     96        \subfloat[][Latency, 100 \ats per \proc]{
     97                \resizebox{0.5\linewidth}{!}{
     98                        \input{result.cycle.jax.ns.pstex_t}
     99                }
     100
     101        }
     102        \subfloat[][Latency, 1 \ats per \proc]{
     103                \resizebox{0.5\linewidth}{!}{
     104                        \input{result.cycle.low.jax.ns.pstex_t}
     105                }
     106                \label{fig:cycle:jax:low:ns}
     107        }
     108        \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 100 cycles per \proc, 5 \ats per cycle.}
     109        \label{fig:cycle:jax}
     110\end{figure}
     111Figure~\ref{fig:cycle:jax} shows the throughput as a function of \proc count, with the following constants:
     112Each run uses 100 cycles per \proc, 5 \ats per cycle.
     113
     114\todo{results discussion}
    76115
    77116\section{Yield}
     
    81120Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty.
    82121This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done.
    83 
    84 \todo{code, setup, results}
    85 
    86 \begin{lstlisting}
    87         Thread.main() {
    88                 count := 0
    89                 while !stop {
    90                         yield()
    91                         count ++
    92                 }
    93                 global.count += count
    94         }
    95 \end{lstlisting}
     122Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, the ``wait/wake-next'' is simply replaced by a yield.
     123
     124\begin{figure}
     125        \begin{lstlisting}
     126                Thread.main() {
     127                        count := 0
     128                        for {
     129                                yield()
     130                                count ++
     131                                if must_stop() { break }
     132                        }
     133                        global.count += count
     134                }
     135        \end{lstlisting}
     136        \caption[Yield Benchmark : Pseudo Code]{Yield Benchmark : Pseudo Code}
     137        \label{fig:yield:code}
     138\end{figure}
     139
     140\subsection{Results}
     141\begin{figure}
     142        \subfloat[][Throughput, 100 \ats per \proc]{
     143                \resizebox{0.5\linewidth}{!}{
     144                        \input{result.yield.jax.ops.pstex_t}
     145                }
     146                \label{fig:yield:jax:ops}
     147        }
     148        \subfloat[][Throughput, 1 \ats per \proc]{
     149                \resizebox{0.5\linewidth}{!}{
     150                \input{result.yield.low.jax.ops.pstex_t}
     151                }
     152                \label{fig:yield:jax:low:ops}
     153        }
     154
     155        \subfloat[][Latency, 100 \ats per \proc]{
     156                \resizebox{0.5\linewidth}{!}{
     157                \input{result.yield.jax.ns.pstex_t}
     158                }
     159                \label{fig:yield:jax:ns}
     160        }
     161        \subfloat[][Latency, 1 \ats per \proc]{
     162                \resizebox{0.5\linewidth}{!}{
     163                \input{result.yield.low.jax.ns.pstex_t}
     164                }
     165                \label{fig:yield:jax:low:ns}
     166        }
     167        \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 1 \ats per \proc.}
     168        \label{fig:yield:jax}
     169\end{figure}
     170Figure~\ref{fig:yield:ops:jax} shows the throughput as a function of \proc count, with the following constants:
     171Each run uses 100 \ats per \proc.
     172
     173\todo{results discussion}
    96174
    97175
     
    105183In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if they are not handled correctly.
    106184
    107 To achieve this the benchmark uses a fixed size array of \newterm{chair}s, where a chair is a data structure that holds a single blocked \gls{at}.
    108 When a \gls{at} attempts to block on the chair, it must first unblocked the \gls{at} currently blocked on said chair, if any.
    109 This creates a flow where \glspl{at} push each other out of the chairs before being pushed out themselves.
    110 For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of chairs plus the number of \glspl{proc}.
     185To achieve this the benchmark uses a fixed size array of semaphores.
     186Each \gls{at} picks a random semaphore, \texttt{V}s it to unblock a \at waiting and then \texttt{P}s on the semaphore.
     187This creates a flow where \glspl{at} push each other out of the semaphores before being pushed out themselves.
     188For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of semaphores plus the number of \glspl{proc}.
     189Note that the nature of these semaphores mean the counter can go beyond 1, which could lead to calls to \texttt{P} not blocking.
    111190
    112191\todo{code, setup, results}
     
    116195                for {
    117196                        r := random() % len(spots)
    118                         next := xchg(spots[r], this)
    119                         if next { next.wake() }
    120                         wait()
     197                        spots[r].V()
     198                        spots[r].P()
    121199                        count ++
    122200                        if must_stop() { break }
     
    125203        }
    126204\end{lstlisting}
     205
     206\begin{figure}
     207        \subfloat[][Throughput, 100 \ats per \proc]{
     208                \resizebox{0.5\linewidth}{!}{
     209                        \input{result.churn.jax.ops.pstex_t}
     210                }
     211                \label{fig:churn:jax:ops}
     212        }
     213        \subfloat[][Throughput, 1 \ats per \proc]{
     214                \resizebox{0.5\linewidth}{!}{
     215                        \input{result.churn.low.jax.ops.pstex_t}
     216                }
     217                \label{fig:churn:jax:low:ops}
     218        }
     219
     220        \subfloat[][Latency, 100 \ats per \proc]{
     221                \resizebox{0.5\linewidth}{!}{
     222                        \input{result.churn.jax.ns.pstex_t}
     223                }
     224
     225        }
     226        \subfloat[][Latency, 1 \ats per \proc]{
     227                \resizebox{0.5\linewidth}{!}{
     228                        \input{result.churn.low.jax.ns.pstex_t}
     229                }
     230                \label{fig:churn:jax:low:ns}
     231        }
     232        \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. Throughput is the total operation per second across all cores. Latency is the duration of each opeartion.}
     233        \label{fig:churn:jax}
     234\end{figure}
    127235
    128236\section{Locality}
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    rfa2a3b1 r622a358  
    77More precise \CFA supports adding \procs using the RAII object @processor@.
    88These objects can be created at any time and can be destroyed at any time.
    9 They are normally create as automatic stack variables, but this is not a requirement.
     9They are normally created as automatic stack variables, but this is not a requirement.
    1010
    1111The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence.
    1212
    1313\section{Manual Resizing}
    14 The consequence of dynamically changing the number of \procs is that all internal arrays that are sized based on the number of \procs neede to be \texttt{realloc}ed.
    15 This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing because there is no guarantee the \proc causing the shrink had the highest index. Therefore indexes need to be reassigned to preserve contiguous indexes.}.
    16 
    17 There are no performance requirements, within reason, for resizing since this is usually considered as part of setup and teardown.
     14Manual resizing is expected to be a rare operation.
     15Programmers are mostly expected to resize clusters on startup or teardown.
     16Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
     17As such all internal arrays that are sized based on the number of \procs need to be \texttt{realloc}ed.
     18This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}.
     19
     20There are no performance requirements, within reason, for resizing since it is expected to be rare.
    1821However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks.
    1922It should also avoid as much as possible any effect on performance when the number of \procs remain constant.
    20 This later requirement prehibits simple solutions, like simply adding a global lock to these arrays.
     23This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
    2124
    2225\subsection{Read-Copy-Update}
     
    2427In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy.
    2528This approach potentially has the advantage that it may not need any synchronization to do the switch.
    26 The switch definitely implies a race where \procs could still use the previous, original, data structure after the copy was switched in.
    27 The important question then becomes whether or not this race can be recovered from.
    28 If the changes that arrived late can be transferred from the original to the copy then this solution works.
    29 
    30 For linked-lists, dequeing is somewhat of a problem.
     29However, there is a race where \procs could still use the previous, original, data structure after the copy was switched in.
     30This race not only requires some added memory reclamation scheme, it also requires that operations made on the stale original version be eventually moved to the copy.
     31
     32For linked-lists, enqueing is only somewhat problematic, \ats enqueued to the original queues need to be transferred to the new, which might not preserve ordering.
     33Dequeing is more challenging.
    3134Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at.
    32 Fixing this requires making the array contain pointers to subqueues rather than the subqueues themselves.
     35Fixing this requires more synchronization or more indirection on the queues.
    3336
    3437Another challenge is that the original must be kept until all \procs have witnessed the change.
     
    97100In addition to users manually changing the number of \procs, it is desireable to support ``removing'' \procs when there is not enough \ats for all the \procs to be useful.
    98101While manual resizing is expected to be rare, the number of \ats is expected to vary much more which means \procs may need to be ``removed'' for only short periods of time.
    99 Furthermore, race conditions that spuriously lead to the impression no \ats are ready are actually common in practice.
    100 Therefore \procs should not be actually \emph{removed} but simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready.
     102Furthermore, race conditions that spuriously lead to the impression that no \ats are ready are actually common in practice.
     103Therefore resources associated with \procs should not be freed but \procs simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready.
    101104This state is referred to as \newterm{Idle-Sleep}.
    102105
     
    110113The \CFA scheduler simply follows the ``Race-to-Idle'\cit{https://doi.org/10.1137/1.9781611973099.100}' approach where a sleeping \proc is woken any time an \at becomes ready and \procs go to idle sleep anytime they run out of work.
    111114
     115\section{Sleeping}
     116As usual, the corner-stone of any feature related to the kernel is the choice of system call.
     117In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options:
     118
     119\paragraph{\texttt{pthread\_mutex}/\texttt{pthread\_cond}}
     120The most classic option is to use some combination of \texttt{pthread\_mutex} and \texttt{pthread\_cond}.
     121These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a \texttt{pthread\_cond} until signalled.
     122While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal \texttt{pthread\_cond}s.
     123For \io results to wake a \proc waiting on a \texttt{pthread\_cond} means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled.
     124
     125\subsection{\texttt{io\_uring} and Epoll}
     126An alternative is to flip the problem on its head and block waiting for \io, using \texttt{io\_uring} or even \texttt{epoll}.
     127This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme.
     128This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother.
     129This leads to additional complexity because there can be a race between these artificial \io operations and genuine \io operations.
     130If not handled correctly, this can lead to the artificial files going out of sync.
     131
     132\subsection{Event FDs}
     133Another interesting approach is to use an event file descriptor\cit{eventfd}.
     134This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses \texttt{read} and \texttt{write}, but also behaves like a semaphore.
     135Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}.
     136Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{This is without the \texttt{EFD\_SEMAPHORE} flag. This flags changes the behavior of \texttt{read} but is not needed for this work.}.
     137If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
     138What makes this feature particularly interesting is that \texttt{io\_uring} supports the \texttt{IORING\_REGISTER\_EVENTFD} command, to register an event fd to a particular instance.
     139Once that instance is registered, any \io completion will result in \texttt{io\_uring} writing to the event FD.
     140This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io.
     141
     142\begin{figure}
     143        \centering
     144        \input{idle1.pstex_t}
     145        \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
     146        Each \proc has a private event FD.}
     147        \label{fig:idle1}
     148\end{figure}
     149
    112150
    113151\section{Tracking Sleepers}
    114152Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.
    115153The classic challenge is when a \at is made ready while a \proc is going to sleep, there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
    116 
    117 Furthermore, the ``Race-to-Idle'' approach means that there is some
    118 
    119 \section{Sleeping}
    120 
    121 \subsection{Event FDs}
    122 
    123 \subsection{Epoll}
    124 
    125 \subsection{\texttt{io\_uring}}
    126 
    127 \section{Reducing Latency}
     154Since \ats can be made ready by timers, \io operations or other events outside a clusre, this race can occur even if the \proc going to sleep is the only \proc awake.
     155As a result, improper handling of this race can lead to all \procs going to sleep and the system deadlocking.
     156
     157Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
     158Contention slowing down \procs attempting to sleep or wake-up can be tolerated.
     159These \procs are not doing useful work and therefore not contributing to overall performance.
     160However, notifying, checking if a \proc must be woken-up and doing so if needed, can significantly affect overall performance and must be low cost.
     161
     162\subsection{Sleepers List}
     163Each cluster maintains a list of idle \procs, organized as a stack.
     164This ordering hopefully allows \proc at the tail to stay in idle sleep for extended period of times.
     165Because of these unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \proc handle as much of the work as possible.
     166The idle \procs maintain the of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
     167This approach means that maintaining the list is fairly straightforward.
     168The list can simply use a single lock per cluster and only \procs that are getting in and out of idle state will contend for that lock.
     169
     170This approach also simplifies notification.
     171Indeed, \procs need to be notify when a new \at is readied, but they also must be notified during resizing, so the \gls{kthrd} can be joined.
     172This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
     173Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
     174The notification process then simply needs to wake-up the desired idle \proc, using \texttt{pthread\_cond\_signal}, \texttt{write} on an fd, etc., and the \proc will handle the rest.
     175
     176\subsection{Reducing Latency}
     177As mentioned in this section, \procs going idle for extremely short periods of time is likely in certain common scenarios.
     178Therefore, the latency of doing a system call to read from and writing to the event fd can actually negatively affect overall performance in a notable way.
     179Is it important to reduce latency and contention of the notification as much as possible.
     180Figure~\ref{fig:idle1} shoes the basic idle sleep data structure.
     181For the notifiers, this data structure can cause contention on the lock and the event fd syscall can cause notable latency.
     182
     183\begin{figure}
     184        \centering
     185        \input{idle2.pstex_t}
     186        \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
     187        \label{fig:idle2}
     188\end{figure}
     189
     190The contention is mostly due to the lock on the list needing to be held to get to the head \proc.
     191That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts.
     192The contentention from the \procs attempting to go to sleep can be mitigated slightly by using \texttt{try\_acquire} instead, so the \procs simply continue searching for \ats if the lock is held.
     193This trick cannot be used for waking \procs since they are not in a state where they can run \ats.
     194However, it is worth nothing that notification does not strictly require accessing the list or the head \proc.
     195Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}.
     196To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to \texttt{null} so only only notifier will contend on the system call.
     197
     198\begin{figure}
     199        \centering
     200        \input{idle_state.pstex_t}
     201        \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
     202        \label{fig:idle:state}
     203\end{figure}
     204
     205The next optimization that can be done is to avoid the latency of the event fd when possible.
     206This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd.
     207A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.
     208The flag starts in state \texttt{SEARCH}, while the \proc is searching for \ats to run.
     209The \proc then confirms the sleep by atomically swaping the state to \texttt{SLEEP}.
     210If the previous state was still \texttt{SEARCH}, then the \proc does read the event fd.
     211Meanwhile, notifiers atomically exchange the state to \texttt{AWAKE} state.
     212if the previous state was \texttt{SLEEP}, then the notifier must write to the event fd.
     213However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably.
     214This leads to the final data structure shown in Figure~\ref{fig:idle}.
     215
     216\begin{figure}
     217        \centering
     218        \input{idle.pstex_t}
     219        \caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
     220        Each \proc has a private event FD with a benaphore in front of it.
     221        The list also has an atomic pointer to the event fd and benaphore of the first \proc on the list.}
     222        \label{fig:idle}
     223\end{figure}
Note: See TracChangeset for help on using the changeset viewer.