Apr 11, 2022, 5:51:28 PM (6 months ago)
Thierry Delisle <tdelisle@…>
enum, master, pthread-emulation, qualifiedEnum
05e33f5, 29b3692

First complete draft of chapter 3.
Still very rough but probably worht reading.

1 edited


  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r8d76f2b r13888c0  
    223223Therefore this unprotected read of the timestamp and average satisfy the limited correctness that is required.
     226        \centering
     227        \input{cache-share.pstex_t}
     228        \caption[CPU design with wide L3 sharing]{CPU design with wide L3 sharing \smallskip\newline A very simple CPU with 4 \glspl{hthrd}. L1 and L2 are private to each \gls{hthrd} but the L3 is shared across to entire core.}
     229        \label{fig:cache-share}
     233        \centering
     234        \input{cache-noshare.pstex_t}
     235        \caption[CPU design with a narrower L3 sharing]{CPU design with a narrower L3 sharing \smallskip\newline A different CPU design, still with 4 \glspl{hthrd}. L1 and L2 are still private to each \gls{hthrd} but the L3 is shared some of the CPU but there is still two distinct L3 instances.}
     236        \label{fig:cache-noshare}
     239With redundant tiemstamps this scheduling algorithm achieves both the fairness and performance requirements, on some machines.
     240The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
     241For example, on machines where the motherboard holds multiple CPU, cache misses can be satisfied from a cache that belongs to the CPU that missed, the \emph{local} CPU, or by a different CPU, a \emph{remote} one.
     242Cache misses that are satisfied by a remote CPU will have higher latency than if it is satisfied by the local CPU.
     243However, this is not specific to systems with multiple CPUs.
     244Depending on the cache structure, cache-misses can have different latency for the same CPU.
     245The AMD EPYC 7662 CPUs that is described in Chapter~\ref{microbench} is an example of that.
     246Figure~\ref{fig:cache-share} and Figure~\ref{fig:cache-noshare} show two different cache topologies with highlight this difference.
     247In Figure~\ref{fig:cache-share}, all cache instances are either private to a \gls{hthrd} or shared to the entire system, this means latency due to cache-misses are likely fairly consistent.
     248By comparison, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by a hit in either instance of the L3.
     249However, the memory access latency to the remote L3 instance will be notably higher than the memory access latency to the local L3.
     250The impact of these different design on this algorithm is that scheduling will scale very well on architectures similar to Figure~\ref{fig:cache-share}, both will have notably worst scalling with many narrower L3 instances.
     251This is simply because as the number of L3 instances grow, so two does the chances that the random helping will cause significant latency.
     252The solution is to have the scheduler be aware of the cache topology.
    225254\subsection{Per CPU Sharding}
     255Building a scheduler that is aware of cache topology poses two main challenges: discovering cache topology and matching \procs to cache instance.
     256Sadly, there is no standard portable way to discover cache topology in C.
     257Therefore, while this is a significant portability challenge, it is outside the scope of this thesis to design a cross-platform cache discovery mechanisms.
     258The rest of this work assumes discovering the cache topology based on Linux's \texttt{/sys/devices/system/cpu} directory.
     259This leaves the challenge of matching \procs to cache instance, or more precisely identifying which subqueues of the ready queue are local to which cache instance.
     260Once this matching is available, the helping algorithm can be changed to add bias so that \procs more often help subqueues local to the same cache instance
     261\footnote{Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuinng.}.
     263The obvious approach to mapping cache instances to subqueues is to statically tie subqueues to CPUs.
     264Instead of having each subqueue local to a specific \proc, the system is initialized with subqueues for each \glspl{hthrd} up front.
     265Then \procs dequeue and enqueue by first asking which CPU id they are local to, in order to identify which subqueues are the local ones.
     266\Glspl{proc} can get the CPU id from \texttt{sched\_getcpu} or \texttt{librseq}.
     268This approach solves the performance problems on systems with topologies similar to Figure~\ref{fig:cache-noshare}.
     269However, it actually causes some subtle fairness problems in some systems, specifically systems with few \procs and many \glspl{hthrd}.
     270In these cases, the large number of subqueues and the bias agains subqueues tied to different cache instances make it so it is very unlikely any single subqueue is picked.
     271To make things worst, the small number of \procs mean that few helping attempts will be made.
     272This combination of few attempts and low chances make it so a \at stranded on a subqueue that is not actively dequeued from may wait very long before it gets randomly helped.
     273On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can actually take multiple seconds for a \at to get dequeued from a remote queue.
     274Therefore, a more dynamic matching of subqueues to cache instance is needed.
    227276\subsection{Topological Work Stealing}
     277The approach that is used in the \CFA scheduler is to have per-\proc subqueue, but have an excplicit data-structure track which cache instance each subqueue is tied to.
     278This is requires some finess because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
     279A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
     280Therefore the algorithm can be built as follows: Before enqueuing or dequeing a \at, each \proc queries the CPU id and the corresponding cache instance.
     281Since subqueues are tied to \procs, each \proc can then update the cache instance mapped to the local subqueue(s).
     282To avoid unnecessary cache line invalidation, the map is only written to if the mapping changes.
Note: See TracChangeset for help on using the changeset viewer.