Changeset 13888c0
- Timestamp:
- Apr 11, 2022, 5:51:28 PM (2 years ago)
- Branches:
- ADT, ast-experimental, enum, master, pthread-emulation, qualifiedEnum
- Children:
- 05e33f5, 29b3692
- Parents:
- 8d76f2b
- Location:
- doc/theses/thierry_delisle_PhD/thesis
- Files:
-
- 2 added
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/Makefile
r8d76f2b r13888c0 30 30 base \ 31 31 base_avg \ 32 cache-share \ 33 cache-noshare \ 32 34 empty \ 33 35 emptybit \ -
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r8d76f2b r13888c0 223 223 Therefore this unprotected read of the timestamp and average satisfy the limited correctness that is required. 224 224 225 \begin{figure} 226 \centering 227 \input{cache-share.pstex_t} 228 \caption[CPU design with wide L3 sharing]{CPU design with wide L3 sharing \smallskip\newline A very simple CPU with 4 \glspl{hthrd}. L1 and L2 are private to each \gls{hthrd} but the L3 is shared across to entire core.} 229 \label{fig:cache-share} 230 \end{figure} 231 232 \begin{figure} 233 \centering 234 \input{cache-noshare.pstex_t} 235 \caption[CPU design with a narrower L3 sharing]{CPU design with a narrower L3 sharing \smallskip\newline A different CPU design, still with 4 \glspl{hthrd}. L1 and L2 are still private to each \gls{hthrd} but the L3 is shared some of the CPU but there is still two distinct L3 instances.} 236 \label{fig:cache-noshare} 237 \end{figure} 238 239 With redundant tiemstamps this scheduling algorithm achieves both the fairness and performance requirements, on some machines. 240 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}. 241 For example, on machines where the motherboard holds multiple CPU, cache misses can be satisfied from a cache that belongs to the CPU that missed, the \emph{local} CPU, or by a different CPU, a \emph{remote} one. 242 Cache misses that are satisfied by a remote CPU will have higher latency than if it is satisfied by the local CPU. 243 However, this is not specific to systems with multiple CPUs. 244 Depending on the cache structure, cache-misses can have different latency for the same CPU. 245 The AMD EPYC 7662 CPUs that is described in Chapter~\ref{microbench} is an example of that. 246 Figure~\ref{fig:cache-share} and Figure~\ref{fig:cache-noshare} show two different cache topologies with highlight this difference. 247 In Figure~\ref{fig:cache-share}, all cache instances are either private to a \gls{hthrd} or shared to the entire system, this means latency due to cache-misses are likely fairly consistent. 248 By comparison, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by a hit in either instance of the L3. 249 However, the memory access latency to the remote L3 instance will be notably higher than the memory access latency to the local L3. 250 The impact of these different design on this algorithm is that scheduling will scale very well on architectures similar to Figure~\ref{fig:cache-share}, both will have notably worst scalling with many narrower L3 instances. 251 This is simply because as the number of L3 instances grow, so two does the chances that the random helping will cause significant latency. 252 The solution is to have the scheduler be aware of the cache topology. 253 225 254 \subsection{Per CPU Sharding} 255 Building a scheduler that is aware of cache topology poses two main challenges: discovering cache topology and matching \procs to cache instance. 256 Sadly, there is no standard portable way to discover cache topology in C. 257 Therefore, while this is a significant portability challenge, it is outside the scope of this thesis to design a cross-platform cache discovery mechanisms. 258 The rest of this work assumes discovering the cache topology based on Linux's \texttt{/sys/devices/system/cpu} directory. 259 This leaves the challenge of matching \procs to cache instance, or more precisely identifying which subqueues of the ready queue are local to which cache instance. 260 Once this matching is available, the helping algorithm can be changed to add bias so that \procs more often help subqueues local to the same cache instance 261 \footnote{Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuinng.}. 262 263 The obvious approach to mapping cache instances to subqueues is to statically tie subqueues to CPUs. 264 Instead of having each subqueue local to a specific \proc, the system is initialized with subqueues for each \glspl{hthrd} up front. 265 Then \procs dequeue and enqueue by first asking which CPU id they are local to, in order to identify which subqueues are the local ones. 266 \Glspl{proc} can get the CPU id from \texttt{sched\_getcpu} or \texttt{librseq}. 267 268 This approach solves the performance problems on systems with topologies similar to Figure~\ref{fig:cache-noshare}. 269 However, it actually causes some subtle fairness problems in some systems, specifically systems with few \procs and many \glspl{hthrd}. 270 In these cases, the large number of subqueues and the bias agains subqueues tied to different cache instances make it so it is very unlikely any single subqueue is picked. 271 To make things worst, the small number of \procs mean that few helping attempts will be made. 272 This combination of few attempts and low chances make it so a \at stranded on a subqueue that is not actively dequeued from may wait very long before it gets randomly helped. 273 On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can actually take multiple seconds for a \at to get dequeued from a remote queue. 274 Therefore, a more dynamic matching of subqueues to cache instance is needed. 226 275 227 276 \subsection{Topological Work Stealing} 228 229 277 The approach that is used in the \CFA scheduler is to have per-\proc subqueue, but have an excplicit data-structure track which cache instance each subqueue is tied to. 278 This is requires some finess because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. 279 A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}. 280 Therefore the algorithm can be built as follows: Before enqueuing or dequeing a \at, each \proc queries the CPU id and the corresponding cache instance. 281 Since subqueues are tied to \procs, each \proc can then update the cache instance mapped to the local subqueue(s). 282 To avoid unnecessary cache line invalidation, the map is only written to if the mapping changes. 283
Note: See TracChangeset
for help on using the changeset viewer.