Index: doc/theses/thierry_delisle_PhD/thesis/Makefile
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/Makefile	(revision 6978468d141a18b7c22313b2afea3c90eaddd399)
+++ doc/theses/thierry_delisle_PhD/thesis/Makefile	(revision 29b36923507eaa8acfad777d130426ba63a3d494)
@@ -30,4 +30,6 @@
 	base \
 	base_avg \
+	cache-share \
+	cache-noshare \
 	empty \
 	emptybit \
Index: doc/theses/thierry_delisle_PhD/thesis/fig/cache-noshare.fig
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/fig/cache-noshare.fig	(revision 29b36923507eaa8acfad777d130426ba63a3d494)
+++ doc/theses/thierry_delisle_PhD/thesis/fig/cache-noshare.fig	(revision 29b36923507eaa8acfad777d130426ba63a3d494)
@@ -0,0 +1,99 @@
+#FIG 3.2  Produced by xfig version 3.2.7b
+Landscape
+Center
+Inches
+Letter
+100.00
+Single
+-2
+1200 2
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2550 2550 456 456 2550 2550 2100 2475
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3750 2550 456 456 3750 2550 3300 2475
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4950 2550 456 456 4950 2550 4500 2475
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 6150 2550 456 456 6150 2550 5700 2475
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2100 3300 3000 3300 3000 3600 2100 3600 2100 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2100 3900 3000 3900 3000 4500 2100 4500 2100 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3300 3300 4200 3300 4200 3600 3300 3600 3300 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3300 3900 4200 3900 4200 4500 3300 4500 3300 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 4500 3300 5400 3300 5400 3600 4500 3600 4500 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 4500 3900 5400 3900 5400 4500 4500 4500 4500 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 5700 3300 6600 3300 6600 3600 5700 3600 5700 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 5700 3900 6600 3900 6600 4500 5700 4500 5700 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2100 4800 4200 4800 4200 5700 2100 5700 2100 4800
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 4500 4800 6600 4800 6600 5700 4500 5700 4500 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 2550 3000 2550 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 6150 3000 6150 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 6150 3600 6150 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 3750 3000 3750 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4950 3000 4950 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4950 3600 4950 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 3750 3600 3750 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 2550 3600 2550 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 2550 4500 2550 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 3750 4500 3750 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4950 4500 4950 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 6150 4500 6150 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4200 5250 4500 5250
+4 0 0 50 -1 0 11 0.0000 2 135 360 4725 2625 CPU2\001
+4 0 0 50 -1 0 11 0.0000 2 135 360 2325 2625 CPU0\001
+4 0 0 50 -1 0 11 0.0000 2 135 360 5925 2625 CPU3\001
+4 0 0 50 -1 0 11 0.0000 2 135 360 3525 2625 CPU1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 2475 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 4875 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 6075 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 2400 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 4875 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 3675 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 6075 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 3675 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 3000 5250 L3\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 5475 5250 L3\001
Index: doc/theses/thierry_delisle_PhD/thesis/fig/cache-share.fig
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/fig/cache-share.fig	(revision 29b36923507eaa8acfad777d130426ba63a3d494)
+++ doc/theses/thierry_delisle_PhD/thesis/fig/cache-share.fig	(revision 29b36923507eaa8acfad777d130426ba63a3d494)
@@ -0,0 +1,92 @@
+#FIG 3.2  Produced by xfig version 3.2.7b
+Landscape
+Center
+Inches
+Letter
+100.00
+Single
+-2
+1200 2
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2550 2550 456 456 2550 2550 2100 2475
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3750 2550 456 456 3750 2550 3300 2475
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4950 2550 456 456 4950 2550 4500 2475
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 6150 2550 456 456 6150 2550 5700 2475
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2100 3300 3000 3300 3000 3600 2100 3600 2100 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2100 3900 3000 3900 3000 4500 2100 4500 2100 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3300 3300 4200 3300 4200 3600 3300 3600 3300 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3300 3900 4200 3900 4200 4500 3300 4500 3300 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 4500 3300 5400 3300 5400 3600 4500 3600 4500 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 4500 3900 5400 3900 5400 4500 4500 4500 4500 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 5700 3300 6600 3300 6600 3600 5700 3600 5700 3300
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 5700 3900 6600 3900 6600 4500 5700 4500 5700 3900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2100 4800 6600 4800 6600 5775 2100 5775 2100 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 2550 3000 2550 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 3750 3000 3750 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4950 3000 4950 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 6150 3000 6150 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 6150 3600 6150 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4950 3600 4950 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 3750 3600 3750 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 2550 3600 2550 3900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 2550 4500 2550 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 3750 4500 3750 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 4950 4500 4950 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 45.00
+	1 1 1.00 60.00 45.00
+	 6150 4500 6150 4800
+4 0 0 50 -1 0 11 0.0000 2 135 360 4725 2625 CPU2\001
+4 0 0 50 -1 0 11 0.0000 2 135 360 2325 2625 CPU0\001
+4 0 0 50 -1 0 11 0.0000 2 135 360 5925 2625 CPU3\001
+4 0 0 50 -1 0 11 0.0000 2 135 360 3525 2625 CPU1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 2475 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 4875 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 6075 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 2400 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 4875 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 3675 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 6075 4275 L2\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 3675 3525 L1\001
+4 0 0 50 -1 0 11 0.0000 2 135 180 4275 5325 L3\001
Index: doc/theses/thierry_delisle_PhD/thesis/text/core.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text/core.tex	(revision 6978468d141a18b7c22313b2afea3c90eaddd399)
+++ doc/theses/thierry_delisle_PhD/thesis/text/core.tex	(revision 29b36923507eaa8acfad777d130426ba63a3d494)
@@ -223,7 +223,61 @@
 Therefore this unprotected read of the timestamp and average satisfy the limited correctness that is required.
 
+\begin{figure}
+	\centering
+	\input{cache-share.pstex_t}
+	\caption[CPU design with wide L3 sharing]{CPU design with wide L3 sharing \smallskip\newline A very simple CPU with 4 \glspl{hthrd}. L1 and L2 are private to each \gls{hthrd} but the L3 is shared across to entire core.}
+	\label{fig:cache-share}
+\end{figure}
+
+\begin{figure}
+	\centering
+	\input{cache-noshare.pstex_t}
+	\caption[CPU design with a narrower L3 sharing]{CPU design with a narrower L3 sharing \smallskip\newline A different CPU design, still with 4 \glspl{hthrd}. L1 and L2 are still private to each \gls{hthrd} but the L3 is shared some of the CPU but there is still two distinct L3 instances.}
+	\label{fig:cache-noshare}
+\end{figure}
+
+With redundant tiemstamps this scheduling algorithm achieves both the fairness and performance requirements, on some machines.
+The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
+For example, on machines where the motherboard holds multiple CPU, cache misses can be satisfied from a cache that belongs to the CPU that missed, the \emph{local} CPU, or by a different CPU, a \emph{remote} one.
+Cache misses that are satisfied by a remote CPU will have higher latency than if it is satisfied by the local CPU.
+However, this is not specific to systems with multiple CPUs.
+Depending on the cache structure, cache-misses can have different latency for the same CPU.
+The AMD EPYC 7662 CPUs that is described in Chapter~\ref{microbench} is an example of that.
+Figure~\ref{fig:cache-share} and Figure~\ref{fig:cache-noshare} show two different cache topologies with highlight this difference.
+In Figure~\ref{fig:cache-share}, all cache instances are either private to a \gls{hthrd} or shared to the entire system, this means latency due to cache-misses are likely fairly consistent.
+By comparison, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by a hit in either instance of the L3.
+However, the memory access latency to the remote L3 instance will be notably higher than the memory access latency to the local L3.
+The impact of these different design on this algorithm is that scheduling will scale very well on architectures similar to Figure~\ref{fig:cache-share}, both will have notably worst scalling with many narrower L3 instances.
+This is simply because as the number of L3 instances grow, so two does the chances that the random helping will cause significant latency.
+The solution is to have the scheduler be aware of the cache topology.
+
 \subsection{Per CPU Sharding}
+Building a scheduler that is aware of cache topology poses two main challenges: discovering cache topology and matching \procs to cache instance.
+Sadly, there is no standard portable way to discover cache topology in C.
+Therefore, while this is a significant portability challenge, it is outside the scope of this thesis to design a cross-platform cache discovery mechanisms.
+The rest of this work assumes discovering the cache topology based on Linux's \texttt{/sys/devices/system/cpu} directory.
+This leaves the challenge of matching \procs to cache instance, or more precisely identifying which subqueues of the ready queue are local to which cache instance.
+Once this matching is available, the helping algorithm can be changed to add bias so that \procs more often help subqueues local to the same cache instance
+\footnote{Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuinng.}.
+
+The obvious approach to mapping cache instances to subqueues is to statically tie subqueues to CPUs.
+Instead of having each subqueue local to a specific \proc, the system is initialized with subqueues for each \glspl{hthrd} up front.
+Then \procs dequeue and enqueue by first asking which CPU id they are local to, in order to identify which subqueues are the local ones.
+\Glspl{proc} can get the CPU id from \texttt{sched\_getcpu} or \texttt{librseq}.
+
+This approach solves the performance problems on systems with topologies similar to Figure~\ref{fig:cache-noshare}.
+However, it actually causes some subtle fairness problems in some systems, specifically systems with few \procs and many \glspl{hthrd}.
+In these cases, the large number of subqueues and the bias agains subqueues tied to different cache instances make it so it is very unlikely any single subqueue is picked.
+To make things worst, the small number of \procs mean that few helping attempts will be made.
+This combination of few attempts and low chances make it so a \at stranded on a subqueue that is not actively dequeued from may wait very long before it gets randomly helped.
+On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can actually take multiple seconds for a \at to get dequeued from a remote queue.
+Therefore, a more dynamic matching of subqueues to cache instance is needed.
 
 \subsection{Topological Work Stealing}
-
-
+The approach that is used in the \CFA scheduler is to have per-\proc subqueue, but have an excplicit data-structure track which cache instance each subqueue is tied to.
+This is requires some finess because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
+A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
+Therefore the algorithm can be built as follows: Before enqueuing or dequeing a \at, each \proc queries the CPU id and the corresponding cache instance.
+Since subqueues are tied to \procs, each \proc can then update the cache instance mapped to the local subqueue(s).
+To avoid unnecessary cache line invalidation, the map is only written to if the mapping changes.
+