Changeset ffec1bf for doc/theses

doc/theses/mike_brooks_MMath/array.tex

r9e23b446	rffec1bf
182	182	\CFA's array is also the first extension of C to use its tracked bounds to generate the pointer arithmetic implied by advanced allocation patterns. Other bound-tracked extensions of C either forbid certain C patterns entirely, or address the problem of \emph{verifying} that the user's provided pointer arithmetic is self-consistent. The \CFA array, applied to accordion structures [TOD: cross-reference] \emph{implies} the necessary pointer arithmetic, generated automatically, and not appearing at all in a user's program.
183	183
184		\subsction{Safety in a padded room}
	184	\subsection{Safety in a padded room}
185	185
186	186	Java's array [todo:cite] is a straightforward example of assuring safety against undefined behaviour, at a cost of expressiveness for more applied properties. Consider the array parameter declarations in:

doc/theses/thierry_delisle_PhD/thesis/.gitignore

r9e23b446	rffec1bf
1	1	back_text/
	2	SAVE.fig

doc/theses/thierry_delisle_PhD/thesis/Makefile

-              r9e23b446
+              rffec1bf
         base \
         base_avg \
+        base_ts2 \
         cache-share \
         cache-noshare \
 …
         emptytls \
         emptytree \
+        executionStates \
         fairness \
         idle \
 …
         io_uring \
         pivot_ring \
+        MQMS \
+        MQMSG \
         system \
         cycle \
 …
         result.memcd.rate.qps \
         result.memcd.rate.99th \
+        SQMS \
+}

doc/theses/thierry_delisle_PhD/thesis/fig/base.fig

-              r9e23b446
+              rffec1bf
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6975 4200 20 20 6975 4200 6995 4200
 -6
 6375 5100 6675 5250
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6450 5175 20 20 6450 5175 6470 5175
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6525 5175 20 20 6525 5175 6545 5175
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6600 5175 20 20 6600 5175 6620 5175
+6450 5025 6750 5175
+3 0 1 0 0 50 -1 20 0.000 1 0.0000 6525 5100 20 20 6525 5100 6545 5100
+3 0 1 0 0 50 -1 20 0.000 1 0.0000 6600 5100 20 20 6600 5100 6620 5100
+3 0 1 0 0 50 -1 20 0.000 1 0.0000 6675 5100 20 20 6675 5100 6695 5100
 -6
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3900 2400 300 300 3900 2400 4200 2400
 …
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 2475 3000 2475
-3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
-5210 3150 4950 2850 4950 2700 5210 2850 5470 3150 5470
-5210
-3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
-5210 4350 4950 4050 4950 3900 5210 4050 5470 4350 5470
-5210
-3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
-5210 5550 4950 5250 4950 5100 5210 5250 5470 5550 5470
-5210
 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2
 5700 3600 1200
+5400 3600 1200
 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2
 5700 4800 1200
+5400 4800 1200
 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2
+5700 6000 1200
+2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001
+2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 TS\001
+2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001
+2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 2650 TS\001
+2 -1 50 -1 0 12 0.0000 2 135 900 2100 5175 Processors\001
+5400 6000 1200
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 3300 4800 3300 5400 2700 5400 2700 4800
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 4500 4800 4500 5400 3900 5400 3900 4800
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 5700 4800 5700 5400 5100 5400 5100 4800
+2 -1 50 -1 0 12 0.0000 2 135 645 2100 3075 Threads\001
+2 -1 50 -1 0 12 0.0000 2 180 525 2100 2850 Ready\001
+1 -1 50 -1 0 11 0.0000 2 120 210 2700 4450 TS\001
+2 -1 50 -1 0 12 0.0000 2 180 660 2100 4200 Array of\001
+2 -1 50 -1 0 12 0.0000 2 165 600 2100 4425 Queues\001
+1 -1 50 -1 0 11 0.0000 2 120 210 2700 3550 TS\001
+2 -1 50 -1 0 12 0.0000 2 135 840 2100 5175 Processors\001

doc/theses/thierry_delisle_PhD/thesis/fig/base_avg.fig

-              r9e23b446
+              rffec1bf
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6975 4200 20 20 6975 4200 6995 4200
 -6
 6375 5100 6675 5250
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6450 5175 20 20 6450 5175 6470 5175
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6525 5175 20 20 6525 5175 6545 5175
 3 0 1 0 0 50 -1 20 0.000 1 0.0000 6600 5175 20 20 6600 5175 6620 5175
+6450 5025 6750 5175
+3 0 1 0 0 50 -1 20 0.000 1 0.0000 6525 5100 20 20 6525 5100 6545 5100
+3 0 1 0 0 50 -1 20 0.000 1 0.0000 6600 5100 20 20 6600 5100 6620 5100
+3 0 1 0 0 50 -1 20 0.000 1 0.0000 6675 5100 20 20 6675 5100 6695 5100
 -6
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3900 2400 300 300 3900 2400 4200 2400
 …
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 3975 3900 3600
+4200 3900 3600
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 …
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 3975 5100 3600
+4200 5100 3600
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 …
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 3975 6300 3600
+4200 6300 3600
 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 …
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 45.00 90.00
 3975 4500 3600
+4200 4500 3600
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 3375 3000 3375
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 2475 3000 2475
-3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
-5210 3150 4950 2850 4950 2700 5210 2850 5470 3150 5470
-5210
-3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
-5210 4350 4950 4050 4950 3900 5210 4050 5470 4350 5470
-5210
-3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
-5210 5550 4950 5250 4950 5100 5210 5250 5470 5550 5470
-5210
 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2
 5700 3600 1200
+5400 3600 1200
 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2
 5700 4800 1200
+5400 4800 1200
 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2
+5700 6000 1200
+5400 6000 1200
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 3300 4800 3300 5400 2700 5400 2700 4800
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 4500 4800 4500 5400 3900 5400 3900 4800
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 5700 4800 5700 5400 5100 5400 5100 4800
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 4050 3000 4050
+2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001
+2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 MA\001
+2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001
+2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 2650 TS\001
+2 -1 50 -1 0 12 0.0000 2 135 900 2100 5175 Processors\001
+1 -1 50 -1 0 11 0.0000 2 135 180 2700 4200 TS\001
+2 -1 50 -1 0 12 0.0000 2 135 645 2100 3075 Threads\001
+2 -1 50 -1 0 12 0.0000 2 180 525 2100 2850 Ready\001
+1 -1 50 -1 0 11 0.0000 2 120 300 2700 4450 MA\001
+2 -1 50 -1 0 12 0.0000 2 180 660 2100 4200 Array of\001
+2 -1 50 -1 0 12 0.0000 2 165 600 2100 4425 Queues\001
+1 -1 50 -1 0 11 0.0000 2 120 210 2700 3550 TS\001
+2 -1 50 -1 0 12 0.0000 2 135 840 2100 5175 Processors\001
+1 -1 50 -1 0 11 0.0000 2 120 210 2700 4225 TS\001

doc/theses/thierry_delisle_PhD/thesis/fig/cache-noshare.fig

-              r9e23b446
+              rffec1bf
 -2
 2
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2550 2550 456 456 2550 2550 2100 2475
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3750 2550 456 456 3750 2550 3300 2475
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4950 2550 456 456 4950 2550 4500 2475
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 6150 2550 456 456 6150 2550 5700 2475
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1650 1650 456 456 1650 1650 1200 1575
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2850 1650 456 456 2850 1650 2400 1575
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4050 1650 456 456 4050 1650 3600 1575
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 5250 1650 456 456 5250 1650 4800 1575
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 3000 3300 3000 3600 2100 3600 2100 3300
+2400 2100 2400 2100 2700 1200 2700 1200 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 3000 3900 3000 4500 2100 4500 2100 3900
+3000 2100 3000 2100 3600 1200 3600 1200 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 4200 3300 4200 3600 3300 3600 3300 3300
+2400 3300 2400 3300 2700 2400 2700 2400 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 4200 3900 4200 4500 3300 4500 3300 3900
+3000 3300 3000 3300 3600 2400 3600 2400 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 5400 3300 5400 3600 4500 3600 4500 3300
+2400 4500 2400 4500 2700 3600 2700 3600 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 5400 3900 5400 4500 4500 4500 4500 3900
+3000 4500 3000 4500 3600 3600 3600 3600 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 6600 3300 6600 3600 5700 3600 5700 3300
+2400 5700 2400 5700 2700 4800 2700 4800 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 6600 3900 6600 4500 5700 4500 5700 3900
+3000 5700 3000 5700 3600 4800 3600 4800 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 4800 4200 4800 4200 5700 2100 5700 2100 4800
+3900 3300 3900 3300 4800 1200 4800 1200 3900
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 4800 6600 4800 6600 5700 4500 5700 4500 4800
+3900 5700 3900 5700 4800 3600 4800 3600 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 2550 3300
+2100 1650 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 6150 3300
+2100 5250 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 6150 3900
+2700 5250 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 3750 3300
+2100 2850 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 4950 3300
+2100 4050 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 4950 3900
+2700 4050 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 3750 3900
+2700 1650 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 2550 3900
+3600 1650 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 2550 4800
+3600 2850 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 3750 4800
+3600 4050 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 4950 4800
+3600 5250 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 6150 4800
+4350 3600 4350
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 5250 4500 5250
 0 0 50 -1 0 11 0.0000 2 135 360 4725 2625 CPU2\001
 0 0 50 -1 0 11 0.0000 2 135 360 2325 2625 CPU0\001
 0 0 50 -1 0 11 0.0000 2 135 360 5925 2625 CPU3\001
 0 0 50 -1 0 11 0.0000 2 135 360 3525 2625 CPU1\001
 0 0 50 -1 0 11 0.0000 2 135 180 2475 3525 L1\001
 0 0 50 -1 0 11 0.0000 2 135 180 4875 3525 L1\001
 0 0 50 -1 0 11 0.0000 2 135 180 6075 3525 L1\001
 0 0 50 -1 0 11 0.0000 2 135 180 2400 4275 L2\001
 0 0 50 -1 0 11 0.0000 2 135 180 4875 4275 L2\001
 0 0 50 -1 0 11 0.0000 2 135 180 3675 4275 L2\001
 0 0 50 -1 0 11 0.0000 2 135 180 6075 4275 L2\001
 0 0 50 -1 0 11 0.0000 2 135 180 3675 3525 L1\001
 0 0 50 -1 0 11 0.0000 2 135 180 3000 5250 L3\001
 0 0 50 -1 0 11 0.0000 2 135 180 5475 5250 L3\001
+2700 2850 3000
+1 0 50 -1 0 12 0.0000 2 165 945 1650 1725 CORE$_0$\001
+1 0 50 -1 0 12 0.0000 2 135 225 2250 4425 L3\001
+1 0 50 -1 0 12 0.0000 2 135 225 4650 4425 L3\001
+1 0 50 -1 0 12 0.0000 2 135 225 5250 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 4050 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 2850 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 1650 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 1650 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 135 225 2850 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 135 225 4050 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 135 225 5250 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 165 945 2850 1725 CORE$_1$\001
+1 0 50 -1 0 12 0.0000 2 165 945 4050 1725 CORE$_2$\001
+1 0 50 -1 0 12 0.0000 2 165 945 5250 1725 CORE$_3$\001

doc/theses/thierry_delisle_PhD/thesis/fig/cache-share.fig

-              r9e23b446
+              rffec1bf
 -2
 2
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2550 2550 456 456 2550 2550 2100 2475
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3750 2550 456 456 3750 2550 3300 2475
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4950 2550 456 456 4950 2550 4500 2475
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 6150 2550 456 456 6150 2550 5700 2475
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1650 1650 456 456 1650 1650 1200 1575
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4050 1650 456 456 4050 1650 3600 1575
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 5250 1650 456 456 5250 1650 4800 1575
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2850 1650 456 456 2850 1650 2400 1575
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 3000 3300 3000 3600 2100 3600 2100 3300
+2400 2100 2400 2100 2700 1200 2700 1200 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 3000 3900 3000 4500 2100 4500 2100 3900
+3000 2100 3000 2100 3600 1200 3600 1200 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 4200 3300 4200 3600 3300 3600 3300 3300
+2400 3300 2400 3300 2700 2400 2700 2400 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 4200 3900 4200 4500 3300 4500 3300 3900
+3000 3300 3000 3300 3600 2400 3600 2400 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 5400 3300 5400 3600 4500 3600 4500 3300
+2400 4500 2400 4500 2700 3600 2700 3600 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3900 5400 3900 5400 4500 4500 4500 4500 3900
+3000 4500 3000 4500 3600 3600 3600 3600 3000
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 3300 6600 3300 6600 3600 5700 3600 5700 3300
+2400 5700 2400 5700 2700 4800 2700 4800 2400
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+3900 6600 3900 6600 4500 5700 4500 5700 3900
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+4800 6600 4800 6600 5775 2100 5775 2100 4800
+3000 5700 3000 5700 3600 4800 3600 4800 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 2550 3300
+2100 1650 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 3750 3300
+2100 2850 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 4950 3300
+2100 4050 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3000 6150 3300
+2100 5250 2400
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 6150 3900
+2700 5250 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 4950 3900
+2700 4050 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 3750 3900
+2700 2850 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 3600 2550 3900
+2700 1650 3000
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 2550 4800
+3600 1650 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 3750 4800
+3600 2850 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
 4500 4950 4800
+3600 4050 3900
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 45.00
 1 1.00 60.00 45.00
+4500 6150 4800
+0 0 50 -1 0 11 0.0000 2 135 360 4725 2625 CPU2\001
+0 0 50 -1 0 11 0.0000 2 135 360 2325 2625 CPU0\001
+0 0 50 -1 0 11 0.0000 2 135 360 5925 2625 CPU3\001
+0 0 50 -1 0 11 0.0000 2 135 360 3525 2625 CPU1\001
+0 0 50 -1 0 11 0.0000 2 135 180 2475 3525 L1\001
+0 0 50 -1 0 11 0.0000 2 135 180 4875 3525 L1\001
+0 0 50 -1 0 11 0.0000 2 135 180 6075 3525 L1\001
+0 0 50 -1 0 11 0.0000 2 135 180 2400 4275 L2\001
+0 0 50 -1 0 11 0.0000 2 135 180 4875 4275 L2\001
+0 0 50 -1 0 11 0.0000 2 135 180 3675 4275 L2\001
+0 0 50 -1 0 11 0.0000 2 135 180 6075 4275 L2\001
+0 0 50 -1 0 11 0.0000 2 135 180 3675 3525 L1\001
+0 0 50 -1 0 11 0.0000 2 135 180 4275 5325 L3\001
+3600 5250 3900
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+3900 5700 3900 5700 4800 1200 4800 1200 3900
+1 0 50 -1 0 12 0.0000 2 135 225 3450 4425 L3\001
+1 0 50 -1 0 12 0.0000 2 135 225 1650 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 2850 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 4050 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 5250 3375 L2\001
+1 0 50 -1 0 12 0.0000 2 135 225 5250 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 135 225 4050 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 135 225 2850 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 135 225 1650 2625 L1\001
+1 0 50 -1 0 12 0.0000 2 165 945 1650 1725 CORE$_0$\001
+1 0 50 -1 0 12 0.0000 2 165 945 2850 1725 CORE$_1$\001
+1 0 50 -1 0 12 0.0000 2 165 945 4050 1725 CORE$_2$\001
+1 0 50 -1 0 12 0.0000 2 165 945 5250 1725 CORE$_3$\001

doc/theses/thierry_delisle_PhD/thesis/fig/cycle.fig

-              r9e23b446
+              rffec1bf
 -2
 2
 1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 3144.643 2341.072 3525 2250 3375 2025 3150 1950
 0 1.00 60.00 120.00
 1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 1955.357 2341.072 1950 1950 1725 2025 1575 2250
 0 1.00 60.00 120.00
 1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 3637.500 3487.500 3750 3750 3900 3600 3900 3375
 0 1.00 60.00 120.00
 1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 2587.500 4087.500 2325 4500 2550 4575 2850 4500
 0 1.00 60.00 120.00
 1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 1612.500 3487.500 1200 3375 1200 3600 1350 3825
 0 1.00 60.00 120.00
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3675 2850 586 586 3675 2850 4125 3225
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 4125 586 586 3300 4125 3750 4500
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1875 4125 586 586 1875 4125 2325 4500
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1425 2850 586 586 1425 2850 1875 3225
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2550 1950 586 586 2550 1950 3000 2325
 0 0 50 -1 0 11 0.0000 2 135 720 1125 2925 Thread 2\001
 2 0 50 -1 0 11 0.0000 2 165 540 1650 1950 Unpark\001
 0 0 50 -1 0 11 0.0000 2 165 540 4050 3600 Unpark\001
 2 0 50 -1 0 11 0.0000 2 165 540 1125 3750 Unpark\001
 2 0 50 -1 0 11 0.0000 2 165 540 2850 4800 Unpark\001
 0 0 50 -1 0 11 0.0000 2 135 720 2250 2025 Thread 1\001
 0 0 50 -1 0 11 0.0000 2 135 720 3000 4200 Thread 4\001
 0 0 50 -1 0 11 0.0000 2 135 720 1575 4200 Thread 3\001
 0 0 50 -1 0 11 0.0000 2 165 540 3525 2025 Unpark\001
 0 0 50 -1 0 11 0.0000 2 135 720 3375 2925 Thread 5\001
+1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 3150.000 4012.500 2850 4575 3150 4650 3450 4575
+1 1.00 60.00 120.00
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 1 2268.750 3450.000 1950 3825 1800 3600 1800 3300
+1 1.00 60.00 120.00
+1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 4031.250 3450.000 4350 3825 4500 3600 4500 3300
+1 1.00 60.00 120.00
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 1 3675.000 2250.000 3750 1725 4050 1875 4200 2175
+1 1.00 60.00 120.00
+1 0 1 0 7 50 -1 -1 0.000 0 1 1 0 2625.000 2250.000 2550 1725 2250 1875 2100 2175
+1 1.00 60.00 120.00
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3150 1800 600 600 3150 1800 3750 1800
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1875 2700 600 600 1875 2700 2475 2700
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 2400 4200 600 600 2400 4200 3000 4200
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3900 4200 600 600 3900 4200 4500 4200
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4425 2700 600 600 4425 2700 5025 2700
+1 0 50 -1 0 11 0.0000 2 165 855 2400 4275 Thread$_3$\001
+1 0 50 -1 0 11 0.0000 2 165 855 3900 4275 Thread$_4$\001
+1 0 50 -1 0 11 0.0000 2 165 855 1875 2775 Thread$_2$\001
+1 0 50 -1 0 11 0.0000 2 165 855 3150 1875 Thread$_1$\001
+1 0 50 -1 0 11 0.0000 2 165 855 4425 2775 Thread$_5$\001
+1 0 50 -1 0 11 0.0000 2 180 540 3150 4875 Unpark\001
+0 0 50 -1 0 11 0.0000 2 180 540 4650 3675 Unpark\001
+2 0 50 -1 0 11 0.0000 2 180 540 1650 3600 Unpark\001
+2 0 50 -1 0 11 0.0000 2 180 540 2100 1875 Unpark\001
+0 0 50 -1 0 11 0.0000 2 180 540 4200 1875 Unpark\001

doc/theses/thierry_delisle_PhD/thesis/fig/idle.fig

-              r9e23b446
+              rffec1bf
 -2
 2
+5919 5250 6375 5775
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 6147.000 5409.011 6102 5410 6147 5364 6192 5410
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 6147.000 5410.000 6010 5410 6147 5273 6284 5410
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
+5410 6010 5501 5919 5501 5919 5775 6375 5775 6375 5501
+5501 6284 5410
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 4
+5410 6102 5501 6192 5501 6192 5410
+-6
+7442 6525 7875 6900
+1 0 1 0 7 50 -1 -1 0.000 0 1 1 1 3376.136 2169.318 2250 2625 2775 3225 3525 3375
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+3466 2774 3899 3149
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 6584 7442 6900
+2833 3466 3149
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 6584 7836 6703
+2833 3860 2952
 2 0 1 0 7 50 -1 -1 0.000 0 0 0 4
 6703 7599 6663 7737 6722 7836 6703
+2952 3623 2912 3761 2971 3860 2952
 .000 -0.500 -0.500 0.000
 2 0 1 0 7 50 -1 -1 0.000 0 0 0 4
 6579 7621 6540 7759 6599 7857 6579
+2828 3645 2789 3783 2848 3881 2828
 .000 -0.500 -0.500 0.000
 -6
 7575 6825 7950 7325
+3599 3074 3974 3574
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6950 7700 6825 7950 6825 7950 7325 7575 7325 7575 6950
 6950 7700 6825
+3199 3724 3074 3974 3074 3974 3574 3599 3574 3599 3199
+3199 3724 3074
 -6
 9092 6525 9525 6900
+5116 2774 5549 3149
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 6584 9092 6900
+2833 5116 3149
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 6584 9486 6703
+2833 5510 2952
 2 0 1 0 7 50 -1 -1 0.000 0 0 0 4
 6703 9249 6663 9387 6722 9486 6703
+2952 5273 2912 5411 2971 5510 2952
 .000 -0.500 -0.500 0.000
 2 0 1 0 7 50 -1 -1 0.000 0 0 0 4
 6579 9271 6540 9409 6599 9507 6579
+2828 5295 2789 5433 2848 5531 2828
 .000 -0.500 -0.500 0.000
 -6
 9225 6825 9600 7325
+5249 3074 5625 3574
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6950 9350 6825 9600 6825 9600 7325 9225 7325 9225 6950
 6950 9350 6825
+3199 5374 3074 5625 3074 5625 3574 5249 3574 5249 3199
+3199 5374 3074
 -6
 10742 6525 11175 6900
+6766 2774 7199 3149
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 6584 10742 6900
+2833 6766 3149
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 6584 11136 6703
+2833 7160 2952
 2 0 1 0 7 50 -1 -1 0.000 0 0 0 4
 6703 10899 6663 11037 6722 11136 6703
+2952 6923 2912 7061 2971 7160 2952
 .000 -0.500 -0.500 0.000
 2 0 1 0 7 50 -1 -1 0.000 0 0 0 4
 6579 10921 6540 11059 6599 11157 6579
+2828 6945 2789 7083 2848 7181 2828
 .000 -0.500 -0.500 0.000
 -6
 10875 6825 11250 7325
+6899 3074 7274 3574
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
+6950 11000 6825 11250 6825 11250 7325 10875 7325 10875 6950
+6950 11000 6825
+3199 7024 3074 7274 3074 7274 3574 6899 3574 6899 3199
+3199 7024 3074
+-6
+1875 1500 2331 2025
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 2104.000 1660.011 2058 1660 2103 1614 2148 1660
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 2104.000 1661.000 1966 1660 2103 1523 2240 1660
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
+1660 1966 1751 1875 1751 1875 2025 2331 2025 2331 1751
+1751 2240 1660
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 4
+1660 2058 1751 2148 1751 2148 1660
 -6
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+6150 6675 6150
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+5250 6675 5250 6675 6600 5850 6600 5850 5250
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 7725 6525
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 9375 6525
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 11025 6525
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 10763 6308 11288 6308 11550 5854 11288 5400 10763 5400
+5854
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 9113 6308 9638 6308 9900 5854 9638 5400 9113 5400
+5854
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 7463 6308 7988 6308 8250 5854 7988 5400 7463 5400
+5854
+2400 2699 2399
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 7275 5925
+2399 3749 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 8925 5925
+2399 5399 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 10575 5925
+2175 3299 2174
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5775 9825 5775
+2174 4949 2174
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5775 8175 5775
+2 0 1 0 7 50 -1 -1 0.000 0 1 1 4
+2174 6599 2174
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
+6375 6375 6825 6750 7050 7350 6975
+.000 -0.500 -0.500 0.000
+0 0 50 -1 0 11 0.0000 2 135 810 5925 5175 Idle List\001
+0 0 50 -1 0 11 0.0000 2 135 810 5175 5550 Idle List\001
+0 0 50 -1 0 11 0.0000 2 135 360 5325 5700 Lock\001
+0 0 50 -1 0 11 0.0000 2 135 540 5775 6900 Atomic\001
+0 0 50 -1 0 11 0.0000 2 135 630 5775 7125 Pointer\001
+0 0 50 -1 0 11 0.0000 2 165 810 7950 6675 Benaphore\001
+0 0 50 -1 0 11 0.0000 2 135 720 8025 7125 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 1260 7275 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 165 810 9600 6675 Benaphore\001
+0 0 50 -1 0 11 0.0000 2 135 720 9675 7125 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 1260 8925 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 165 810 11250 6675 Benaphore\001
+0 0 50 -1 0 11 0.0000 2 135 720 11325 7125 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 1260 10575 5325 Idle Processor\001
+2024 5849 2024
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+2024 4199 2024
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1499 2699 1499 2699 2850 1800 2850 1800 1499
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 5850 1650 5850 2550 4950 2550 4950 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 4200 1650 4200 2550 3300 2550 3300 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 7500 1650 7500 2550 6600 2550 6600 1650
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+2399 7049 2774
+0 0 50 -1 0 11 0.0000 2 120 525 1799 3149 Atomic\001
+0 0 50 -1 0 11 0.0000 2 120 510 1799 3374 Pointer\001
+0 0 50 -1 0 11 0.0000 2 180 765 3974 2924 Benaphore\001
+0 0 50 -1 0 11 0.0000 2 120 690 4049 3374 Event FD\001
+0 0 50 -1 0 11 0.0000 2 180 765 5625 2924 Benaphore\001
+0 0 50 -1 0 11 0.0000 2 120 690 5699 3374 Event FD\001
+0 0 50 -1 0 11 0.0000 2 180 765 7274 2924 Benaphore\001
+0 0 50 -1 0 11 0.0000 2 120 690 7349 3374 Event FD\001
+2 0 50 -1 0 11 0.0000 2 135 585 1725 1800 Idle List\001
+2 0 50 -1 0 11 0.0000 2 135 360 1725 1950 Lock\001
+1 0 50 -1 0 11 0.0000 2 135 585 2250 1425 Idle List\001
+1 0 50 -1 0 11 0.0000 2 135 1020 3750 1575 Idle Processor\001
+1 0 50 -1 0 11 0.0000 2 135 1020 5400 1575 Idle Processor\001
+1 0 50 -1 0 11 0.0000 2 135 1020 7050 1575 Idle Processor\001

doc/theses/thierry_delisle_PhD/thesis/fig/idle1.fig

-              r9e23b446
+              rffec1bf
 -2
 2
 5919 5250 6375 5775
 1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 6147.000 5409.011 6102 5410 6147 5364 6192 5410
 1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 6147.000 5410.000 6010 5410 6147 5273 6284 5410
+1875 1500 2331 2025
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 2104.000 1660.011 2058 1660 2103 1614 2148 1660
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 2104.000 1661.000 1966 1660 2103 1523 2240 1660
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 5410 6010 5501 5919 5501 5919 5775 6375 5775 6375 5501
 5501 6284 5410
+1660 1966 1751 1875 1751 1875 2025 2331 2025 2331 1751
+1751 2240 1660
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 4
 5410 6102 5501 6192 5501 6192 5410
+1660 2058 1751 2148 1751 2148 1660
 -6
 7575 6525 7950 7025
+3599 2774 3974 3274
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6650 7700 6525 7950 6525 7950 7025 7575 7025 7575 6650
 6650 7700 6525
+2899 3724 2774 3974 2774 3974 3274 3599 3274 3599 2899
+2899 3724 2774
 -6
 9225 6525 9600 7025
+5249 2774 5625 3274
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6650 9350 6525 9600 6525 9600 7025 9225 7025 9225 6650
 6650 9350 6525
+2899 5374 2774 5625 2774 5625 3274 5249 3274 5249 2899
+2899 5374 2774
 -6
 10875 6525 11250 7025
+6899 2774 7274 3274
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6650 11000 6525 11250 6525 11250 7025 10875 7025 10875 6650
 6650 11000 6525
+2899 7024 2774 7274 2774 7274 3274 6899 3274 6899 2899
+2899 7024 2774
 -6
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 7725 6525
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 9375 6525
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 11025 6525
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 10763 6308 11288 6308 11550 5854 11288 5400 10763 5400
+5854
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 9113 6308 9638 6308 9900 5854 9638 5400 9113 5400
+5854
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 7463 6308 7988 6308 8250 5854 7988 5400 7463 5400
+5854
+1 1.00 60.00 60.00
+2399 3749 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 7275 5925
+2399 5399 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 8925 5925
+2399 7049 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 10575 5925
+2175 3299 2174
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5775 9825 5775
+2174 4949 2174
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
+5775 8175 5775
+2174 6599 2174
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+2024 5849 2024
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+2024 4199 2024
 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+5250 6675 5250 6675 6075 5850 6075 5850 5250
+0 0 50 -1 0 11 0.0000 2 135 810 5925 5175 Idle List\001
+0 0 50 -1 0 11 0.0000 2 135 810 5175 5550 Idle List\001
+0 0 50 -1 0 11 0.0000 2 135 360 5325 5700 Lock\001
+0 0 50 -1 0 11 0.0000 2 135 1260 7275 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 135 1260 8925 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 135 1260 10575 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 135 720 8025 6825 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 720 9675 6825 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 720 11325 6825 Event FD\001
+1650 5850 1650 5850 2550 4950 2550 4950 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 4200 1650 4200 2550 3300 2550 3300 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 7500 1650 7500 2550 6600 2550 6600 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1499 2699 1499 2699 2400 1800 2400 1800 1499
+2 0 50 -1 0 11 0.0000 2 135 585 1725 1800 Idle List\001
+2 0 50 -1 0 11 0.0000 2 135 360 1725 1950 Lock\001
+1 0 50 -1 0 11 0.0000 2 135 585 2250 1425 Idle List\001
+1 0 50 -1 0 11 0.0000 2 135 1020 3750 1575 Idle Processor\001
+1 0 50 -1 0 11 0.0000 2 135 1020 5400 1575 Idle Processor\001
+1 0 50 -1 0 11 0.0000 2 135 1020 7050 1575 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 120 690 4049 3074 Event FD\001
+0 0 50 -1 0 11 0.0000 2 120 690 5699 3074 Event FD\001
+0 0 50 -1 0 11 0.0000 2 120 690 7349 3074 Event FD\001

doc/theses/thierry_delisle_PhD/thesis/fig/idle2.fig

-              r9e23b446
+              rffec1bf
 -2
 2
+5919 5250 6375 5775
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 6147.000 5409.011 6102 5410 6147 5364 6192 5410
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 6147.000 5410.000 6010 5410 6147 5273 6284 5410
+1 0 1 0 7 50 -1 -1 0.000 0 1 1 1 3150.000 2106.250 2250 2625 2775 3075 3525 3075
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+1875 1500 2331 2025
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 2104.000 1660.011 2058 1660 2103 1614 2148 1660
+1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 2104.000 1661.000 1966 1660 2103 1523 2240 1660
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 5410 6010 5501 5919 5501 5919 5775 6375 5775 6375 5501
 5501 6284 5410
+1660 1966 1751 1875 1751 1875 2025 2331 2025 2331 1751
+1751 2240 1660
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 4
 5410 6102 5501 6192 5501 6192 5410
+1660 2058 1751 2148 1751 2148 1660
 -6
 7575 6525 7950 7025
+3599 2774 3974 3274
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6650 7700 6525 7950 6525 7950 7025 7575 7025 7575 6650
 6650 7700 6525
+2899 3724 2774 3974 2774 3974 3274 3599 3274 3599 2899
+2899 3724 2774
 -6
 9225 6525 9600 7025
+5249 2774 5625 3274
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6650 9350 6525 9600 6525 9600 7025 9225 7025 9225 6650
 6650 9350 6525
+2899 5374 2774 5625 2774 5625 3274 5249 3274 5249 2899
+2899 5374 2774
 -6
 10875 6525 11250 7025
+6899 2774 7274 3274
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 6650 11000 6525 11250 6525 11250 7025 10875 7025 10875 6650
 6650 11000 6525
+2899 7024 2774 7274 2774 7274 3274 6899 3274 6899 2899
+2899 7024 2774
 -6
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+6150 6675 6150
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+5250 6675 5250 6675 6600 5850 6600 5850 5250
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 7725 6525
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 9375 6525
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+0 1.00 60.00 60.00
+6150 11025 6525
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 10763 6308 11288 6308 11550 5854 11288 5400 10763 5400
+5854
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 9113 6308 9638 6308 9900 5854 9638 5400 9113 5400
+5854
+3 0 1 0 7 50 -1 -1 0.000 0 0 0 0 0 7
+5854 7463 6308 7988 6308 8250 5854 7988 5400 7463 5400
+5854
+2400 2699 2399
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 7275 5925
+2399 3749 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 8925 5925
+2399 5399 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5925 10575 5925
+2399 7049 2774
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5775 9825 5775
+2175 3299 2174
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
 5775 8175 5775
+2 0 1 0 7 50 -1 -1 0.000 0 1 1 4
+2174 4949 2174
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
 1 1.00 60.00 120.00
 1 1.00 60.00 60.00
+6375 6375 6825 6900 6975 7500 6750
+.000 -0.500 -0.500 0.000
+0 0 50 -1 0 11 0.0000 2 135 810 5925 5175 Idle List\001
+0 0 50 -1 0 11 0.0000 2 135 810 5175 5550 Idle List\001
+0 0 50 -1 0 11 0.0000 2 135 360 5325 5700 Lock\001
+0 0 50 -1 0 11 0.0000 2 135 540 5775 6900 Atomic\001
+0 0 50 -1 0 11 0.0000 2 135 630 5775 7125 Pointer\001
+0 0 50 -1 0 11 0.0000 2 135 1260 7275 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 135 1260 8925 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 135 1260 10575 5325 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 135 720 8025 6825 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 720 9675 6825 Event FD\001
+0 0 50 -1 0 11 0.0000 2 135 720 11325 6825 Event FD\001
+2174 6599 2174
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+2024 5849 2024
+1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+1 1.00 60.00 120.00
+1 1.00 60.00 60.00
+2024 4199 2024
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1499 2699 1499 2699 2850 1800 2850 1800 1499
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 5850 1650 5850 2550 4950 2550 4950 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 4200 1650 4200 2550 3300 2550 3300 1650
+2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+1650 7500 1650 7500 2550 6600 2550 6600 1650
+0 0 50 -1 0 11 0.0000 2 120 525 1799 3149 Atomic\001
+0 0 50 -1 0 11 0.0000 2 120 510 1799 3374 Pointer\001
+2 0 50 -1 0 11 0.0000 2 135 585 1725 1800 Idle List\001
+2 0 50 -1 0 11 0.0000 2 135 360 1725 1950 Lock\001
+1 0 50 -1 0 11 0.0000 2 135 585 2250 1425 Idle List\001
+1 0 50 -1 0 11 0.0000 2 135 1020 3750 1575 Idle Processor\001
+1 0 50 -1 0 11 0.0000 2 135 1020 5400 1575 Idle Processor\001
+1 0 50 -1 0 11 0.0000 2 135 1020 7050 1575 Idle Processor\001
+0 0 50 -1 0 11 0.0000 2 120 690 4049 3074 Event FD\001
+0 0 50 -1 0 11 0.0000 2 120 690 5699 3074 Event FD\001
+0 0 50 -1 0 11 0.0000 2 120 690 7349 3074 Event FD\001

doc/theses/thierry_delisle_PhD/thesis/fig/idle_state.fig

-              r9e23b446
+              rffec1bf
 -2
 2
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3900 3600 571 571 3900 3600 3375 3375
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 6300 3600 605 605 6300 3600 5775 3300
 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 5100 5400 600 600 5100 5400 4500 5400
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 3000 3600 600 600 3000 3600 2400 3600
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1800 1800 600 600 1800 1800 1200 1800
+3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4205 1800 600 600 4205 1800 3605 1800
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 0 1.00 60.00 120.00
 4125 4725 4950
+1 1.00 60.00 120.00
+2325 2625 3150
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 0 1.00 60.00 120.00
 3600 5700 3600
+1 1.00 60.00 120.00
+1800 3600 1800
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 0 1.00 60.00 120.00
 4125 5475 4875
 1 0 50 -1 0 11 0.0000 2 135 450 5100 5475 AWAKE\001
 1 0 50 -1 0 11 0.0000 2 135 450 6300 3675 SLEEP\001
 1 0 50 -1 0 11 0.0000 2 135 540 3900 3675 SEARCH\001
 0 0 50 -1 0 11 0.0000 2 135 360 5775 4650 WAKE\001
 2 0 50 -1 0 11 0.0000 2 135 540 4350 4650 CANCEL\001
 1 0 50 -1 0 11 0.0000 2 135 630 5025 3450 CONFIRM\001
+1 1.00 60.00 120.00
+2325 3375 3150
+1 0 50 -1 0 11 0.0000 2 120 675 3000 3675 AWAKE\001
+1 0 50 -1 0 11 0.0000 2 120 525 4200 1875 SLEEP\001
+1 0 50 -1 0 11 0.0000 2 120 720 1800 1875 SEARCH\001
+2 0 50 -1 0 11 0.0000 2 120 720 2250 2850 CANCEL\001
+1 0 50 -1 0 11 0.0000 2 120 840 2925 1650 CONFIRM\001
+0 0 50 -1 0 11 0.0000 2 120 540 3750 2850 WAKE\001

doc/theses/thierry_delisle_PhD/thesis/fig/io_uring.fig

-              r9e23b446
+              rffec1bf
 -2
 2
 180 3240 2025 3510
+675 3105 2520 3375
 1 0 1 0 7 40 -1 -1 0.000 0 0 -1 0 0 2
 3240 720 3510
+3105 1215 3375
 1 0 1 0 7 40 -1 -1 0.000 0 0 -1 0 0 2
 3240 450 3510
+3105 945 3375
 2 0 1 0 7 45 -1 20 0.000 0 0 -1 0 0 5
 3240 1260 3240 1260 3510 180 3510 180 3240
+3105 1755 3105 1755 3375 675 3375 675 3105
 1 0 1 0 7 40 -1 -1 0.000 0 0 -1 0 0 2
 3240 990 3510
 0 0 40 -1 0 12 0.0000 2 165 990 1035 3420 {\\small S3}\001
 0 0 40 -1 0 12 0.0000 2 165 990 765 3420 {\\small S2}\001
 0 0 40 -1 0 12 0.0000 2 165 990 225 3420 {\\small S0}\001
 0 0 40 -1 0 12 0.0000 2 165 990 495 3420 {\\small S1}\001
+3105 1485 3375
+0 0 40 -1 0 12 0.0000 2 165 930 1530 3285 {\\small S3}\001
+0 0 40 -1 0 12 0.0000 2 165 930 1260 3285 {\\small S2}\001
+0 0 40 -1 0 12 0.0000 2 165 930 720 3285 {\\small S0}\001
+0 0 40 -1 0 12 0.0000 2 165 930 990 3285 {\\small S1}\001
 -6
 1530 2610 3240 4140
 1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 2455.714 3375.000 1890 2700 1575 3375 1890 4050
+2025 2475 3735 4005
+1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 2950.714 3240.000 2385 2565 2070 3240 2385 3915
 1 1.00 60.00 120.00
 3 0 1 0 7 40 -1 20 0.000 1 0.0000 2475 3375 315 315 2475 3375 2790 3375
 3 0 1 0 7 50 -1 20 0.000 1 0.0000 2475 3375 765 765 2475 3375 3240 3375
+3 0 1 0 7 40 -1 20 0.000 1 0.0000 2970 3240 315 315 2970 3240 3285 3240
+3 0 1 0 7 50 -1 20 0.000 1 0.0000 2970 3240 765 765 2970 3240 3735 3240
 1 0 1 0 7 45 -1 -1 0.000 0 0 -1 0 0 2
 3375 2133 2690
+3240 2628 2555
 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
 3375 1769 3093
+3240 2264 2958
 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
 3375 1769 3661
+3240 2264 3526
 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
 3375 2133 4057
+3240 2628 3922
 1 1 1 0 7 35 -1 0 4.000 0 0 -1 0 0 2
 3375 2745 3375
+3240 3240 3240
 -6
 585 2250 1485 2610
 2 0 50 -1 0 12 0.0000 2 135 900 1485 2385 Submission\001
 2 0 50 -1 0 12 0.0000 2 165 360 1485 2580 Ring\001
+1080 2115 1980 2475
+2 0 50 -1 0 12 0.0000 2 135 945 1980 2250 Submission\001
+2 0 50 -1 0 12 0.0000 2 180 405 1980 2445 Ring\001
 -6
 3600 2610 5265 4140
 1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 4384.000 3375.000 4950 4050 5265 3375 4950 2700
+4095 2475 5760 4005
+1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 4879.000 3240.000 5445 3915 5760 3240 5445 2565
 1 1.00 60.00 120.00
 3 0 1 0 7 40 -1 20 0.000 1 3.1416 4365 3375 315 315 4365 3375 4050 3375
 3 0 1 0 7 50 -1 20 0.000 1 3.1416 4365 3375 765 765 4365 3375 3600 3375
+3 0 1 0 7 40 -1 20 0.000 1 3.1416 4860 3240 315 315 4860 3240 4545 3240
+3 0 1 0 7 50 -1 20 0.000 1 3.1416 4860 3240 765 765 4860 3240 4095 3240
 1 0 1 0 7 45 -1 -1 0.000 0 0 -1 0 0 2
 3375 4707 4060
+3240 5202 3925
 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
 3375 5071 3657
+3240 5566 3522
 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
 3375 5071 3089
+3240 5566 2954
 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
 3375 4707 2693
+3240 5202 2558
 1 1 1 0 7 35 -1 0 4.000 0 0 -1 0 0 2
 3375 4095 3375
+3240 4590 3240
 -6
 5355 2250 6255 2610
 0 0 50 -1 0 12 0.0000 2 165 360 5355 2580 Ring\001
 0 0 50 -1 0 12 0.0000 2 165 900 5355 2385 Completion\001
+5850 2115 6750 2475
+0 0 50 -1 0 12 0.0000 2 180 405 5850 2445 Ring\001
+0 0 50 -1 0 12 0.0000 2 180 975 5850 2250 Completion\001
 -6
 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
 1 1.00 60.00 120.00
 2025 2550 2486
+1890 3045 2351
 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
 1 1.00 60.00 120.00
 2475 3825 2025
+2340 4320 1890
 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
 1 1.00 60.00 120.00
 4268 3066 4538
+4095 3600 4410
 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
 1 1.00 60.00 120.00
 4545 4275 4230
+4410 4770 4095
 1 1 1 0 7 55 -1 -1 4.000 0 0 -1 0 0 2
 3375 6255 3375
 0 0 35 -1 0 12 0.0000 2 165 1170 1845 3060 {\\small \\&S2}\001
 0 0 35 -1 0 12 0.0000 2 165 1170 1755 3420 {\\small \\&S3}\001
 0 0 35 -1 0 12 0.0000 2 165 1170 1890 3735 {\\small \\&S0}\001
 0 0 50 -1 0 12 0.0000 6 135 360 2790 2565 Push\001
 0 0 50 -1 0 12 0.0000 6 165 270 2880 4230 Pop\001
 0 0 50 -1 0 12 0.0000 6 135 360 2025 4275 Head\001
 0 0 50 -1 0 12 0.0000 6 135 360 2025 2565 Tail\001
 0 0 35 -1 0 12 0.0000 2 165 990 4635 3060 {\\small C0}\001
 0 0 35 -1 0 12 0.0000 2 165 990 4815 3420 {\\small C1}\001
 0 0 35 -1 0 12 0.0000 2 165 990 4635 3780 {\\small C2}\001
 0 0 50 -1 0 12 0.0000 4 135 360 4725 4275 Tail\001
 0 0 50 -1 0 12 0.0000 6 135 360 4590 2565 Head\001
 0 0 50 -1 0 12 0.0000 2 135 990 5535 3285 Kernel Line\001
 1 0 50 -1 0 12 0.0000 2 180 1350 3375 4815 {\\Large Kernel}\001
 1 0 50 -1 0 12 0.0000 2 180 1800 3375 1845 {\\Large Application}\001
 0 0 50 -1 0 12 0.0000 6 165 270 3690 2565 Pop\001
 0 0 50 -1 0 12 0.0000 4 135 360 3465 4230 Push\001
 0 0 50 -1 0 12 0.0000 2 135 90 0 3285 S\001
+3240 6750 3240
+0 0 35 -1 0 12 0.0000 2 165 1140 2340 2925 {\\small \\&S2}\001
+0 0 50 -1 0 12 0.0000 6 135 390 3285 2430 Push\001
+0 0 50 -1 0 12 0.0000 6 135 330 2520 2430 Tail\001
+0 0 35 -1 0 12 0.0000 2 165 960 5130 2925 {\\small C0}\001
+0 0 35 -1 0 12 0.0000 2 165 960 5310 3285 {\\small C1}\001
+0 0 35 -1 0 12 0.0000 2 165 960 5130 3645 {\\small C2}\001
+0 0 50 -1 0 12 0.0000 4 135 330 5220 4140 Tail\001
+0 0 50 -1 0 12 0.0000 6 135 420 5085 2430 Head\001
+0 0 50 -1 0 12 0.0000 2 135 960 6030 3150 Kernel Line\001
+0 0 50 -1 0 12 0.0000 2 135 105 495 3150 S\001
+0 0 35 -1 0 12 0.0000 2 165 1140 2385 3645 {\\small \\&S0}\001
+0 0 50 -1 0 12 0.0000 6 135 420 2340 4140 Head\001
+0 0 35 -1 0 12 0.0000 2 165 1140 2250 3285 {\\small \\&S3}\001
+2 0 50 -1 0 12 0.0000 4 135 390 4500 4140 Push\001
+1 0 50 -1 0 12 0.0000 2 180 1290 3915 4680 {\\Large Kernel}\001
+0 0 50 -1 0 12 0.0000 6 180 315 3285 4140 Pop\001
+1 0 50 -1 0 12 0.0000 2 180 1725 3915 1755 {\\Large Application}\001
+2 0 50 -1 0 12 0.0000 6 180 315 4545 2430 Pop\001

doc/theses/thierry_delisle_PhD/thesis/fig/system.fig

-              r9e23b446
+              rffec1bf
 3750 8025 3750
 -6
+4125 4725 4950 4950
+3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4250 4838 100 100 4250 4838 4350 4838
+0 -1 0 0 0 12 0.0000 2 135 510 4425 4875 thread\001
+-6
+5175 4725 6300 4950
+2 0 1 -1 -1 0 0 -1 0.000 0 0 0 0 0 5
+4950 5400 4725 5175 4725 5175 4950 5400 4950
+0 -1 0 0 0 12 0.0000 2 135 765 5475 4875 processor\001
+-6
+6600 4725 7500 4950
+2 1 1 -1 -1 0 0 -1 3.000 0 0 0 0 0 5
+4950 6600 4950 6600 4725 6825 4725 6825 4950
+0 -1 0 0 0 12 0.0000 2 135 540 6900 4875 cluster\001
+-6
+2175 4725 3975 4950
+3 0 1 0 0 0 0 0 0.000 1 0.0000 2250 4830 30 30 2250 4830 2280 4830
+0 -1 0 0 0 12 0.0000 2 180 1605 2325 4875 generator/coroutine\001
+-6
+1575 2550 2775 3900
+2 0 1 -1 -1 0 0 -1 0.000 0 0 0 0 0 5
+3450 2400 3000 1950 3000 1950 3450 2400 3450
+1 -1 0 0 0 12 0.0000 2 135 1170 2175 2700 Discrete-event\001
+1 -1 0 0 0 12 0.0000 2 180 720 2175 2925 Manager\001
+1 -1 0 0 0 12 0.0000 2 180 930 2175 3675 preemption\001
+1 -1 0 0 0 12 0.0000 2 135 630 2175 3900 timeout\001
+-6
 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5550 2625 150 150 5550 2625 5700 2625
 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5550 3225 150 150 5550 3225 5700 3225
 …
 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3975 2850 150 150 3975 2850 4125 2850
 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 7200 2775 150 150 7200 2775 7350 2775
-3 0 1 0 0 0 0 0 0.000 1 0.0000 2250 4830 30 30 2250 4830 2280 4830
 3 0 1 0 0 0 0 0 0.000 1 0.0000 7200 2775 30 30 7200 2775 7230 2805
 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3525 3600 150 150 3525 3600 3675 3600
-3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4625 4838 100 100 4625 4838 4725 4838
-2 0 1 -1 -1 0 0 -1 0.000 0 0 0 0 0 5
-4200 2400 3750 1950 3750 1950 4200 2400 4200
 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5
 4500 6300 1800 3000 1800 3000 4500 6300 4500
 …
 1 1.00 45.00 90.00
 3750 7875 2325 7200 2325 7200 2550
+2 1 1 -1 -1 0 0 -1 3.000 0 0 0 0 0 5
+4950 6750 4950 6750 4725 6975 4725 6975 4950
+2 0 1 -1 -1 0 0 -1 0.000 0 0 0 0 0 5
+4950 5850 4725 5625 4725 5625 4950 5850 4950
+1 -1 0 0 0 10 0.0000 2 135 900 5550 4425 Processors\001
+1 -1 0 0 0 10 0.0000 2 165 1170 4200 3975 Ready Threads\001
+1 -1 0 0 0 10 0.0000 2 165 1440 7350 1725 Other Cluster(s)\001
+1 -1 0 0 0 10 0.0000 2 135 1080 4650 1725 User Cluster\001
+1 -1 0 0 0 10 0.0000 2 165 630 2175 3675 Manager\001
+1 -1 0 0 0 10 0.0000 2 135 1260 2175 3525 Discrete-event\001
+1 -1 0 0 0 10 0.0000 2 150 900 2175 4350 preemption\001
+0 -1 0 0 0 10 0.0000 2 135 630 7050 4875 cluster\001
+1 -1 0 0 0 10 0.0000 2 135 1350 4200 3225 Blocked Threads\001
+0 -1 0 0 0 10 0.0000 2 135 540 4800 4875 thread\001
+0 -1 0 0 0 10 0.0000 2 120 810 5925 4875 processor\001
+0 -1 0 0 0 10 0.0000 2 165 1710 2325 4875 generator/coroutine\001
+1 -1 0 0 0 12 0.0000 2 135 840 5550 4425 Processors\001
+1 -1 0 0 0 12 0.0000 2 180 1215 4200 3975 Ready Threads\001
+1 -1 0 0 0 12 0.0000 2 165 1275 7350 1725 Other Cluster(s)\001
+1 -1 0 0 0 12 0.0000 2 135 990 4650 1725 User Cluster\001
+1 -1 0 0 0 12 0.0000 2 135 1380 4200 3225 Blocked Threads\001

doc/theses/thierry_delisle_PhD/thesis/local.bib

-              r9e23b446
+              rffec1bf
 % Cforall
 @misc{cfa:frontpage,
   url = {https://cforall.uwaterloo.ca/}
+  howpublished = {\href{https://cforall.uwaterloo.ca}{https://\-cforall.uwaterloo.ca}}
+}
 @article{cfa:typesystem,
 …
 @misc{MAN:linux/cfs,
   title = {{CFS} Scheduler - The Linux Kernel documentation},
   url = {https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html}
+  howpublished = {\href{https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html}{https://\-www.kernel.org/\-doc/\-html/\-latest/\-scheduler/\-sched-design-CFS.html}}
+}
 …
   year = {2019},
   month = {February},
   url = {https://opensource.com/article/19/2/fair-scheduling-linux}
+  howpublished = {\href{https://opensource.com/article/19/2/fair-scheduling-linux}{https://\-opensource.com/\-article/\-19/2\-/\-fair-scheduling-linux}}
+}
 …
+}
 @article{MAN:linux/cfs/balancing,
+@misc{MAN:linux/cfs/balancing,
   title={Reworking {CFS} load balancing},
+  journal={LWN article, available at: https://lwn.net/Articles/793427/},
+  year={2013}
+  journal={LWN article},
+  year={2019},
+  howpublished = {\href{https://lwn.net/Articles/793427}{https://\-lwn.net/\-Articles/\-793427}},
+}
 …
   title = {Mach Scheduling and Thread Interfaces - Kernel Programming Guide},
   organization = {Apple Inc.},
   url = {https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}
+  howPublish = {\href{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}}
+}
 …
   month = {June},
   series = {Developer Reference},
   url = {https://www.microsoftpressstore.com/articles/article.aspx?p=2233328&seqNum=7#:~:text=Overview\%20of\%20Windows\%20Scheduling,a\%20phenomenon\%20called\%20processor\%20affinity}
+}
 @online{GITHUB:go,
+  howpublished = {\href{https://www.microsoftpressstore.com/articles/article.aspx?p=2233328&seqNum=7#:~:text=Overview\%20of\%20Windows\%20Scheduling,a\%20phenomenon\%20called\%20processor\%20affinity}{https://\-www.microsoftpressstore.com/\-articles/\-article.aspx?p=2233328&seqNum=7#:~:text=Overview\%20of\%20Windows\%20Scheduling,a\%20phenomenon\%20called\%20processor\%20affinity}}
+}
+@misc{GITHUB:go,
   title = {GitHub - The Go Programming Language},
   author = {The Go Programming Language},
   url = {https://github.com/golang/go},
+  howpublished = {\href{https://github.com/golang/go}{https://\-github.com/\-golang/\-go}},
   version = {Change-Id: If07f40b1d73b8f276ee28ffb8b7214175e56c24d}
+}
 …
   year = {2019},
   booktitle = {Hydra},
   url = {https://www.youtube.com/watch?v=-K11rY57K7k&ab_channel=Hydra}
+  howpublished = {\href{https://www.youtube.com/watch?v=-K11rY57K7k&ab_channel=Hydra}{https://\-www.youtube.com/\-watch?v=-K11rY57K7k&ab_channel=Hydra}}
+}
 …
   year = {2008},
   booktitle = {Erlang User Conference},
+  url = {http://www.erlang.se/euc/08/euc_smp.pdf}
+}
+  howpublished = {\href{http://www.erlang.se/euc/08/euc_smp.pdf}{http://\-www.erlang.se/\-euc/\-08/\-euc_smp.pdf}}
+}
 @manual{MAN:tbb/scheduler,
   title = {Scheduling Algorithm - Intel{\textregistered} Threading Building Blocks Developer Reference},
   organization = {Intel{\textregistered}},
   url = {https://www.threadingbuildingblocks.org/docs/help/reference/task_scheduler/scheduling_algorithm.html}
+  howpublished = {\href{https://www.threadingbuildingblocks.org/docs/help/reference/task_scheduler/scheduling_algorithm.html}{https://\-www.threadingbuildingblocks.org/\-docs/\-help/\-reference/\-task\_scheduler/\-scheduling\_algorithm.html}}
+}
 …
   title = {Quasar Core - Quasar User Manual},
   organization = {Parallel Universe},
   url = {https://docs.paralleluniverse.co/quasar/}
+  howpublished = {\href{https://docs.paralleluniverse.co/quasar}{https://\-docs.paralleluniverse.co/\-quasar}}
+}
 @misc{MAN:project-loom,
   url = {https://www.baeldung.com/openjdk-project-loom}
+  howpublished = {\href{https://www.baeldung.com/openjdk-project-loom}{https://\-www.baeldung.com/\-openjdk-project-loom}}
+}
 @misc{MAN:java/fork-join,
   url = {https://www.baeldung.com/java-fork-join}
+  howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}}
+}
 …
   month   = "March",
   version = {0,4},
   howpublished = {\url{https://kernel.dk/io_uring.pdf}}
+  howpublished = {\href{https://kernel.dk/io_uring.pdf}{https://\-kernel.dk/\-io\_uring.pdf}}
+}
 …
   title = "Control theory --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2020",
   url = "https://en.wikipedia.org/wiki/Task_parallelism",
+  howpublished = {\href{https://en.wikipedia.org/wiki/Task_parallelism}{https://\-en.wikipedia.org/\-wiki/\-Task\_parallelism}},
   note = "[Online; accessed 22-October-2020]"
+}
 …
   title = "Task parallelism --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2020",
   url = "https://en.wikipedia.org/wiki/Control_theory",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Control_theory}{https://\-en.wikipedia.org/\-wiki/\-Control\_theory}",
   note = "[Online; accessed 22-October-2020]"
+}
 …
   title = "Implicit parallelism --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2020",
   url = "https://en.wikipedia.org/wiki/Implicit_parallelism",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Implicit_parallelism}{https://\-en.wikipedia.org/\-wiki/\-Implicit\_parallelism}",
   note = "[Online; accessed 23-October-2020]"
+}
 …
   title = "Explicit parallelism --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2017",
   url = "https://en.wikipedia.org/wiki/Explicit_parallelism",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Explicit_parallelism}{https://\-en.wikipedia.org/\-wiki/\-Explicit\_parallelism}",
   note = "[Online; accessed 23-October-2020]"
+}
 …
   title = "Linear congruential generator --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2020",
   url = "https://en.wikipedia.org/wiki/Linear_congruential_generator",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Linear_congruential_generator}{https://en.wikipedia.org/wiki/Linear\_congruential\_generator}",
   note = "[Online; accessed 2-January-2021]"
+}
 …
   title = "Futures and promises --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2020",
   url = "https://en.wikipedia.org/wiki/Futures_and_promises",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Futures_and_promises}{https://\-en.wikipedia.org/\-wiki/Futures\_and\_promises}",
   note = "[Online; accessed 9-February-2021]"
+}
 …
   title = "Read-copy-update --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2022",
   url = "https://en.wikipedia.org/wiki/Linear_congruential_generator",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Linear_congruential_generator}{https://\-en.wikipedia.org/\-wiki/\-Linear\_congruential\_generator}",
   note = "[Online; accessed 12-April-2022]"
+}
 …
   title = "Readers-writer lock --- {W}ikipedia{,} The Free Encyclopedia",
   year = "2021",
   url = "https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Readers-writer_lock}{https://\-en.wikipedia.org/\-wiki/\-Readers-writer\_lock}",
   note = "[Online; accessed 12-April-2022]"
+}
+@misc{wiki:binpak,
+  author = "{Wikipedia contributors}",
+  title = "Bin packing problem --- {W}ikipedia{,} The Free Encyclopedia",
+  year = "2022",
+  howpublished = "\href{https://en.wikipedia.org/wiki/Bin_packing_problem}{https://\-en.wikipedia.org/\-wiki/\-Bin\_packing\_problem}",
+  note = "[Online; accessed 29-June-2022]"
+}
 …
 % [05/04, 12:36] Trevor Brown
 %     i don't know where rmr complexity was first introduced, but there are many many many papers that use the term and define it
 % [05/04, 12:37] Trevor Brown
+% [05/04, 12:37] Trevor Brown
 %     here's one paper that uses the term a lot and links to many others that use it... might trace it to something useful there https://drops.dagstuhl.de/opus/volltexte/2021/14832/pdf/LIPIcs-DISC-2021-30.pdf
 % [05/04, 12:37] Trevor Brown
+% [05/04, 12:37] Trevor Brown
 %     another option might be to cite a textbook
 % [05/04, 12:42] Trevor Brown
+% [05/04, 12:42] Trevor Brown
 %     but i checked two textbooks in the area i'm aware of and i don't see a definition of rmr complexity in either
 % [05/04, 12:42] Trevor Brown
+% [05/04, 12:42] Trevor Brown
 %     this one has a nice statement about the prevelance of rmr complexity, as well as some rough definition
 % [05/04, 12:42] Trevor Brown
+% [05/04, 12:42] Trevor Brown
 %     https://dl.acm.org/doi/pdf/10.1145/3465084.3467938
 …
+%
 % https://doi.org/10.1137/1.9781611973099.100
+@misc{AIORant,
+  author = "Linus Torvalds",
+  title = "Re: [PATCH 09/13] aio: add support for async openat()",
+  year = "2016",
+  month = jan,
+  howpublished = "\href{https://lwn.net/Articles/671657}{https://\-lwn.net/\-Articles/671657}",
+  note = "[Online; accessed 6-June-2022]"
+}
+@misc{apache,
+  key = {Apache Software Foundation},
+  title = {{T}he {A}pache Web Server},
+  howpublished = {\href{http://httpd.apache.org}{http://\-httpd.apache.org}},
+  note = "[Online; accessed 6-June-2022]"
+}
+@misc{SeriallyReusable,
+    author      = {IBM},
+    title       = {Serially reusable programs},
+    month       = mar,
+    howpublished= {\href{https://www.ibm.com/docs/en/ztpf/1.1.0.15?topic=structures-serially-reusable-programs}{https://www.ibm.com/\-docs/\-en/\-ztpf/\-1.1.0.15?\-topic=structures\--serially\--reusable-programs}},
+    year        = 2021,
+}
+@inproceedings{Albers12,
+    author      = {Susanne Albers and Antonios Antoniadis},
+    title       = {Race to Idle: New Algorithms for Speed Scaling with a Sleep State},
+    booktitle   = {Proceedings of the 2012  Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)},
+    doi         = {10.1137/1.9781611973099.100},
+    URL         = {https://epubs.siam.org/doi/abs/10.1137/1.9781611973099.100},
+    eprint      = {https://epubs.siam.org/doi/pdf/10.1137/1.9781611973099.100},
+    year        = 2012,
+    month       = jan,
+    pages       = {1266-1285},
+}

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

-              r9e23b446
+              rffec1bf
 \chapter{Scheduling Core}\label{core}
+Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenario, where the system has reached a steady state. For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers. In short, the system is neither overloaded nor underloaded.
+It is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected. As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new load and return to the steady state, \eg, by adding or removing workers. Therefore, flaws in scheduling the steady state tend to be pervasive in all states.
+Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenario, where the system has reached a steady state.
+For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers.
+In short, the system is neither overloaded nor underloaded.
+It is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected.
+As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new load and return to the steady state, \eg, by adding or removing workers.
+Therefore, flaws in scheduling the steady state tend to be pervasive in all states.
 \section{Design Goals}
+As with most of the design decisions behind \CFA, an important goal is to match the expectation of the programmer according to their execution mental-model. To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental-model, the system also respects this model.
+As with most of the design decisions behind \CFA, an important goal is to match the expectation of the programmer according to their execution mental-model.
+To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental-model, the system also respects this model.
 For threading, a simple and common execution mental-model is the ``Ideal multi-tasking CPU'' :
 …
 Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
+In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with each other but simply share the hardware. This assumption makes it easier to reason about threading because ready \glspl{thrd} can be thought of in isolation and the effect of the scheduler can be virtually ignored. This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees:
+In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with each other but simply share the hardware.
+This assumption makes it easier to reason about threading because ready \glspl{thrd} can be thought of in isolation and the effect of the scheduler can be virtually ignored.
+This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees:
 \begin{enumerate}
         \item A fairness guarantee: a \gls{thrd} that is ready to run is not prevented by another thread.
 …
 \end{enumerate}
+It is important to note that these guarantees are expected only up to a point. \Glspl{thrd} that are ready to run should not be prevented to do so, but they still share the limited hardware resources. Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
+Similarly the performance guarantee, the lack of interference among threads, is only relevant up to a point. Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. How much is an acceptable cost is obviously highly variable. For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages. This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models. Recall programmer expectation is that the impact of the scheduler can be ignored. Therefore, if the cost of scheduling is compatitive to other popular languages, the guarantee will be consider achieved.
+It is important to note that these guarantees are expected only up to a point.
+\Glspl{thrd} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
+Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
+Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point.
+Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention.
+How much is an acceptable cost is obviously highly variable.
+For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages.
+This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models.
+Recall programmer expectation is that the impact of the scheduler can be ignored.
+Therefore, if the cost of scheduling is competitive to other popular languages, the guarantee is consider achieved.
 More precisely the scheduler should be:
 \begin{itemize}
 …
 \subsection{Fairness Goals}
 For this work fairness will be considered as having two strongly related requirements: true starvation freedom and ``fast'' load balancing.
 \paragraph{True starvation freedom} is more easily defined: As long as at least one \proc continues to dequeue \ats, all read \ats should be able to run eventually.
 In any running system, \procs can stop dequeing \ats if they start running a \at that will simply never park.
 Traditional workstealing schedulers do not have starvation freedom in these cases.
+For this work, fairness is considered to have two strongly related requirements: true starvation freedom and ``fast'' load balancing.
+\paragraph{True starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress.
+In any running system, a \proc can stop dequeuing \ats if it starts running a \at that never blocks.
+Without preemption, traditional work-stealing schedulers do not have starvation freedom in this case.
 Now this requirement begs the question, what about preemption?
 Generally speaking preemption happens on the timescale of several milliseconds, which brings us to the next requirement: ``fast'' load balancing.
 \paragraph{Fast load balancing} means that load balancing should happen faster than preemption would normally allow.
 For interactive applications that need to run at 60, 90, 120 frames per second, \ats having to wait for several millseconds to run are effectively starved.
+For interactive applications that need to run at 60, 90, 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved.
 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale.
 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general load of the system.
 \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal}
+An important performance factor in modern architectures is cache locality. Waiting for data at lower levels or not present in the cache can have a major impact on performance. Having multiple \glspl{hthrd} writing to the same cache lines also leads to cache lines that must be waited on. It is therefore preferable to divide data among each \gls{hthrd}\footnote{This partitioning can be an explicit division up front or using data structures where different \glspl{hthrd} are naturally routed to different cache lines.}.
+For a scheduler, having good locality\footnote{This section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how the data used by the application is affected by scheduling. External locality is a much more complicated subject and is discussed in the next section.}, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \gls{thrd}, and as consequence cache lines, to a \gls{hthrd} that is currently available.
+However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally, where Figure~\ref{fig:fair} shows a visual representation of this behaviour. As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental-model.
+An important performance factor in modern architectures is cache locality.
+Waiting for data at lower levels or not present in the cache can have a major impact on performance.
+Having multiple \glspl{hthrd} writing to the same cache lines also leads to cache lines that must be waited on.
+It is therefore preferable to divide data among each \gls{hthrd}\footnote{This partitioning can be an explicit division up front or using data structures where different \glspl{hthrd} are naturally routed to different cache lines.}.
+For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness.
+Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \gls{thrd}, and as consequence cache lines, to a \gls{hthrd} that is currently available.
+Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how the data used by the application is affected by scheduling.
+External locality is a much more complicated subject and is discussed in the next section.
+However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally.
+Figure~\ref{fig:fair} shows a visual representation of this behaviour.
+As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental-model.
 \begin{figure}
 …
         \input{fairness.pstex_t}
         \vspace*{-10pt}
+        \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \gls{thrd} awaits running is shown as the time the ready \gls{thrd} waits increases, Ready Time, the chances that its data is still in cache, Locality, decreases. At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model. Since the actual values and curves of this graph can be highly variable, the graph is an idealized representation of the two opposing goals.}
+        \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \gls{thrd} awaits running is shown as the time the ready \gls{thrd} waits increases, Ready Time, the chances that its data is still in cache decreases, Locality.
+        At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model.
+        Since the actual values and curves of this graph can be highly variable, the graph is an idealized representation of the two opposing goals.}
         \label{fig:fair}
 \end{figure}
 \subsection{Performance Challenges}\label{pref:challenge}
+While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges. Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design.
+While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges.
+Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design.
 \subsubsection{Scalability}
 …
 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeues \ats.
 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worst improvements.
 While the ready-queue itself can be sharded to alleviate the main source of contention, auxillary scheduling features, \eg counting ready \ats, can also be sources of contention.
+While the ready-queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
 \subsubsection{Migration Cost}
 Another important source of latency in scheduling is migration.
 An \at is said to have migrated if it is executed by two different \proc consecutively, which is the process discussed in \ref{fairnessvlocal}.
 Migrations can have many different causes, but it certain programs it can be all but impossible to limit migrations.
 Chapter~\ref{microbench} for example, has a benchmark where any \at can potentially unblock any other \at, which can leat to \ats migrating more often than not.
 Because of this it is important to design the internal data structures of the scheduler to limit the latency penalty from migrations.
+Another important source of scheduling latency is migration.
+A \at migrates if it executes on two different \procs consecutively, which is the process discussed in \ref{fairnessvlocal}.
+Migrations can have many different causes, but in certain programs, it can be impossible to limit migration.
+Chapter~\ref{microbench} has a benchmark where any \at can potentially unblock any other \at, which can lead to \ats migrating frequently.
+Hence, it is important to design the internal data structures of the scheduler to limit any latency penalty from migrations.
 \section{Inspirations}
+In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance. The problem is adding/removing \glspl{thrd} is a single point of contention. As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. The solution to this problem is to shard the ready-queue : create multiple sub-ready-queues that multiple \glspl{hthrd} can access and modify without interfering.
+Before going into the design of \CFA's scheduler proper, it is relevant to discuss two sharding solutions which served as the inspiration scheduler in this thesis.
+In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance.
+The problem is a single point of contention when adding/removing \ats.
+As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
+The solution to this problem is to shard the ready-queue: create multiple \emph{subqueues} forming the logical ready-queue and the subqueues are accessed by multiple \glspl{hthrd} without interfering.
+Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
 \subsection{Work-Stealing}
 As mentioned in \ref{existing:workstealing}, a popular pattern shard the ready-queue is work-stealing.
 In this pattern each \gls{proc} has its own local ready-queue and \glspl{proc} only access each other's ready-queue if they run out of work on their local ready-queue.
 The interesting aspect of workstealing happen in easier scheduling cases, \ie enough work for everyone but no more and no load balancing needed.
 In these cases, work-stealing is close to optimal scheduling: it can achieve perfect locality and have no contention.
+As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready-queue is work-stealing.
+In this approach, each \gls{proc} has its own local subqueue and \glspl{proc} only access each other's subqueue if they run out of work on their local ready-queue.
+The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
+In this case, work stealing is close to optimal scheduling: it can achieve perfect locality and have no contention.
 On the other hand, work-stealing schedulers only attempt to do load-balancing when a \gls{proc} runs out of work.
 This means that the scheduler never balances unfair loads unless they result in a \gls{proc} running out of work.
+Chapter~\ref{microbench} shows that in pathological cases this problem can lead to indefinite starvation.
+Based on these observation, the conclusion is that a \emph{perfect} scheduler should behave very similarly to work-stealing in the easy cases, but should have more proactive load-balancing if the need arises.
+\subsection{Relaxed-Fifo}
+An entirely different scheme is to create a ``relaxed-FIFO'' queue as in \todo{cite Trevor's paper}. This approach forgos any ownership between \gls{proc} and ready-queue, and simply creates a pool of ready-queues from which the \glspl{proc} can pick from.
+\Glspl{proc} choose ready-queus at random, but timestamps are added to all elements of the queue and dequeues are done by picking two queues and dequeing the oldest element.
+All subqueues are protected by TryLocks and \procs simply pick a different subqueue if they fail to acquire the TryLock.
+The result is a queue that has both decent scalability and sufficient fairness.
+The lack of ownership means that as long as one \gls{proc} is still able to repeatedly dequeue elements, it is unlikely that any element will stay on the queue for much longer than any other element.
+This contrasts with work-stealing, where \emph{any} \gls{proc} busy for an extended period of time results in all the elements on its local queue to have to wait. Unless another \gls{proc} runs out of work.
+Chapter~\ref{microbench} shows that pathological cases work stealing can lead to indefinite starvation.
+Based on these observation, the conclusion is that a \emph{perfect} scheduler should behave similar to work-stealing in the steady-state case, but load balance proactively when the need arises.
+\subsection{Relaxed-FIFO}
+A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \todo{cite Trevor's paper}.
+This approach forgoes any ownership between \gls{proc} and subqueue, and simply creates a pool of ready-queues from which \glspl{proc} pick.
+Scheduling is performed as follows:
+\begin{itemize}
+\item
+All subqueues are protected by TryLocks.
+\item
+Timestamps are added to each element of a subqueue.
+\item
+A \gls{proc} randomly tests ready queues until it has acquired one or two queues.
+\item
+If two queues are acquired, the older of the two \ats at the front the acquired queues is dequeued.
+\item
+Otherwise the \ats from the single queue is dequeued.
+\end{itemize}
+The result is a queue that has both good scalability and sufficient fairness.
+The lack of ownership ensures that as long as one \gls{proc} is still able to repeatedly dequeue elements, it is unlikely any element will delay longer than any other element.
+This guarantee contrasts with work-stealing, where a \gls{proc} with a long subqueue results in unfairness for its \ats in comparison to a \gls{proc} with a short subqueue.
+This unfairness persists until a \gls{proc} runs out of work and steals.
 An important aspects of this scheme's fairness approach is that the timestamps make it possible to evaluate how long elements have been on the queue.
 However, another major aspect is that \glspl{proc} will eagerly search for these older elements instead of focusing on specific queues.
 While the fairness, of this scheme is good, it does suffer in terms of performance.
 It requires very wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues can be difficult if there are too few ready \ats.
+However, \glspl{proc} eagerly search for these older elements instead of focusing on specific queues, which negatively affects locality.
+While this scheme has good fairness, its performance suffers.
+It requires wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues is difficult when there are few ready \ats.
 \section{Relaxed-FIFO++}
+Since it has inherent fairness quelities and decent performance in the presence of many \ats, the relaxed-FIFO queue appears as a good candidate to form the basis of a scheduler.
+The most obvious problems is for workloads where the number of \ats is barely greater than the number of \procs.
+In these situations, the wide sharding means most of the sub-queues from which the relaxed queue is formed will be empty.
+The consequence is that when a dequeue operations attempts to pick a sub-queue at random, it is likely that it picks an empty sub-queue and will have to pick again.
+This problem can repeat an unbounded number of times.
+The inherent fairness and good performance with many \ats, makes the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.
+The problem case is workloads where the number of \ats is barely greater than the number of \procs.
+In these situations, the wide sharding of the ready queue means most of its subqueues are empty.
+Furthermore, the non-empty subqueues are unlikely to hold more than one item.
+The consequence is that a random dequeue operation is likely to pick an empty subqueue, resulting in an unbounded number of selections.
+This state is generally unstable: each subqueue is likely to frequently toggle between being empty and nonempty.
+Indeed, when the number of \ats is \emph{equal} to the number of \procs, every pop operation is expected to empty a subqueue and every push is expected to add to an empty subqueue.
+In the worst case, a check of the subqueues sees all are empty or full.
 As this is the most obvious challenge, it is worth addressing first.
+The obvious solution is to supplement each subqueue with some sharded data structure that keeps track of which subqueues are empty.
+This data structure can take many forms, for example simple bitmask or a binary tree that tracks which branch are empty.
+Following a binary tree on each pick has fairly good Big O complexity and many modern architectures have powerful bitmask manipulation instructions.
+However, precisely tracking which sub-queues are empty is actually fundamentally problematic.
+The reason is that each subqueues are already a form of sharding and the sharding width has presumably already chosen to avoid contention.
+However, tracking which ready queue is empty is only useful if the tracking mechanism uses denser sharding than the sub queues, then it will invariably create a new source of contention.
+But if the tracking mechanism is not denser than the sub-queues, then it will generally not provide useful because reading this new data structure risks being as costly as simply picking a sub-queue at random.
+Early experiments with this approach have shown that even with low success rates, randomly picking a sub-queue can be faster than a simple tree walk.
+The obvious solution is to supplement each sharded subqueue with data that indicates if the queue is empty/nonempty to simplify finding nonempty queues, \ie ready \glspl{at}.
+This sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty subqueues.
+Specifically, many modern architectures have powerful bitmask manipulation instructions or searching a binary tree has good Big-O complexity.
+However, precisely tracking nonempty subqueues is problematic.
+The reason is that the subqueues are initially sharded with a width presumably chosen to avoid contention.
+However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie denser than the sharded subqueues.
+Otherwise, it does not provide useful information because reading this new data structure risks being as costly as simply picking a subqueue at random.
+But if the tracking mechanism \emph{is} denser than the shared subqueues, than constant updates invariably create a new source of contention.
+Early experiments with this approach showed that randomly picking, even with low success rates, is often faster than bit manipulations or tree walks.
 The exception to this rule is using local tracking.
 If each \proc keeps track locally of which sub-queue is empty, then this can be done with a very dense data structure without introducing a new source of contention.
 The consequence of local tracking however, is that the information is not complete.
 Each \proc is only aware of the last state it saw each subqueues but does not have any information about freshness.
 Even on systems with low \gls{hthrd} count, \eg 4 or 8, this can quickly lead to the local information being no better than the random pick.
 This is due in part to the cost of this maintaining this information and its poor quality.
 However, using a very low cost approach to local tracking may actually be beneficial.
 If the local tracking is no more costly than the random pick, than \emph{any} improvement to the succes rate, however low it is, would lead to a performance benefits.
 This leads to the following approach:
+If each \proc locally keeps track of empty subqueues, than this can be done with a very dense data structure without introducing a new source of contention.
+However, the consequence of local tracking is that the information is incomplete.
+Each \proc is only aware of the last state it saw about each subqueue so this information quickly becomes stale.
+Even on systems with low \gls{hthrd} count, \eg 4 or 8, this approach can quickly lead to the local information being no better than the random pick.
+This result is due in part to the cost of maintaining information and its poor quality.
+However, using a very low cost but inaccurate approach for local tracking can actually be beneficial.
+If the local tracking is no more costly than a random pick, than \emph{any} improvement to the success rate, however low it is, leads to a performance benefits.
+This suggests to the following approach:
 \subsection{Dynamic Entropy}\cit{https://xkcd.com/2318/}
 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng}.
 The \glsxtrshort{prng} state can be seen as containing a list of all the future sub-queues that will be accessed.
 While this is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the subqueues that were accessed.
 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, for example some Linear Congruential Generators\cit{https://en.wikipedia.org/wiki/Linear\_congruential\_generator} support running the algorithm backwards while offering good quality and performance.
+The Relaxed-FIFO approach can be made to handle the case of mostly empty subqueues by tweaking the \glsxtrlong{prng}.
+The \glsxtrshort{prng} state can be seen as containing a list of all the future subqueues that will be accessed.
+While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the subqueues that were accessed.
+Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cit{https://en.wikipedia.org/wiki/Linear\_congruential\_generator} support running the algorithm backwards while offering good quality and performance.
 This particular \glsxtrshort{prng} can be used as follows:
+Each \proc maintains two \glsxtrshort{prng} states, which whill be refered to as \texttt{F} and \texttt{B}.
+When a \proc attempts to dequeue a \at, it picks the subqueues by running the \texttt{B} backwards.
+When a \proc attempts to enqueue a \at, it runs \texttt{F} forward to pick to subqueue to enqueue to.
+If the enqueue is successful, the state \texttt{B} is overwritten with the content of \texttt{F}.
+The result is that each \proc will tend to dequeue \ats that it has itself enqueued.
+When most sub-queues are empty, this technique increases the odds of finding \ats at very low cost, while also offering an improvement on locality in many cases.
+However, while this approach does notably improve performance in many cases, this algorithm is still not competitive with work-stealing algorithms.
+\begin{itemize}
+\item
+Each \proc maintains two \glsxtrshort{prng} states, refereed to as $F$ and $B$.
+\item
+When a \proc attempts to dequeue a \at, it picks a subqueue by running $B$ backwards.
+\item
+When a \proc attempts to enqueue a \at, it runs $F$ forward picking a subqueue to enqueue to.
+If the enqueue is successful, the state $B$ is overwritten with the content of $F$.
+\end{itemize}
+The result is that each \proc tends to dequeue \ats that it has itself enqueued.
+When most subqueues are empty, this technique increases the odds of finding \ats at very low cost, while also offering an improvement on locality in many cases.
+Tests showed this approach performs better than relaxed-FIFO in many cases.
+However, it is still not competitive with work-stealing algorithms.
 The fundamental problem is that the constant randomness limits how much locality the scheduler offers.
+This becomes problematic both because the scheduler is likely to get cache misses on internal data-structures and because migration become very frequent.
+Therefore since the approach of modifying to relaxed-FIFO algorithm to behave more like work stealing does not seem to pan out, the alternative is to do it the other way around.
+This becomes problematic both because the scheduler is likely to get cache misses on internal data-structures and because migrations become frequent.
+Therefore, the attempt to modify the relaxed-FIFO algorithm to behave more like work stealing did not pan out.
+The alternative is to do it the other way around.
 \section{Work Stealing++}
 To add stronger fairness guarantees to workstealing a few changes.
+To add stronger fairness guarantees to work stealing a few changes are needed.
 First, the relaxed-FIFO algorithm has fundamentally better fairness because each \proc always monitors all subqueues.
+Therefore the workstealing algorithm must be prepended with some monitoring.
+Before attempting to dequeue from a \proc's local queue, the \proc must make some effort to make sure remote queues are not being neglected.
+To make this possible, \procs must be able to determie which \at has been on the ready-queue the longest.
+Which is the second aspect that much be added.
+The relaxed-FIFO approach uses timestamps for each \at and this is also what is done here.
+Therefore, the work-stealing algorithm must be prepended with some monitoring.
+Before attempting to dequeue from a \proc's subqueue, the \proc must make some effort to ensure other subqueues are not being neglected.
+To make this possible, \procs must be able to determine which \at has been on the ready queue the longest.
+Second, the relaxed-FIFO approach needs timestamps for each \at to make this possible.
 \begin{figure}
         \centering
         \input{base.pstex_t}
+        \caption[Base \CFA design]{Base \CFA design \smallskip\newline A Pool of sub-ready queues offers the sharding, two per \glspl{proc}. Each \gls{proc} have local subqueues, however \glspl{proc} can access any of the sub-queues. Each \at is timestamped when enqueued.}
+        \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of subqueues offers the sharding, two per \glspl{proc}.
+        Each \gls{proc} can access all of the subqueues.
+        Each \at is timestamped when enqueued.}
         \label{fig:base}
 \end{figure}
+The algorithm is structure as shown in Figure~\ref{fig:base}.
+This is very similar to classic workstealing except the local queues are placed in an array so \procs can access eachother's queue in constant time.
+Sharding width can be adjusted based on need.
+When a \proc attempts to dequeue a \at, it first picks a random remote queue and compares its timestamp to the timestamps of the local queue(s), dequeue from the remote queue if needed.
+Implemented as as naively state above, this approach has some obvious performance problems.
+Figure~\ref{fig:base} shows the algorithm structure.
+This structure is similar to classic work-stealing except the subqueues are placed in an array so \procs can access them in constant time.
+Sharding width can be adjusted based on contention.
+Note, as an optimization, the TS of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.
+This organization keeps the highly accessed front TSs directly in the array.
+When a \proc attempts to dequeue a \at, it first picks a random remote subqueue and compares its timestamp to the timestamps of its local subqueue(s).
+The oldest waiting \at is dequeued to provide global fairness.
+However, this na\"ive implemented has performance problems.
 First, it is necessary to have some damping effect on helping.
 Random effects like cache misses and preemption can add spurious but short bursts of latency for which helping is not helpful, pun intended.
 The effect of these bursts would be to cause more migrations than needed and make this workstealing approach slowdown to the match the relaxed-FIFO approach.
+Random effects like cache misses and preemption can add spurious but short bursts of latency negating the attempt to help.
+These bursts can cause increased migrations and make this work stealing approach slowdown to the level of relaxed-FIFO.
 \begin{figure}
 …
 \end{figure}
 A simple solution to this problem is to compare an exponential moving average\cit{https://en.wikipedia.org/wiki/Moving\_average\#Exponential\_moving\_average} instead if the raw timestamps, shown in Figure~\ref{fig:base-ma}.
 Note that this is slightly more complex than it sounds because since the \at at the head of a subqueue is still waiting, its wait time has not ended.
 Therefore the exponential moving average is actually an exponential moving average of how long each already dequeued \at have waited.
 To compare subqueues, the timestamp at the head must be compared to the current time, yielding the bestcase wait time for the \at at the head of the queue.
+A simple solution to this problem is to use an exponential moving average\cit{https://en.wikipedia.org/wiki/Moving\_average\#Exponential\_moving\_average} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}.
+Note, this is more complex because the \at at the head of a subqueue is still waiting, so its wait time has not ended.
+Therefore, the exponential moving average is actually an exponential moving average of how long each dequeued \at has waited.
+To compare subqueues, the timestamp at the head must be compared to the current time, yielding the best-case wait-time for the \at at the head of the queue.
 This new waiting is averaged with the stored average.
+To limit even more the amount of unnecessary migration, a bias can be added to the local queue, where a remote queue is helped only if its moving average is more than \emph{X} times the local queue's average.
+None of the experimentation that I have run with these scheduler seem to indicate that the choice of the weight for the moving average or the choice of bis is particularly important.
+Weigths and biases of similar \emph{magnitudes} have similar effects.
+With these additions to workstealing, scheduling can be made as fair as the relaxed-FIFO approach, well avoiding the majority of unnecessary migrations.
+Unfortunately, the performance of this approach does suffer in the cases with no risks of starvation.
+The problem is that the constant polling of remote subqueues generally entail a cache miss.
+To make things worst, remote subqueues that are very active, \ie \ats are frequently enqueued and dequeued from them, the higher the chances are that polling will incurr a cache-miss.
+Conversly, the active subqueues do not benefit much from helping since starvation is already a non-issue.
+This puts this algorithm in an akward situation where it is paying for a cost, but the cost itself suggests the operation was unnecessary.
+To further limit migration, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average.
+Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
+With these additions to work stealing, scheduling can be made as fair as the relaxed-FIFO approach, avoiding the majority of unnecessary migrations.
+Unfortunately, the work to achieve fairness has a performance cost, especially when the workload is inherently fair, and hence, there is only short-term or no starvation.
+The problem is that the constant polling, \ie reads, of remote subqueues generally entail a cache miss because the TSs are constantly being updated, \ie, writes.
+To make things worst, remote subqueues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.
+Conversely, the active subqueues do not benefit much from helping since starvation is already a non-issue.
+This puts this algorithm in the awkward situation of paying for a cost that is largely unnecessary.
 The good news is that this problem can be mitigated
 \subsection{Redundant Timestamps}
+The problem with polling remote queues is due to a tension between the consistency requirement on the subqueue.
+For the subqueues, correctness is critical. There must be a consensus among \procs on which subqueues hold which \ats.
+Since the timestamps are use for fairness, it is alco important to have consensus and which \at is the oldest.
+However, when deciding if a remote subqueue is worth polling, correctness is much less of a problem.
+Since the only need is that a subqueue will eventually be polled, some data staleness can be acceptable.
+This leads to a tension where stale timestamps are only problematic in some cases.
+Furthermore, stale timestamps can be somewhat desirable since lower freshness requirements means less tension on the cache coherence protocol.
+\begin{figure}
+        \centering
+        % \input{base_ts2.pstex_t}
+        \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline A array is added containing a copy of the timestamps. These timestamps are written to with relaxed atomics, without fencing, leading to fewer cache invalidations.}
+        \label{fig:base-ts2}
+\end{figure}
+A solution to this is to create a second array containing a copy of the timestamps and average.
+The problem with polling remote subqueues is that correctness is critical.
+There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion.
+Furthermore, since timestamps are use for fairness, it is critical to have consensus on which \at is the oldest.
+However, when deciding if a remote subqueue is worth polling, correctness is less of a problem.
+Since the only requirement is that a subqueue is eventually polled, some data staleness is acceptable.
+This leads to a situation where stale timestamps are only problematic in some cases.
+Furthermore, stale timestamps can be desirable since lower freshness requirements mean less cache invalidations.
+Figure~\ref{fig:base-ts2} shows a solution with a second array containing a copy of the timestamps and average.
 This copy is updated \emph{after} the subqueue's critical sections using relaxed atomics.
 \Glspl{proc} now check if polling is needed by comparing the copy of the remote timestamp instead of the actual timestamp.
+The result is that since there is no fencing, the writes can be buffered and cause fewer cache invalidations.
+The correctness argument here is somewhat subtle.
+The result is that since there is no fencing, the writes can be buffered in the hardware and cause fewer cache invalidations.
+\begin{figure}
+        \centering
+        \input{base_ts2.pstex_t}
+        \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
+        These timestamps are written to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
+        \label{fig:base-ts2}
+\end{figure}
+The correctness argument is somewhat subtle.
 The data used for deciding whether or not to poll a queue can be stale as long as it does not cause starvation.
+Therefore, it is acceptable if stale data make queues appear older than they really are but not fresher.
+For the timestamps, this means that missing writes to the timestamp is acceptable since they will make the head \at look older.
+For the moving average, as long as the operation are RW-safe, the average is guaranteed to yield a value that is between the oldest and newest values written.
+Therefore this unprotected read of the timestamp and average satisfy the limited correctness that is required.
+Therefore, it is acceptable if stale data makes queues appear older than they really are but appearing fresher can be a problem.
+For the timestamps, this means missing writes to the timestamp is acceptable since they make the head \at look older.
+For the moving average, as long as the operations are just atomic reads/writes, the average is guaranteed to yield a value that is between the oldest and newest values written.
+Therefore, this unprotected read of the timestamp and average satisfy the limited correctness that is required.
+With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
+The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
+For example, on machines with a CPU containing multiple hyperthreads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on same (local) CPU, or by a CPU on a different (remote) socket.
+Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
+However, these delays are not specific to systems with multiple CPUs.
+Depending on the cache structure, cache misses can have different latency on the same CPU, \eg the AMD EPYC 7662 CPUs used in Chapter~\ref{microbench}.
 \begin{figure}
         \centering
         \input{cache-share.pstex_t}
         \caption[CPU design with wide L3 sharing]{CPU design with wide L3 sharing \smallskip\newline A very simple CPU with 4 \glspl{hthrd}. L1 and L2 are private to each \gls{hthrd} but the L3 is shared across to entire core.}
+        \caption[CPU design with wide L3 sharing]{CPU design with wide L3 sharing \smallskip\newline A CPU with 4 cores, where caches L1 and L2 are private to each core, and the L3 cache is shared across all cores.}
         \label{fig:cache-share}
+\end{figure}
+\begin{figure}
+        \centering
+        \vspace{25pt}
         \input{cache-noshare.pstex_t}
         \caption[CPU design with a narrower L3 sharing]{CPU design with a narrower L3 sharing \smallskip\newline A different CPU design, still with 4 \glspl{hthrd}. L1 and L2 are still private to each \gls{hthrd} but the L3 is shared some of the CPU but there is still two distinct L3 instances.}
+        \caption[CPU design with a narrower L3 sharing]{CPU design with a narrow L3 sharing \smallskip\newline A CPU with 4 cores, where caches L1 and L2 are private to each core, and the L3 cache is shared across a pair of cores.}
         \label{fig:cache-noshare}
 \end{figure}
+With redundant tiemstamps this scheduling algorithm achieves both the fairness and performance requirements, on some machines.
+The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
+For example, on machines where the motherboard holds multiple CPU, cache misses can be satisfied from a cache that belongs to the CPU that missed, the \emph{local} CPU, or by a different CPU, a \emph{remote} one.
+Cache misses that are satisfied by a remote CPU will have higher latency than if it is satisfied by the local CPU.
+However, this is not specific to systems with multiple CPUs.
+Depending on the cache structure, cache-misses can have different latency for the same CPU.
+The AMD EPYC 7662 CPUs that is described in Chapter~\ref{microbench} is an example of that.
+Figure~\ref{fig:cache-share} and Figure~\ref{fig:cache-noshare} show two different cache topologies with highlight this difference.
+In Figure~\ref{fig:cache-share}, all cache instances are either private to a \gls{hthrd} or shared to the entire system, this means latency due to cache-misses are likely fairly consistent.
+By comparison, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by a hit in either instance of the L3.
+However, the memory access latency to the remote L3 instance will be notably higher than the memory access latency to the local L3.
+The impact of these different design on this algorithm is that scheduling will scale very well on architectures similar to Figure~\ref{fig:cache-share}, both will have notably worst scalling with many narrower L3 instances.
+This is simply because as the number of L3 instances grow, so two does the chances that the random helping will cause significant latency.
+The solution is to have the scheduler be aware of the cache topology.
+Figures~\ref{fig:cache-share} and~\ref{fig:cache-noshare} show two different cache topologies that highlight this difference.
+In Figure~\ref{fig:cache-share}, all cache misses are either private to a CPU or shared with another CPU.
+This means latency due to cache misses is fairly consistent.
+In contrast, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by either instance of L3 cache.
+However, the memory-access latency to the remote L3 is higher than the memory-access latency to the local L3.
+The impact of these different designs on this algorithm is that scheduling only scales well on architectures with a wide L3 cache, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many narrower L3 cache instances, similar to Figure~\ref{fig:cache-noshare}.
+Hence, as the number of L3 instances grow, so too does the chance that the random helping causes significant cache latency.
+The solution is for the scheduler be aware of the cache topology.
 \subsection{Per CPU Sharding}
+Building a scheduler that is aware of cache topology poses two main challenges: discovering cache topology and matching \procs to cache instance.
+Sadly, there is no standard portable way to discover cache topology in C.
+Therefore, while this is a significant portability challenge, it is outside the scope of this thesis to design a cross-platform cache discovery mechanisms.
+The rest of this work assumes discovering the cache topology based on Linux's \texttt{/sys/devices/system/cpu} directory.
+This leaves the challenge of matching \procs to cache instance, or more precisely identifying which subqueues of the ready queue are local to which cache instance.
+Once this matching is available, the helping algorithm can be changed to add bias so that \procs more often help subqueues local to the same cache instance
+\footnote{Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuinng.}.
+The obvious approach to mapping cache instances to subqueues is to statically tie subqueues to CPUs.
+Instead of having each subqueue local to a specific \proc, the system is initialized with subqueues for each \glspl{hthrd} up front.
+Then \procs dequeue and enqueue by first asking which CPU id they are local to, in order to identify which subqueues are the local ones.
+\Glspl{proc} can get the CPU id from \texttt{sched\_getcpu} or \texttt{librseq}.
+This approach solves the performance problems on systems with topologies similar to Figure~\ref{fig:cache-noshare}.
+However, it actually causes some subtle fairness problems in some systems, specifically systems with few \procs and many \glspl{hthrd}.
+In these cases, the large number of subqueues and the bias agains subqueues tied to different cache instances make it so it is very unlikely any single subqueue is picked.
+To make things worst, the small number of \procs mean that few helping attempts will be made.
+This combination of few attempts and low chances make it so a \at stranded on a subqueue that is not actively dequeued from may wait very long before it gets randomly helped.
+Building a scheduler that is cache aware poses two main challenges: discovering the cache topology and matching \procs to this cache structure.
+Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem.
+This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory.
+This leaves the challenge of matching \procs to cache structure, or more precisely identifying which subqueues of the ready queue are local to which subcomponents of the cache structure.
+Once a matching is generated, the helping algorithm is changed to add bias so that \procs more often help subqueues local to the same cache substructure.\footnote{
+Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning.}
+The simplest approach for mapping subqueues to cache structure is to statically tie subqueues to CPUs.
+Instead of having each subqueue local to a specific \proc, the system is initialized with subqueues for each hardware hyperthread/core up front.
+Then \procs dequeue and enqueue by first asking which CPU id they are executing on, in order to identify which subqueues are the local ones.
+\Glspl{proc} can get the CPU id from @sched_getcpu@ or @librseq@.
+This approach solves the performance problems on systems with topologies with narrow L3 caches, similar to Figure \ref{fig:cache-noshare}.
+However, it can still cause some subtle fairness problems in systems with few \procs and many \glspl{hthrd}.
+In this case, the large number of subqueues and the bias against subqueues tied to different cache substructures make it unlikely that every subqueue is picked.
+To make things worst, the small number of \procs mean that few helping attempts are made.
+This combination of low selection and few helping attempts allow a \at to become stranded on a subqueue for a long time until it gets randomly helped.
 On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can actually take multiple seconds for a \at to get dequeued from a remote queue.
 Therefore, a more dynamic matching of subqueues to cache instance is needed.
 \subsection{Topological Work Stealing}
+The approach that is used in the \CFA scheduler is to have per-\proc subqueue, but have an excplicit data-structure track which cache instance each subqueue is tied to.
+This is requires some finess because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
+\label{s:TopologicalWorkStealing}
+Therefore, the approach used in the \CFA scheduler is to have per-\proc subqueues, but have an explicit data-structure track which cache substructure each subqueue is tied to.
+This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
 A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
 Therefore the algorithm can be built as follows: Before enqueuing or dequeing a \at, each \proc queries the CPU id and the corresponding cache instance.
+Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
 Since subqueues are tied to \procs, each \proc can then update the cache instance mapped to the local subqueue(s).
 To avoid unnecessary cache line invalidation, the map is only written to if the mapping changes.
+This scheduler is used in the remainder of the thesis for managing CPU execution, but additional scheduling is needed to handle long-term blocking and unblocking, such as I/O.

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

-              r9e23b446
+              rffec1bf
 \chapter{Micro-Benchmarks}\label{microbench}
 The first step of evaluation is always to test-out small controlled cases, to ensure that the basics are working properly.
 This sections presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler.
+The first step in evaluating this work is to test-out small controlled cases to ensure the basics work properly.
+This chapter presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler.
 \section{Benchmark Environment}
+All of these benchmarks are run on two distinct hardware environment, an AMD and an INTEL machine.
+For all benchmarks, \texttt{taskset} is used to limit the experiment to 1 NUMA Node with no hyper threading.
+All benchmarks are run on two distinct hardware platforms.
+\begin{description}
+\item[AMD] is a server with two AMD EPYC 7662 CPUs and 256GB of DDR4 RAM.
+The EPYC CPU has 64 cores with 2 \glspl{hthrd} per core, for 128 \glspl{hthrd} per socket with 2 sockets for a total of 256 \glspl{hthrd}.
+Each CPU has 4 MB, 64 MB and 512 MB of L1, L2 and L3 caches, respectively.
+Each L1 and L2 instance are only shared by \glspl{hthrd} on a given core, but each L3 instance is shared by 4 cores, therefore 8 \glspl{hthrd}.
+The server runs Ubuntu 20.04.2 LTS on top of Linux Kernel 5.8.0-55.
+\item[Intel] is a server with four Intel Xeon Platinum 8160 CPUs and 384GB of DDR4 RAM.
+The Xeon CPU has 24 cores with 2 \glspl{hthrd} per core, for 48 \glspl{hthrd} per socket with 4 sockets for a total of 196 \glspl{hthrd}.
+Each CPU has 3 MB, 96 MB and 132 MB of L1, L2 and L3 caches respectively.
+Each L1 and L2 instance are only shared by \glspl{hthrd} on a given core, but each L3 instance is shared across the entire CPU, therefore 48 \glspl{hthrd}.
+The server runs Ubuntu 20.04.2 LTS on top of Linux Kernel 5.8.0-55.
+\end{description}
+For all benchmarks, @taskset@ is used to limit the experiment to 1 NUMA Node with no hyper threading.
 If more \glspl{hthrd} are needed, then 1 NUMA Node with hyperthreading is used.
+If still more \glspl{hthrd} are needed then the experiment is limited to as few NUMA Nodes as needed.
+\paragraph{AMD} The AMD machine is a server with two AMD EPYC 7662 CPUs and 256GB of DDR4 RAM.
+The server runs Ubuntu 20.04.2 LTS on top of Linux Kernel 5.8.0-55.
+These EPYCs have 64 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 256 \glspl{hthrd}.
+The cpus each have 4 MB, 64 MB and 512 MB of L1, L2 and L3 caches respectively.
+Each L1 and L2 instance are only shared by \glspl{hthrd} on a given core, but each L3 instance is shared by 4 cores, therefore 8 \glspl{hthrd}.
+\paragraph{Intel} The Intel machine is a server with four Intel Xeon Platinum 8160 CPUs and 384GB of DDR4 RAM.
+The server runs Ubuntu 20.04.2 LTS on top of Linux Kernel 5.8.0-55.
+These Xeon Platinums have 24 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 192 \glspl{hthrd}.
+The cpus each have 3 MB, 96 MB and 132 MB of L1, L2 and L3 caches respectively.
+Each L1 and L2 instance are only shared by \glspl{hthrd} on a given core, but each L3 instance is shared across the entire CPU, therefore 48 \glspl{hthrd}.
+This limited sharing of the last level cache on the AMD machine is markedly different than the Intel machine. Indeed, while on both architectures L2 cache misses that are served by L3 caches on a different cpu incurr a significant latency, on AMD it is also the case that cache misses served by a different L3 instance on the same cpu still incur high latency.
+If still more \glspl{hthrd} are needed, then the experiment is limited to as few NUMA Nodes as needed.
+The limited sharing of the last-level cache on the AMD machine is markedly different than the Intel machine.
+Indeed, while on both architectures L2 cache misses that are served by L3 caches on a different CPU incur a significant latency, on the AMD it is also the case that cache misses served by a different L3 instance on the same CPU still incur high latency.
 …
         \label{fig:cycle}
 \end{figure}
+The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue.
+Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark.
+However, yielding can be treated as a special case, since it also carries the information that the number of the ready \glspl{at} will not change.
+Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs.
+For this reason, I chose a different first benchmark, which I call the Cycle Benchmark.
+This benchmark arranges many \glspl{at} into multiple rings of \glspl{at}.
+Each ring is effectively a circular singly-linked list.
+The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready queue.
+Since these two operation also describe a @yield@ operation, many systems use this operation as the most basic benchmark.
+However, yielding can be treated as a special case by optimizing it away (dead code) since the number of ready \glspl{at} does not change.
+Not all systems perform this optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}.
+For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}.
+This benchmark arranges a number of \glspl{at} into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
 At runtime, each \gls{at} unparks the next \gls{at} before parking itself.
+This corresponds to the desired pair of ready queue operations.
+Unparking the next \gls{at} requires pushing that \gls{at} onto the ready queue and the ensuing park will cause the runtime to pop a \gls{at} from the ready-queue.
+Figure~\ref{fig:cycle} shows a visual representation of this arrangement.
+The goal of this ring is that the underlying runtime cannot rely on the guarantee that the number of ready \glspl{at} will stay constant over the duration of the experiment.
+Unparking the next \gls{at} pushes that \gls{at} onto the ready queue as does the ensuing park.
+Hence, the underlying runtime cannot rely on the number of ready \glspl{at} staying constant over the duration of the experiment.
 In fact, the total number of \glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at} parking.
+The size of the cycle is also decided based on this race: cycles that are too small may see the chain of unparks go full circle before the first \gls{at} can park.
+While this would not be a correctness problem, every runtime system must handle that race, it could lead to pushes and pops being optimized away.
+Since silently omitting ready-queue operations would throw off the measuring of these operations, the ring of \glspl{at} must be big enough so the \glspl{at} have the time to fully park before they are unparked.
+Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system.
+To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of \glspl{proc} available.
+Beyond this point, adding more rings serves to mitigate even more the idle sleep handling.
+This is to avoid the case where one of the \glspl{proc} runs out of work because of the variation on the number of ready \glspl{at} mentionned above.
+The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition.
+Figure~\ref{fig:cycle:code} shows pseudo code for this benchmark.
+\begin{figure}
+        \begin{lstlisting}
+                Thread.main() {
+                        count := 0
+                        for {
+                                wait()
+                                this.next.wake()
+                                count ++
+                                if must_stop() { break }
+                        }
+                        global.count += count
+                }
+        \end{lstlisting}
+        \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code}
+        \label{fig:cycle:code}
+\end{figure}
+That is, the runtime cannot anticipate that the current task will immediately park.
+As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \gls{at} parks because of time-slicing or multiple \procs.
+Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops.
+To prevent any attempt of silently omitting ready-queue operations, the ring of \glspl{at} is made big enough so the \glspl{at} have time to fully park before being unparked again.
+(Note, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.)
+Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment.
+To avoid this benchmark being affected by idle-sleep handling, the number of rings is multiple times greater than the number of \glspl{proc}.
+This design avoids the case where one of the \glspl{proc} runs out of work because of the variation on the number of ready \glspl{at} mentioned above.
+Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark.
+There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
+\begin{figure}
+\begin{cfa}
+Thread.main() {
+        count := 0
+        for {
+                @wait()@
+                @this.next.wake()@
+                count ++
+                if must_stop() { break }
+        }
+        global.count += count
+}
+\end{cfa}
+\caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code}
+\label{fig:cycle:code}
+\end{figure}
 \subsection{Results}
+Figure~\ref{fig:cycle:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle.
 \begin{figure}
         \subfloat[][Throughput, 100 \ats per \proc]{
 …
                 \label{fig:cycle:jax:low:ns}
+        }
         \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 100 cycles per \proc, 5 \ats per cycle.}
+        \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput as a function of \proc count with 100 cycles per \proc and 5 \ats per cycle.}
         \label{fig:cycle:jax}
 \end{figure}
-Figure~\ref{fig:cycle:jax} shows the throughput as a function of \proc count, with the following constants:
-Each run uses 100 cycles per \proc, 5 \ats per cycle.
 \todo{results discussion}
 \section{Yield}
 For completion, I also include the yield benchmark.
 This benchmark is much simpler than the cycle tests, it simply creates many \glspl{at} that call \texttt{yield}.
 As mentionned in the previous section, this benchmark may be less representative of usages that only make limited use of \texttt{yield}, due to potential shortcuts in the routine.
 Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty.
 This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done.
 Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, the ``wait/wake-next'' is simply replaced by a yield.
 \begin{figure}
         \begin{lstlisting}
                 Thread.main() {
                         count := 0
                         for {
                                 yield()
                                 count ++
                                 if must_stop() { break }
+                        }
                         global.count += count
+                }
         \end{lstlisting}
         \caption[Yield Benchmark : Pseudo Code]{Yield Benchmark : Pseudo Code}
         \label{fig:yield:code}
+For completion, the classic yield benchmark is included.
+This benchmark is simpler than the cycle test: it creates many \glspl{at} that call @yield@.
+As mentioned, this benchmark may not be representative because of optimization shortcuts in @yield@.
+The only interesting variable in this benchmark is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) can be empty.
+This scenario can put a strain on the idle-sleep handling compared to scenarios where there is plenty of work.
+Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, where the @wait/next.wake@ is replaced by @yield@.
+\begin{figure}
+\begin{cfa}
+Thread.main() {
+        count := 0
+        for {
+                @yield()@
+                count ++
+                if must_stop() { break }
+        }
+        global.count += count
+}
+\end{cfa}
+\caption[Yield Benchmark : Pseudo Code]{Yield Benchmark : Pseudo Code}
+\label{fig:yield:code}
 \end{figure}
 \subsection{Results}
+Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count, where each run uses 100 \ats per \proc.
 \begin{figure}
         \subfloat[][Throughput, 100 \ats per \proc]{
 …
         \label{fig:yield:jax}
 \end{figure}
-Figure~\ref{fig:yield:ops:jax} shows the throughput as a function of \proc count, with the following constants:
-Each run uses 100 \ats per \proc.
 \todo{results discussion}
 \section{Churn}
 The Cycle and Yield benchmark represents an ``easy'' scenario for a scheduler, \eg, an embarrassingly parallel application.
 In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} up-front and none of the \glspl{at} communicate with each other.
 The Churn benchmark represents more chaotic usages, where there is no relation between the last \gls{proc} on which a \gls{at} ran and the \gls{proc} that unblocked it.
 When a \gls{at} is unblocked from a different \gls{proc} than the one on which it last ran, the unblocking \gls{proc} must either ``steal'' the \gls{at} or place it on a remote queue.
 This results can result in either contention on the remote queue or \glspl{rmr} on \gls{at} data structure.
 In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if they are not handled correctly.
 To achieve this the benchmark uses a fixed size array of semaphores.
 Each \gls{at} picks a random semaphore, \texttt{V}s it to unblock a \at waiting and then \texttt{P}s on the semaphore.
+The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application.
+In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} upfront and none of the \glspl{at} communicate with each other.
+The Churn benchmark represents more chaotic execution, where there is no relation between the last \gls{proc} on which a \gls{at} ran and blocked and the \gls{proc} that subsequently unblocks it.
+With processor-specific ready-queues, when a \gls{at} is unblocked by a different \gls{proc} that means the unblocking \gls{proc} must either ``steal'' the \gls{at} from another processor or find it on a global queue.
+This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on \gls{at} data structure.
+In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly.
+This benchmark uses a fixed-size array of counting semaphores.
+Each \gls{at} picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore.
 This creates a flow where \glspl{at} push each other out of the semaphores before being pushed out themselves.
+For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of semaphores plus the number of \glspl{proc}.
+Note that the nature of these semaphores mean the counter can go beyond 1, which could lead to calls to \texttt{P} not blocking.
+For this benchmark to work, the number of \glspl{at} must be equal or greater than the number of semaphores plus the number of \glspl{proc}.
+Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@.
+Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@.
+\begin{figure}
+\begin{cfa}
+Thread.main() {
+        count := 0
+        for {
+                r := random() % len(spots)
+                @spots[r].V()@
+                @spots[r].P()@
+                count ++
+                if must_stop() { break }
+        }
+        global.count += count
+}
+\end{cfa}
+\caption[Churn Benchmark : Pseudo Code]{Churn Benchmark : Pseudo Code}
+\label{fig:churn:code}
+\end{figure}
+\subsection{Results}
+Figure~\ref{fig:churn:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle.
+\begin{figure}
+        \subfloat[][Throughput, 100 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.jax.ops.pstex_t}
+                }
+                \label{fig:churn:jax:ops}
+        }
+        \subfloat[][Throughput, 1 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.low.jax.ops.pstex_t}
+                }
+                \label{fig:churn:jax:low:ops}
+        }
+        \subfloat[][Latency, 100 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.jax.ns.pstex_t}
+                }
+        }
+        \subfloat[][Latency, 1 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.low.jax.ns.pstex_t}
+                }
+                \label{fig:churn:jax:low:ns}
+        }
+        \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine.
+        Throughput is the total operation per second across all cores. Latency is the duration of each operation.}
+        \label{fig:churn:jax}
+\end{figure}
+\todo{results discussion}
+\section{Locality}
 \todo{code, setup, results}
-\begin{lstlisting}
-        Thread.main() {
-                count := 0
-                for {
-                        r := random() % len(spots)
-                        spots[r].V()
-                        spots[r].P()
-                        count ++
-                        if must_stop() { break }
+                }
-                global.count += count
+        }
-\end{lstlisting}
-\begin{figure}
-        \subfloat[][Throughput, 100 \ats per \proc]{
-                \resizebox{0.5\linewidth}{!}{
-                        \input{result.churn.jax.ops.pstex_t}
+                }
-                \label{fig:churn:jax:ops}
+        }
-        \subfloat[][Throughput, 1 \ats per \proc]{
-                \resizebox{0.5\linewidth}{!}{
-                        \input{result.churn.low.jax.ops.pstex_t}
+                }
-                \label{fig:churn:jax:low:ops}
+        }
-        \subfloat[][Latency, 100 \ats per \proc]{
-                \resizebox{0.5\linewidth}{!}{
-                        \input{result.churn.jax.ns.pstex_t}
+                }
+        }
-        \subfloat[][Latency, 1 \ats per \proc]{
-                \resizebox{0.5\linewidth}{!}{
-                        \input{result.churn.low.jax.ns.pstex_t}
+                }
-                \label{fig:churn:jax:low:ns}
+        }
-        \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. Throughput is the total operation per second across all cores. Latency is the duration of each opeartion.}
-        \label{fig:churn:jax}
-\end{figure}
-\section{Locality}
-\todo{code, setup, results}
 \section{Transfer}
 The last benchmark is more exactly characterize as an experiment than a benchmark.
 It tests the behavior of the schedulers for a particularly misbehaved workload.
+The last benchmark is more of an experiment than a benchmark.
+It tests the behaviour of the schedulers for a misbehaved workload.
 In this workload, one of the \gls{at} is selected at random to be the leader.
 The leader then spins in a tight loop until it has observed that all other \glspl{at} have acknowledged its leadership.
 The leader \gls{at} then picks a new \gls{at} to be the ``spinner'' and the cycle repeats.
+The benchmark comes in two flavours for the behavior of the non-leader \glspl{at}:
+once they acknowledged the leader, they either block on a semaphore or yield repeatadly.
+This experiment is designed to evaluate the short term load balancing of the scheduler.
+Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experient to terminate.
+This is because the spinning \gls{at} is effectively preventing the \gls{proc} from runnning any other \glspl{thrd}.
+In the semaphore flavour, the number of runnable \glspl{at} will eventually dwindle down to only the leader.
+This is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work.
+The benchmark comes in two flavours for the non-leader \glspl{at}:
+once they acknowledged the leader, they either block on a semaphore or spin yielding.
+The experiment is designed to evaluate the short-term load-balancing of a scheduler.
+Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experiment to terminate.
+This problem occurs because the spinning \gls{at} is effectively preventing the \gls{proc} from running any other \glspl{thrd}.
+In the semaphore flavour, the number of runnable \glspl{at} eventually dwindles down to only the leader.
+This scenario is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work.
 In the yielding flavour, the number of runnable \glspl{at} stays constant.
 This is a harder case to handle because corrective measures must be taken even if work is still available.
 Note that languages that have mandatory preemption do circumvent this problem by forcing the spinner to yield.
+This scenario is a harder case to handle because corrective measures must be taken even when work is available.
+Note, runtime systems with preemption circumvent this problem by forcing the spinner to yield.
 \todo{code, setup, results}
+\begin{lstlisting}
+        Thread.lead() {
+                this.idx_seen = ++lead_idx
+                if lead_idx > stop_idx {
+                        done := true
+                        return
+                }
+                // Wait for everyone to acknowledge my leadership
+                start: = timeNow()
+\begin{figure}
+\begin{cfa}
+Thread.lead() {
+        this.idx_seen = ++lead_idx
+        if lead_idx > stop_idx {
+                done := true
+                return
+        }
+        // Wait for everyone to acknowledge my leadership
+        start: = timeNow()
+        for t in threads {
+                while t.idx_seen != lead_idx {
+                        asm pause
+                        if (timeNow() - start) > 5 seconds { error() }
+                }
+        }
+        // pick next leader
+        leader := threads[ prng() % len(threads) ]
+        // wake every one
+        if ! exhaust {
                 for t in threads {
+                        while t.idx_seen != lead_idx {
+                                asm pause
+                                if (timeNow() - start) > 5 seconds { error() }
+                        }
+                }
+                // pick next leader
+                leader := threads[ prng() % len(threads) ]
+                // wake every one
+                if !exhaust {
+                        for t in threads {
+                                if t != me { t.wake() }
+                        }
+                }
+        }
+        Thread.wait() {
+                this.idx_seen := lead_idx
+                if exhaust { wait() }
+                else { yield() }
+        }
+        Thread.main() {
+                while !done  {
+                        if leader == me { this.lead() }
+                        else { this.wait() }
+                }
+        }
+\end{lstlisting}
+                        if t != me { t.wake() }
+                }
+        }
+}
+Thread.wait() {
+        this.idx_seen := lead_idx
+        if exhaust { wait() }
+        else { yield() }
+}
+Thread.main() {
+        while !done  {
+                if leader == me { this.lead() }
+                else { this.wait() }
+        }
+}
+\end{cfa}
+\caption[Transfer Benchmark : Pseudo Code]{Transfer Benchmark : Pseudo Code}
+\label{fig:transfer:code}
+\end{figure}
+\subsection{Results}
+Figure~\ref{fig:transfer:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle.
+\todo{results discussion}

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

-              r9e23b446
+              rffec1bf
 \chapter{Previous Work}\label{existing}
+Scheduling is the process of assigning resources to incomming requests.
+A very common form of this is assigning available workers to work-requests.
+The need for scheduling is very common in Computer Science, \eg Operating Systems and Hypervisors schedule available CPUs, NICs schedule available bamdwith, but scheduling is also common in other fields.
+For example, in assmebly lines assigning parts in need of assembly to line workers is a form of scheduling.
+In all these cases, the choice of a scheduling algorithm generally depends first and formost on how much information is available to the scheduler.
+Workloads that are well-kown, consistent and homegenous can benefit from a scheduler that is optimized to use this information while ill-defined inconsistent heterogenous workloads will require general algorithms.
+A secondary aspect to that is how much information can be gathered versus how much information must be given as part of the input.
+There is therefore a spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limitted information.
+Note that this description includes both infomation about each requests, \eg time to complete or resources needed, and information about the relationships between request, \eg whether or not some request must be completed before another request starts.
+Scheduling physical resources, for example in assembly lines, is generally amenable to using very well informed scheduling since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods of time.
+As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa.
+Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc.
+Scheduling is also common in most other fields, \eg in assembly lines, assigning parts to line workers is a form of scheduling.
+In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler.
+Workloads that are well-known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
+A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
+This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
+Note, this description includes both information about each requests, \eg time to complete or resources needed, and information about the relationships among request, \eg whether or not some request must be completed before another request starts.
+Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling, since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods of time.
 When a faster pace is needed and changes are much more frequent gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.
 \section{Naming Convention}
+Scheduling has been studied by various different communities concentrating on different incarnation of the same problems. As a result, their is no real naming convention for scheduling that is respected across these communities. For this document, I will use the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the objects which will execute these \glspl{at}.
+Scheduling has been studied by various communities concentrating on different incarnation of the same problems.
+As a result, there are no standard naming conventions for scheduling that is respected across these communities.
+This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats.
 \section{Static Scheduling}
 Static schedulers require that \glspl{at} have their dependencies and costs explicitly and exhaustively specified prior schedule.
 The scheduler then processes this input ahead of time and producess a \newterm{schedule} to which the system can later adhere.
 This approach is generally popular in real-time systems since the need for strong guarantees justifies the cost of supplying this information.
 In general, static schedulers are less relavant to this project since they require input from the programmers that \CFA does not have as part of its concurrency semantic.
 Specifying this information explicitly can add a significant burden on the programmers and reduces flexibility, for this reason the \CFA scheduler does not require this information.
+\newterm{Static schedulers} require \ats dependencies and costs be explicitly and exhaustively specified prior to scheduling.
+The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution.
+This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information.
+In general, static schedulers are less relevant to this project because they require input from the programmers that the programming language does not have as part of its concurrency semantic.
+Specifying this information explicitly adds a significant burden to the programmer and reduces flexibility.
+For this reason, the \CFA scheduler does not require this information.
 \section{Dynamic Scheduling}
+It may be difficult to fulfill the requirements of static scheduler if dependencies are conditionnal. In this case, it may be preferable to detect dependencies at runtime. This detection effectively takes the form of adding one or more new \gls{at}(s) to the system as their dependencies are resolved. As well as potentially halting or suspending a \gls{at} that dynamically detect unfulfilled dependencies. Each \gls{at} has the responsability of adding the dependent \glspl{at} back in the system once completed. As a consequence, the scheduler may have an incomplete view of the system, seeing only \glspl{at} we no pending dependencies. Schedulers that support this detection at runtime are referred to as \newterm{Dynamic Schedulers}.
+\newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all.
+Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
+This detection takes the form of observing new \ats(s) in the system and determining dependencies from their behaviour, including suspending or halting a \ats that dynamically detects unfulfilled dependencies.
+Furthermore, each \ats has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
+As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies.
 \subsection{Explicitly Informed Dynamic Schedulers}
+While dynamic schedulers do not have access to an exhaustive list of dependencies for a \gls{at}, they may require to provide more or less information about each \gls{at}, including for example: expected duration, required ressources, relative importance, etc. The scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information} Precisely providing this information can be difficult for programmers, especially \emph{predicted} behaviour, and the scheduler may need to support some amount of imprecision in the provided information. For example, specifying that a \glspl{at} takes approximately 5 seconds to complete, rather than exactly 5 seconds. User provided information can also become a significant burden depending how the effort to provide the information scales with the number of \glspl{at} and there complexity. For example, providing an exhaustive list of files read by 5 \glspl{at} is an easier requirement the providing an exhaustive list of memory addresses accessed by 10'000 distinct \glspl{at}.
+Since the goal of this thesis is to provide a scheduler as a replacement for \CFA's existing \emph{uninformed} scheduler, Explicitly Informed schedulers are less relevant to this project. Nevertheless, some strategies are worth mentionnding.
+\subsubsection{Prority Scheduling}
+A commonly used information that schedulers used to direct the algorithm is priorities. Each Task is given a priority and higher-priority \glspl{at} are preferred to lower-priority ones. The simplest priority scheduling algorithm is to simply require that every \gls{at} have a distinct pre-established priority and always run the available \gls{at} with the highest priority. Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \glspl{at}. It can therefore be diserable for schedulers to support \glspl{at} with identical priorities and/or automatically setting and adjusting priorites for \glspl{at}. The most common operating some variation on priorities with overlaps and dynamic priority adjustments. For example, Microsoft Windows uses a pair of priorities
+While dynamic schedulers may not have an exhaustive list of dependencies for a \ats, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc.
+When available, a scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information}
+However, most programmers do not determine or even \emph{predict} this information;
+at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \ats takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
+Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity.
+For example, providing an exhaustive list of files read by 5 \ats is an easier requirement then providing an exhaustive list of memory addresses accessed by 10,000 independent \ats.
+Since the goal of this thesis is to provide a scheduler as a replacement for \CFA's existing \emph{uninformed} scheduler, explicitly informed schedulers are less relevant to this project. Nevertheless, some strategies are worth mentioning.
+\subsubsection{Priority Scheduling}
+Common information used by schedulers to direct their algorithm is priorities.
+Each \ats is given a priority and higher-priority \ats are preferred to lower-priority ones.
+The simplest priority scheduling algorithm is to require that every \ats have a distinct pre-established priority and always run the available \ats with the highest priority.
+Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \ats.
+It can therefore be desirable for schedulers to support \ats with identical priorities and/or automatically setting and adjusting priorities for \ats.
+Most common operating systems use some variant on priorities with overlaps and dynamic priority adjustments.
+For example, Microsoft Windows uses a pair of priorities
 \cit{https://docs.microsoft.com/en-us/windows/win32/procthread/scheduling-priorities,https://docs.microsoft.com/en-us/windows/win32/taskschd/taskschedulerschema-priority-settingstype-element}, one specified by users out of ten possible options and one adjusted by the system.
 \subsection{Uninformed and Self-Informed Dynamic Schedulers}
 Several scheduling algorithms do not require programmers to provide additionnal information on each \gls{at}, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
+Several scheduling algorithms do not require programmers to provide additional information on each \ats, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
 \subsubsection{Feedback Scheduling}
+As mentionned, Schedulers may also gather information about each \glspl{at} to direct their decisions. This design effectively moves the scheduler to some extent into the realm of \newterm{Control Theory}\cite{wiki:controltheory}. This gathering does not generally involve programmers and as such does not increase programmer burden the same way explicitly provided information may. However, some feedback schedulers do offer the option to programmers to offer additionnal information on certain \glspl{at}, in order to direct scheduling decision. The important distinction being whether or not the scheduler can function without this additionnal information.
+As mentioned, schedulers may also gather information about each \ats to direct their decisions.
+This design effectively moves the scheduler into the realm of \newterm{Control Theory}~\cite{wiki:controltheory}.
+This information gathering does not generally involve programmers, and as such, does not increase programmer burden the same way explicitly provided information may.
+However, some feedback schedulers do allow programmers to offer additional information on certain \ats, in order to direct scheduling decisions.
+The important distinction being whether or not the scheduler can function without this additional information.
 \section{Work Stealing}\label{existing:workstealing}
+One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work-stealing. This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker work on its local \glspl{at} first, but allows the possibility for other workers to steal local \glspl{at} if they run out of \glspl{at}. \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has queue of \glspl{at} to accomplish and workers without \glspl{at} steal \glspl{at} from random workers. (The Burton and Sleep algorithm had trees of \glspl{at} and stole only among neighbours). Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations.
+Many variations of this algorithm have been proposed over the years\cite{DBLP:journals/ijpp/YangH18}, both optmizations of existing implementations and approaches that account for new metrics.
+\paragraph{Granularity} A significant portion of early Work Stealing research was concentrating on \newterm{Implicit Parellelism}\cite{wiki:implicitpar}. Since the system was responsible to split the work, granularity is a challenge that cannot be left to the programmers (as opposed to \newterm{Explicit Parellelism}\cite{wiki:explicitpar} where the burden can be left to programmers). In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead. The best performance generally means finding a middle ground between the two. Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
+\paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating \glspl{at} from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93}
+One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work stealing.
+This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats.
+\cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm had trees of \ats and steal only among neighbours.}.
+Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations.
+Many variations of this algorithm have been proposed over the years~\cite{DBLP:journals/ijpp/YangH18}, both optimizations of existing implementations and approaches that account for new metrics.
+\paragraph{Granularity} A significant portion of early work-stealing research concentrated on \newterm{Implicit Parallelism}~\cite{wiki:implicitpar}.
+Since the system is responsible for splitting the work, granularity is a challenge that cannot be left to programmers, as opposed to \newterm{Explicit Parallelism}\cite{wiki:explicitpar} where the burden can be left to programmers.
+In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead.
+The best performance generally means finding a middle ground between the two.
+Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
+\paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating \ats from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93}
 \todo{The survey is not great on this subject}
 \paragraph{Complex Machine Architecture} Another aspect that has been looked at is how well Work Stealing is applicable to different machine architectures.
+\paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures.
 \subsection{Theoretical Results}
+There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogenous systems\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. \cite{DBLP:journals/jacm/BlellochGM99} examine the space bounds of Work Stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} show that for underloaded systems, the scheduler will complete computations in finite time, \ie is \newterm{stable}. Others show that Work-Stealing is applicable to various scheduling contexts\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. \cite{DBLP:conf/ipps/ColeR13} also studied how Randomized Work Stealing affects false sharing among \glspl{at}.
+However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentionning that this theoretical research has mainly focused on ``fully-strict'' computations, \ie workloads that can be fully represented with a Direct Acyclic Graph. It is unclear how well these distributions represent workloads in real world scenarios.
+There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
+\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
+Others show that work stealing is applicable to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}.
+\cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats.
+However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully-strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.
+It is unclear how well these distributions represent workloads in real world scenarios.
 \section{Preemption}
+One last aspect of scheduling worth mentionning is preemption since many schedulers rely on it for some of their guarantees. Preemption is the idea of interrupting \glspl{at} that have been running for too long, effectively injecting suspend points in the applications. There are multiple techniques to achieve this but they all aim to have the effect of guaranteeing that suspend points in a \gls{at} are never further apart than some fixed duration. While this helps schedulers guarantee that no \glspl{at} will unfairly monopolize a worker, preemption can effectively added to any scheduler. Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it.
+\section{Schedulers in Production}\label{existing:prod}
+This section will show a quick overview of several schedulers which are generally available a the time of writing. While these schedulers don't necessarily represent to most recent advances in scheduling, they are what is generally accessible to programmers. As such, I believe that these schedulers are at least as relevant as those presented in published work. I chose both schedulers that operating in kernel space and in user space, as both can offer relevant insight for this project. However, I did not list any schedulers aimed for real-time applications, as these have constraints that are much stricter than what is needed for this project.
+One last aspect of scheduling is preemption since many schedulers rely on it for some of their guarantees.
+Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application.
+There are multiple techniques to achieve this effect but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration.
+While this helps schedulers guarantee that no \ats unfairly monopolizes a worker, preemption can effectively be added to any scheduler.
+Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it.
+\section{Production Schedulers}\label{existing:prod}
+This section presents a quick overview of several current schedulers.
+While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers.
+As such, I believe these schedulers are at least as relevant as those presented in published work.
+Schedulers that operate in kernel space and in user space are considered, as both can offer relevant insight for this project.
+However, real-time schedulers are not considered, as these have constraints that are much stricter than what is needed for this project.
 \subsection{Operating System Schedulers}
+Operating System Schedulers tend to be fairly complex schedulers, they generally support some amount of real-time, aim to balance interactive and non-interactive \glspl{at} and support for multiple users sharing hardware without requiring these users to cooperate. Here are more details on a few schedulers used in the common operating systems: Linux, FreeBsd, Microsoft Windows and Apple's OS X. The information is less complete for operating systems behind closed source.
+Operating System Schedulers tend to be fairly complex as they generally support some amount of real-time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
+Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X.
+The information is less complete for operating systems with closed source.
 \paragraph{Linux's CFS}
+The default scheduler used by Linux (the Completely Fair Scheduler)\cite{MAN:linux/cfs,MAN:linux/cfs2} is a feedback scheduler based on CPU time. For each processor, it constructs a Red-Black tree of \glspl{at} waiting to run, ordering them by amount of CPU time spent. The scheduler schedules the \gls{at} that has spent the least CPU time. It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time spent. The ordering of \glspl{at} is also impacted by a group based notion of fairness, where \glspl{at} belonging to groups having spent less CPU time are preferred to \glspl{at} beloning to groups having spent more CPU time. Linux achieves load-balancing by regularly monitoring the system state\cite{MAN:linux/cfs/balancing} and using some heuristic on the load (currently CPU time spent in the last millisecond plus decayed version of the previous time slots\cite{MAN:linux/cfs/pelt}.).
+\cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work-stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly. The issues highlighted sem to stem from Linux's need to support fairness across \glspl{at} \emph{and} across users\footnote{Enforcing fairness across users means, for example, that given two users: one with a single \gls{at} and the other with one thousand \glspl{at}, the user with a single \gls{at} does not receive one one thousandth of the CPU time.}, increasing the complexity.
+Linux also offers a FIFO scheduler, a real-time schedulerwhich runs the highest-priority \gls{at}, and a round-robin scheduler, which is an extension of the fifo-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
+The default scheduler used by Linux, the Completely Fair Scheduler~\cite{MAN:linux/cfs,MAN:linux/cfs2}, is a feedback scheduler based on CPU time.
+For each processor, it constructs a Red-Black tree of \ats waiting to run, ordering them by the amount of CPU time used.
+The \ats that has used the least CPU time is scheduled.
+It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time used.
+The ordering of \ats is also affected by a group based notion of fairness, where \ats belonging to groups having used less CPU time are preferred to \ats belonging to groups having used more CPU time.
+Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the load, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.
+\cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly.
+The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \ats and the other with one thousand \ats, the user with a single \ats does not receive one thousandth of the CPU time.}, increasing the complexity.
+Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
 \paragraph{FreeBSD}
+The ULE scheduler used in FreeBSD\cite{DBLP:conf/bsdcon/Roberson03} is a feedback scheduler similar to Linux's CFS. It uses different data structures and heuristics but also schedules according to some combination of CPU time spent and niceness values. It also periodically balances the load of the system(according to a different heuristic), but uses a simpler Work Stealing approach.
+The ULE scheduler used in FreeBSD\cite{DBLP:conf/bsdcon/Roberson03} is a feedback scheduler similar to Linux's CFS.
+It uses different data structures and heuristics but also schedules according to some combination of CPU time used and niceness values.
+It also periodically balances the load of the system (according to a different heuristic), but uses a simpler work stealing approach.
 \paragraph{Windows(OS)}
+Microsoft's Operating System's Scheduler\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. It supports 32 levels of priorities, some of which are reserved for real-time and prviliged applications. It schedules \glspl{at} based on the highest priorities (lowest number) and how much cpu time each \glspl{at} have used. The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
+Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities.
+It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications.
+It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \ats has used.
+The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
 \todo{load balancing}
 …
 \subsection{User-Level Schedulers}
+By comparison, user level schedulers tend to be simpler, gathering fewer metrics and avoid complex notions of fairness. Part of the simplicity is due to the fact that all \glspl{at} have the same user, and therefore cooperation is both feasible and probable.
+\paragraph{Go}
+Go's scheduler uses a Randomized Work Stealing algorithm that has a global runqueue(\emph{GRQ}) and each processor(\emph{P}) has both a fixed-size runqueue(\emph{LRQ}) and a high-priority next ``chair'' holding a single element.\cite{GITHUB:go,YTUBE:go} Preemption is present, but only at function call boundaries.
+By comparison, user level schedulers tend to be simpler, gathering fewer metrics and avoid complex notions of fairness. Part of the simplicity is due to the fact that all \ats have the same user, and therefore cooperation is both feasible and probable.
+\paragraph{Go}\label{GoSafePoint}
+Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}.
+Preemption is present, but only at safe-points,~\cit{https://go.dev/src/runtime/preempt.go} which are inserted detection code at various frequent access boundaries.
 The algorithm is as follows :
 \begin{enumerate}
         \item Once out of 61 times, directly pick 1 element from the \emph{GRQ}.
+        \item Once out of 61 times, pick 1 element from the \emph{GRQ}.
         \item If there is an item in the ``chair'' pick it.
         \item Else pick an item from the \emph{LRQ}.
+        \item If it was empty steal (len(\emph{GRQ}) / \#of\emph{P}) + 1 items (max 256) from the \emph{GRQ}.
+        \item If it was empty steal \emph{half} the \emph{LRQ} of another \emph{P} chosen randomly.
+        \begin{itemize}
+        \item If it is empty steal (len(\emph{GRQ}) / \#of\emph{P}) + 1 items (max 256) from the \emph{GRQ}
+        \item and steal \emph{half} the \emph{LRQ} of another \emph{P} chosen randomly.
+        \end{itemize}
 \end{enumerate}
 \paragraph{Erlang}
+Erlang is a functionnal language that supports concurrency in the form of processes, threads that share no data. It seems to be some kind of Round-Robin Scheduler. It currently uses some mix of Work Sharing and Work Stealing to achieve load balancing\cite{:erlang}, where underloaded workers steal from other workers, but overloaded workers also push work to other workers. This migration logic seems to be directed by monitoring logic that evaluates the load a few times per seconds.
+Erlang is a functional language that supports concurrency in the form of processes: threads that share no data.
+It uses a kind of round-robin scheduler, with a mix of work sharing and stealing to achieve load balancing~\cite{:erlang}, where under-loaded workers steal from other workers, but overloaded workers also push work to other workers.
+This migration logic is directed by monitoring logic that evaluates the load a few times per seconds.
 \paragraph{Intel\textregistered ~Threading Building Blocks}
+\newterm{Thread Building Blocks}(TBB) is Intel's task parellelism\cite{wiki:taskparallel} framework. It runs \newterm{jobs}, uninterruptable \glspl{at}, schedulable objects that must always run to completion, on a pool of worker threads. TBB's scheduler is a variation of Randomized Work Stealing that also supports higher-priority graph-like dependencies\cite{MAN:tbb/scheduler}. It schedules \glspl{at} as follows (where \textit{t} is the last \gls{at} completed):
+\newterm{Thread Building Blocks} (TBB) is Intel's task parallelism \cite{wiki:taskparallel} framework.
+It runs \newterm{jobs}, which are uninterruptable \ats that must always run to completion, on a pool of worker threads.
+TBB's scheduler is a variation of randomized work-stealing that also supports higher-priority graph-like dependencies~\cite{MAN:tbb/scheduler}.
+It schedules \ats as follows (where \textit{t} is the last \ats completed):
 \begin{displayquote}
         \begin{enumerate}
                 \item The task returned by \textit{t}\texttt{.execute()}
+                \item The task returned by \textit{t}@.execute()@
                 \item The successor of t if \textit{t} was its last completed predecessor.
                 \item A task popped from the end of the thread’s own deque.
+                \item A task popped from the end of the thread's own deque.
                 \item A task with affinity for the thread.
                 \item A task popped from approximately the beginning of the shared queue.
                 \item A task popped from the beginning of another randomly chosen thread’s deque.
+                \item A task popped from the beginning of another randomly chosen thread's deque.
         \end{enumerate}
 …
 \paragraph{Quasar/Project Loom}
+Java has two projects that are attempting to introduce lightweight threading into java in the form of Fibers, Quasar\cite{MAN:quasar} and Project Loom\cite{MAN:project-loom}\footnote{It is unclear to me if these are distinct projects or not}. Both projects seem to be based on the \texttt{ForkJoinPool} in Java which appears to be a simple incarnation of Randomized Work Stealing\cite{MAN:java/fork-join}.
+Java has two projects, Quasar~\cite{MAN:quasar} and Project Loom~\cite{MAN:project-loom}\footnote{It is unclear if these are distinct projects.}, that are attempting to introduce lightweight thread\-ing in the form of Fibers.
+Both projects seem to be based on the @ForkJoinPool@ in Java, which appears to be a simple incarnation of randomized work-stealing~\cite{MAN:java/fork-join}.
 \paragraph{Grand Central Dispatch}
+This is an API produce by Apple\cit{Official GCD source} that offers task parellelism\cite{wiki:taskparallel}. Its distinctive aspect is that it uses multiple ``Dispatch Queues'', some of which are created by programmers. These queues each have their own local ordering guarantees, \eg \glspl{at} on queue $A$ are executed in \emph{FIFO} order.
+An Apple\cit{Official GCD source} API that offers task parallelism~\cite{wiki:taskparallel}.
+Its distinctive aspect is multiple ``Dispatch Queues'', some of which are created by programmers.
+Each queue has its own local ordering guarantees, \eg \ats on queue $A$ are executed in \emph{FIFO} order.
 \todo{load balancing and scheduling}
 …
 % http://web.archive.org/web/20090920043909/http://images.apple.com/macosx/technology/docs/GrandCentral_TB_brief_20090903.pdf
 In terms of semantics, the Dispatch Queues seem to be very similar in semantics to Intel\textregistered ~TBB \texttt{execute()} and predecessor semantics. Where it would be possible to convert from one to the other.
+In terms of semantics, the Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics.
 \paragraph{LibFibre}
+LibFibre\cite{DBLP:journals/pomacs/KarstenB20} is a light-weight user-level threading framework developt at the University of Waterloo. Similarly to Go, it uses a variation of Work Stealing with a global queue that is higher priority than stealing. Unlock Go it does not have the high-priority next ``chair'' and does not use Randomized Work Stealing.
+LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a light-weight user-level threading framework developed at the University of Waterloo.
+Similarly to Go, it uses a variation of work stealing with a global queue that is higher priority than stealing.
+Unlike Go, it does not have the high-priority next ``chair'' and does not use randomized work-stealing.

doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

-              r9e23b446
+              rffec1bf
 \chapter*{Introduction}\label{intro}
 \todo{A proper intro}
+\chapter{Introduction}\label{intro}
+\section{\CFA programming language}
+The C programming language~\cite{C11}
+The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
+Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs.
+My previous master's thesis on concurrent in \CFA focused on features and interfaces.
+This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
+Specifically, this work concentrates on scheduling and \glsxtrshort{io}.
+Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}.
+The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs.
+My previous master's thesis on concurrent in \CFA focused on features and interfaces.
+This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. Specifically, this work concentrates on scheduling and \glsxtrshort{io}. Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and  no non-blocking I/O capabilities at the user-thread level.
+As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
+While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work.
+As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work.
+\section{Scheduling}
+Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones.
+On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}.
+These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit.
+On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization.
+A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal.
+With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}.
+However, optimal solutions are often not required.
+Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns.
+Scheduling occurs at discreet points when there are transitions in a system.
+For example, a thread cycles through the following transitions during its execution.
+\begin{center}
+\input{executionStates.pstex_t}
+\end{center}
+These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s):
+\begin{itemize}
+\item
+entering the system (new $\rightarrow$ ready)
+\item
+timer alarm for preemption (running $\rightarrow$ ready)
+\item
+long term delay versus spinning (running $\rightarrow$ blocked)
+\item
+blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready)
+\item
+normal completion or error, \ie segment fault (running $\rightarrow$ halted)
+\item
+scheduler assigns a thread to a resource (ready $\rightarrow$ running)
+\end{itemize}
+Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system.
+When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
+Ready queues organize threads for scheduling, which indirectly organizes the work to be performed.
+The structure of ready queues can take many different forms.
+Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
+\begin{center}
+\begin{tabular}{l|l}
+\multicolumn{1}{c|}{\textbf{SQMS}} & \multicolumn{1}{c}{\textbf{MQMS}} \\
+\hline
+\raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t}
+\end{tabular}
+\end{center}
+Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS.
+The three major optimization criteria for a scheduler are:
+\begin{enumerate}[leftmargin=*]
+\item
+\newterm{load balancing}: available work is distributed so no processor is idle when work is available.
+\noindent
+Eventual progress for each work unit is often an important consideration, \ie no starvation.
+\item
+\newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors.
+\noindent
+Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
+When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
+That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors.
+\item
+\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{
+Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.}
+\noindent
+Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object.
+\end{enumerate}
+Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns.
+Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}.
+SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
+MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
+Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively.
+These approaches attempt to perform better load-balancing at the cost of affinity and contention.
+Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues
+Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent.
+Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads.
+Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals.
+More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
+\section{Contributions}\label{s:Contributions}
+This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system:
+\begin{enumerate}[leftmargin=*]
+\item
+A scalable scheduling algorithm that offers progress guarantees.
+\item
+An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
+\item
+Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
+\end{enumerate}

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-              r9e23b446
+              rffec1bf
 \chapter{User Level \io}
 As mentioned in Section~\ref{prev:io}, User-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
+As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
 \section{Kernel Interface}
 Since this work fundamentally depends on operating-system support, the first step of any design is to discuss the available interfaces and pick one (or more) as the foundations of the non-blocking \io subsystem.
+Since this work fundamentally depends on operating-system support, the first step of this design is to discuss the available interfaces and pick one (or more) as the foundation for the non-blocking \io subsystem in this work.
 \subsection{\lstinline{O_NONBLOCK}}
 …
 In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}.
 This feature can be used as the foundation for the non-blocking \io subsystem.
 However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be use in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait
 \footnote{In this context, ready means \emph{some} operation can be performed without blocking.
+However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be used in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait.\footnote{
+In this context, ready means \emph{some} operation can be performed without blocking.
 It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
 For example, a ready read may only return a subset of bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}.
+For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
 This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block.
 There are three options to monitor file descriptors in Linux
 \footnote{For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
+There are three options to monitor file descriptors in Linux:\footnote{
+For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
 The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.},
 @select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}.
 All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptors becomes ready.
+The group of file descriptors being waited is called the \newterm{interest set}.
+\paragraph{\lstinline{select}} is the oldest of these options, it takes as an input a contiguous array of bits, where each bits represent a file descriptor of interest.
+On return, it modifies the set in place to identify which of the file descriptors changed status.
+This destructive change means that calling select in a loop requires re-initializing the array each time and the number of file descriptors supported has a hard limit.
+Another limit of @select@ is that once the call is started, the interest set can no longer be modified.
+Monitoring a new file descriptor generally requires aborting any in progress call to @select@
+\footnote{Starting a new call to \lstinline{select} is possible but requires a distinct kernel thread, and as a result is not an acceptable multiplexing solution when the interest set is large and highly dynamic unless the number of parallel calls to \lstinline{select} can be strictly bounded.}.
+\paragraph{\lstinline{poll}} is an improvement over select, which removes the hard limit on the number of file descriptors and the need to re-initialize the input on every call.
+It works using an array of structures as an input rather than an array of bits, thus allowing a more compact input for small interest sets.
+Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed while the call is blocked.
+\paragraph{\lstinline{epoll}} further improves these two functions by allowing the interest set to be dynamically added to and removed from while a \gls{kthrd} is blocked on an @epoll@ call.
+The group of file descriptors being waited on is called the \newterm{interest set}.
+\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
+Hence, the array length must be as long as the largest FD currently of interest.
+On return, it outputs the set in place to identify which of the file descriptors changed state.
+This destructive change means selecting in a loop requires re-initializing the array for each iteration.
+Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
+Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
+However, these changes are only reflected when the manager makes its next call to @select@.
+Note, it is possible for the manager thread to never unblock if its current interest set never changes, \eg the sockets/pipes/ttys it is waiting on never get data again.
+Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
+\begin{comment}
+From: Tim Brecht <brecht@uwaterloo.ca>
+Subject: Re: FD sets
+Date: Wed, 6 Jul 2022 00:29:41 +0000
+Large number of open files
+--------------------------
+In order to be able to use more than the default number of open file
+descriptors you may need to:
+o increase the limit on the total number of open files /proc/sys/fs/file-max
+  (on Linux systems)
+o increase the size of FD_SETSIZE
+  - the way I often do this is to figure out which include file __FD_SETSIZE
+    is defined in, copy that file into an appropriate directory in ./include,
+    and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
+    gets used
+  For example on a RH 9.0 distribution I've copied
+  /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
+  Then I modify typesizes.h to look something like:
+  #ifdef BIGGER_FD_SETSIZE
+  #define __FD_SETSIZE            32767
+  #else
+  #define __FD_SETSIZE            1024
+  #endif
+  Note that the since I'm moving and testing the userver on may different
+  machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
+  This way if you redefine the FD_SETSIZE it will get used instead of the
+  default original file.
+\end{comment}
+\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
+For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.
+Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialize on every call.
+Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \gls{kthrd}, while a manager thread is blocked in @poll@.
+\paragraph{\lstinline{epoll}} follows after @poll@, and places the interest set in the kernel rather than the application, where it is managed by an internal \gls{kthrd}.
+There are two separate functions: one to add to the interest set and another to check for FDs with state changes.
 This dynamic capability is accomplished by creating an \emph{epoll instance} with a persistent interest set, which is used across multiple calls.
+This capability significantly reduces synchronization overhead on the part of the caller (in this case the \io subsystem), since the interest set can be modified when adding or removing file descriptors without having to synchronize with other \glspl{kthrd} potentially calling @epoll@.
+However, all three of these system calls have limitations.
+As the interest set is augmented, the changes become implicitly part of the interest set for a blocked manager \gls{kthrd}.
+This capability significantly reduces synchronization between \glspl{kthrd} and the manager calling @epoll@.
+However, all three of these I/O systems have limitations.
 The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
 Furthermore, @epoll@ has been shown to have problems with pipes and ttys~\cit{Peter's examples in some fashion}.
 …
 It also supports batching multiple operations in a single system call.
 AIO offers two different approach to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
+AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
 For the purpose of \io multiplexing, @aio_suspend@ is the best interface.
 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
 …
         \begin{flushright}
                 -- Linus Torvalds\cit{https://lwn.net/Articles/671657/}
+                -- Linus Torvalds~\cite{AIORant}
         \end{flushright}
 \end{displayquote}
 …
 A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
 Like AIO, it represents \io operations as entries added to a queue.
 But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
+But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
 …
 In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
 This approach is used by languages like Go\cit{Go} and frameworks like libuv\cit{libuv}, since it has the advantage that it can easily be used across multiple operating systems.
+This approach is used by languages like Go\cit{Go}, frameworks like libuv\cit{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
 As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
 …
 \section{Event-Engine}
 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
 In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engines then starts the operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
+In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
 The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
 …
 \begin{enumerate}
 \item
 An SQE is allocated from the pre-allocated array (denoted \emph{S} in Figure~\ref{fig:iouring}).
+An SQE is allocated from the pre-allocated array \emph{S}.
 This array is created at the same time as the @io_uring@ instance, is in kernel-locked memory visible by both the kernel and the application, and has a fixed size determined at creation.
+How these entries are allocated is not important for the functioning of @io_uring@, the only requirement is that no entry is reused before the kernel has consumed it.
+How these entries are allocated is not important for the functioning of @io_uring@;
+the only requirement is that no entry is reused before the kernel has consumed it.
 \item
 The SQE is filled according to the desired operation.
+This step is straight forward, the only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
+This step is straight forward.
+The only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
 \item
 The SQE is submitted to the submission ring by appending the index of the SQE to the ring following regular ring buffer steps: \lstinline{buffer[head] = item; head++}.
 Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
 Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
+Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
 \item
 The kernel is notified of the change to the ring using the system call @io_uring_enter@.
 …
 The @io_uring_enter@ system call is protected by a lock inside the kernel.
 This protection means that concurrent call to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
+It is possible to do the first three submission steps in parallel, however, doing so requires careful synchronization.
+It is possible to do the first three submission steps in parallel;
+however, doing so requires careful synchronization.
 @io_uring@ also introduces constraints on the number of simultaneous operations that can be ``in flight''.
 Obviously, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
 In addition, the @io_uring_enter@ system call can fail because ``The  kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can  have  pending.''.
+First, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
+Second, the @io_uring_enter@ system call can fail because ``The  kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can have pending.''.
 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
 \subsection{Multiplexing \io: Submission}
 The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
+While it is possible to do the first steps of submission in parallel, the duration of the system call scales with number of entries submitted.
+While there is freedom in designing the submission side, there are some realities of @io_uring@ that must be taken into account.
+It is possible to do the first steps of submission in parallel;
+however, the duration of the system call scales with the number of entries submitted.
 The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
 Beyond this limit, the length of the system call is the throughput limiting factor.
+I concluded from early experiments that preparing submissions seems to take at most as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
+Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
+Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
+Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continously
+\footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}.
+I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
+Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
+Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
+As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
 There is nothing preventing a new operation with, for example, the same file descriptors to a different @io_uring@ instance.
+There is nothing preventing a new operation with, \eg the same file descriptors to a different @io_uring@ instance.
 A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
 SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
+The consequence of this feature is that filling SQEs can be arbitrarly complex and therefore users may need to run arbitrary code between allocation and submission.
+Supporting chains is a requirement of the \io subsystem, but it is still valuable.
+Support for this feature can be fulfilled simply to supporting arbitrary user code between allocation and submission.
+\subsubsection{Public Instances}
+One approach is to have multiple shared instances.
+\Glspl{thrd} attempting \io operations pick one of the available instances and submit operations to that instance.
+Since there is no coupling between \glspl{proc} and @io_uring@ instances in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
+Since @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: the synchronization needed to submit does not induce more contention than @io_uring@ already does and the scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
+This second aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
+Allocation in this scheme can be handled fairly easily.
+Free SQEs, \ie, SQEs that aren't currently being used to represent a request, can be written to safely and have a field called @user_data@ which the kernel only reads to copy to @cqe@s.
+Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
+This requires a simple concurrent bag.
+The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
+Allocation failures need to be pushed up to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
+Furthermore, the routing algorithm should block operations up-front if none of the instances have available SQEs.
+Once an SQE is allocated, \glspl{thrd} can fill them normally, they simply need to keep track of the SQE index and which instance it belongs to.
+Once an SQE is filled in, what needs to happen is that the SQE must be added to the submission ring buffer, an operation that is not thread-safe on itself, and the kernel must be notified using the @io_uring_enter@ system call.
+The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail
+\footnote{This is because it is invalid to have the same \lstinline{sqe} multiple times in the ring buffer.}.
+However, as mentioned, the system call itself can fail with the expectation that it will be retried once some of the already submitted operations complete.
+Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
+Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
+This can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
+In the case of designating a \gls{thrd}, ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests would be batched together and one of the \glspl{thrd} would do the system call on behalf of the others, referred to as the \newterm{submitter}.
+In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a ``next submitter'' that guarentees everything that is missed by the current submitter is seen by the next one.
+Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call will include their request.
+Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
+Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point.
+Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
+Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
+If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
+A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
+With this pool of instances approach, the big advantage is that it is fairly flexible.
+It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
+It also can gracefully handle running out of ressources, SQEs or the kernel returning @EBUSY@.
+The down side to this is that many of the steps used for submitting need complex synchronization to work properly.
+The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
+The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused and handle the kernel returning @EBUSY@.
+All this synchronization may have a significant cost and, compared to the next approach presented, this synchronization is entirely overhead.
+The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
+Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
+Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
+Similar to scheduling, sharding @io_uring@ instances can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
+These three sharding approaches are analyzed.
 \subsubsection{Private Instances}
+Another approach is to simply create one ring instance per \gls{proc}.
+This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps.
+This is effectively the same requirement as using @thread_local@ variables.
+Since SQEs that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit SQEs in allocation order
+\footnote{The actual requirement is that \glspl{thrd} cannot context switch between allocation and submission.
+This requirement means that from the subsystem's point of view, the allocation and submission are sequential.
+To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring.}
+, greatly simplifying both allocation and submission.
+In this design, allocation and submission form a partitionned ring buffer as shown in Figure~\ref{fig:pring}.
+Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call.
+Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, etc.
+The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling.
+This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not time-sliced during submission steps.
+This requirement is the same as accessing @thread_local@ variables, where a \gls{thrd} is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
+This failure is the serially reusable problem~\cite{SeriallyReusable}.
+Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{
+To remove this requirement, a \gls{thrd} needs the ability to ``yield to a specific \gls{proc}'', \ie, park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
+From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
+In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
+Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call.
+Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, \etc.
 \begin{figure}
 …
 \end{figure}
+This approach has the advantage that it does not require much of the synchronization needed in the shared approach.
+This comes at the cost that \glspl{thrd} submitting \io operations have less flexibility, they cannot park or yield, and several exceptional cases are handled poorly.
+Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations, in such a case the \gls{thrd} needs to be moved to a different \gls{proc}, the only current way of achieving this would be to @yield()@ hoping to be scheduled on a different \gls{proc}, which is not guaranteed.
+A more involved version of this approach can seem to solve most of these problems, using a pattern called \newterm{helping}.
+\Glspl{thrd} that wish to submit \io operations but cannot do so
+\footnote{either because of an allocation failure or because they were migrate to a different \gls{proc} between allocation and submission}
+create an object representing what they wish to achieve and add it to a list somewhere.
+For this particular problem, one solution would be to have a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
+The problem with these ``solutions'' is that they are still bound by the strong coupling between \glspl{proc} and @io_uring@ instances.
+These data structures would allow moving \glspl{thrd} to a specific \gls{proc} when the current \gls{proc} cannot fulfill the \io request.
+Imagine a simple case with two \glspl{thrd} on two \glspl{proc}, one \gls{thrd} submits an \io operation and then sets a flag, the other \gls{thrd} spins until the flag is set.
+If the first \gls{thrd} is preempted between allocation and submission and moves to the other \gls{proc}, the original \gls{proc} could start running the spinning \gls{thrd}.
+If this happens, the helping ``solution'' is for the \io \gls{thrd}to added append an item to the submission list of the \gls{proc} where the allocation was made.
+This approach has the advantage that it does not require much of the synchronization needed in a shared approach.
+However, this benefit means \glspl{thrd} submitting \io operations have less flexibility: they cannot park or yield, and several exceptional cases are handled poorly.
+Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations.
+In this case, the \io \gls{thrd} needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
+A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
+\Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migration to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
+While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \glspl{thrd} to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
+Imagine a simple scenario with two \glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd} spins until the flag is set.
+Assume both \glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}.
+In this case, the helping solution has the \io \gls{thrd} append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
 No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}.
+However, in this case, the \gls{proc} is unable to help because it is executing the spinning \gls{thrd} mentioned when first expression this case
+\footnote{This particular example is completely artificial, but in the presence of many more \glspl{thrd}, it is not impossible that this problem would arise ``in the wild''.
+Furthermore, this pattern is difficult to reliably detect and avoid.}
+resulting in a deadlock.
+Once in this situation, the only escape is to interrupted the execution of the \gls{thrd}, either directly or due to regular preemption, only then can the \gls{proc} take the time to handle the pending request to help.
+Interrupting \glspl{thrd} for this purpose is far from desireable, the cost is significant and the situation may be hard to detect.
+However, a more subtle reason why interrupting the \gls{thrd} is not a satisfying solution is that the \gls{proc} is not actually using the instance it is tied to.
+If it were to use it, then helping could be done as part of the usage.
+However, the \io \gls{proc} is unable to help because it is executing the spinning \gls{thrd} resulting in a deadlock.
+While this example is artificial, in the presence of many \glspl{thrd}, it is possible for this problem to arise ``in the wild''.
+Furthermore, this pattern is difficult to reliably detect and avoid.
+Once in this situation, the only escape is to interrupted the spinning \gls{thrd}, either directly or via some regular preemption, \eg time slicing.
+Having to interrupt \glspl{thrd} for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
 Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
+Therefore a more satisfying solution would be for the \gls{thrd} submitting the operation to simply notice that the instance is unused and simply go ahead and use it.
+This is the approach presented next.
+Therefore, a more satisfying solution is for the \gls{thrd} submitting the operation to notice that the instance is unused and simply go ahead and use it.
+This approach is presented shortly.
+\subsubsection{Public Instances}
+The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
+\Glspl{thrd} attempting an \io operation pick one of the available instances and submit the operation to that instance.
+Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
+Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
+\begin{itemize}
+\item
+The synchronization needed to submit does not induce more contention than @io_uring@ already does.
+\item
+The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
+This aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
+\end{itemize}
+Allocation in this scheme is fairly easy.
+Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written to safely and have a field called @user_data@ that the kernel only reads to copy to @cqe@s.
+Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
+The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
+Allocation failures need to be pushed to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
+Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
+Once an SQE is allocated, \glspl{thrd} insert the \io request information, and keep track of the SQE index and the instance it belongs to.
+Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
+The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean a \lstinline{sqe} multiple times in the ring buffer, which is undefined behaviour.
+However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations complete.
+Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
+Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
+Balancing submission can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
+Ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd} is designated to do the system call on behalf of the others, called the \newterm{submitter}.
+However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
+Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call includes their request.
+Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
+Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
+Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
+Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
+If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
+A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
+With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
+It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
+It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
+The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly.
+The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
+The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
+All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead.
 \subsubsection{Instance borrowing}
+Both of the approaches presented above have undesirable aspects that stem from too loose or too tight coupling between @io_uring@ and \glspl{proc}.
+In the first approach, loose coupling meant that all operations have synchronization overhead that a tighter coupling can avoid.
+The second approach on the other hand suffers from tight coupling causing problems when the \gls{proc} do not benefit from the coupling.
+While \glspl{proc} are continously issuing \io operations tight coupling is valuable since it avoids synchronization costs.
+However, in unlikely failure cases or when \glspl{proc} are not making use of their instance, tight coupling is no longer advantageous.
+A compromise between these approaches would be to allow tight coupling but have the option to revoke this coupling dynamically when failure cases arise.
+I call this approach ``instance borrowing''\footnote{While it looks similar to work-sharing and work-stealing, I think it is different enough from either to warrant a different verb to avoid confusion.}.
+In this approach, each cluster owns a pool of @io_uring@ instances managed by an arbiter.
+Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
+The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
+The second approach suffers from loose coupling causing operations to have synchronization overhead, which tighter coupling avoids.
+When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
+However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
+A compromise between these approaches is to allow tight coupling but have the option to revoke the coupling dynamically when failure cases arise.
+I call this approach \newterm{instance borrowing}.\footnote{
+While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
+In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
 When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
 However, in doing so it ties to the instance to the \gls{proc} it is currently running on.
 This coupling is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
 This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at any given time, akin to the private instances approach.
 However, where it differs is that revocation from the arbiter means this approach does not suffer from the deadlock scenario described above.
+This instance is now bound to the \gls{proc} the \gls{thrd} is running on.
+This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
+This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
+However, it differs in that revocation by the arbiter means this approach does not suffer from the deadlock scenario described above.
 Arbitration is needed in the following cases:
 \begin{enumerate}
         \item The current \gls{proc} does not currently hold an instance.
+        \item The current \gls{proc} does not hold an instance.
         \item The current instance does not have sufficient SQEs to satisfy the request.
+        \item The current \gls{proc} has the wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission.
+        I will refer to these as \newterm{External Submissions}.
+        \item The current \gls{proc} has a wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission, called \newterm{external submissions}.
 \end{enumerate}
 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their ownership of the instance is not being revoked.
+This can be accomplished by a lock-less handshake\footnote{Note that the handshake is not Lock-\emph{Free} since it lacks the proper progress guarantee.}.
+However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
+Note the handshake is not lock \emph{free} since it lacks the proper progress guarantee.}
 A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
 If not it proceeds, otherwise it delegates the operation to the arbiter.
+If not, it proceeds, otherwise it delegates the operation to the arbiter.
 Once the operation is completed, the \gls{proc} lowers its local flag.
 Correspondingly, before revoking an instance the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
+Correspondingly, before revoking an instance, the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
 Only then does it reclaim the instance and potentially assign it to an other \gls{proc}.
 …
 \paragraph{External Submissions} are handled by the arbiter by revoking the appropriate instance and adding the submission to the submission ring.
 There is no need to immediately revoke the instance however.
+However, there is no need to immediately revoke the instance.
 External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
+This means that whoever is responsible for the system call first checks if the instance has any external submissions.
+If it is the case, it asks the arbiter to revoke the instance and add the external submissions to the ring.
+\paragraph{Pending Allocations} can be more complicated to handle.
+If the arbiter has available instances, the arbiter can attempt to directly hand over the instance and satisfy the request.
+Otherwise it must hold onto the list of threads until SQEs are made available again.
+This handling becomes that much more complex if pending allocation require more than one SQE, since the arbiter must make a decision between statisfying requests in FIFO ordering or satisfy requests for fewer SQEs first.
+While this arbiter has the potential to solve many of the problems mentionned in above, it also introduces a significant amount of complexity.
+This means whoever is responsible for the system call, first checks if the instance has any external submissions.
+If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
+\paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request.
+Otherwise, it must hold onto the list of threads until SQEs are made available again.
+This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs.
+While an arbiter has the potential to solve many of the problems mentioned above, it also introduces a significant amount of complexity.
 Tracking which processors are borrowing which instances and which instances have SQEs available ends-up adding a significant synchronization prelude to any I/O operation.
 Any submission must start with a handshake that pins the currently borrowed instance, if available.
 An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}.
 Once the allocation is completed, the submission must still check that the instance is still burrowed before attempt to flush.
 These extra synchronization steps end-up having a similar cost to the multiple shared instances approach.
+Once the allocation is completed, the submission must check that the instance is still burrowed before attempting to flush.
+These synchronization steps turn out to have a similar cost to the multiple shared-instances approach.
 Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end-up cycling the processors, which leads to significant cache deterioration.
 Because of these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
+For these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
 \subsubsection{Private Instances V2}
 % Verbs of this design
 % Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
 % Submition: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
+% Submission: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
 % Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
 …
 % Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
 …
 \section{Interface}
+Finally, the last important part of the \io subsystem is it's interface. There are multiple approaches that can be offered to programmers, each with advantages and disadvantages. The new \io subsystem can replace the C runtime's API or extend it. And in the later case the interface can go from very similar to vastly different. The following sections discuss some useful options using @read@ as an example. The standard Linux interface for C is :
+@ssize_t read(int fd, void *buf, size_t count);@
+The last important part of the \io subsystem is its interface.
+There are multiple approaches that can be offered to programmers, each with advantages and disadvantages.
+The new \io subsystem can replace the C runtime API or extend it, and in the later case, the interface can go from very similar to vastly different.
+The following sections discuss some useful options using @read@ as an example.
+The standard Linux interface for C is :
+\begin{cfa}
+ssize_t read(int fd, void *buf, size_t count);
+\end{cfa}
 \subsection{Replacement}
 Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
 This has the advantage of potentially working transparently and supporting existing binaries without needing recompilation.
+This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
 However, this approach also entails a plethora of subtle technical challenges which generally boils down to making a perfect replacement.
+However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
 Since the gcc ecosystems does not offer a scheme for such perfect replacement, this approach was rejected as being laudable but infeasible.
+Since the gcc ecosystems does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
 \subsection{Synchronous Extension}
+An other interface option is to simply offer an interface that is different in name only. For example:
+@ssize_t cfa_read(int fd, void *buf, size_t count);@
+\noindent This is much more feasible but still familiar to C programmers.
+It comes with the caveat that any code attempting to use it must be recompiled, which can be a big problem considering the amount of existing legacy C binaries.
+Another interface option is to offer an interface different in name only.
+For example:
+\begin{cfa}
+ssize_t cfa_read(int fd, void *buf, size_t count);
+\end{cfa}
+This approach is feasible and still familiar to C programmers.
+It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
 However, it has the advantage of implementation simplicity.
+Finally, there is a certain irony to using a blocking synchronous interfaces for a feature often referred to as ``non-blocking'' \io.
 \subsection{Asynchronous Extension}
+It is important to mention that there is a certain irony to using only synchronous, therefore blocking, interfaces for a feature often referred to as ``non-blocking'' \io.
+A fairly traditional way of doing this is using futures\cit{wikipedia futures}.
+As simple way of doing so is as follows:
+@future(ssize_t) read(int fd, void *buf, size_t count);@
+\noindent Note that this approach is not necessarily the most idiomatic usage of futures.
+The definition of read above ``returns'' the read content through an output parameter which cannot be synchronized on.
+A more classical asynchronous API could look more like:
+@future([ssize_t, void *]) read(int fd, size_t count);@
+\noindent However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
+Because of the performance implications of this, the first approach is considered preferable as it is more familiar to C programmers.
+\subsection{Interface directly to \lstinline{io_uring}}
+Finally, an other interface that can be relevant is to simply expose directly the underlying \texttt{io\_uring} interface. For example:
+@array(SQE, want) cfa_io_allocate(int want);@
+@void cfa_io_submit( const array(SQE, have) & );@
+\noindent This offers more flexibility to users wanting to fully use all of the \texttt{io\_uring} features.
+A fairly traditional way of providing asynchronous interactions is using a future mechanism~\cite{multilisp}, \eg:
+\begin{cfa}
+future(ssize_t) read(int fd, void *buf, size_t count);
+\end{cfa}
+where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
+The data read is placed in @buf@.
+The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
+Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
+A classical asynchronous API is:
+\begin{cfa}
+future([ssize_t, void *]) read(int fd, size_t count);
+\end{cfa}
+where the future tuple covers the components that require synchronization.
+However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
+Because of the performance implications of this API, the first approach is considered preferable as it is more familiar to C programmers.
+\subsection{Direct \lstinline{io_uring} Interface}
+The last interface directly exposes the underlying @io_uring@ interface, \eg:
+\begin{cfa}
+array(SQE, want) cfa_io_allocate(int want);
+void cfa_io_submit( const array(SQE, have) & );
+\end{cfa}
+where the generic @array@ contains an array of SQEs with a size that may be less than the request.
+This offers more flexibility to users wanting to fully utilize all of the @io_uring@ features.
 However, it is not the most user-friendly option.
+It obviously imposes a strong dependency between user code and \texttt{io\_uring} but at the same time restricting users to usages that are compatible with how \CFA internally uses \texttt{io\_uring}.
+It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricting users to usages that are compatible with how \CFA internally uses @io_uring@.

doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

-              r9e23b446
+              rffec1bf
 \chapter{Scheduling in practice}\label{practice}
 The scheduling algorithm discribed in Chapter~\ref{core} addresses scheduling in a stable state.
 However, it does not address problems that occur when the system changes state.
+The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state.
+This chapter addresses problems that occur when the system state changes.
 Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
+This entails that the scheduling algorithm must support these transitions.
+More precise \CFA supports adding \procs using the RAII object @processor@.
+These objects can be created at any time and can be destroyed at any time.
+They are normally created as automatic stack variables, but this is not a requirement.
+The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence.
+These changes affect the scheduling algorithm, which must dynamically alter its behaviour.
+In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
+\begin{cfa}
+{
+        processor p[4]; // 4 new kernel threads
+        ... // execute on 4 processors
+        processor * dp = new( processor, 6 ); // 6 new kernel threads
+        ... // execute on 10 processors
+        delete( dp );   // delete 6 kernel threads
+        ... // execute on 4 processors
+} // delete 4 kernel threads
+\end{cfa}
+Dynamically allocated processors can be deleted an any time, \ie their lifetime exceeds the block of creation.
+The consequence is that the scheduler and \io subsystems must know when these \procs come in and out of existence and roll them into the appropriate scheduling algorithms.
 \section{Manual Resizing}
 Manual resizing is expected to be a rare operation.
 Programmers are mostly expected to resize clusters on startup or teardown.
 Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
 As such all internal arrays that are sized based on the number of \procs need to be \texttt{realloc}ed.
 This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}.
+Programmers normally create/delete processors on a clusters at startup/teardown.
+Therefore, dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
+As such, all internal scheduling arrays that are sized based on the number of \procs need to be @realloc@ed.
+This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or for any other reason.
 There are no performance requirements, within reason, for resizing since it is expected to be rare.
 However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks.
+However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks.
 It should also avoid as much as possible any effect on performance when the number of \procs remain constant.
 This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
 \subsection{Read-Copy-Update}
+One solution is to use the Read-Copy-Update\cite{wiki:rcu} pattern.
+In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy.
+This approach potentially has the advantage that it may not need any synchronization to do the switch.
+However, there is a race where \procs could still use the previous, original, data structure after the copy was switched in.
+This race not only requires some added memory reclamation scheme, it also requires that operations made on the stale original version be eventually moved to the copy.
+For linked-lists, enqueing is only somewhat problematic, \ats enqueued to the original queues need to be transferred to the new, which might not preserve ordering.
+Dequeing is more challenging.
+Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at.
+Fixing this requires more synchronization or more indirection on the queues.
+Another challenge is that the original must be kept until all \procs have witnessed the change.
+This is a straight forward memory reclamation challenge but it does mean that every operation will need \emph{some} form of synchronization.
+If each of these operation does need synchronization then it is possible a simpler solution achieves the same performance.
+Because in addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges.
+One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}.
+In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempt an Indiana Jones Switch to replace the original with the copy.
+This approach has the advantage that it may not need any synchronization to do the switch.
+However, there is a race where \procs still use the original data structure after the copy is switched.
+This race not only requires adding a memory-reclamation scheme, it also requires that operations made on the stale original version are eventually moved to the copy.
+Specifically, the original data structure must be kept until all \procs have witnessed the change.
+This requirement is the \newterm{memory reclamation challenge} and means every operation needs \emph{some} form of synchronization.
+If all operations need synchronization, then the overall cost of this technique is likely to be similar to an uncontended lock approach.
+In addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges.
 Especially merging subqueues while having a minimal impact on fairness and locality.
+\subsection{Read-Writer Lock}
+A simpler approach would be to use a \newterm{Readers-Writer Lock}\cite{wiki:rwlock} where the resizing requires acquiring the lock as a writer while simply enqueing/dequeing \ats requires acquiring the lock as a reader.
+For example, given a linked-list, having a node enqueued onto the original and new list is not necessarily a problem depending on the chosen list structure.
+If the list supports arbitrary insertions, then inconsistencies in the tail pointer do not break the list;
+however, ordering may not be preserved.
+Furthermore, nodes enqueued to the original queues eventually need to be uniquely transferred to the new queues, which may further perturb ordering.
+Dequeuing is more challenging when nodes appear on both lists because of pending reclamation: dequeuing a node from one list does not remove it from the other nor is that node in the same place on the other list.
+This situation can lead to multiple \procs dequeuing the same \at.
+Fixing these challenges requires more synchronization or more indirection to the queues, plus coordinated searching to ensure unique elements.
+\subsection{Readers-Writer Lock}
+A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader.
 Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
+Since this is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
+To maximize reader scalability, the readers should not contend with eachother when attempting to acquire and release the critical sections.
+This effectively requires that each reader have its own piece of memory to mark as locked and unlocked.
+Reades then acquire the lock wait for writers to finish the critical section and then acquire their local spinlocks.
+Writers acquire the global lock, so writers have mutual exclusion among themselves, and then acquires each of the local reader locks.
+Acquiring all the local locks guarantees mutual exclusion between the readers and the writer, while the wait on the read side prevents readers from continously starving the writer.
+\todo{reference listings}
+\begin{lstlisting}
+Since this approach is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
+To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section.
+To achieve this goal requires each reader to have its own memory to mark as locked and unlocked.
+The read acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
+The write acquire acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
+Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
+Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
+The lock in nonblocking, so both readers and writers spin while the lock is held.
+\todo{finish explanation}
+\begin{figure}
+\begin{cfa}
 void read_lock() {
         // Step 1 : make sure no writers in
         while write_lock { Pause(); }
-        // May need fence here
         // Step 2 : acquire our local lock
+        while atomic_xchg( tls.lock ) {
+                Pause();
+        }
+}
+        while atomic_xchg( tls.lock ) { Pause(); }
+}
 void read_unlock() {
         tls.lock = false;
+}
-\end{lstlisting}
-\begin{lstlisting}
 void write_lock()  {
         // Step 1 : lock global lock
+        while atomic_xchg( write_lock ) {
+                Pause();
+        }
+        while atomic_xchg( write_lock ) { Pause(); }
         // Step 2 : lock per-proc locks
         for t in all_tls {
+                while atomic_xchg( t.lock ) {
+                        Pause();
+                }
+                while atomic_xchg( t.lock ) { Pause(); }
+        }
+}
 void write_unlock() {
         // Step 1 : release local locks
+        for t in all_tls {
+                t.lock = false;
+        }
+        for t in all_tls { t.lock = false; }
         // Step 2 : release global lock
         write_lock = false;
+}
+\end{lstlisting}
+\section{Idle-Sleep}
+In addition to users manually changing the number of \procs, it is desireable to support ``removing'' \procs when there is not enough \ats for all the \procs to be useful.
+While manual resizing is expected to be rare, the number of \ats is expected to vary much more which means \procs may need to be ``removed'' for only short periods of time.
+Furthermore, race conditions that spuriously lead to the impression that no \ats are ready are actually common in practice.
+Therefore resources associated with \procs should not be freed but \procs simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready.
+This state is referred to as \newterm{Idle-Sleep}.
+\end{cfa}
+\caption{Specialized Readers-Writer Lock}
+\label{f:SpecializedReadersWriterLock}
+\end{figure}
+\section{Idle-Sleep}\label{idlesleep}
+While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs.
+For this work, it is the programer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.
+This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
+These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
+While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the processor.
+Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed.
 Idle sleep effectively encompasses several challenges.
+First some data structure needs to keep track of all \procs that are in idle sleep.
+Because of idle sleep can be spurious, this data structure has strict performance requirements in addition to the strict correctness requirements.
+Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg \texttt{pthread\_cond\_wait}, pthread semaphores.
+The complexity here is to support \at parking and unparking, timers, \io operations and all other \CFA features with minimal complexity.
+Finally, idle sleep also includes a heuristic to determine the appropriate number of \procs to be in idle sleep an any given time.
+This third challenge is however outside the scope of this thesis because developping a general heuristic is involved enough to justify its own work.
+The \CFA scheduler simply follows the ``Race-to-Idle'\cit{https://doi.org/10.1137/1.9781611973099.100}' approach where a sleeping \proc is woken any time an \at becomes ready and \procs go to idle sleep anytime they run out of work.
+First, a data structure needs to keep track of all \procs that are in idle sleep.
+Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
+Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore.
+The complexity here is to support \at parking and unparking, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
+Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
+However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
+Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work.
+An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready.
+Since waking up a sleeping \proc can have notable latency, it is possible multiple \ats become ready while a single \proc is waking up.
+This facts begs the question, if many \procs are available, how many should be woken?
+If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelisation.
+If the ready \ats will run for a short very short time, waking many \procs may be wasteful.
+As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified.
 \section{Sleeping}
 As usual, the corner-stone of any feature related to the kernel is the choice of system call.
+In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options:
+\paragraph{\texttt{pthread\_mutex}/\texttt{pthread\_cond}}
+The most classic option is to use some combination of \texttt{pthread\_mutex} and \texttt{pthread\_cond}.
+These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a \texttt{pthread\_cond} until signalled.
+While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal \texttt{pthread\_cond}s.
+For \io results to wake a \proc waiting on a \texttt{pthread\_cond} means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled.
+\subsection{\texttt{io\_uring} and Epoll}
+An alternative is to flip the problem on its head and block waiting for \io, using \texttt{io\_uring} or even \texttt{epoll}.
+This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme.
+This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother.
+This leads to additional complexity because there can be a race between these artificial \io operations and genuine \io operations.
+If not handled correctly, this can lead to the artificial files going out of sync.
+In terms of blocking a \gls{kthrd} until some event occurs, the Linux kernel has many available options.
+\subsection{\lstinline{pthread_mutex}/\lstinline{pthread_cond}}
+The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe park/unpark of a \gls{kthrd} to/from a @pthread_cond@.
+While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s.
+For \io results to wake a \proc waiting on a @pthread_cond@ means a different \glspl{kthrd} must be woken up first, which then signals the \proc.
+\subsection{\lstinline{io_uring} and Epoll}
+An alternative is to flip the problem on its head and block waiting for \io, using @io_uring@ or @epoll@.
+This creates the inverse situation, where \io operations directly wake sleeping \procs but waking blocked \procs must use an indirect scheme.
+This generally takes the form of creating a file descriptor, \eg, dummy file, pipe, or event fd, and using that file descriptor when \procs need to wake each other.
+This leads to additional complexity because there can be a race between these artificial \io and genuine \io operations.
+If not handled correctly, this can lead to artificial files getting delayed too long behind genuine files, resulting in longer latency.
 \subsection{Event FDs}
 Another interesting approach is to use an event file descriptor\cit{eventfd}.
+This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses \texttt{read} and \texttt{write}, but also behaves like a semaphore.
+Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}.
+Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{This is without the \texttt{EFD\_SEMAPHORE} flag. This flags changes the behavior of \texttt{read} but is not needed for this work.}.
+This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
+Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits.
+Writes \emph{add} their values to a buffer using arithmetic addition versus buffer append, and reads zero out the buffer and return the buffer values so far.\footnote{
+This behaviour is without the \lstinline{EFD_SEMAPHORE} flag, which changes the behaviour of \lstinline{read} but is not needed for this work.}
 If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
+What makes this feature particularly interesting is that \texttt{io\_uring} supports the \texttt{IORING\_REGISTER\_EVENTFD} command, to register an event fd to a particular instance.
+Once that instance is registered, any \io completion will result in \texttt{io\_uring} writing to the event FD.
+This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io.
+What makes this feature particularly interesting is that @io_uring@ supports the @IORING_REGISTER_EVENTFD@ command to register an event @fd@ to a particular instance.
+Once that instance is registered, any \io completion results in @io_uring@ writing to the event @fd@.
+This means that a \proc waiting on the event @fd@ can be \emph{directly} woken up by either other \procs or incoming \io.
+\section{Tracking Sleepers}
+Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.
+The classic challenge occurs when a \at is made ready while a \proc is going to sleep: there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
+Since \ats can be made ready by timers, \io operations, or other events outside a cluster, this race can occur even if the \proc going to sleep is the only \proc awake.
+As a result, improper handling of this race leads to all \procs going to sleep when there are ready \ats and the system deadlocks.
+The handshake closing the race is done with both the notifier and the idle \proc executing two ordered steps.
+The notifier first make sure the newly ready \at is visible to \procs searching for \ats, and then attempt to notify an idle \proc.
+On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed.
+Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed.
+These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc.
+Conversly, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search.
+Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
+Contention can be tolerated for \procs attempting to sleep or wake-up because these \procs are not doing useful work, and therefore, not contributing to overall performance.
+However, notifying, checking if a \proc must be woken-up, and doing so if needed, can significantly affect overall performance and must be low cost.
+\subsection{Sleepers List}
+Each cluster maintains a list of idle \procs, organized as a stack.
+This ordering allows \procs at the tail to stay in idle sleep for extended period of times while those at the head of the list wake-up for bursts of activity.
+Because of unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \procs handle as much of the work as possible.
+The idle \procs maintain the stack of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
+This approach means that maintaining the list is fairly straightforward.
+The list can simply use a single lock per cluster and only \procs that are getting in and out of the idle state contend for that lock.
+This approach also simplifies notification.
+Indeed, \procs not only need to be notify when a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined.
+These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
+Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
+The single lock also means the notification process simply needs to wake-up the desired idle \proc, using @pthread_cond_signal@, @write@ on an @fd@, \etc, and the \proc handles the rest.
+\subsection{Reducing Latency}
+As mentioned in this section, \procs going to sleep for extremely short periods of time is likely in certain scenarios.
+Therefore, the latency of doing a system call to read from and writing to an event @fd@ can negatively affect overall performance in a notable way.
+Hence, it is important to reduce latency and contention of the notification as much as possible.
+Figure~\ref{fig:idle1} shows the basic idle-sleep data structure.
+For the notifiers, this data structure can cause contention on the lock and the event @fd@ syscall can cause notable latency.
 \begin{figure}
 …
         \input{idle1.pstex_t}
         \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
         Each \proc has a private event FD.}
+        Each \proc has a private event \lstinline{fd}.}
         \label{fig:idle1}
 \end{figure}
+\section{Tracking Sleepers}
+Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.
+The classic challenge is when a \at is made ready while a \proc is going to sleep, there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
+Since \ats can be made ready by timers, \io operations or other events outside a clusre, this race can occur even if the \proc going to sleep is the only \proc awake.
+As a result, improper handling of this race can lead to all \procs going to sleep and the system deadlocking.
+Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
+Contention slowing down \procs attempting to sleep or wake-up can be tolerated.
+These \procs are not doing useful work and therefore not contributing to overall performance.
+However, notifying, checking if a \proc must be woken-up and doing so if needed, can significantly affect overall performance and must be low cost.
+\subsection{Sleepers List}
+Each cluster maintains a list of idle \procs, organized as a stack.
+This ordering hopefully allows \proc at the tail to stay in idle sleep for extended period of times.
+Because of these unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \proc handle as much of the work as possible.
+The idle \procs maintain the of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
+This approach means that maintaining the list is fairly straightforward.
+The list can simply use a single lock per cluster and only \procs that are getting in and out of idle state will contend for that lock.
+This approach also simplifies notification.
+Indeed, \procs need to be notify when a new \at is readied, but they also must be notified during resizing, so the \gls{kthrd} can be joined.
+This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
+Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
+The notification process then simply needs to wake-up the desired idle \proc, using \texttt{pthread\_cond\_signal}, \texttt{write} on an fd, etc., and the \proc will handle the rest.
+\subsection{Reducing Latency}
+As mentioned in this section, \procs going idle for extremely short periods of time is likely in certain common scenarios.
+Therefore, the latency of doing a system call to read from and writing to the event fd can actually negatively affect overall performance in a notable way.
+Is it important to reduce latency and contention of the notification as much as possible.
+Figure~\ref{fig:idle1} shoes the basic idle sleep data structure.
+For the notifiers, this data structure can cause contention on the lock and the event fd syscall can cause notable latency.
+\begin{figure}
+Contention occurs because the idle-list lock must be held to access the idle list, \eg by \procs attempting to go to sleep, \procs waking, or notification attempts.
+The contention from the \procs attempting to go to sleep can be mitigated slightly by using @try_acquire@, so the \procs simply busy wait again searching for \ats if the lock is held.
+This trick cannot be used when waking \procs since the waker needs to return immediately to what it was doing.
+Interestingly, general notification, \ie waking any idle processor versus a specific one, does not strictly require modifying the list.
+Here, contention can be reduced notably by having notifiers avoid the lock entirely by adding a pointer to the event @fd@ of the first idle \proc, as in Figure~\ref{fig:idle2}.
+To avoid contention among notifiers, notifiers atomically exchange the pointer with @NULL@.
+The first notifier succeeds on the exchange and obtains the @fd@ of an idle \proc;
+hence, only one notifier contends on the system call.
+This notifier writes to the @fd@ to wake a \proc.
+The woken \proc then updates the atomic pointer, while it is updating the head of the list, as it removes itself from the list.
+Notifiers that obtained a @NULL@ in the exchange simply move on knowing that another notifier is already waking a \proc.
+This behaviour is equivalent to having multiple notifier write to the @fd@ since reads consume all previous writes.
+Note that with and without this atomic pointer, bursts of notification can lead to an unspecified number of \procs being woken up, depending on how the arrival notification compares witht the latency of \procs waking up.
+As mentioned in section~\ref{idlesleep}, there is no optimal approach to handle these bursts.
+It is therefore difficult to justify the cost of any extra synchronization here.
+\begin{figure}[t]
         \centering
         \input{idle2.pstex_t}
         \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
+        \caption[Improved Idle-Sleep Data Structure]{Improved Idle-Sleep Data Structure \smallskip\newline An atomic pointer is added to the list pointing to the Event FD of the first \proc on the list.}
         \label{fig:idle2}
 \end{figure}
+The contention is mostly due to the lock on the list needing to be held to get to the head \proc.
+That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts.
+The contentention from the \procs attempting to go to sleep can be mitigated slightly by using \texttt{try\_acquire} instead, so the \procs simply continue searching for \ats if the lock is held.
+This trick cannot be used for waking \procs since they are not in a state where they can run \ats.
+However, it is worth nothing that notification does not strictly require accessing the list or the head \proc.
+Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}.
+To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to \texttt{null} so only only notifier will contend on the system call.
+The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cit{benaphore} in front of the event @fd@.
+The benaphore over the event @fd@ logically provides a three state flag to avoid unnecessary system calls, where the states are expressed explicit in Figure~\ref{fig:idle:state}.
+A \proc begins its idle sleep by adding itself to the idle list before searching for an \at.
+In the process of adding itself to the idle list, it sets the state flag to @SEARCH@.
+If no \ats can be found during the search, the \proc then confirms it is going to sleep by atomically swapping the state to @SLEEP@.
+If the previous state is still @SEARCH@, then the \proc does read the event @fd@.
+Meanwhile, notifiers atomically exchange the state to @AWAKE@ state.
+If the previous state is @SLEEP@, then the notifier must write to the event @fd@.
+However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event @fd@ can be omitted, which reduces latency notably.
+These extensions leads to the final data structure shown in Figure~\ref{fig:idle}.
 \begin{figure}
         \centering
         \input{idle_state.pstex_t}
         \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
+        \caption[Improved Idle-Sleep Latency]{Improved Idle-Sleep Latency \smallskip\newline A three state flag is added to the event \lstinline{fd}.}
         \label{fig:idle:state}
 \end{figure}
-The next optimization that can be done is to avoid the latency of the event fd when possible.
-This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd.
-A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.
-The flag starts in state \texttt{SEARCH}, while the \proc is searching for \ats to run.
-The \proc then confirms the sleep by atomically swaping the state to \texttt{SLEEP}.
-If the previous state was still \texttt{SEARCH}, then the \proc does read the event fd.
-Meanwhile, notifiers atomically exchange the state to \texttt{AWAKE} state.
-if the previous state was \texttt{SLEEP}, then the notifier must write to the event fd.
-However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably.
-This leads to the final data structure shown in Figure~\ref{fig:idle}.
 \begin{figure}
 …
         \input{idle.pstex_t}
         \caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
         Each \proc has a private event FD with a benaphore in front of it.
         The list also has an atomic pointer to the event fd and benaphore of the first \proc on the list.}
+        Each \proc has a private event \lstinline{fd} with a benaphore in front of it.
+        The list also has an atomic pointer to the event \lstinline{fd} and benaphore of the first \proc on the list.}
         \label{fig:idle}
 \end{figure}

doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

-              r9e23b446
+              rffec1bf
 This chapter presents an overview of the capabilities of the \CFA runtime prior to this thesis work.
+\Celeven introduced threading features, such the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@. Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC). While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC. This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel. The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking. This also entails that the kernel has a perfect view of every thread executing in the system\footnote{This is not completely true due to primitives like \lstinline|futex|es, which have a significant portion of their logic in user space.}.
+\section{C Threading}
+\Celeven introduced threading features, such the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@.
+Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC).
+While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.
+This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel.
+The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking.
+A consequence of this approach is that the kernel has a perfect view of every thread executing in the system\footnote{This is not completely true due to primitives like \lstinline|futex|es, which have a significant portion of their logic in user space.}.
 \section{M:N Threading}\label{prev:model}
 …
 Threading in \CFA is based on \Gls{uthrding}, where \glspl{thrd} are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \glspl{thrd} and switch among \glspl{thrd} liberally without many concerns for performance.
+The \CFA M:N threading models is implemented using many user-level threads mapped onto fewer \glspl{kthrd}. The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack. The difference is that user-level threads do not have a corresponding object in the kernel, they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \gls{thrd} until it context switches out, it then chooses a different \gls{thrd} to run.
+The \CFA M:N threading models is implemented using many user-level threads mapped onto fewer \glspl{kthrd}.
+The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack.
+The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \gls{thrd} until it context switches out, it then chooses a different \gls{thrd} to run.
 \section{Clusters}
+\CFA allows the option to group user-level threading, in the form of clusters. Both \glspl{thrd} and \glspl{proc} belong to a specific cluster. \Glspl{thrd} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters. Figure~\ref{fig:system} shows an overview of the \CFA runtime, which allows programmers to tightly control parallelism. It also opens the door to handling effects like NUMA, by pining clusters to a specific NUMA node\footnote{This is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for cpu masks.}.
+\CFA allows the option to group user-level threading, in the form of clusters.
+Both \glspl{thrd} and \glspl{proc} belong to a specific cluster.
+\Glspl{thrd} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.
+Figure~\ref{fig:system} shows an overview of the \CFA runtime, which allows programmers to tightly control parallelism.
+It also opens the door to handling effects like NUMA, by pinning clusters to a specific NUMA node\footnote{This capability is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for CPU masks.}.
 \begin{figure}
 …
                 \input{system.pstex_t}
         \end{center}
         \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{thrd} are scheduled inside a particular cluster, where it only runs on the \glspl{proc} which belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{kthrd} which lives outside any cluster and does not run \glspl{thrd}.}
+        \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{thrd} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \glspl{thrd}.}
         \label{fig:system}
 \end{figure}
 …
 \begin{quote}
+Given a simple network program with 2 \glspl{thrd} and a single \gls{proc}, one \gls{thrd} sends network requests to a server and the other \gls{thrd} waits for a response from the server. If the second \gls{thrd} races ahead, it may wait for responses to requests that have not been sent yet. In theory, this should not be a problem, even if the second \gls{thrd} waits, because the first \gls{thrd} is still ready to run and should be able to get CPU time to send the request. With M:N threading, while the first \gls{thrd} is ready, the lone \gls{proc} \emph{cannot} run the first \gls{thrd} if it is blocked in the \glsxtrshort{io} operation of the second \gls{thrd}. If this happen, the system is in a synchronization deadlock\footnote{In this example, the deadlocked could be resolved if the server sends unprompted messages to the client. However, this solution is not general and may not be appropriate even in this simple case.}.
+Given a simple network program with 2 \glspl{thrd} and a single \gls{proc}, one \gls{thrd} sends network requests to a server and the other \gls{thrd} waits for a response from the server.
+If the second \gls{thrd} races ahead, it may wait for responses to requests that have not been sent yet.
+In theory, this should not be a problem, even if the second \gls{thrd} waits, because the first \gls{thrd} is still ready to run and should be able to get CPU time to send the request.
+With M:N threading, while the first \gls{thrd} is ready, the lone \gls{proc} \emph{cannot} run the first \gls{thrd} if it is blocked in the \glsxtrshort{io} operation of the second \gls{thrd}.
+If this happen, the system is in a synchronization deadlock\footnote{In this example, the deadlock could be resolved if the server sends unprompted messages to the client.
+However, this solution is neither general nor appropriate even in this simple case.}.
 \end{quote}
+Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, like \glslink{uthrding}{User-Level \emph{Threading}} blocks \glspl{thrd} rather than \glspl{proc} when doing \glsxtrshort{io} operations, which entails multiplexing the \glsxtrshort{io} operations of many \glspl{thrd} onto fewer \glspl{proc}. This multiplexing requires that a single \gls{proc} be able to execute multiple \glsxtrshort{io} operations in parallel. This requirement cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its blocking duration. Executing \glsxtrshort{io} operations in parallel requires \emph{asynchronous} \glsxtrshort{io}, sometimes referred to as \emph{non-blocking}, since the \gls{kthrd} does not block.
+Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \glspl{thrd} rather than \glspl{proc} when doing \glsxtrshort{io} ope      rations.
+This feature entails multiplexing the \glsxtrshort{io} operations of many \glspl{thrd} onto fewer \glspl{proc}.
+The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel.
+This requirement cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its blocking duration.
+Executing \glsxtrshort{io} operations in parallel requires \emph{asynchronous} \glsxtrshort{io}, sometimes referred to as \emph{non-blocking}, since the \gls{kthrd} does not block.
 \section{Interoperating with \texttt{C}}
+\section{Interoperating with C}
 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}:
 \begin{quote}
 All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions1 need not be thread-safe. ... (list of 70+ potentially excluded functions)
+All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions)
 \end{quote}
 Only UNIX @man@ pages identify whether or not a library function is thread safe, and hence, may block on a pthread lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
+Only UNIX @man@ pages identify whether or not a library function is thread safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
 Languages like Go and Java, which have strict interoperability with C\cit{JNI, GoLang with C}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
 …
 \begin{enumerate}
         \item Precisely identifying blocking C calls is difficult.
         \item Introducing control points code can have a significant impact on general performance.
+        \item Introducing safe-point code (see Go~page~\pageref{GoSafePoint}) can have a significant impact on general performance.
 \end{enumerate}
+Because of these consequences, this work does not attempt to ``sandbox'' calls to C. Therefore, it is possible calls from an unidentified library will block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model. Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications. Therefore, a complete solution to this problem is outside the scope of this thesis.
+Because of these consequences, this work does not attempt to ``sandbox'' calls to C.
+Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model.
+Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications.
+Therefore, a complete solution to this problem is outside the scope of this thesis.\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks are redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.}

doc/theses/thierry_delisle_PhD/thesis/thesis.tex

-              r9e23b446
+              rffec1bf
 \usepackage{graphicx} % For including graphics
 \usepackage{subcaption}
+\usepackage{comment} % Removes large sections of the document.
 % Hyperlinks make it very easy to navigate an electronic document.
 …
         citecolor=OliveGreen,   % color of links to bibliography
         filecolor=magenta,      % color of file links
+        urlcolor=cyan           % color of external links
+        urlcolor=blue,           % color of external links
+        breaklinks=true
+}
 \ifthenelse{\boolean{PrintVersion}}{   % for improved print quality, change some hyperref options

Context Navigation

Legend:

doc/theses/mike_brooks_MMath/array.tex

doc/theses/thierry_delisle_PhD/thesis/.gitignore

doc/theses/thierry_delisle_PhD/thesis/Makefile

doc/theses/thierry_delisle_PhD/thesis/fig/base.fig

doc/theses/thierry_delisle_PhD/thesis/fig/base_avg.fig

doc/theses/thierry_delisle_PhD/thesis/fig/cache-noshare.fig

doc/theses/thierry_delisle_PhD/thesis/fig/cache-share.fig

doc/theses/thierry_delisle_PhD/thesis/fig/cycle.fig

doc/theses/thierry_delisle_PhD/thesis/fig/idle.fig

doc/theses/thierry_delisle_PhD/thesis/fig/idle1.fig

doc/theses/thierry_delisle_PhD/thesis/fig/idle2.fig

doc/theses/thierry_delisle_PhD/thesis/fig/idle_state.fig

doc/theses/thierry_delisle_PhD/thesis/fig/io_uring.fig

doc/theses/thierry_delisle_PhD/thesis/fig/system.fig

doc/theses/thierry_delisle_PhD/thesis/local.bib

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

doc/theses/thierry_delisle_PhD/thesis/thesis.tex

Download in other formats: