Ignore:
Timestamp:
Jul 18, 2022, 8:06:18 AM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
6a896b0, d677355
Parents:
4f3807d
Message:

proofread chapter text/io.tex, and updates in other chapaters

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
9 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/fig/io_uring.fig

    r4f3807d r847bb6f  
    88-2
    991200 2
    10 6 180 3240 2025 3510
     106 675 3105 2520 3375
    11112 1 0 1 0 7 40 -1 -1 0.000 0 0 -1 0 0 2
    12          720 3240 720 3510
     12         1215 3105 1215 3375
    13132 1 0 1 0 7 40 -1 -1 0.000 0 0 -1 0 0 2
    14          450 3240 450 3510
     14         945 3105 945 3375
    15152 2 0 1 0 7 45 -1 20 0.000 0 0 -1 0 0 5
    16          180 3240 1260 3240 1260 3510 180 3510 180 3240
     16         675 3105 1755 3105 1755 3375 675 3375 675 3105
    17172 1 0 1 0 7 40 -1 -1 0.000 0 0 -1 0 0 2
    18          990 3240 990 3510
    19 4 0 0 40 -1 0 12 0.0000 2 165 990 1035 3420 {\\small S3}\001
    20 4 0 0 40 -1 0 12 0.0000 2 165 990 765 3420 {\\small S2}\001
    21 4 0 0 40 -1 0 12 0.0000 2 165 990 225 3420 {\\small S0}\001
    22 4 0 0 40 -1 0 12 0.0000 2 165 990 495 3420 {\\small S1}\001
     18         1485 3105 1485 3375
     194 0 0 40 -1 0 12 0.0000 2 165 930 1530 3285 {\\small S3}\001
     204 0 0 40 -1 0 12 0.0000 2 165 930 1260 3285 {\\small S2}\001
     214 0 0 40 -1 0 12 0.0000 2 165 930 720 3285 {\\small S0}\001
     224 0 0 40 -1 0 12 0.0000 2 165 930 990 3285 {\\small S1}\001
    2323-6
    24 6 1530 2610 3240 4140
    25 5 1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 2455.714 3375.000 1890 2700 1575 3375 1890 4050
     246 2025 2475 3735 4005
     255 1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 2950.714 3240.000 2385 2565 2070 3240 2385 3915
    2626        1 1 1.00 60.00 120.00
    27 1 3 0 1 0 7 40 -1 20 0.000 1 0.0000 2475 3375 315 315 2475 3375 2790 3375
    28 1 3 0 1 0 7 50 -1 20 0.000 1 0.0000 2475 3375 765 765 2475 3375 3240 3375
     271 3 0 1 0 7 40 -1 20 0.000 1 0.0000 2970 3240 315 315 2970 3240 3285 3240
     281 3 0 1 0 7 50 -1 20 0.000 1 0.0000 2970 3240 765 765 2970 3240 3735 3240
    29292 1 0 1 0 7 45 -1 -1 0.000 0 0 -1 0 0 2
    30          2475 3375 2133 2690
     30         2970 3240 2628 2555
    31312 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
    32          2475 3375 1769 3093
     32         2970 3240 2264 2958
    33332 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
    34          2475 3375 1769 3661
     34         2970 3240 2264 3526
    35352 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
    36          2475 3375 2133 4057
     36         2970 3240 2628 3922
    37372 1 1 1 0 7 35 -1 0 4.000 0 0 -1 0 0 2
    38          2205 3375 2745 3375
     38         2700 3240 3240 3240
    3939-6
    40 6 585 2250 1485 2610
    41 4 2 0 50 -1 0 12 0.0000 2 135 900 1485 2385 Submission\001
    42 4 2 0 50 -1 0 12 0.0000 2 165 360 1485 2580 Ring\001
     406 1080 2115 1980 2475
     414 2 0 50 -1 0 12 0.0000 2 135 945 1980 2250 Submission\001
     424 2 0 50 -1 0 12 0.0000 2 180 405 1980 2445 Ring\001
    4343-6
    44 6 3600 2610 5265 4140
    45 5 1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 4384.000 3375.000 4950 4050 5265 3375 4950 2700
     446 4095 2475 5760 4005
     455 1 0 1 0 7 35 -1 -1 0.000 0 1 1 0 4879.000 3240.000 5445 3915 5760 3240 5445 2565
    4646        1 1 1.00 60.00 120.00
    47 1 3 0 1 0 7 40 -1 20 0.000 1 3.1416 4365 3375 315 315 4365 3375 4050 3375
    48 1 3 0 1 0 7 50 -1 20 0.000 1 3.1416 4365 3375 765 765 4365 3375 3600 3375
     471 3 0 1 0 7 40 -1 20 0.000 1 3.1416 4860 3240 315 315 4860 3240 4545 3240
     481 3 0 1 0 7 50 -1 20 0.000 1 3.1416 4860 3240 765 765 4860 3240 4095 3240
    49492 1 0 1 0 7 45 -1 -1 0.000 0 0 -1 0 0 2
    50          4365 3375 4707 4060
     50         4860 3240 5202 3925
    51512 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
    52          4365 3375 5071 3657
     52         4860 3240 5566 3522
    53532 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
    54          4365 3375 5071 3089
     54         4860 3240 5566 2954
    55552 1 0 1 0 7 45 -1 -1 4.000 0 0 -1 0 0 2
    56          4365 3375 4707 2693
     56         4860 3240 5202 2558
    57572 1 1 1 0 7 35 -1 0 4.000 0 0 -1 0 0 2
    58          4635 3375 4095 3375
     58         5130 3240 4590 3240
    5959-6
    60 6 5355 2250 6255 2610
    61 4 0 0 50 -1 0 12 0.0000 2 165 360 5355 2580 Ring\001
    62 4 0 0 50 -1 0 12 0.0000 2 165 900 5355 2385 Completion\001
     606 5850 2115 6750 2475
     614 0 0 50 -1 0 12 0.0000 2 180 405 5850 2445 Ring\001
     624 0 0 50 -1 0 12 0.0000 2 180 975 5850 2250 Completion\001
    6363-6
    64642 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
    6565        1 1 1.00 60.00 120.00
    66          2925 2025 2550 2486
     66         3420 1890 3045 2351
    67672 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
    6868        1 1 1.00 60.00 120.00
    69          4275 2475 3825 2025
     69         4770 2340 4320 1890
    70702 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
    7171        1 1 1.00 60.00 120.00
    72          2751 4268 3066 4538
     72         3060 4095 3600 4410
    73732 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
    7474        1 1 1.00 60.00 120.00
    75          3780 4545 4275 4230
     75         4275 4410 4770 4095
    76762 1 1 1 0 7 55 -1 -1 4.000 0 0 -1 0 0 2
    77          0 3375 6255 3375
    78 4 0 0 35 -1 0 12 0.0000 2 165 1170 1845 3060 {\\small \\&S2}\001
    79 4 0 0 35 -1 0 12 0.0000 2 165 1170 1755 3420 {\\small \\&S3}\001
    80 4 0 0 35 -1 0 12 0.0000 2 165 1170 1890 3735 {\\small \\&S0}\001
    81 4 0 0 50 -1 0 12 0.0000 6 135 360 2790 2565 Push\001
    82 4 0 0 50 -1 0 12 0.0000 6 165 270 2880 4230 Pop\001
    83 4 0 0 50 -1 0 12 0.0000 6 135 360 2025 4275 Head\001
    84 4 0 0 50 -1 0 12 0.0000 6 135 360 2025 2565 Tail\001
    85 4 0 0 35 -1 0 12 0.0000 2 165 990 4635 3060 {\\small C0}\001
    86 4 0 0 35 -1 0 12 0.0000 2 165 990 4815 3420 {\\small C1}\001
    87 4 0 0 35 -1 0 12 0.0000 2 165 990 4635 3780 {\\small C2}\001
    88 4 0 0 50 -1 0 12 0.0000 4 135 360 4725 4275 Tail\001
    89 4 0 0 50 -1 0 12 0.0000 6 135 360 4590 2565 Head\001
    90 4 0 0 50 -1 0 12 0.0000 2 135 990 5535 3285 Kernel Line\001
    91 4 1 0 50 -1 0 12 0.0000 2 180 1350 3375 4815 {\\Large Kernel}\001
    92 4 1 0 50 -1 0 12 0.0000 2 180 1800 3375 1845 {\\Large Application}\001
    93 4 0 0 50 -1 0 12 0.0000 6 165 270 3690 2565 Pop\001
    94 4 0 0 50 -1 0 12 0.0000 4 135 360 3465 4230 Push\001
    95 4 0 0 50 -1 0 12 0.0000 2 135 90 0 3285 S\001
     77         495 3240 6750 3240
     784 0 0 35 -1 0 12 0.0000 2 165 1140 2340 2925 {\\small \\&S2}\001
     794 0 0 50 -1 0 12 0.0000 6 135 390 3285 2430 Push\001
     804 0 0 50 -1 0 12 0.0000 6 135 330 2520 2430 Tail\001
     814 0 0 35 -1 0 12 0.0000 2 165 960 5130 2925 {\\small C0}\001
     824 0 0 35 -1 0 12 0.0000 2 165 960 5310 3285 {\\small C1}\001
     834 0 0 35 -1 0 12 0.0000 2 165 960 5130 3645 {\\small C2}\001
     844 0 0 50 -1 0 12 0.0000 4 135 330 5220 4140 Tail\001
     854 0 0 50 -1 0 12 0.0000 6 135 420 5085 2430 Head\001
     864 0 0 50 -1 0 12 0.0000 2 135 960 6030 3150 Kernel Line\001
     874 0 0 50 -1 0 12 0.0000 2 135 105 495 3150 S\001
     884 0 0 35 -1 0 12 0.0000 2 165 1140 2385 3645 {\\small \\&S0}\001
     894 0 0 50 -1 0 12 0.0000 6 135 420 2340 4140 Head\001
     904 0 0 35 -1 0 12 0.0000 2 165 1140 2250 3285 {\\small \\&S3}\001
     914 2 0 50 -1 0 12 0.0000 4 135 390 4500 4140 Push\001
     924 1 0 50 -1 0 12 0.0000 2 180 1290 3915 4680 {\\Large Kernel}\001
     934 0 0 50 -1 0 12 0.0000 6 180 315 3285 4140 Pop\001
     944 1 0 50 -1 0 12 0.0000 2 180 1725 3915 1755 {\\Large Application}\001
     954 2 0 50 -1 0 12 0.0000 6 180 315 4545 2430 Pop\001
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    r4f3807d r847bb6f  
    22% Cforall
    33@misc{cfa:frontpage,
    4   url = {https://cforall.uwaterloo.ca/}
     4  howpublished = {\href{https://cforall.uwaterloo.ca}{https://\-cforall.uwaterloo.ca}}
    55}
    66@article{cfa:typesystem,
     
    481481@misc{MAN:linux/cfs,
    482482  title = {{CFS} Scheduler - The Linux Kernel documentation},
    483   url = {https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html}
     483  howpublished = {\href{https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html}{https://\-www.kernel.org/\-doc/\-html/\-latest/\-scheduler/\-sched-design-CFS.html}}
    484484}
    485485
     
    489489  year = {2019},
    490490  month = {February},
    491   url = {https://opensource.com/article/19/2/fair-scheduling-linux}
     491  howpublished = {\href{https://opensource.com/article/19/2/fair-scheduling-linux}{https://\-opensource.com/\-article/\-19/2\-/\-fair-scheduling-linux}}
    492492}
    493493
     
    523523  title = {Mach Scheduling and Thread Interfaces - Kernel Programming Guide},
    524524  organization = {Apple Inc.},
    525   url = {https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}
     525  howPublish = {\href{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}}
    526526}
    527527
     
    536536  month = {June},
    537537  series = {Developer Reference},
    538   url = {https://www.microsoftpressstore.com/articles/article.aspx?p=2233328&seqNum=7#:~:text=Overview\%20of\%20Windows\%20Scheduling,a\%20phenomenon\%20called\%20processor\%20affinity}
     538  howpublished = {\href{https://www.microsoftpressstore.com/articles/article.aspx?p=2233328&seqNum=7#:~:text=Overview\%20of\%20Windows\%20Scheduling,a\%20phenomenon\%20called\%20processor\%20affinity}{https://\-www.microsoftpressstore.com/\-articles/\-article.aspx?p=2233328&seqNum=7#:~:text=Overview\%20of\%20Windows\%20Scheduling,a\%20phenomenon\%20called\%20processor\%20affinity}}
    539539}
    540540
     
    542542  title = {GitHub - The Go Programming Language},
    543543  author = {The Go Programming Language},
    544   url = {https://github.com/golang/go},
     544  howpublished = {\href{https://github.com/golang/go}{https://\-github.com/\-golang/\-go}},
    545545  version = {Change-Id: If07f40b1d73b8f276ee28ffb8b7214175e56c24d}
    546546}
     
    551551  year = {2019},
    552552  booktitle = {Hydra},
    553   url = {https://www.youtube.com/watch?v=-K11rY57K7k&ab_channel=Hydra}
     553  howpublished = {\href{https://www.youtube.com/watch?v=-K11rY57K7k&ab_channel=Hydra}{https://\-www.youtube.com/\-watch?v=-K11rY57K7k&ab_channel=Hydra}}
    554554}
    555555
     
    559559  year = {2008},
    560560  booktitle = {Erlang User Conference},
    561   url = {http://www.erlang.se/euc/08/euc_smp.pdf}
     561  howpublished = {\href{http://www.erlang.se/euc/08/euc_smp.pdf}{http://\-www.erlang.se/\-euc/\-08/\-euc_smp.pdf}}
    562562}
    563563
     
    567567  title = {Scheduling Algorithm - Intel{\textregistered} Threading Building Blocks Developer Reference},
    568568  organization = {Intel{\textregistered}},
    569   url = {https://www.threadingbuildingblocks.org/docs/help/reference/task_scheduler/scheduling_algorithm.html}
     569  howpublished = {\href{https://www.threadingbuildingblocks.org/docs/help/reference/task_scheduler/scheduling_algorithm.html}{https://\-www.threadingbuildingblocks.org/\-docs/\-help/\-reference/\-task\_scheduler/\-scheduling\_algorithm.html}}
    570570}
    571571
     
    573573  title = {Quasar Core - Quasar User Manual},
    574574  organization = {Parallel Universe},
    575   url = {https://docs.paralleluniverse.co/quasar/}
     575  howpublished = {\href{https://docs.paralleluniverse.co/quasar}{https://\-docs.paralleluniverse.co/\-quasar}}
    576576}
    577577@misc{MAN:project-loom,
    578   url = {https://www.baeldung.com/openjdk-project-loom}
     578  howpublished = {\href{https://www.baeldung.com/openjdk-project-loom}{https://\-www.baeldung.com/\-openjdk-project-loom}}
    579579}
    580580
    581581@misc{MAN:java/fork-join,
    582   url = {https://www.baeldung.com/java-fork-join}
     582  howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}}
    583583}
    584584
     
    633633  month   = "March",
    634634  version = {0,4},
    635   howpublished = {\url{https://kernel.dk/io_uring.pdf}}
     635  howpublished = {\href{https://kernel.dk/io_uring.pdf}{https://\-kernel.dk/\-io\_uring.pdf}}
    636636}
    637637
     
    642642  title = "Control theory --- {W}ikipedia{,} The Free Encyclopedia",
    643643  year = "2020",
    644   url = "https://en.wikipedia.org/wiki/Task_parallelism",
     644  howpublished = {\href{https://en.wikipedia.org/wiki/Task_parallelism}{https://\-en.wikipedia.org/\-wiki/\-Task\_parallelism}},
    645645  note = "[Online; accessed 22-October-2020]"
    646646}
     
    650650  title = "Task parallelism --- {W}ikipedia{,} The Free Encyclopedia",
    651651  year = "2020",
    652   url = "https://en.wikipedia.org/wiki/Control_theory",
     652  howpublished = "\href{https://en.wikipedia.org/wiki/Control_theory}{https://\-en.wikipedia.org/\-wiki/\-Control\_theory}",
    653653  note = "[Online; accessed 22-October-2020]"
    654654}
     
    658658  title = "Implicit parallelism --- {W}ikipedia{,} The Free Encyclopedia",
    659659  year = "2020",
    660   url = "https://en.wikipedia.org/wiki/Implicit_parallelism",
     660  howpublished = "\href{https://en.wikipedia.org/wiki/Implicit_parallelism}{https://\-en.wikipedia.org/\-wiki/\-Implicit\_parallelism}",
    661661  note = "[Online; accessed 23-October-2020]"
    662662}
     
    666666  title = "Explicit parallelism --- {W}ikipedia{,} The Free Encyclopedia",
    667667  year = "2017",
    668   url = "https://en.wikipedia.org/wiki/Explicit_parallelism",
     668  howpublished = "\href{https://en.wikipedia.org/wiki/Explicit_parallelism}{https://\-en.wikipedia.org/\-wiki/\-Explicit\_parallelism}",
    669669  note = "[Online; accessed 23-October-2020]"
    670670}
     
    674674  title = "Linear congruential generator --- {W}ikipedia{,} The Free Encyclopedia",
    675675  year = "2020",
    676   url = "https://en.wikipedia.org/wiki/Linear_congruential_generator",
     676  howpublished = "\href{https://en.wikipedia.org/wiki/Linear_congruential_generator}{https://en.wikipedia.org/wiki/Linear\_congruential\_generator}",
    677677  note = "[Online; accessed 2-January-2021]"
    678678}
     
    682682  title = "Futures and promises --- {W}ikipedia{,} The Free Encyclopedia",
    683683  year = "2020",
    684   url = "https://en.wikipedia.org/wiki/Futures_and_promises",
     684  howpublished = "\href{https://en.wikipedia.org/wiki/Futures_and_promises}{https://\-en.wikipedia.org/\-wiki/Futures\_and\_promises}",
    685685  note = "[Online; accessed 9-February-2021]"
    686686}
     
    690690  title = "Read-copy-update --- {W}ikipedia{,} The Free Encyclopedia",
    691691  year = "2022",
    692   url = "https://en.wikipedia.org/wiki/Linear_congruential_generator",
     692  howpublished = "\href{https://en.wikipedia.org/wiki/Linear_congruential_generator}{https://\-en.wikipedia.org/\-wiki/\-Linear\_congruential\_generator}",
    693693  note = "[Online; accessed 12-April-2022]"
    694694}
     
    698698  title = "Readers-writer lock --- {W}ikipedia{,} The Free Encyclopedia",
    699699  year = "2021",
    700   url = "https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock",
     700  howpublished = "\href{https://en.wikipedia.org/wiki/Readers-writer_lock}{https://\-en.wikipedia.org/\-wiki/\-Readers-writer\_lock}",
    701701  note = "[Online; accessed 12-April-2022]"
    702702}
     
    705705  title = "Bin packing problem --- {W}ikipedia{,} The Free Encyclopedia",
    706706  year = "2022",
    707   url = "https://en.wikipedia.org/wiki/Bin_packing_problem",
     707  howpublished = "\href{https://en.wikipedia.org/wiki/Bin_packing_problem}{https://\-en.wikipedia.org/\-wiki/\-Bin\_packing\_problem}",
    708708  note = "[Online; accessed 29-June-2022]"
    709709}
     
    712712% [05/04, 12:36] Trevor Brown
    713713%     i don't know where rmr complexity was first introduced, but there are many many many papers that use the term and define it
    714 % [05/04, 12:37] Trevor Brown
     714% [05/04, 12:37] Trevor Brown
    715715%     here's one paper that uses the term a lot and links to many others that use it... might trace it to something useful there https://drops.dagstuhl.de/opus/volltexte/2021/14832/pdf/LIPIcs-DISC-2021-30.pdf
    716 % [05/04, 12:37] Trevor Brown
     716% [05/04, 12:37] Trevor Brown
    717717%     another option might be to cite a textbook
    718 % [05/04, 12:42] Trevor Brown
     718% [05/04, 12:42] Trevor Brown
    719719%     but i checked two textbooks in the area i'm aware of and i don't see a definition of rmr complexity in either
    720 % [05/04, 12:42] Trevor Brown
     720% [05/04, 12:42] Trevor Brown
    721721%     this one has a nice statement about the prevelance of rmr complexity, as well as some rough definition
    722 % [05/04, 12:42] Trevor Brown
     722% [05/04, 12:42] Trevor Brown
    723723%     https://dl.acm.org/doi/pdf/10.1145/3465084.3467938
    724724
     
    728728%
    729729% https://doi.org/10.1137/1.9781611973099.100
     730
     731
     732@misc{AIORant,
     733  author = "Linus Torvalds",
     734  title = "Re: [PATCH 09/13] aio: add support for async openat()",
     735  year = "2016",
     736  month = jan,
     737  howpublished = "\href{https://lwn.net/Articles/671657}{https://\-lwn.net/\-Articles/671657}",
     738  note = "[Online; accessed 6-June-2022]"
     739}
     740
     741@misc{apache,
     742  key = {Apache Software Foundation},
     743  title = {{T}he {A}pache Web Server},
     744  howpublished = {\href{http://httpd.apache.org}{http://\-httpd.apache.org}},
     745  note = "[Online; accessed 6-June-2022]"
     746}
     747
     748@misc{SeriallyReusable,
     749    author      = {IBM},
     750    title       = {Serially reusable programs},
     751    month       = mar,
     752    howpublished= {\href{https://www.ibm.com/docs/en/ztpf/1.1.0.15?topic=structures-serially-reusable-programs}{https://www.ibm.com/\-docs/\-en/\-ztpf/\-1.1.0.15?\-topic=structures\--serially\--reusable-programs}},
     753    year        = 2021,
     754}
     755
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r4f3807d r847bb6f  
    322322Building a scheduler that is cache aware poses two main challenges: discovering the cache topology and matching \procs to this cache structure.
    323323Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem.
    324 This work uses the cache topology information from Linux's \texttt{/sys/devices/system/cpu} directory.
     324This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory.
    325325This leaves the challenge of matching \procs to cache structure, or more precisely identifying which subqueues of the ready queue are local to which subcomponents of the cache structure.
    326326Once a matching is generated, the helping algorithm is changed to add bias so that \procs more often help subqueues local to the same cache substructure.\footnote{
     
    330330Instead of having each subqueue local to a specific \proc, the system is initialized with subqueues for each hardware hyperthread/core up front.
    331331Then \procs dequeue and enqueue by first asking which CPU id they are executing on, in order to identify which subqueues are the local ones.
    332 \Glspl{proc} can get the CPU id from \texttt{sched\_getcpu} or \texttt{librseq}.
     332\Glspl{proc} can get the CPU id from @sched_getcpu@ or @librseq@.
    333333
    334334This approach solves the performance problems on systems with topologies with narrow L3 caches, similar to Figure \ref{fig:cache-noshare}.
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r4f3807d r847bb6f  
    77All of these benchmarks are run on two distinct hardware environment, an AMD and an INTEL machine.
    88
    9 For all benchmarks, \texttt{taskset} is used to limit the experiment to 1 NUMA Node with no hyper threading.
     9For all benchmarks, @taskset@ is used to limit the experiment to 1 NUMA Node with no hyper threading.
    1010If more \glspl{hthrd} are needed, then 1 NUMA Node with hyperthreading is used.
    1111If still more \glspl{hthrd} are needed then the experiment is limited to as few NUMA Nodes as needed.
     
    3535\end{figure}
    3636The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue.
    37 Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark.
     37Since these two operation also describe a @yield@ operation, many systems use this as the most basic benchmark.
    3838However, yielding can be treated as a special case, since it also carries the information that the number of the ready \glspl{at} will not change.
    3939Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs.
     
    5757This is to avoid the case where one of the \glspl{proc} runs out of work because of the variation on the number of ready \glspl{at} mentionned above.
    5858
    59 The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition.
     59The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
    6060Figure~\ref{fig:cycle:code} shows pseudo code for this benchmark.
    6161
     
    116116\section{Yield}
    117117For completion, I also include the yield benchmark.
    118 This benchmark is much simpler than the cycle tests, it simply creates many \glspl{at} that call \texttt{yield}.
    119 As mentionned in the previous section, this benchmark may be less representative of usages that only make limited use of \texttt{yield}, due to potential shortcuts in the routine.
     118This benchmark is much simpler than the cycle tests, it simply creates many \glspl{at} that call @yield@.
     119As mentionned in the previous section, this benchmark may be less representative of usages that only make limited use of @yield@, due to potential shortcuts in the routine.
    120120Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty.
    121121This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done.
     
    184184
    185185To achieve this the benchmark uses a fixed size array of semaphores.
    186 Each \gls{at} picks a random semaphore, \texttt{V}s it to unblock a \at waiting and then \texttt{P}s on the semaphore.
     186Each \gls{at} picks a random semaphore, @V@s it to unblock a \at waiting and then @P@s on the semaphore.
    187187This creates a flow where \glspl{at} push each other out of the semaphores before being pushed out themselves.
    188188For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of semaphores plus the number of \glspl{proc}.
    189 Note that the nature of these semaphores mean the counter can go beyond 1, which could lead to calls to \texttt{P} not blocking.
     189Note that the nature of these semaphores mean the counter can go beyond 1, which could lead to calls to @P@ not blocking.
    190190
    191191\todo{code, setup, results}
  • doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

    r4f3807d r847bb6f  
    178178\begin{displayquote}
    179179        \begin{enumerate}
    180                 \item The task returned by \textit{t}\texttt{.execute()}
     180                \item The task returned by \textit{t}@.execute()@
    181181                \item The successor of t if \textit{t} was its last completed predecessor.
    182182                \item A task popped from the end of the thread's own deque.
     
    193193\paragraph{Quasar/Project Loom}
    194194Java has two projects, Quasar~\cite{MAN:quasar} and Project Loom~\cite{MAN:project-loom}\footnote{It is unclear if these are distinct projects.}, that are attempting to introduce lightweight thread\-ing in the form of Fibers.
    195 Both projects seem to be based on the \texttt{ForkJoinPool} in Java, which appears to be a simple incarnation of randomized work-stealing~\cite{MAN:java/fork-join}.
     195Both projects seem to be based on the @ForkJoinPool@ in Java, which appears to be a simple incarnation of randomized work-stealing~\cite{MAN:java/fork-join}.
    196196
    197197\paragraph{Grand Central Dispatch}
     
    204204% http://web.archive.org/web/20090920043909/http://images.apple.com/macosx/technology/docs/GrandCentral_TB_brief_20090903.pdf
    205205
    206 In terms of semantics, the Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \texttt{execute()} and predecessor semantics.
     206In terms of semantics, the Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics.
    207207
    208208\paragraph{LibFibre}
  • doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

    r4f3807d r847bb6f  
    103103An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
    104104\item
    105 Support for user-level \glsxtrshort{io} capabilities based on Linux's \texttt{io\_uring}.
     105Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
    106106\end{enumerate}
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r4f3807d r847bb6f  
    11\chapter{User Level \io}
    2 As mentioned in Section~\ref{prev:io}, User-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
     2As mentioned in Section~\ref{prev:io}, user-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
    33Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
    44
    55\section{Kernel Interface}
    6 Since this work fundamentally depends on operating-system support, the first step of any design is to discuss the available interfaces and pick one (or more) as the foundations of the non-blocking \io subsystem.
     6Since this work fundamentally depends on operating-system support, the first step of this design is to discuss the available interfaces and pick one (or more) as the foundation for the non-blocking \io subsystem in this work.
    77
    88\subsection{\lstinline{O_NONBLOCK}}
     
    1010In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}.
    1111This feature can be used as the foundation for the non-blocking \io subsystem.
    12 However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be use in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait
    13 \footnote{In this context, ready means \emph{some} operation can be performed without blocking.
     12However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be used in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait.\footnote{
     13In this context, ready means \emph{some} operation can be performed without blocking.
    1414It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
    15 For example, a ready read may only return a subset of bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}.
     15For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
    1616This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block.
    1717
    18 There are three options to monitor file descriptors in Linux
    19 \footnote{For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
     18There are three options to monitor file descriptors in Linux:\footnote{
     19For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
    2020The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.},
    2121@select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}.
    2222All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptors becomes ready.
    23 The group of file descriptors being waited is called the \newterm{interest set}.
    24 
    25 \paragraph{\lstinline{select}} is the oldest of these options, it takes as an input a contiguous array of bits, where each bits represent a file descriptor of interest.
    26 On return, it modifies the set in place to identify which of the file descriptors changed status.
    27 This destructive change means that calling select in a loop requires re-initializing the array each time and the number of file descriptors supported has a hard limit.
    28 Another limit of @select@ is that once the call is started, the interest set can no longer be modified.
    29 Monitoring a new file descriptor generally requires aborting any in progress call to @select@
    30 \footnote{Starting a new call to \lstinline{select} is possible but requires a distinct kernel thread, and as a result is not an acceptable multiplexing solution when the interest set is large and highly dynamic unless the number of parallel calls to \lstinline{select} can be strictly bounded.}.
    31 
    32 \paragraph{\lstinline{poll}} is an improvement over select, which removes the hard limit on the number of file descriptors and the need to re-initialize the input on every call.
    33 It works using an array of structures as an input rather than an array of bits, thus allowing a more compact input for small interest sets.
    34 Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed while the call is blocked.
    35 
    36 \paragraph{\lstinline{epoll}} further improves these two functions by allowing the interest set to be dynamically added to and removed from while a \gls{kthrd} is blocked on an @epoll@ call.
     23The group of file descriptors being waited on is called the \newterm{interest set}.
     24
     25\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
     26Hence, the array length must be as long as the largest FD currently of interest.
     27On return, it outputs the set in place to identify which of the file descriptors changed state.
     28This destructive change means selecting in a loop requires re-initializing the array for each iteration.
     29Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
     30Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
     31However, these changes are only reflected when the manager makes its next call to @select@.
     32Note, it is possible for the manager thread to never unblock if its current interest set never changes, \eg the sockets/pipes/ttys it is waiting on never get data again.
     33Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
     34
     35\begin{comment}
     36From: Tim Brecht <brecht@uwaterloo.ca>
     37Subject: Re: FD sets
     38Date: Wed, 6 Jul 2022 00:29:41 +0000
     39
     40Large number of open files
     41--------------------------
     42
     43In order to be able to use more than the default number of open file
     44descriptors you may need to:
     45
     46o increase the limit on the total number of open files /proc/sys/fs/file-max
     47  (on Linux systems)
     48
     49o increase the size of FD_SETSIZE
     50  - the way I often do this is to figure out which include file __FD_SETSIZE
     51    is defined in, copy that file into an appropriate directory in ./include,
     52    and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
     53    gets used
     54
     55  For example on a RH 9.0 distribution I've copied
     56  /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
     57
     58  Then I modify typesizes.h to look something like:
     59
     60  #ifdef BIGGER_FD_SETSIZE
     61  #define __FD_SETSIZE            32767
     62  #else
     63  #define __FD_SETSIZE            1024
     64  #endif
     65
     66  Note that the since I'm moving and testing the userver on may different
     67  machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
     68
     69  This way if you redefine the FD_SETSIZE it will get used instead of the
     70  default original file.
     71\end{comment}
     72
     73\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
     74(For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.)
     75Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialize on every call.
     76Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \gls{kthrd}, while a manager thread is blocked in @poll@.
     77
     78\paragraph{\lstinline{epoll}} follows after @poll@, and places the interest set in the kernel rather than the application, where it is managed by an internal \gls{kthrd}.
     79There are two separate functions: one to add to the interest set and another to check for FDs with state changes.
    3780This dynamic capability is accomplished by creating an \emph{epoll instance} with a persistent interest set, which is used across multiple calls.
    38 This capability significantly reduces synchronization overhead on the part of the caller (in this case the \io subsystem), since the interest set can be modified when adding or removing file descriptors without having to synchronize with other \glspl{kthrd} potentially calling @epoll@.
    39 
    40 However, all three of these system calls have limitations.
     81As the interest set is augmented, the changes become implicitly part of the interest set for a blocked manager \gls{kthrd}.
     82This capability significantly reduces synchronization between \glspl{kthrd} and the manager calling @epoll@.
     83
     84However, all three of these I/O systems have limitations.
    4185The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
    4286Furthermore, @epoll@ has been shown to have problems with pipes and ttys~\cit{Peter's examples in some fashion}.
     
    5397It also supports batching multiple operations in a single system call.
    5498
    55 AIO offers two different approach to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
     99AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
    56100For the purpose of \io multiplexing, @aio_suspend@ is the best interface.
    57101However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
     
    70114
    71115        \begin{flushright}
    72                 -- Linus Torvalds\cit{https://lwn.net/Articles/671657/}
     116                -- Linus Torvalds~\cite{AIORant}
    73117        \end{flushright}
    74118\end{displayquote}
     
    85129A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
    86130Like AIO, it represents \io operations as entries added to a queue.
    87 But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
     131But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
    88132The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
    89133
     
    97141In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
    98142However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
    99 This approach is used by languages like Go\cit{Go} and frameworks like libuv\cit{libuv}, since it has the advantage that it can easily be used across multiple operating systems.
     143This approach is used by languages like Go\cit{Go}, frameworks like libuv\cit{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
    100144This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
    101145As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
     
    111155\section{Event-Engine}
    112156An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
    113 In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engines then starts the operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
     157In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
    114158The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
    115159
     
    134178\begin{enumerate}
    135179\item
    136 An SQE is allocated from the pre-allocated array (denoted \emph{S} in Figure~\ref{fig:iouring}).
     180An SQE is allocated from the pre-allocated array \emph{S}.
    137181This array is created at the same time as the @io_uring@ instance, is in kernel-locked memory visible by both the kernel and the application, and has a fixed size determined at creation.
    138 How these entries are allocated is not important for the functioning of @io_uring@, the only requirement is that no entry is reused before the kernel has consumed it.
     182How these entries are allocated is not important for the functioning of @io_uring@;
     183the only requirement is that no entry is reused before the kernel has consumed it.
    139184\item
    140185The SQE is filled according to the desired operation.
    141 This step is straight forward, the only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
     186This step is straight forward.
     187The only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
    142188\item
    143189The SQE is submitted to the submission ring by appending the index of the SQE to the ring following regular ring buffer steps: \lstinline{buffer[head] = item; head++}.
    144190Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
    145191Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
     192Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
    146193\item
    147194The kernel is notified of the change to the ring using the system call @io_uring_enter@.
     
    161208The @io_uring_enter@ system call is protected by a lock inside the kernel.
    162209This protection means that concurrent call to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
    163 It is possible to do the first three submission steps in parallel, however, doing so requires careful synchronization.
     210It is possible to do the first three submission steps in parallel;
     211however, doing so requires careful synchronization.
    164212
    165213@io_uring@ also introduces constraints on the number of simultaneous operations that can be ``in flight''.
    166 Obviously, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
    167 In addition, the @io_uring_enter@ system call can fail because ``The  kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can  have pending.''.
     214First, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
     215Second, the @io_uring_enter@ system call can fail because ``The  kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can have pending.''.
    168216This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
    169217
    170218\subsection{Multiplexing \io: Submission}
     219
    171220The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
    172 While it is possible to do the first steps of submission in parallel, the duration of the system call scales with number of entries submitted.
     221While there is freedom in designing the submission side, there are some realities of @io_uring@ that must be taken into account.
     222It is possible to do the first steps of submission in parallel;
     223however, the duration of the system call scales with the number of entries submitted.
    173224The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
    174225Beyond this limit, the length of the system call is the throughput limiting factor.
    175 I concluded from early experiments that preparing submissions seems to take at most as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
    176 Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
    177 Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
    178 Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continously
    179 \footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}.
     226I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
     227Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
     228Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
     229As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
    180230Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
    181 There is nothing preventing a new operation with, for example, the same file descriptors to a different @io_uring@ instance.
     231There is nothing preventing a new operation with, \eg the same file descriptors to a different @io_uring@ instance.
    182232
    183233A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
    184234SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
    185 The consequence of this feature is that filling SQEs can be arbitrarly complex and therefore users may need to run arbitrary code between allocation and submission.
    186 Supporting chains is a requirement of the \io subsystem, but it is still valuable.
    187 Support for this feature can be fulfilled simply to supporting arbitrary user code between allocation and submission.
    188 
    189 \subsubsection{Public Instances}
    190 One approach is to have multiple shared instances.
    191 \Glspl{thrd} attempting \io operations pick one of the available instances and submit operations to that instance.
    192 Since there is no coupling between \glspl{proc} and @io_uring@ instances in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
    193 Since @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: the synchronization needed to submit does not induce more contention than @io_uring@ already does and the scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
    194 This second aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
    195 
    196 Allocation in this scheme can be handled fairly easily.
    197 Free SQEs, \ie, SQEs that aren't currently being used to represent a request, can be written to safely and have a field called @user_data@ which the kernel only reads to copy to @cqe@s.
    198 Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
    199 This requires a simple concurrent bag.
    200 The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
    201 
    202 Allocation failures need to be pushed up to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
    203 Furthermore, the routing algorithm should block operations up-front if none of the instances have available SQEs.
    204 
    205 Once an SQE is allocated, \glspl{thrd} can fill them normally, they simply need to keep track of the SQE index and which instance it belongs to.
    206 
    207 Once an SQE is filled in, what needs to happen is that the SQE must be added to the submission ring buffer, an operation that is not thread-safe on itself, and the kernel must be notified using the @io_uring_enter@ system call.
    208 The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail
    209 \footnote{This is because it is invalid to have the same \lstinline{sqe} multiple times in the ring buffer.}.
    210 However, as mentioned, the system call itself can fail with the expectation that it will be retried once some of the already submitted operations complete.
    211 Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
    212 Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
    213 This can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
    214 
    215 In the case of designating a \gls{thrd}, ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests would be batched together and one of the \glspl{thrd} would do the system call on behalf of the others, referred to as the \newterm{submitter}.
    216 In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a ``next submitter'' that guarentees everything that is missed by the current submitter is seen by the next one.
    217 Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call will include their request.
    218 Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
    219 
    220 Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point.
    221 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
    222 Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
    223 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
    224 A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
    225 
    226 With this pool of instances approach, the big advantage is that it is fairly flexible.
    227 It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
    228 It also can gracefully handle running out of ressources, SQEs or the kernel returning @EBUSY@.
    229 The down side to this is that many of the steps used for submitting need complex synchronization to work properly.
    230 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
    231 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused and handle the kernel returning @EBUSY@.
    232 All this synchronization may have a significant cost and, compared to the next approach presented, this synchronization is entirely overhead.
     235The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
     236Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
     237Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
     238
     239Similar to scheduling, sharding @io_uring@ instances can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
     240These three sharding approaches are analyzed.
    233241
    234242\subsubsection{Private Instances}
    235 Another approach is to simply create one ring instance per \gls{proc}.
    236 This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps.
    237 This is effectively the same requirement as using @thread_local@ variables.
    238 Since SQEs that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit SQEs in allocation order
    239 \footnote{The actual requirement is that \glspl{thrd} cannot context switch between allocation and submission.
    240 This requirement means that from the subsystem's point of view, the allocation and submission are sequential.
    241 To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring.}
    242 , greatly simplifying both allocation and submission.
    243 In this design, allocation and submission form a partitionned ring buffer as shown in Figure~\ref{fig:pring}.
    244 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call.
     243The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling.
     244This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not time-sliced during submission steps.
     245This requirement is the same as accessing @thread_local@ variables, where a \gls{thrd} is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
     246This failure is the serially reusable problem~\cite{SeriallyReusable}.
     247Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{
     248To remove this requirement, a \gls{thrd} needs the ability to ``yield to a specific \gls{proc}'', \ie, park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
     249From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
     250In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
     251Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call.
    245252Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, etc.
    246253
     
    254261\end{figure}
    255262
    256 This approach has the advantage that it does not require much of the synchronization needed in the shared approach.
    257 This comes at the cost that \glspl{thrd} submitting \io operations have less flexibility, they cannot park or yield, and several exceptional cases are handled poorly.
    258 Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations, in such a case the \gls{thrd} needs to be moved to a different \gls{proc}, the only current way of achieving this would be to @yield()@ hoping to be scheduled on a different \gls{proc}, which is not guaranteed.
    259 
    260 A more involved version of this approach can seem to solve most of these problems, using a pattern called \newterm{helping}.
    261 \Glspl{thrd} that wish to submit \io operations but cannot do so
    262 \footnote{either because of an allocation failure or because they were migrate to a different \gls{proc} between allocation and submission}
    263 create an object representing what they wish to achieve and add it to a list somewhere.
    264 For this particular problem, one solution would be to have a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
    265 The problem with these ``solutions'' is that they are still bound by the strong coupling between \glspl{proc} and @io_uring@ instances.
    266 These data structures would allow moving \glspl{thrd} to a specific \gls{proc} when the current \gls{proc} cannot fulfill the \io request.
    267 
    268 Imagine a simple case with two \glspl{thrd} on two \glspl{proc}, one \gls{thrd} submits an \io operation and then sets a flag, the other \gls{thrd} spins until the flag is set.
    269 If the first \gls{thrd} is preempted between allocation and submission and moves to the other \gls{proc}, the original \gls{proc} could start running the spinning \gls{thrd}.
    270 If this happens, the helping ``solution'' is for the \io \gls{thrd}to added append an item to the submission list of the \gls{proc} where the allocation was made.
     263This approach has the advantage that it does not require much of the synchronization needed in a shared approach.
     264However, this benefit means \glspl{thrd} submitting \io operations have less flexibility: they cannot park or yield, and several exceptional cases are handled poorly.
     265Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations.
     266In this case, the \io \gls{thrd} needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
     267
     268A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
     269\Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migration to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
     270While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \glspl{thrd} to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
     271
     272Imagine a simple scenario with two \glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd} spins until the flag is set.
     273Assume both \glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}.
     274In this case, the helping solution has the \io \gls{thrd} append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
    271275No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}.
    272 However, in this case, the \gls{proc} is unable to help because it is executing the spinning \gls{thrd} mentioned when first expression this case
    273 \footnote{This particular example is completely artificial, but in the presence of many more \glspl{thrd}, it is not impossible that this problem would arise ``in the wild''.
    274 Furthermore, this pattern is difficult to reliably detect and avoid.}
    275 resulting in a deadlock.
    276 Once in this situation, the only escape is to interrupted the execution of the \gls{thrd}, either directly or due to regular preemption, only then can the \gls{proc} take the time to handle the pending request to help.
    277 Interrupting \glspl{thrd} for this purpose is far from desireable, the cost is significant and the situation may be hard to detect.
    278 However, a more subtle reason why interrupting the \gls{thrd} is not a satisfying solution is that the \gls{proc} is not actually using the instance it is tied to.
    279 If it were to use it, then helping could be done as part of the usage.
     276However, the \io \gls{proc} is unable to help because it is executing the spinning \gls{thrd} resulting in a deadlock.
     277While this example is artificial, in the presence of many \glspl{thrd}, it is possible for this problem to arise ``in the wild''.
     278Furthermore, this pattern is difficult to reliably detect and avoid.
     279Once in this situation, the only escape is to interrupted the spinning \gls{thrd}, either directly or via some regular preemption (\eg time slicing).
     280Having to interrupt \glspl{thrd} for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
     281% However, a more important reason why interrupting the \gls{thrd} is not a satisfying solution is that the \gls{proc} is using the instance it is tied to.
     282% If it were to use it, then helping could be done as part of the usage.
    280283Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
    281 Therefore a more satisfying solution would be for the \gls{thrd} submitting the operation to simply notice that the instance is unused and simply go ahead and use it.
    282 This is the approach presented next.
     284Therefore, a more satisfying solution is for the \gls{thrd} submitting the operation to notice that the instance is unused and simply go ahead and use it.
     285This approach is presented shortly.
     286
     287\subsubsection{Public Instances}
     288The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
     289\Glspl{thrd} attempting an \io operation pick one of the available instances and submit the operation to that instance.
     290Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
     291Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
     292\begin{itemize}
     293\item
     294The synchronization needed to submit does not induce more contention than @io_uring@ already does.
     295\item
     296The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
     297This aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
     298\end{itemize}
     299
     300Allocation in this scheme is fairly easy.
     301Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written to safely and have a field called @user_data@ that the kernel only reads to copy to @cqe@s.
     302Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
     303% This requires a simple concurrent bag.
     304The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
     305
     306Allocation failures need to be pushed to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
     307Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
     308
     309Once an SQE is allocated, \glspl{thrd} insert the \io request information, and keep track of the SQE index and the instance it belongs to.
     310
     311Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
     312The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it is invalid to have the same \lstinline{sqe} multiple times in a ring buffer.
     313However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations complete.
     314
     315Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
     316Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
     317Balancing submission can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
     318
     319Ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd} is designated to do the system call on behalf of the others, called the \newterm{submitter}.
     320However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
     321Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call includes their request.
     322Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
     323
     324Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
     325Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
     326Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
     327If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
     328A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
     329
     330With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
     331It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
     332It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
     333The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly.
     334The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
     335The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
     336All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead.
    283337
    284338\subsubsection{Instance borrowing}
    285 Both of the approaches presented above have undesirable aspects that stem from too loose or too tight coupling between @io_uring@ and \glspl{proc}.
    286 In the first approach, loose coupling meant that all operations have synchronization overhead that a tighter coupling can avoid.
    287 The second approach on the other hand suffers from tight coupling causing problems when the \gls{proc} do not benefit from the coupling.
    288 While \glspl{proc} are continously issuing \io operations tight coupling is valuable since it avoids synchronization costs.
    289 However, in unlikely failure cases or when \glspl{proc} are not making use of their instance, tight coupling is no longer advantageous.
    290 A compromise between these approaches would be to allow tight coupling but have the option to revoke this coupling dynamically when failure cases arise.
    291 I call this approach ``instance borrowing''\footnote{While it looks similar to work-sharing and work-stealing, I think it is different enough from either to warrant a different verb to avoid confusion.}.
    292 
    293 In this approach, each cluster owns a pool of @io_uring@ instances managed by an arbiter.
     339Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
     340The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
     341The second approach suffers from loose coupling causing operations to have synchronization overhead, which tighter coupling avoids.
     342When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
     343However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
     344A compromise between these approaches is to allow tight coupling but have the option to revoke the coupling dynamically when failure cases arise.
     345I call this approach \newterm{instance borrowing}.\footnote{
     346While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
     347
     348In this approach, each cluster (see Figure~\ref{fig:system}) owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
    294349When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
    295 However, in doing so it ties to the instance to the \gls{proc} it is currently running on.
    296 This coupling is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
    297 This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at any given time, akin to the private instances approach.
    298 However, where it differs is that revocation from the arbiter means this approach does not suffer from the deadlock scenario described above.
     350This instance is now bound to the \gls{proc} the \gls{thrd} is running on.
     351This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
     352This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
     353However, it differs in that revocation by the arbiter (an interrupt) means this approach does not suffer from the deadlock scenario described above.
    299354
    300355Arbitration is needed in the following cases:
    301356\begin{enumerate}
    302         \item The current \gls{proc} does not currently hold an instance.
     357        \item The current \gls{proc} does not hold an instance.
    303358        \item The current instance does not have sufficient SQEs to satisfy the request.
    304         \item The current \gls{proc} has the wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission.
    305         I will refer to these as \newterm{External Submissions}.
     359        \item The current \gls{proc} has a wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission, called \newterm{external submissions}.
    306360\end{enumerate}
    307 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their ownership of the instance is not being revoked.
    308 This can be accomplished by a lock-less handshake\footnote{Note that the handshake is not Lock-\emph{Free} since it lacks the proper progress guarantee.}.
     361However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
     362Note the handshake is not lock \emph{free} since it lacks the proper progress guarantee.}
    309363A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
    310 If not it proceeds, otherwise it delegates the operation to the arbiter.
     364If not, it proceeds, otherwise it delegates the operation to the arbiter.
    311365Once the operation is completed, the \gls{proc} lowers its local flag.
    312366
    313 Correspondingly, before revoking an instance the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
     367Correspondingly, before revoking an instance, the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
    314368Only then does it reclaim the instance and potentially assign it to an other \gls{proc}.
    315369
     
    323377
    324378\paragraph{External Submissions} are handled by the arbiter by revoking the appropriate instance and adding the submission to the submission ring.
    325 There is no need to immediately revoke the instance however.
     379However,  there is no need to immediately revoke the instance.
    326380External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
    327 This means that whoever is responsible for the system call first checks if the instance has any external submissions.
    328 If it is the case, it asks the arbiter to revoke the instance and add the external submissions to the ring.
    329 
    330 \paragraph{Pending Allocations} can be more complicated to handle.
    331 If the arbiter has available instances, the arbiter can attempt to directly hand over the instance and satisfy the request.
    332 Otherwise it must hold onto the list of threads until SQEs are made available again.
    333 This handling becomes that much more complex if pending allocation require more than one SQE, since the arbiter must make a decision between statisfying requests in FIFO ordering or satisfy requests for fewer SQEs first.
    334 
    335 While this arbiter has the potential to solve many of the problems mentionned in above, it also introduces a significant amount of complexity.
     381This means whoever is responsible for the system call, first checks if the instance has any external submissions.
     382If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
     383
     384\paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request.
     385Otherwise, it must hold onto the list of threads until SQEs are made available again.
     386This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs.
     387
     388While an arbiter has the potential to solve many of the problems mentioned above, it also introduces a significant amount of complexity.
    336389Tracking which processors are borrowing which instances and which instances have SQEs available ends-up adding a significant synchronization prelude to any I/O operation.
    337390Any submission must start with a handshake that pins the currently borrowed instance, if available.
    338391An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}.
    339 Once the allocation is completed, the submission must still check that the instance is still burrowed before attempt to flush.
    340 These extra synchronization steps end-up having a similar cost to the multiple shared instances approach.
     392Once the allocation is completed, the submission must check that the instance is still burrowed before attempting to flush.
     393These synchronization steps turn out to have a similar cost to the multiple shared-instances approach.
    341394Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end-up cycling the processors, which leads to significant cache deterioration.
    342 Because of these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
     395For these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
    343396
    344397\subsubsection{Private Instances V2}
    345398
    346 
    347 
    348399% Verbs of this design
    349400
    350401% Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
    351402
    352 % Submition: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
     403% Submission: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
    353404
    354405% Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
     
    357408
    358409% Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
    359 
    360 
    361410
    362411
     
    404453
    405454\section{Interface}
    406 Finally, the last important part of the \io subsystem is it's interface. There are multiple approaches that can be offered to programmers, each with advantages and disadvantages. The new \io subsystem can replace the C runtime's API or extend it. And in the later case the interface can go from very similar to vastly different. The following sections discuss some useful options using @read@ as an example. The standard Linux interface for C is :
    407 
    408 @ssize_t read(int fd, void *buf, size_t count);@
     455
     456The last important part of the \io subsystem is its interface.
     457There are multiple approaches that can be offered to programmers, each with advantages and disadvantages.
     458The new \io subsystem can replace the C runtime API or extend it, and in the later case, the interface can go from very similar to vastly different.
     459The following sections discuss some useful options using @read@ as an example.
     460The standard Linux interface for C is :
     461\begin{lstlisting}
     462ssize_t read(int fd, void *buf, size_t count);
     463\end{lstlisting}
    409464
    410465\subsection{Replacement}
    411466Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
    412467The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
    413 This has the advantage of potentially working transparently and supporting existing binaries without needing recompilation.
     468This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
    414469It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
    415 However, this approach also entails a plethora of subtle technical challenges which generally boils down to making a perfect replacement.
     470However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
    416471If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
    417 Since the gcc ecosystems does not offer a scheme for such perfect replacement, this approach was rejected as being laudable but infeasible.
     472Since the gcc ecosystems does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
    418473
    419474\subsection{Synchronous Extension}
    420 An other interface option is to simply offer an interface that is different in name only. For example:
    421 
    422 @ssize_t cfa_read(int fd, void *buf, size_t count);@
    423 
    424 \noindent This is much more feasible but still familiar to C programmers.
    425 It comes with the caveat that any code attempting to use it must be recompiled, which can be a big problem considering the amount of existing legacy C binaries.
     475Another interface option is to offer an interface different in name only.
     476For example:
     477\begin{lstlisting}
     478ssize_t cfa_read(int fd, void *buf, size_t count);
     479\end{lstlisting}
     480This approach is feasible and still familiar to C programmers.
     481It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
    426482However, it has the advantage of implementation simplicity.
     483Finally, there is a certain irony to using a blocking synchronous interfaces for a feature often referred to as ``non-blocking'' \io.
    427484
    428485\subsection{Asynchronous Extension}
    429 It is important to mention that there is a certain irony to using only synchronous, therefore blocking, interfaces for a feature often referred to as ``non-blocking'' \io.
    430 A fairly traditional way of doing this is using futures\cit{wikipedia futures}.
    431 As simple way of doing so is as follows:
    432 
    433 @future(ssize_t) read(int fd, void *buf, size_t count);@
    434 
    435 \noindent Note that this approach is not necessarily the most idiomatic usage of futures.
    436 The definition of read above ``returns'' the read content through an output parameter which cannot be synchronized on.
    437 A more classical asynchronous API could look more like:
    438 
    439 @future([ssize_t, void *]) read(int fd, size_t count);@
    440 
    441 \noindent However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
    442 Because of the performance implications of this, the first approach is considered preferable as it is more familiar to C programmers.
    443 
    444 \subsection{Interface directly to \lstinline{io_uring}}
    445 Finally, an other interface that can be relevant is to simply expose directly the underlying \texttt{io\_uring} interface. For example:
    446 
    447 @array(SQE, want) cfa_io_allocate(int want);@
    448 
    449 @void cfa_io_submit( const array(SQE, have) & );@
    450 
    451 \noindent This offers more flexibility to users wanting to fully use all of the \texttt{io\_uring} features.
     486A fairly traditional way of providing asynchronous interactions is using a future mechanism~\cite{multilisp}, \eg:
     487\begin{lstlisting}
     488future(ssize_t) read(int fd, void *buf, size_t count);
     489\end{lstlisting}
     490where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
     491The data read is placed in @buf@.
     492The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
     493Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
     494A classical asynchronous API is:
     495\begin{lstlisting}
     496future([ssize_t, void *]) read(int fd, size_t count);
     497\end{lstlisting}
     498where the future tuple covers the components that require synchronization.
     499However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
     500Because of the performance implications of this API, the first approach is considered preferable as it is more familiar to C programmers.
     501
     502\subsection{Direct \lstinline{io_uring} Interface}
     503The last interface directly exposes the underlying @io_uring@ interface, \eg:
     504\begin{lstlisting}
     505array(SQE, want) cfa_io_allocate(int want);
     506void cfa_io_submit( const array(SQE, have) & );
     507\end{lstlisting}
     508where the generic @array@ contains an array of SQEs with a size that may be less than the request.
     509This offers more flexibility to users wanting to fully utilize all of the @io_uring@ features.
    452510However, it is not the most user-friendly option.
    453 It obviously imposes a strong dependency between user code and \texttt{io\_uring} but at the same time restricting users to usages that are compatible with how \CFA internally uses \texttt{io\_uring}.
    454 
    455 
     511It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricting users to usages that are compatible with how \CFA internally uses @io_uring@.
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    r4f3807d r847bb6f  
    1515Programmers are mostly expected to resize clusters on startup or teardown.
    1616Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
    17 As such all internal arrays that are sized based on the number of \procs need to be \texttt{realloc}ed.
     17As such all internal arrays that are sized based on the number of \procs need to be @realloc@ed.
    1818This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}.
    1919
     
    107107First some data structure needs to keep track of all \procs that are in idle sleep.
    108108Because of idle sleep can be spurious, this data structure has strict performance requirements in addition to the strict correctness requirements.
    109 Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg \texttt{pthread\_cond\_wait}, pthread semaphores.
     109Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg @pthread_cond_wait@, pthread semaphores.
    110110The complexity here is to support \at parking and unparking, timers, \io operations and all other \CFA features with minimal complexity.
    111111Finally, idle sleep also includes a heuristic to determine the appropriate number of \procs to be in idle sleep an any given time.
     
    117117In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options:
    118118
    119 \paragraph{\texttt{pthread\_mutex}/\texttt{pthread\_cond}}
    120 The most classic option is to use some combination of \texttt{pthread\_mutex} and \texttt{pthread\_cond}.
    121 These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a \texttt{pthread\_cond} until signalled.
    122 While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal \texttt{pthread\_cond}s.
    123 For \io results to wake a \proc waiting on a \texttt{pthread\_cond} means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled.
    124 
    125 \subsection{\texttt{io\_uring} and Epoll}
    126 An alternative is to flip the problem on its head and block waiting for \io, using \texttt{io\_uring} or even \texttt{epoll}.
     119\paragraph{\lstinline{pthread_mutex}/\lstinline{pthread_cond}}
     120The most classic option is to use some combination of @pthread_mutex@ and @pthread_cond@.
     121These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a @pthread_cond@ until signalled.
     122While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal @pthread_cond@s.
     123For \io results to wake a \proc waiting on a @pthread_cond@ means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled.
     124
     125\subsection{\lstinline{io_uring} and Epoll}
     126An alternative is to flip the problem on its head and block waiting for \io, using @io_uring@ or even @epoll@.
    127127This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme.
    128128This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother.
     
    132132\subsection{Event FDs}
    133133Another interesting approach is to use an event file descriptor\cit{eventfd}.
    134 This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses \texttt{read} and \texttt{write}, but also behaves like a semaphore.
     134This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
    135135Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}.
    136 Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{This is without the \texttt{EFD\_SEMAPHORE} flag. This flags changes the behavior of \texttt{read} but is not needed for this work.}.
     136Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{
     137This is without the \lstinline{EFD_SEMAPHORE} flag. This flags changes the behavior of \lstinline{read} but is not needed for this work.}.
    137138If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
    138 What makes this feature particularly interesting is that \texttt{io\_uring} supports the \texttt{IORING\_REGISTER\_EVENTFD} command, to register an event fd to a particular instance.
    139 Once that instance is registered, any \io completion will result in \texttt{io\_uring} writing to the event FD.
     139What makes this feature particularly interesting is that @io_uring@ supports the @IORING_REGISTER_EVENTFD@ command, to register an event fd to a particular instance.
     140Once that instance is registered, any \io completion will result in @io\_uring@ writing to the event FD.
    140141This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io.
    141142
     
    172173This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
    173174Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
    174 The notification process then simply needs to wake-up the desired idle \proc, using \texttt{pthread\_cond\_signal}, \texttt{write} on an fd, etc., and the \proc will handle the rest.
     175The notification process then simply needs to wake-up the desired idle \proc, using @pthread_cond_signal@, @write@ on an fd, etc., and the \proc will handle the rest.
    175176
    176177\subsection{Reducing Latency}
     
    190191The contention is mostly due to the lock on the list needing to be held to get to the head \proc.
    191192That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts.
    192 The contentention from the \procs attempting to go to sleep can be mitigated slightly by using \texttt{try\_acquire} instead, so the \procs simply continue searching for \ats if the lock is held.
     193The contentention from the \procs attempting to go to sleep can be mitigated slightly by using @try\_acquire@ instead, so the \procs simply continue searching for \ats if the lock is held.
    193194This trick cannot be used for waking \procs since they are not in a state where they can run \ats.
    194195However, it is worth nothing that notification does not strictly require accessing the list or the head \proc.
    195196Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}.
    196 To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to \texttt{null} so only only notifier will contend on the system call.
     197To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to @null@ so only only notifier will contend on the system call.
    197198
    198199\begin{figure}
     
    206207This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd.
    207208A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.
    208 The flag starts in state \texttt{SEARCH}, while the \proc is searching for \ats to run.
    209 The \proc then confirms the sleep by atomically swaping the state to \texttt{SLEEP}.
    210 If the previous state was still \texttt{SEARCH}, then the \proc does read the event fd.
    211 Meanwhile, notifiers atomically exchange the state to \texttt{AWAKE} state.
    212 if the previous state was \texttt{SLEEP}, then the notifier must write to the event fd.
     209The flag starts in state @SEARCH@, while the \proc is searching for \ats to run.
     210The \proc then confirms the sleep by atomically swaping the state to @SLEEP@.
     211If the previous state was still @SEARCH@, then the \proc does read the event fd.
     212Meanwhile, notifiers atomically exchange the state to @AWAKE@ state.
     213if the previous state was @SLEEP@, then the notifier must write to the event fd.
    213214However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably.
    214215This leads to the final data structure shown in Figure~\ref{fig:idle}.
  • doc/theses/thierry_delisle_PhD/thesis/thesis.tex

    r4f3807d r847bb6f  
    108108        citecolor=OliveGreen,   % color of links to bibliography
    109109        filecolor=magenta,      % color of file links
    110         urlcolor=cyan           % color of external links
     110        urlcolor=blue,           % color of external links
     111        breaklinks=true
    111112}
    112113\ifthenelse{\boolean{PrintVersion}}{   % for improved print quality, change some hyperref options
Note: See TracChangeset for help on using the changeset viewer.