Changeset 72d1118


Ignore:
Timestamp:
Sep 20, 2022, 2:42:15 PM (21 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
8f1e035
Parents:
aa9f215
Message:

Finished final read

Location:
doc
Files:
5 edited

Legend:

Unmodified
Added
Removed
  • doc/bibliography/pl.bib

    raa9f215 r72d1118  
    188188        Unstructured {\it sends\/} and {\it accepts\/} are forbidden.  To
    189189        this the mechanisms of {\it delegation\/} and {\it delay queues\/}
    190         are added to enable switching and triggering of activities. 
    191         Concurrent subactivities and atomic actions are provided for 
     190        are added to enable switching and triggering of activities.
     191        Concurrent subactivities and atomic actions are provided for
    192192        compactness and simplicity.  We show how solutions to many important
    193193        concurrent problems [sic], such as pipelining, constraint management
    194         and ``administration'' can be compactly expressed using these 
     194        and ``administration'' can be compactly expressed using these
    195195        mechanisms.
    196196   },
     
    529529        like ``c is a collection with element type e'', but how such things
    530530        are used isn't explained.
    531        
     531
    532532        For each descriptive class used in a parameter list, an implicit
    533533        parameter is created that is passed a vector of procedures.
     
    11721172@techreport{Prokopec11,
    11731173    keywords    = {ctrie, concurrent map},
    1174     contributer = {a3moss@uwaterloo.ca}, 
     1174    contributer = {a3moss@uwaterloo.ca},
    11751175    title       = {Cache-aware lock-free concurrent hash tries},
    11761176    author      = {Prokopec, Aleksandar and Bagwell, Phil and Odersky, Martin},
     
    15001500    year        = 2001,
    15011501    url         = {http://citeseer.ist.psu.edu/berger01composing.html}
    1502 } 
     1502}
    15031503
    15041504@article{Andrews83,
     
    15451545        We give a rationale for our decisions and compare Concurrent C
    15461546        extensions with the concurrent programming facilities in Ada.
    1547         Concurrent C has been implemented on the UNIX system running on a 
     1547        Concurrent C has been implemented on the UNIX system running on a
    15481548        single processor.  A distributed version of Concurrent C is being
    15491549        implemented.
     
    18141814    keywords    = {objects, concurrency},
    18151815    contributer = {gjditchfield@plg},
    1816     author      = {P. A. Buhr and G. J. Ditchfield and B. M. Younger and C. R. Zarnke}, 
     1816    author      = {P. A. Buhr and G. J. Ditchfield and B. M. Younger and C. R. Zarnke},
    18171817    title       = {Concurrency in {C}{\kern-.1em\hbox{\large\texttt{+\kern-.25em+}}}},
    18181818    institution = {Department of Computer Science, University of Waterloo},
     
    20442044    series      = {Lecture Notes in Computer Science, Ed. by G. Goos and J. Hartmanis}
    20452045}
    2046  
     2046
    20472047@article{Wang71,
    20482048    keywords    = {coroutines},
     
    20562056    pages       = {425-449},
    20572057}
    2058  
     2058
    20592059@article{Castagna95,
    20602060    keywords    = {type-systems, covariance, contravariance},
     
    23902390    year        = 1996,
    23912391}
    2392  
     2392
    23932393@article{Richardson93,
    23942394    keywords    = {C++, persistence, database},
     
    24732473    publisher   = {ACM},
    24742474    address     = {New York, NY, USA},
    2475 } 
     2475}
    24762476
    24772477@article{design,
     
    27002700    publisher   = {ACM},
    27012701    address     = {New York, NY, USA},
    2702 } 
     2702}
    27032703
    27042704@book{Eiffel,
     
    33573357    publisher   = {ACM},
    33583358    address     = {New York, NY, USA},
    3359 } 
     3359}
    33603360
    33613361@manual{Fortran95,
     
    37403740    keywords    = {processes, distributed computing},
    37413741    contributer = {pabuhr@plg},
    3742     author      = {Robert E. Strom and David F. Bacon and Arthur P. Goldberg and Andy Lowry and Daniel M. Yellin and Shaula Alexander Yemini}, 
     3742    author      = {Robert E. Strom and David F. Bacon and Arthur P. Goldberg and Andy Lowry and Daniel M. Yellin and Shaula Alexander Yemini},
    37433743    title       = {Hermes: A Language for Distributed Computing},
    37443744    institution = {IBM T. J. Watson Research Center},
     
    37513751    keywords    = {processes, distributed computing},
    37523752    contributer = {pabuhr@plg},
    3753     author      = {Robert E. Strom and David F. Bacon and Arthur P. Goldberg and Andy Lowry and Daniel M. Yellin and Shaula Alexander Yemini}, 
     3753    author      = {Robert E. Strom and David F. Bacon and Arthur P. Goldberg and Andy Lowry and Daniel M. Yellin and Shaula Alexander Yemini},
    37543754    title       = {Hermes: A Language for Distributed Computing},
    37553755    publisher   = {Prentice-Hall},
     
    43024302    pages       = {85-103}
    43034303}
    4304    
     4304
    43054305@article{Murer96,
    43064306    keywords    = {interators, generators, cursors},
     
    43304330
    43314331% J
    4332                  
     4332
    43334333@book{Java,
    43344334    keywords    = {Java},
     
    46274627    publisher   = {ACM},
    46284628    address     = {New York, NY, USA},
    4629 } 
     4629}
    46304630
    46314631@article{Dice15,
     
    49784978    number      = 31
    49794979}
    4980  
     4980
    49814981@article{Dueck90,
    49824982    keywords    = {attribute grammars},
     
    51075107    keywords    = {multiple inheritance},
    51085108    contributer = {pabuhr@plg},
    5109     author      = {Harry Bretthauer and Thomas Christaller and J\"{u}rgen Kopp}, 
     5109    author      = {Harry Bretthauer and Thomas Christaller and J\"{u}rgen Kopp},
    51105110    title       = {Multiple vs. Single Inheritance in Object-oriented Programming Languages. What do we really want?},
    51115111    institution = {Gesellschaft F\"{u}r Mathematik und Datenverarbeitung mbH},
     
    56505650    publisher   = {ACM},
    56515651    address     = {New York, NY, USA},
    5652 } 
     5652}
    56535653
    56545654@book{Deitel04,
     
    58275827    year        = 1980,
    58285828    month       = nov, volume = 15, number = 11, pages = {47-56},
    5829     note        = {Proceedings of the ACM-SIGPLAN Symposium on the {Ada} Programming Language}, 
     5829    note        = {Proceedings of the ACM-SIGPLAN Symposium on the {Ada} Programming Language},
    58305830    comment     = {
    58315831        The two-pass (bottom-up, then top-down) algorithm, with a proof
     
    59575957        Given a base typed lambda calculus with function types, type
    59585958        abstractions, and a recursive expression \(\mbox{fix } x:t.e\),
    5959         then type inference for the partially typed language 
     5959        then type inference for the partially typed language
    59605960        \begin{eqnarray}
    59615961        \lambda x:\tau.e        &\Rightarrow& \lambda x.e       \\
     
    66036603        manner.  The paper also discusses efficient composition of
    66046604        sequences of asynchronous calls to different locations in a
    6605         network. 
     6605        network.
    66066606    }
    66076607}
     
    66166616    volume      = 32, number = 4, pages = {305-311},
    66176617    abstract    = {
    6618        
     6618
    66196619    }
    66206620}
     
    69346934        partitioning switch statements into dense tables.  It also
    69356935        implements target-independent function tracing and expression-level
    6936         profiling. 
     6936        profiling.
    69376937    }
    69386938}
     
    71507150    publisher   = {ACM},
    71517151    address     = {New York, NY, USA},
    7152 } 
     7152}
    71537153
    71547154@inproceedings{Leissa14,
     
    72687268    keywords    = {Smalltalk, abstract class, protocol},
    72697269    contributer = {gjditchfield@plg},
    7270     author      = {A. Goldberg and D. Robson}, 
     7270    author      = {A. Goldberg and D. Robson},
    72717271    title       = {Smalltalk-80: The Language and its Implementation},
    72727272    publisher   = {Addison-Wesley},
     
    78457845    title       = {Thread (computing)},
    78467846    author      = {{Threading Model}},
    7847     howpublished= {\href{https://en.wikipedia.org/wiki/Thread_(computing)}{https://\-en.wikipedia.org/\-wiki/\-Thread\_(computing)}},
     7847    howpublished= {\href{https://en.wikipedia.org/wiki/Thread_(computing)}{https://\-en.wikipedia.org/\-wiki/\-Thread\_\-(computing)}},
    78487848}
    78497849
     
    78897889    note        = {Lecture Notes in Computer Science, v. 19},
    78907890    abstract    = {
    7891        
     7891
    78927892    }
    78937893}
     
    80088008    publisher   = {USENIX Association},
    80098009    address     = {Berkeley, CA, USA},
    8010 } 
     8010}
    80118011
    80128012@article{Leroy00,
     
    83548354    author      = {Bjarne Stroustrup},
    83558355    title       = {What is ``Object-Oriented Programming''?},
    8356     booktitle   = {Proceedings of the First European Conference on Object Oriented Programming}, 
     8356    booktitle   = {Proceedings of the First European Conference on Object Oriented Programming},
    83578357    month       = jun,
    83588358    year        = 1987
     
    83968396    publisher   = {ACM},
    83978397    address     = {New York, NY, USA},
    8398 } 
     8398}
    83998399
    84008400% X
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    raa9f215 r72d1118  
    583583  title={Per-entity load tracking},
    584584  author={Corbet, Jonathan},
    585   journal={LWN article, available at: https://lwn.net/Articles/531853},
     585  journal={LWN article, available at: {\href{https://lwn.net/Articles/531853}{https://\-lwn.net/\-Articles/\-531853}}},
    586586  year={2013}
    587587}
     
    717717  title = {Scheduling Benchmarks},
    718718  author = {Thierry Delisle},
    719   howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-SchedulingBenchmarks\_\-PhD22}},
     719  howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-Scheduling\-Benchmarks\_\-PhD22}},
    720720}
    721721
     
    832832  title      = "eventfd(2) Linux User's Manual",
    833833  year       = "2019",
    834   month      = "MArch",
     834  month      = "March",
    835835}
    836836
     
    10601060  year = "2020",
    10611061  month = "June",
    1062   howpublished = "\href{https://xkcd.com/2318/}",
     1062  howpublished = "\href{https://xkcd.com/2318/}{https://\-xkcd.com/\-2318/}",
    10631063  note = "[Online; accessed 10-June-2020]"
    10641064}
     
    10691069  year = "2011",
    10701070  month = "June",
    1071   howpublished = "\href{https://xkcd.com/908/}",
     1071  howpublished = "\href{https://xkcd.com/908/}{https://\-xkcd.com/\-908/}",
    10721072  note = "[Online; accessed 25-August-2022]"
    10731073}
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    raa9f215 r72d1118  
    187187Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.
    188188However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel.
    189 The different performance improvements and plateaus are due to cache topology and appear at the expected: \proc counts of 64, 128 and 192, for the same reasons as on Intel.
     189The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
    190190Looking next at the right column on AMD, Figures~\ref{fig:cycle:nasus:low:ops} and \ref{fig:cycle:nasus:low:ns}, Tokio and Go have the same throughput performance, while \CFA is slightly slower.
    191191This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go.
     
    491491\section{Locality}
    492492
    493 As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote ready-queue.\footnote{
    494 It is also possible to \unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
     493As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote sub-queue.\footnote{
     494It is also possible to \unpark to a third unrelated sub-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
    495495The locality experiment includes two variations of the churn benchmark, where a data array is added.
    496496In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
     
    720720
    721721In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point.
    722 In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by, $(CSL + SL) / (NP - 1)$,
    723 where $CSL$ is the context-switch latency, $SL$ is the cost for enqueueing and dequeuing a \at, and $NP$ is the number of \procs.
     722In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by, $NT(CSL + SL) / (NP - 1)$,
     723where $CSL$ is the context-switch latency, $SL$ is the cost for enqueueing and dequeuing a \at, $NT$ is the number of \ats, and $NP$ is the number of \procs.
    724724However, if the scheduler allows \ats to run many times before other \ats can run once, this delay increases.
    725725The semaphore version is an approximation of strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once.
     
    734734
    735735\begin{table}
    736 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
    737 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
    738 \label{fig:transfer:res}
    739736\setlength{\extrarowheight}{2pt}
    740737\setlength{\tabcolsep}{5pt}
     
    753750\end{tabular}
    754751\end{centering}
     752\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
     753DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
     754\label{fig:transfer:res}
    755755\end{table}
    756756
     
    768768
    769769Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
    770 \CFA achieves better latencies, presumably due to no synchronization with the yield.
     770\CFA achieves better latencies, presumably due to the lack of synchronization with the yield.
    771771Go does complete the experiment, but with drastically higher latency:
    772772latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs.
    773 This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption
    774 , which interrupts the spinning leader after a period.
     773This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption, which interrupts the spinning leader after a period.
    775774Neither Libfibre nor Tokio complete the experiment.
    776775Both runtimes also use classical work-stealing scheduling without preemption, and therefore, none of the work queues are ever emptied so no load balancing occurs.
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    raa9f215 r72d1118  
    2525\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
    2626Hence, the array length must be as long as the largest FD currently of interest.
    27 On return, it outputs the set in place to identify which of the file descriptors changed state.
     27On return, it outputs the set motified in place to identify which of the file descriptors changed state.
    2828This destructive change means selecting in a loop requires re-initializing the array for each iteration.
    2929Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
     
    3535\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
    3636For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.
    37 Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.
     37However, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.
    3838Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \glspl{kthrd}, while a manager thread is blocked in @poll@.
    3939
     
    314314A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
    315315
    316 With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
     316With the pool of SQE instances approach, the big advantage is that it is fairly flexible.
    317317It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
    318318It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
     
    320320The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
    321321The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
    322 Compared to the private-instance approach, all this synchronization has a significant cost this synchronization is entirely overhead.
     322Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.
    323323
    324324\subsubsection{Instance borrowing}
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    raa9f215 r72d1118  
    101101This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
    102102These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
    103 While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the processor.
     103While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the \gls{hthrd}.
    104104Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed.
    105105
     
    107107First, a data structure needs to keep track of all \procs that are in idle sleep.
    108108Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
    109 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore.
     109Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore.
    110110The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
    111111Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
Note: See TracChangeset for help on using the changeset viewer.