Changeset 87f572e for doc


Ignore:
Timestamp:
Mar 9, 2020, 11:09:41 AM (21 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
arm-eh, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr
Children:
5b544a6
Parents:
331eacb (diff), 3d3cbd0 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc
Files:
6 added
6 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/andrew_beach_MMath/Makefile

    r331eacb r87f572e  
    55TEXSRC=$(wildcard *.tex)
    66BIBSRC=$(wildcard *.bib)
     7STYSRC=$(wildcard *.sty)
     8CLSSRC=$(wildcard *.cls)
    79TEXLIB= .:${BUILD}:
    810BIBLIB= .:../../bibliography
     
    2426all: ${DOC}
    2527
    26 ${BUILD}/${DOC}: ${TEXSRC} ${BIBSRC} Makefile | ${BUILD}
     28${BUILD}/${DOC}: ${TEXSRC} ${BIBSRC} ${STYSRC} ${CLSSRC} Makefile | ${BUILD}
    2729        ${LATEX} ${BASE}
    2830        ${BIBTEX} ${BUILD}/${BASE}
  • doc/theses/andrew_beach_MMath/thesis.tex

    r331eacb r87f572e  
    1 % uWaterloo Thesis Template for LaTeX
    2 % Last Updated June 14, 2017 by Stephen Carr, IST Client Services
    3 % FOR ASSISTANCE, please send mail to rt-IST-CSmathsci@ist.uwaterloo.ca
    4 
    5 % Effective October 2006, the University of Waterloo
    6 % requires electronic thesis submission. See the uWaterloo thesis regulations at
    7 % https://uwaterloo.ca/graduate-studies/thesis.
    8 
    9 % DON'T FORGET TO ADD YOUR OWN NAME AND TITLE in the "hyperref" package
    10 % configuration. THIS INFORMATION GETS EMBEDDED IN THE FINAL PDF DOCUMENT.
    11 % You can view the information if you view Properties of the PDF document.
    12 
    13 % Many faculties/departments also require one or more printed
    14 % copies. This template attempts to satisfy both types of output.
    15 % It is based on the standard "book" document class which provides all
    16 % necessary sectioning structures and allows multi-part theses.
    17 
    18 % DISCLAIMER
    19 % To the best of our knowledge, this template satisfies the current uWaterloo
    20 % requirements. However, it is your responsibility to assure that you have met
    21 % all  requirements of the University and your particular department.
    22 % Many thanks for the feedback from many graduates that assisted the
    23 % development of this template.
    24 
    25 % -----------------------------------------------------------------------
    26 
    27 % By default, output is produced that is geared toward generating a PDF
    28 % version optimized for viewing on an electronic display, including
    29 % hyperlinks within the PDF.
    30  
    31 % E.g. to process a thesis called "mythesis.tex" based on this template, run:
    32 
    33 % pdflatex mythesis     -- first pass of the pdflatex processor
    34 % bibtex mythesis       -- generates bibliography from .bib data file(s)
    35 % makeindex         -- should be run only if an index is used
    36 % pdflatex mythesis     -- fixes numbering in cross-references,
    37 % pdflatex mythesis --   bibliographic references, glossaries, index, etc.
    38 
    39 % N.B. The "pdftex" program allows graphics in the following formats to be
    40 % included with the "\includegraphics" command: PNG, PDF, JPEG, TIFF
    41 % Tip 1: Generate your figures and photos in the size you want them to appear
    42 % in your thesis, rather than scaling them with \includegraphics options.
    43 % Tip 2: Any drawings you do should be in scalable vector graphic formats:
    44 % SVG, PNG, WMF, EPS and then converted to PNG or PDF, so they are scalable in
    45 % the final PDF as well.
    46 % Tip 3: Photographs should be cropped and compressed so as not to be too large.
    47 
    48 % To create a PDF output that is optimized for double-sided printing:
    49 %
    50 % 1) comment-out the \documentclass statement in the preamble below, and
    51 % un-comment the second \documentclass line.
    52 %
    53 % 2) change the value assigned below to the boolean variable
    54 % "PrintVersion" from "false" to "true".
    55 
    56 % --------------------- Start of Document Preamble -----------------------
    57 
    58 % Specify the document class, default style attributes, and page dimensions
    59 % For hyperlinked PDF, suitable for viewing on a computer, use this:
    60 \documentclass[letterpaper,12pt,titlepage,oneside,final]{book}
    61  
    62 % For PDF, suitable for double-sided printing, change the PrintVersion
    63 % variable below to "true" and use this \documentclass line instead of the one
    64 % above:
    65 %\documentclass[letterpaper,12pt,titlepage,openright,twoside,final]{book}
    66 
    67 % Some LaTeX commands I define for my own nomenclature.
    68 % If you have to, it's better to change nomenclature once here than in a
    69 % million places throughout your thesis!
    70 \newcommand{\package}[1]{\textbf{#1}} % package names in bold text
    71 \newcommand{\cmmd}[1]{\textbackslash\texttt{#1}} % command name in tt font
    72 \newcommand{\href}[1]{#1} % does nothing, but defines the command so the
    73   % print-optimized version will ignore \href tags (redefined by hyperref pkg).
    74 %\newcommand{\texorpdfstring}[2]{#1} % does nothing, but defines the command
    75 % Anything defined here may be redefined by packages added below...
    76 
    77 % This package allows if-then-else control structures.
    78 \usepackage{ifthen}
    79 \newboolean{PrintVersion}
    80 \setboolean{PrintVersion}{false}
    81 % CHANGE THIS VALUE TO "true" as necessary, to improve printed results for
    82 % hard copies by overriding some options of the hyperref package below.
     1% Main tex file for thesis document.
     2\documentclass[digital]{uw-ethesis}
     3
     4% Commands used in documenting how to use the template. To remove.
     5\newcommand{\package}[1]{\textbf{#1}}
     6\newcommand{\cmmd}[1]{\textbackslash\texttt{#1}}
     7\newcommand{\href}[1]{#1}
    838
    849% For a nomenclature (optional; available from ctan.org)
     
    8611% Lots of math symbols and environments
    8712\usepackage{amsmath,amssymb,amstext}
    88 % For including graphics N.B. pdftex graphics driver
     13% For including graphics, sets the pdftex graphics driver.
    8914\usepackage[pdftex]{graphicx}
    9015
    91 % I believe the general index function is covered by the glossaries.
    92 % \usepackage{makeidx}
    93 % \makeindex
    94 
    95 % Hyperlinks make it very easy to navigate an electronic document.
    96 % In addition, this is where you should specify the thesis title
    97 % and author as they appear in the properties of the PDF document.
    98 % Use the "hyperref" package
    99 % N.B. HYPERREF MUST BE THE LAST PACKAGE LOADED; ADD ADDITIONAL PKGS ABOVE
    100 % N.B. pagebackref=true provides links back from the References to the body
    101 % text. This can cause trouble for printing.
    102 \usepackage[pdftex,pagebackref=false]{hyperref} % with basic options
    103 \hypersetup{
    104     plainpages=false,       % needed if Roman numbers in frontpages
    105     unicode=false,          % non-Latin characters in Acrobat’s bookmarks
    106     pdftoolbar=true,        % show Acrobat’s toolbar?
    107     pdfmenubar=true,        % show Acrobat’s menu?
    108     pdffitwindow=false,     % window fit to page when opened
    109     pdfstartview={FitH},    % fits the width of the page to the window
    110     pdftitle={uWaterloo\ LaTeX\ Thesis\ Template},    % title: CHANGE THIS TEXT!
    111 %    pdfauthor={Author},    % author: CHANGE THIS TEXT! and uncomment this line
    112 %    pdfsubject={Subject},  % subject: CHANGE THIS TEXT! and uncomment this line
    113 %    pdfkeywords={keyword1} {key2} {key3}, % list of keywords, and uncomment this line if desired
    114     pdfnewwindow=true,      % links in new window
    115     colorlinks=true,        % false: boxed links; true: colored links
    116     linkcolor=blue,         % color of internal links
    117     citecolor=green,        % color of links to bibliography
    118     filecolor=magenta,      % color of file links
    119     urlcolor=cyan           % color of external links
    120 }
    121 \ifthenelse{\boolean{PrintVersion}}{
    122   % for improved print quality, override some hyperref options
    123 \hypersetup{
    124 %    colorlinks,%
    125     citecolor=black,%
    126     filecolor=black,%
    127     linkcolor=black,%
    128     urlcolor=black}
    129 }{} % end of ifthenelse (no else)
    130 
    131 \usepackage[toc,abbreviations]{glossaries-extra} % Exception to the
    132 % rule of hyperref being the last add-on package. If glossaries-extra is not
    133 % in your LaTeX distribution, get it from CTAN
    134 % (http://ctan.org/pkg/glossaries-extra).
    135 
    136 % Setting up the page margins...
    137 % uWaterloo thesis requirements specify a minimum of 1 inch (72pt) margin at
    138 % the top, bottom, and outside page edges and a 1.125 in. (81pt) gutter
    139 % margin (on binding side). While this is not an issue for electronic
    140 % viewing, a PDF may be printed, and so we have the same page layout for
    141 % both printed and electronic versions, we leave the gutter margin in.
    142 % Set margins to minimum permitted by uWaterloo thesis regulations:
    143 \setlength{\marginparwidth}{0pt} % width of margin notes
    144 % N.B. If margin notes are used, you must adjust \textwidth, \marginparwidth
    145 % and \marginparsep so that the space left between the margin notes and page
    146 % edge is less than 15 mm (0.6 in.)
    147 % Width of space between body text and margin notes.
    148 \setlength{\marginparsep}{0pt}
    149 \setlength{\evensidemargin}{0.125in} % Adds 1/8 in. to binding side of all
    150 % even-numbered pages when the "twoside" printing option is selected
    151 \setlength{\oddsidemargin}{0.125in} % Adds 1/8 in. to the left of all pages
    152 % when "oneside" printing is selected, and to the left of all odd-numbered
    153 % pages when "twoside" printing is selected
    154 % Assuming US letter paper (8.5 in. x 11 in.) and side margins as above.
    155 \setlength{\textwidth}{6.375in}
    156 \raggedbottom
    157 
    158 % The following statement specifies the amount of space between paragraphs.
    159 % Other reasonable specifications are \bigskipamount and \smallskipamount.
    160 \setlength{\parskip}{\medskipamount}
    161 
    162 % The following statement controls the line spacing.  The default
    163 % spacing corresponds to good typographic conventions and only slight
    164 % changes (e.g., perhaps "1.2"), if any, should be made.
    165 \renewcommand{\baselinestretch}{1} % this is the default line space setting
    166 
    167 % By default, each chapter will start on a recto (right-hand side)
    168 % page.  We also force each section of the front pages to start on
    169 % a recto page by inserting \cleardoublepage commands.
    170 % In many cases, this will require that the verso page be
    171 % blank and, while it should be counted, a page number should not be
    172 % printed.  The following statements ensure a page number is not
    173 % printed on an otherwise blank verso page.
    174 \let\origdoublepage\cleardoublepage
    175 \newcommand{\clearemptydoublepage}{%
    176   \clearpage{\pagestyle{empty}\origdoublepage}}
    177 \let\cleardoublepage\clearemptydoublepage
    178 
    179 % Define Glossary terms (This is properly done here, in the preamble.
    180 % Could be \input{} from a file...)
     16\usehyperrefpackage[pdftex,pagebackref=false]{
     17    pdftitle={Exception Handling in CFA},
     18    pdfauthor={Andrew James Beach},
     19    pdfsubject={Programming Languages},
     20    pdfkeywords={exceptions,implementation},
     21}
     22
     23% The \phantomsection is used to help the hyperref package create links.
     24
     25% Maybe only package that should be loaded after the hyperref package.
     26% From http://ctan.org/pkg/glossaries-extra, extends glossaries which replaces
     27% glossary and builds off of the makeindex system.
     28\usepackage[toc,abbreviations]{glossaries-extra}
     29
    18130% Main glossary entries -- definitions of relevant terminology
    18231\newglossaryentry{computer}
     
    20958description={Random vector: a location in n-dimensional Cartesian space, where each dimensional component is determined by a random process}
    21059}
    211  
     60
     61% Generate the glossaries defined above.
    21262\makeglossaries
    21363
    214 %======================================================================
    215 %   L O G I C A L    D O C U M E N T -- the content of your thesis
    216 %======================================================================
    21764\begin{document}
    21865
    219 % For a large document, it is a good idea to divide your thesis
    220 % into several files, each one containing one chapter.
    221 % To illustrate this idea, the "front pages" (i.e., title page,
    222 % declaration, borrowers' page, abstract, acknowledgements,
    223 % dedication, table of contents, list of tables, list of figures,
    224 % nomenclature) are contained within the file "uw-ethesis-frontpgs.tex" which
    225 % is included into the document by the following statement.
    22666%----------------------------------------------------------------------
    22767% FRONT MATERIAL
     
    23272% MAIN BODY
    23373%----------------------------------------------------------------------
    234 % Because this is a short document, and to reduce the number of files
    235 % needed for this template, the chapters are not separate
    236 % documents as suggested above, but you get the idea. If they were
    237 % separate documents, they would each start with the \chapter command, i.e, do
    238 % not contain \documentclass or \begin{document} and \end{document} commands.
    23974%======================================================================
    24075\chapter{Introduction}
     
    327162%----------------------------------------------------------------------
    328163
    329 % B I B L I O G R A P H Y
    330 % -----------------------
    331 
    332 % The following statement selects the style to use for references. It controls
    333 % the sort order of the entries in the bibliography and also the formatting
    334 % for the in-text labels.
    335 \bibliographystyle{plain}
    336 % This specifies the location of the file containing the bibliographic
    337 % information. It assumes you're using BibTeX (if not, why not?).
    338 
    339 % This is needed if the book class is used, to place the anchor in the correct
    340 % page, because the bibliography will start on its own page.
     164%----------------------------------------------------------------------
     165% BIBLIOGRAPHY
     166%----------------------------------------------------------------------
     167
    341168% Use \clearpage instead if the document class uses the "oneside" argument.
    342169\cleardoublepage
    343 % With hyperref package, enables hyperlinking from the table of contents to
    344 % bibliography
    345170\phantomsection
    346171
    347 % The following statement causes the title "References" to be used for the
    348 % bibliography section:
    349 \renewcommand*{\bibname}{References}
    350 
    351 % Add the References to the Table of Contents
    352 \addcontentsline{toc}{chapter}{\textbf{References}}
    353 
    354 % Tip 5: You can create multiple .bib files to organize your references. Just
    355 % list them all in the \bibliogaphy command, separated by commas (no spaces).
     172% Bibliography setup and creation, renamed to References.
     173\addcontentsline{toc}{chapter}{\textbf{\bibname}}
     174\bibliographystyle{plain}
    356175\bibliography{thesis}
    357176
    358 % The following statement causes the specified references to be added to the
    359 % bibliography even if they were not cited in the text. The asterisk is a
    360 % wildcard that causes all entries in the bibliographic database to be
    361 % included (optional).
     177% Include all uncited entries in the bibliography.
    362178\nocite{*}
    363179
    364 % The \appendix statement indicates the beginning of the appendices.
     180% Begin the appendix, add a title and table of contents entry.
    365181\appendix
    366 % Add a title page before the appendices and a line in the Table of Contents
    367182\chapter*{APPENDICES}
    368183\addcontentsline{toc}{chapter}{APPENDICES}
  • doc/theses/thierry_delisle_PhD/.gitignore

    r331eacb r87f572e  
    88
    99comp_II/build/
     10comp_II/img/*.fig.bak
    1011comp_II/comp_II.pdf
    1112comp_II/comp_II.ps
  • doc/theses/thierry_delisle_PhD/comp_II/Makefile

    r331eacb r87f572e  
    1818
    1919FIGURES = ${addsuffix .tex, \
     20        base \
     21        empty \
     22        emptybit \
     23        emptytree \
     24        resize \
    2025}
    2126
     
    7075        mkdir -p ${Build}
    7176
    72 %.tex : %.fig ${Build}
     77%.tex : img/%.fig ${Build}
    7378        fig2dev -L eepic $< > ${Build}/$@
    7479
    75 %.ps : %.fig | ${Build}
     80%.ps : img/%.fig | ${Build}
    7681        fig2dev -L ps $< > ${Build}/$@
    7782
    78 %.pstex : %.fig | ${Build}
     83%.pstex : img/%.fig | ${Build}
    7984        fig2dev -L pstex $< > ${Build}/$@
    8085        fig2dev -L pstex_t -p ${Build}/$@ $< > ${Build}/$@_t
  • doc/theses/thierry_delisle_PhD/comp_II/comp_II.tex

    r331eacb r87f572e  
    1 \documentclass[11pt,fullpage]{article}
     1\documentclass[11pt]{article}
     2\usepackage{fullpage}
    23\usepackage[T1]{fontenc}
    34\usepackage[utf8]{inputenc}
     
    67\usepackage{xcolor}
    78\usepackage{graphicx}
    8 \usepackage[hidelinks]{hyperref}
     9\usepackage{epic,eepic}
    910\usepackage{glossaries}
    1011\usepackage{textcomp}
    11 \usepackage{geometry}
     12\usepackage[hidelinks]{hyperref}
     13%\usepackage[margin=1in]{geometry}
     14%\usepackage{float}
    1215
    1316% cfa macros used in the document
     
    5154\section{Introduction}
    5255\subsection{\CFA and the \CFA concurrency package}
    53 \CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high productivity features while maintaning the predictible performance of C. As such concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. Concurrent code is written in the syncrhonous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such the \CFA scheduler is a user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}.
    54 
    55 The goal of this research is to produce a scheduler that is simple to use and offers acceptable performance in all cases. Here simplicity does not refer to the API but to how much scheduling concerns programmers need to take into account when using the \CFA concurrency package. Therefore, the main goal of this proposal is as follows :
     56\CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high-productivity features while maintaning the predictible performance of C. As such, concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. \CFA concurrrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such, the \CFA \emph{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}.
     57
     58Scheduling occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler. This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively. The cost of switching between two threads for an indirect handoff has two components : the cost of actually context-switching, i.e., changing the relevant registers to move execution from one thread to the other, and the cost of scheduling, i.e., deciding which thread to run next among all the threads ready to run. The first cost is generally constant and fixed, while the scheduling cost can vary based on the system state\footnote{Affecting the context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a fixed third-thread before scheduling.}. Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, i.e. \textit{linearizability}, and a new dimension to performance: scalability, where scheduling cost now also depends on contention.
     59
     60The more threads switch, the more the administrating cost of scheduling becomes noticeable. It is therefore important to build a scheduler with the lowest possible cost and latency. Another important consideration is \emph{fairness}. In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}. While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows to much unfairness. Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance. In practice, threads must wait in turn but there can be advantages to unfair scheduling, e.g., the express cash register at a grocery store.
     61
     62The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance. Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. Therefore, the main goal of this proposal is :
    5663\begin{quote}
    57 The \CFA scheduler should be \emph{viable} for any workload.
     64The \CFA scheduler should be \emph{viable} for \emph{any} workload.
    5865\end{quote}
    5966
    60 This objective includes producing a scheduling strategy with minimal fairness guarantees, creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily and hide blocking I/O operations and, writing sufficient library tools to allow developpers to properly use the scheduler.
     67For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads. As such, scheduling performance is generally either defined by the best case scenario, a workload to which the scheduler is tailored, or the worst case scenario, i.e., the scheduler behaves no worst than \emph{X}. For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance. A solution to this impossibility is to allow programmers to write their own scheduler, that is not the subject of this proposal, which considers only the default scheduler. As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal.
     68
     69This objective includes producing a scheduling strategy with sufficient fairness guarantees, creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily and hide blocking I/O operations, and writing sufficient library tools to allow developers to indirectly use the scheduler.
    6170
    6271% ===============================================================================
     
    6473
    6574\section{Scheduling for \CFA}
    66 While the \CFA concurrency package doesn't have any particular scheduling needs beyond those of any concurrency package which uses \glspl{uthrd}, it is important that the default \CFA Scheduler be viable in general. Indeed, since the \CFA Scheduler does not target any specific workloads, it is unrealistic to demand that it use the best scheduling strategy in all cases. However, it should offer a viable ``out of the box'' solution for most scheduling problems so that programmers can quickly write performant concurrent without needed to think about which scheduling strategy is more appropriate for their workload. Indeed, only programmers with exceptionnaly high performance requirements should need to write their own scheduler. More specifically, two broad types of schedulering strategies should be avoided in order to avoid penalizing certain types of workloads : feedback-based and priority schedulers.
     75While the \CFA concurrency package does not have any particular scheduling requirements beyond supporting \glspl{uthrd}. Therefore, the detailed requirements of the \CFA scheduler are :
     76
     77\paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. The scheduler cannot allow threads to be dropped from the ready-queue, i.e., scheduled but never run, or be executed multiple times when only being scheduled once. Since \CFA concurrency has no spurious wakeup, this definition of correctness also means the scheduler should have no spurious wakeup. The \CFA scheduler must be correct.
     78
     79\paragraph{Performance} The performance of a scheduler can generally be mesured in terms of scheduling cost, scalability and latency. Scheduling cost is the cost to switch from one thread to another, as mentioned above. For simple applications where a single kernel thread does most of the scheduling, it is generally the dominating cost. When adding many kernel threads, scalability can become an issue, effectively increasing the cost of context-switching when contention is high. Finally, a third axis of performance is tail latency. This measurement is related to fairness and mesures how long is needed for a thread to be run once scheduled but evaluated in the worst cases. The \CFA scheduler should offer good performance in all three metrics.
     80
     81\paragraph{Fairness} Like performance, this requirements has several aspect : eventual progress, predictability and performance reliablility. As a hard requirement, the \CFA scheduler must guarantee eventual progress, i.e., prevent starvation, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complext to reason about. Beyond this requirement, performance should be predictible and reliable, which means similar workloads achieve similar performance and programmer intuition is respected. An example of this is : a thread that yields agressively should not run more often then other tasks. While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. The \CFA scheduler must guarantee eventual progress and should be predictible and offer reliable performance.
     82
     83\paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement. This issue is discussed more in depth towards the end of this proposal. It effectively refers to avoiding using CPU power when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it. Balancing these two states is where the complexity lies. The \CFA scheduler should be efficient with respect to the underlying (shared) computer.
     84
     85To achieve these requirements, I can reject two broad types of scheduling strategies : feedback-based and priority schedulers.
    6786
    6887\subsection{Feedback-Based Schedulers}
    69 Many operating systems use schedulers based on feadback loops in some form, they measure how much CPU a particular thread has used\footnote{Different metrics can be used to here but it is not relevant to the discussion.} and schedule threads based on this metric. These strategies are sensible for operating systems but rely on two assumptions on the workload :
     88Many operating systems use schedulers based on feedback in some form, e.g., measuring how much CPU a particular thread has used\footnote{Different metrics can measured here but it is not relevant to the discussion.} and schedule threads based on this metric. These strategies are sensible for operating systems but rely on two assumptions on the workload :
    7089
    7190\begin{enumerate}
    72         \item Threads live long enough to be scheduled many times.
    73         \item Cooperation among all threads is not simply infeasible, it is a security risk.
     91        \item Threads live long enough for useful feedback information to be to gathered.
     92        \item Threads belong to multiple users so fairness across threads is insufficient.
    7493\end{enumerate}
    7594
    76 While these two assumptions generally hold for operating systems, they may not for \CFA programs. In fact, \CFA uses \glspl{uthrd} which have the explicit goal of reducing the cost of threading primitives to allow many smaller threads. This can naturally lead to have threads with much shorter lifetime and only being scheduled a few times. Scheduling strategies based on feadback loops cannot be effective in these cases because they will not have the opportunity to measure the metrics that underlay the algorithm. Note that the problem of feadback loop convergence (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling event, e.g., threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly.
    77 
    78 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. When multiple users are involved, it is possible that some users are malevolent and try to exploit the scheduling strategy in order to achieve some nefarious objective. Security concerns mean that more precise and robust fairness metrics must be used. In the case of the \CFA scheduler, every thread runs in the same user-space and are controlled from the same user. It is then possible to safely ignore the possibility that threads are malevolent and assume that all threads will ignore or cooperate with each other. This allows for a much simpler fairness metric and in this proposal ``fairness'' will be considered as equal opportunities to run once scheduled.
    79 
    80 Since feadback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not user per-threads feedback. Feedback loops in general are not rejected for secondary concerns like idle sleep, but no feedback loop is used to decide which thread to run next.
     95While these two assumptions generally hold for operating systems, they may not for user-level threading. Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetime, only being scheduled a few times. Scheduling strategies based on feedback cannot be effective in these cases because they do not have the opportunity to measure the metrics that underlie the algorithm. Note that the problem of feedback convergence (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling, e.g., threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly.
     96
     97In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. When multiple users are involved, it is possible that some users are malevolent and try to exploit the scheduling strategy in order to achieve some nefarious objective. Security concerns mean that more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process. In the case of the \CFA scheduler, every thread runs in the same user-space and are controlled by the same user. Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. This approach allows for a much simpler fairness metric and in this proposal ``fairness'' is considered as follows : when multiple threads are cycling through the system, the total ordering of threads being scheduled, i.e., pushed onto the ready-queue, should not differ much from the total ordering of threads being executed, i.e., popped from the ready-queue.
     98
     99Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. Feedback in general is not rejected for secondary concerns like idle sleep, but no feedback is used to decide which thread to run next.
    81100
    82101\subsection{Priority Schedulers}
    83 Another broad category of schedulers are priority schedulers. In these scheduling strategies threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. These priority mean that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority will not run. This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritising a lower priority thread) can both lead to serious problems, leaving programmers between a rock and a hard place.
    84 
    85 An important observation to make is that threads do not need to have explicit priorities for problems to be possible. Indeed, any system with multiple ready-queues and attempts to exhaust one queue before accessing the other queues, could encounter starvation problems. A popular scheduling strategy that suffers from implicit priorities is work-stealing. Work-stealing is generally presented as follows :
    86 
    87 \begin{itemize}
    88         \item Each processor has a list of threads.
    89 \end{itemize}
     102Another broad category of schedulers are priority schedulers. In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. These priority mean that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run. This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems.
     103
     104An important observation to make is that threads do not need to have explicit priorities for problems to occur. Indeed, any system with multiple ready-queues and attempts to exhaust one queue before accessing the other queues, can encounter starvation problems. A popular scheduling strategy that suffers from implicit priorities is work-stealing. Work-stealing is generally presented as follows, each processor has a list of ready threads.
    90105\begin{enumerate}
    91106        \item Run threads from ``this'' processor's list.
     
    93108\end{enumerate}
    94109
    95 In a loaded system\footnote{A loaded system is a system where threads are being run at the same rate they are scheduled}, if a thread does not yield or block for an extended period of time, threads on the same processor list will starve if no other processors can exhaust their list.
     110In a loaded system\footnote{A loaded system is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield, block or preempt for an extended period of time, threads on the same processor list starve if no other processors exhaust their list.
    96111
    97112Since priorities can be complex to handle for programmers, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities.
    98113
    99 \subsection{Schedulers without feadback or priorities}
    100 I claim that the ideal default scheduler for the \CFA runtime is a scheduler that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. The simplest fairness guarantee is to guarantee FIFO ordering, i.e., threads scheduled first will run first. However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additionnal synchronization. Thankfully, strict FIFO is not needed for scheduling. Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run\footnote{This is because the non-determinism means that programmers must already handle ordering problems in order to produce correct code and already must rely on weak guarantees, for example that a specific thread will \emph{eventually} run.}. This need for unfairness to persist before problems occur means that the FIFO fairness guarantee can be significantly relaxed without causing problems. For this proposal, the target guarantee is that the \CFA scheduler guarantees \emph{probable} FIFO ordering, which is defined as follows :
     114\subsection{Schedulers without feedback or priorities}
     115This proposal conjectures that is is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. The simplest fairness guarantee is FIFO ordering, i.e., threads scheduled first run first. However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additionnal synchronization. Thankfully, strict FIFO is not needed for sufficient fairness. Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run. This is because the non-determinism means that programmers must already handle ordering problems in order to produce correct code and already must rely on weak guarantees, for example that a specific thread will \emph{eventually} run. Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering.
     116
     117Scheduling is defined as follows :
    101118\begin{itemize}
    102         \item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regards to $N$.
     119        \item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regard to $N$.
    103120\end{itemize}
    104121
    105 While this is not a strong guarantee, the probability that problems persist for long period of times decreases exponentially, making persisting problems virtually impossible.
    106 
    107 \subsection{Real-Time}
    108 While the objective of this proposed scheduler is similar to the objective of real-time scheduling, this proposal is not a proposal for real-time scheduler and as such makes no attempt to offer either soft or hard guarantees on scheduling delays.
     122While this is not a bounded guarantee, the probability that unfairness persist for long periods of times decreases exponentially, making persisting unfairness virtually impossible.
    109123
    110124% ===============================================================================
     
    112126\section{Proposal}
    113127
    114 \subsection{Ready-Queue}
    115 Using trevor's paper\cit as basis, it is simple to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queue. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the push and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue for which the head has the oldest timestamp. In loaded or overloaded systems, it is higly likely that the queues is far from empty, e.i., several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is higly likely to yield a queue that is not empty.
    116 
    117 When the ready queue is "more empty", i.e., several of the inner queues are empty, selecting a random queue for popping is less likely to yield a valid selection and more attempts need to be made, resulting in a performance degradation. In cases, with few elements on the ready queue and few processors running, performance can be improved by adding information to help processors find which inner queues are used. Preliminary performance tests indicate that with few processors, a bitmask can be used to identify which inner queues are currently in use. This is especially effective in the single-thread case, where the bitmask will always be up-to-date. Furthermore, modern x86 CPUs have a BMI2 extension which allow using the bitmask with very little overhead over directly accessing the readyqueue offerring decent performance even in cases with many empty inner queues. This technique does not solve the problem completely, it randomly attempts to find a block of 64 queues where at least one is used, instead of attempting to find a used queue. For systems with a large number of cores this does not completely solve the problem, but it is a fixed improvement. The size of the blocks are limited by the maximum size atomic instruction can operate on, therefore atomic instructions on large words would increase the 64 queues per block limit.
    118 
    119 \TODO double check the next sentence
    120 Preliminary result indicate that the bitmask approach with the BMI2 extension can lead to multi-threaded performance that is contention agnostic in the worst case.
    121 This result suggests that the contention penalty and the increase performance for additionnal thread cancel each other exactly. This may indicate that a relatively small reduction in contention may tip the performance into positive scalling even for the worst case. It can be noted that in cases of high-contention, the use of the bitmask to find queues that are not empty is much less reliable. Indeed, if contention on the bitmask is high, it means it probably changes significantly between the moment it is read and the actual operation on the queues it represents. Furthermore, the objective of the bitmask is to avoid probing queues that are empty. Therefore, in cases where the bitmask is highly contented, it may be preferrable to probe queues randomly, either until contention decreases or until a prior prefetch of the bitmask completes. Ideally, the scheduler would be able to observe that the bitmask is highly contented and adjust its behaviour appropriately. However, I am not aware of any mechanism to query whether a cacheline is in cache or to run other instructions until a cacheline is fetch without blocking on the cacheline. As such, an alternative that may have a similar impact would be for each thread to have their own bitmask, which would be updated both after each scheduler action and after a certain number of failed probing. If the bitmask has little contention, the local bitmask will be mostly up-to-date and several threads won't need to contend as much on the global bitmask. If the bitmask has significant contention, then fetching it becomes more expensive and threads may as well probe randomly. This solution claims that probing randomly or against an out-of-date bitmask is equivalent.
    122 
    123 In cases where this is insufficient, another approach is to use a hiearchical data structure. Creating a tree of nodes to reduce contention has been shown to work in similar cases\cit(SNZI: Scalable NonZero Indicators)\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer single-threaded performance due to the inherent pointer chasing, as such, it was not considered as the first approach but as a fallback in case the bitmask approach does not satisfy the performance goals.
    124 
    125 Part of this performance relies on contention being low when there are few threads on the readyqueue. However, this can be assumed reliably if the system handles putting idle processors to sleep, which is addressed in section \ref{sleep}.
     128\subsection{Ready-Queue} \label{sec:queue}
     129A simple ready-queue can be built from a FIFO queue, user-threads are pushed onto the queue when they are ready to run and processors (kernel-threads acting as virtual processors) pop the user-threads from the queue and execute them. Using Trevor's paper\cit as basis, it is simple to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant, Section~\ref{sec:resize} will discuss resizing the array.}. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the push and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue for which the head has the oldest timestamp. A higher number of underlying queues leads to less contention on each queue and therefore better performance. In a loaded system, it is higly likely the queues are non-empty, i.e., several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is higly likely to yield a queue with available items. In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two randoms pick will yield an item approximately 9 times out of 10.
     130
     131\begin{figure}
     132        \begin{center}
     133%               {\resizebox{0.8\textwidth}{!}{\input{base}}}
     134                \input{base}
     135        \end{center}
     136        \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists. }
     137        \label{fig:base}
     138\end{figure}
     139
     140\begin{figure}
     141        \begin{center}
     142%               {\resizebox{0.8\textwidth}{!}{\input{empty}}}
     143                \input{empty}
     144        \end{center}
     145        \caption{``More empty'' state of the queue: the array contains many empty cells.}
     146        \label{fig:empty}
     147\end{figure}
     148
     149When the ready queue is "more empty", i.e., several of the inner queues are empty, selecting a random queue for popping is less likely to yield a valid selection and more attempts need to be made, resulting in a performance degradation. Figure~\ref{fig:empty} shows an example with fewer elements where the chances of getting an empty queue is 5/7 per pick, meaning two randoms pick will yield an item only half the time. Since the overarching queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the items density. This leads to four performance cases, as depicted in Table~\ref{tab:perfcases}.
     150
     151\begin{table}
     152        \begin{center}
     153                \begin{tabular}{|r|l|l|}
     154                        \cline{2-3}
     155                        \multicolumn{1}{r|}{} & \multicolumn{1}{c|}{Many Processors} & \multicolumn{1}{c|}{Few Processors} \\
     156                        \hline
     157                        Many Threads & A: good performance & B: good performance \\
     158                        \hline
     159                        Few Threads  & C: poor performance & D: poor performance \\
     160                        \hline
     161                \end{tabular}
     162        \end{center}
     163        \caption{Performance of the relaxed FIFO list in different cases. The number of processors (many or few) refers to the number of kernel-threads \emph{actively} attempting to pop user-threads from the queues, not the total number of kernel-threads. The number of threads (many of few) refers to the number of user-threads ready to be run. Many threads means they outnumber processors significantly and most underlying queues have items, few threads means there are barely more threads than processors and most underlying queues are empty. Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}.}
     164        \label{tab:perfcases}
     165\end{table}
     166
     167Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used. This aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations.
     168
     169A bitmask can be used to identify which inner queues are currently in use, as shown in Figure~\ref{fig:emptybit}. This means that processors can often find user-threads in constant time, regardless of how many underlying queues are empty. Furthermore, modern x86 CPUs have an extension (BMI2) which allow using the bitmask with very little overhead compared to a filled readyqueue, offerring decent performance even in cases with many empty inner queues. However, this technique has its limits, with a single word\footnote{Word refers here to however many bits can be written atomicly.} bitmask, the total number of underlying queues in the overarching queue is limited to the number of bits in the word. With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomicly. A dense bitmap, i.e., either a single word bitmask or a multi word bitmask where all words are densely packed, also causes additionnal problems in case~C (Table~\ref{tab:perfcases}), which the increased contention on the bitmask both causes new performance degradation and means the accuracy of the bitmask is less reliable due to more hardware threads potentially racing to read and/or update that information.
     170
     171\begin{figure}
     172        \begin{center}
     173                {\resizebox{0.8\textwidth}{!}{\input{emptybit}}}
     174        \end{center}
     175        \caption{``More empty'' queue with added bitmask to indicate which array cells have items.}
     176        \label{fig:emptybit}
     177\end{figure}
     178
     179Another approach is to use a hiearchical data structure, for example Figure~\ref{fig:emptytree}. Creating a tree of nodes to reduce contention has been shown to work in similar cases\cit(SNZI: Scalable NonZero Indicators)\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case.
     180
     181\begin{figure}
     182        \begin{center}
     183                {\resizebox{0.8\textwidth}{!}{\input{emptytree}}}
     184        \end{center}
     185        \caption{``More empty'' queue with added binary search tree indicate which array cells have items.}
     186        \label{fig:emptytree}
     187\end{figure}
     188
     189Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independant copies of it. While this approach can offer good scalability \emph{and} low latency, the livelyness of the information can become a problem. In the simple cases, local copies of which underlying queues are empty can become stale and end-up not be useful when for the pop operation. A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct. As mentionned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation. Section~\ref{sec:sleep} discusses an other case where reliable information is required for the algorithm to be correct.
     190
     191There is a fundamental tradeoff among these approach. Dense global information about empty underlying queues will help zero-contention cases at the cost of high-contention case. Sparse global information will help high-contention cases but increase latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hiearchical structures, e.g., binary search tree, effectively aggregate information but following pointer chains, learning information for each node. Similarly, other sparse schemes would need to read multiple cachelines to acquire all the information needed.}. Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases, however the information can become stale making it difficult to use to ensure correctness. The fact that these solutions have these fundamental limits suggest that that a more solution that combines these solutions in an interesting ways. The lock discussed in Section~\ref{sec:resize} also allows for solutions that adapt to the number of processors, which couls also prove useful.
    126192
    127193\paragraph{Objectives and Existing Work}
    128 How much scalability is actually needed is highly debatable, libfibre\cit is has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. As such the single atomic instruction on a shared cacheline may be sufficiently performant.
    129 
    130 I have built a prototype of this ready-queue (including the bitmask and BMI2 usage, but not the sharded bitmask) and ran performance experiments on it but it is difficult to compare this prototype to a thread scheduler as the prototype is used as a data-queue. I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results. I believe that the bitmask approach is currently one of the larger risks of the proposal, early tests lead me to believe it may work but it is not clear that the contention problem can be overcome. The worst-case scenario is a case where the number of processors and the number of ready threads are similar, yet scheduling events are very frequent. Fewer threads should lead to the Idle Sleep mechanism reducing contention while having many threads ready leads to optimal performance. It is difficult to evaluate the likeliness of this worst-case scenario in real workloads. I believe, frequent scheduling events suggest a more ``bursty'' workload where new work is finely divided among many threads which race to completion. This type of workload would only see a peek of contention close to the end of the work, but no sustained contention. Very fine-grained pipelines are less ``bursty'', these may lead to more sustained contention. However, they could also easily benefit from a direct hand-off strategy which would circumvent the problem entirely.
    131 
    132 \subsection{Dynamic Resizing}
    133 The \CFA runtime system currently handles dynamically adding and removing processors from clusters at any time. Since this is part of the existing design, the proposed scheduler must also support this behaviour. However, dynamicly resizing the clusters is considered a rare event associated with setup, teardown and major configuration changes. This assumptions is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. As such, the proposed scheduler must honor the correctness of these behaviour but does not have any performance objectives with regards to resizing a cluster. How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. This description effectively matches with te description of a Reader-Writer lock, in frequent but invasive updates among frequent (mostly) read operations. In the case of the Ready-Queue described above, read operations are operations that push or pop from the ready-queue but do not invalidate any references to the ready queue data structures. Writes on the other-hand would add or remove inner queues, invalidating references to the array of inner queues in the process. Therefore, the current proposed approach to this problem is the add a per-cluster Reader Writer lock around the ready queue to prevent restructuring of the ready-queue data structure while threads are being pushed or popped.
     194
     195How much scalability is actually needed is highly debatable, libfibre\cit has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. As such, the single atomic instruction on a shared cacheline may be sufficiently performant.
     196
     197I have built a prototype of this ready-queue (including the bitmask and BMI2 usage, but not the sharded bitmask) and ran performance experiments on it but it is difficult to compare this prototype to a thread scheduler as the prototype is used as a data-queue. I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results. I believe that the bitmask approach is currently one of the larger risks of the proposal, early tests lead me to believe it may work but it is not clear that the contention problem can be overcome. The worst-case scenario is a case where the number of processors and the number of ready threads are similar, yet scheduling events are very frequent. Fewer threads should lead to the Idle Sleep mechanism discussed in Section~\ref{sec:sleep} to reduce contention while having many threads ready leads to optimal performance. It is difficult to evaluate the likeliness of this worst-case scenario in real workloads. I believe, frequent scheduling events suggest a more ``bursty'' workload where new work is finely divided among many threads which race to completion. This type of workload would only see a peek of contention close to the end of the work, but no sustained contention. Very fine-grained pipelines are less ``bursty'', these may lead to more sustained contention. However, they could also easily benefit from a direct hand-off strategy which would circumvent the problem entirely.
     198
     199\subsection{Dynamic Resizing} \label{sec:resize}
     200The \CFA runtime system currently handles dynamically adding and removing processors from clusters at any time. Since this is part of the existing design, the proposed scheduler must also support this behaviour. However, dynamicly resizing the clusters is considered a rare event associated with setup, teardown and major configuration changes. This assumptions is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. As such, the proposed scheduler must honor the correctness of these behaviour but does not have any performance objectives with regards to resizing a cluster. How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. However, as mentionned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance, the number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and therefore moving it. This can introduce memory reclamation problems if not done correctly.
     201
     202\begin{figure}
     203        \begin{center}
     204%               {\resizebox{0.8\textwidth}{!}{\input{resize}}}
     205                \input{resize}
     206        \end{center}
     207        \caption{Copy of data structure shown in Figure~\ref{fig:base}. The cells of the array can be modified concurrently but resizing the array, which requires moving it, is not safe to do concurrently. This can also be true of the accompanying data structures used to find non-empty queues.}
     208        \label{fig:base2}
     209\end{figure}
     210
     211It is important to note that how the array is used in this case. While the array cells are modified by every push and pop operation, the array itself, i.e., the pointer that would change when resized, is only read during these operations. Therefore the use is this pointer can be described as frequent reads and in frequent writes. This description effectively matches with the description of a Reader-Writer lock, infrequent but invasive updates among frequent read operations. In the case of the Ready-Queue described above, read operations are operations that push or pop from the ready-queue but do not invalidate any references to the ready queue data structures. Writes on the other-hand would add or remove inner queues, invalidating references to the array of inner queues in the process. Therefore, the current proposed approach to this problem is the add a per-cluster Reader Writer lock around the ready queue to prevent restructuring of the ready-queue data structure while threads are being pushed or popped.
    134212
    135213There are possible alternatives to the Reader Writer lock solution. This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject. However, the RWlock solution is simple and can be leveraged to solve other problems (e.g. processor ordering and memory reclamation of threads) which makes it an attractive solution.
    136214
    137215\paragraph{Objectives and Existing Work}
    138 The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottle neck. I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. As such, I do not consider this lock to be a risk on this project.
    139 
    140 \subsection{Idle Sleep} \label{sleep}
    141 As mentionned above, idle sleep is the process of putting processors to sleep while they do not have threads to execute. In this context processors are kernel-threads and sleeping refers to asking the kernel to block a thread. This can be achieved with either thread synchronization operations like pthread\_cond\_wait or using signal operations like sigsuspend.
     216The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottleneck. I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. As such, I do not consider this lock to be a risk on this project.
     217
     218\subsection{Idle Sleep} \label{sec:sleep}
     219As mentionned above, idle sleep is the process of putting processors to sleep while they do not have threads to execute. In this context, processors are kernel-threads and sleeping refers to asking the kernel to block a thread. This can be achieved with either thread synchronization operations like pthread\_cond\_wait or using signal operations like sigsuspend. The goal of putting idle processors to sleep is two-fold, it reduces energy consumption in cases where more idle kernel-threads translate to idle hardware threads, and reduces contention on the ready queue, since the otherwise idle processors generally contend trying to pop items from the queue. Since energy efficiency is a growing concern in many industry sectors\cit, there is not real need to solve the contention problem without using idle sleep.
    142220
    143221Support for idle sleep broadly involves calling the operating system to block the kernel thread but also handling the race between the sleeping and the waking up, and handling which kernel thread should sleep or wake-up.
    144222
    145 When a processor decides to sleep, there is a race that occurs between it signalling that it will go to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. This is equivalent to the classic problem of missing signals when using condition variables, the ``sleepy'' processor indicates that it will sleep but has not yet gone to sleep, if another processor attempts to wake it up, the waking-up operation may claim nothing needs to be done and the signal will have been missed. In cases where threads are scheduled from processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake. Individual processors always finish shceduling threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready-queue is not linearizable). However, this guarantee does not hold if threads are shceduled from outside the cluster, either due to an external event like timers and I/O, or due to a thread migrating from a different cluster. In this case, missed signals can lead to the cluster deadlocking where it should not\footnote{Clusters ``should'' never deadlock, but for this proposal, cases where \CFA users \emph{actually} wrote \CFA code that leads to a deadlock it is considered as a deadlock that ``should'' happen. }. Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleepy processor notices that a thread was scheduled after it signalled its intent to block or code scheduling threads well see the intent to sleep before scheduling and be able to wake-up the processor. This matter is complicated by the fact that pthread offers few tools to implement this solution and offers no guarantee of ordering of threads waking up for most of these tools.
    146 
    147 Another issues is trying to avoid kernel sleeping and waking frequently. A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. Note that while this generally means organising the processors in a stack, I believe that the unique index provided by the ReaderWriter lock can be reused to strictly order the waking order of processors, causing a LIFO like waking order. While a strict LIFO stack is probably better, using the processor index could proove useful and offer a sufficiently LIFO ordering.
     223When a processor decides to sleep, there is a race that occurs between it signalling that it will go to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. This is equivalent to the classic problem of missing signals when using condition variables, the ``sleepy'' processor indicates that it will sleep but has not yet gone to sleep, if another processor attempts to wake it up, the waking-up operation may claim nothing needs to be done and the signal will have been missed. In cases where threads are scheduled from processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake. Individual processors always finish scheduling threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready-queue is not linearizable). However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a thread migrating from a different cluster. In this case, missed signals can lead to the cluster deadlocking where it should not\footnote{Clusters ``should'' never deadlock, but for this proposal, cases where \CFA users \emph{actually} wrote \CFA code that leads to a deadlock it is considered as a deadlock that ``should'' happen. }. Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleepy processor notices that a thread was scheduled after it signalled its intent to block or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor. This matter is complicated by the fact that pthread offers few tools to implement this solution and offers no guarantee of ordering of threads waking up for most of these tools.
     224
     225Another issues is trying to avoid kernel sleeping and waking frequently. A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. Note that while this generally means organising the processors in a stack, I believe that the unique index provided by the ReaderWriter lock can be reused to strictly order the waking order of processors, causing a LIFO like waking order. While a strict LIFO stack is probably better, using the processor index could prove useful and offer a sufficiently LIFO ordering.
    148226
    149227Finally, another important aspect of Idle Sleep is when should processors make the decision to sleep and when it is appropriate for sleeping processors to be woken up. Processors that are unnecessarily awake lead to unnecessary contention and power consumption, while too many sleeping processors can lead to sub-optimal throughput. Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. There is already a wealth of research on the subject and I do not plan to implement a novel idea for the Idle Sleep heuristic in this project.
     
    153231
    154232\paragraph{OS Abstraction}
    155 One of the fundamental part of this converting blocking I/O operations into non-blocking ones. This relies on having an underlying asynchronous I/O interface to which to direct the I/O operations. While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API, simply to use an existing one that is sufficient. uC++ uses the \texttt{select} as its interface, which handles pipes and sockets. It entails significant complexity and has performances problems which make it a less interesting alternative. Another interface which is becoming popular recently\cit is \texttt{epoll}. However, epoll also does not handle file system and seems to have problem to linux pipes and \texttt{TTY}s\cit. A very recent alternative that must still be investigated is \texttt{io\_uring}. It claims to address some of the issues with \texttt{epoll} but is too recent to be confident that it does. Finally, a popular cross-platform alternative is \texttt{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). However, as a full-featured library it includes much more than what is needed and could conflict with other features of \CFA unless significant efforts are made to merge them together.
     233One of the fundamental part of converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations. While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API, simply to use an existing one that is sufficient. uC++ uses the \texttt{select} as its interface, which handles ttys, pipes and sockets, but not disk. It entails significant complexity and is being replaced which make it a less interesting alternative. Another interface which is becoming popular recently\cit is \texttt{epoll}, which is supposed to be cheaper than \texttt{select}. However, epoll also does not handle file system and seems to have problem to linux pipes and \texttt{TTY}s\cit. A very recent alternative that must still be investigated is \texttt{io\_uring}. It claims to address some of the issues with \texttt{epoll} but is too recent to be confident that it does. Finally, a popular cross-platform alternative is \texttt{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). However, as a full-featured library it includes much more than what is needed and could conflict with other features of \CFA unless significant efforts are made to merge them together.
    156234
    157235\paragraph{Event-Engine}
     
    159237
    160238\paragraph{Interface}
    161 Finally, for these components to be available, it is necessary to expose them through a synchronous interface. This can be a novel interface but it is preferrable to attempt to intercept the existing POSIX interface in order to be compatible with existing code. This will allow C programs written using this interface to be transparently converted to \CFA with minimal effeort. Where this is not applicable, a novel interface will be created to fill the gaps.
     239Finally, for these components to be available, it is necessary to expose them through a synchronous interface. This can be a novel interface but it is preferrable to attempt to intercept the existing POSIX interface in order to be compatible with existing code. This allows C programs written using this interface to be transparently converted to \CFA with minimal effeort. Where this is not applicable, a novel interface will be created to fill the gaps.
    162240
    163241
     
    171249\section{Timeline}
    172250
    173 
    174 \cleardoublepage
    175251
    176252% B I B L I O G R A P H Y
    177253% -----------------------------
    178 \addcontentsline{toc}{chapter}{Bibliography}
     254\cleardoublepage
     255\phantomsection         % allows hyperref to link to the correct page
     256\addcontentsline{toc}{section}{\refname}
    179257\bibliographystyle{plain}
    180258\bibliography{pl,local}
     259
     260% G L O S S A R Y
     261% -----------------------------
    181262\cleardoublepage
    182263\phantomsection         % allows hyperref to link to the correct page
    183 
    184 % G L O S S A R Y
    185 % -----------------------------
    186 \addcontentsline{toc}{chapter}{Glossary}
     264\addcontentsline{toc}{section}{Glossary}
    187265\printglossary
    188 \cleardoublepage
    189 \phantomsection         % allows hyperref to link to the correct page
    190266
    191267\end{document}
  • doc/user/user.tex

    r331eacb r87f572e  
    1111%% Created On       : Wed Apr  6 14:53:29 2016
    1212%% Last Modified By : Peter A. Buhr
    13 %% Last Modified On : Thu Mar  5 12:09:42 2020
    14 %% Update Count     : 3885
     13%% Last Modified On : Fri Mar  6 13:34:52 2020
     14%% Update Count     : 3924
    1515%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    1616
     
    66216621An array may be filled, resized, or aligned.
    66226622\end{description}
    6623 \VRef[Table]{t:AllocationVersusCapabilities} shows allocation routines supporting different combinations of storage-management capabilities:
     6623\VRef[Table]{t:AllocationVersusCapabilities} shows allocation routines supporting different combinations of storage-management capabilities.
    66246624\begin{table}
    66256625\centering
     6626\begin{minipage}{0.75\textwidth}
    66266627\begin{tabular}{@{}r|l|l|l|l|l@{}}
    66276628\multicolumn{1}{c}{}&           & \multicolumn{1}{c|}{fill}     & resize        & alignment     & array \\
     
    66316632                & ©realloc©                     & copy                  & yes           & no            & no    \\
    66326633                & ©memalign©            & no                    & no            & yes           & no    \\
    6633                 & ©aligned_alloc©       & no                    & no            & yes           & no    \\
     6634                & ©aligned_alloc©\footnote{Same as ©memalign© but size is an integral multiple of alignment, which is universally ignored.}
     6635                                                        & no                    & no            & yes           & no    \\
    66346636                & ©posix_memalign©      & no                    & no            & yes           & no    \\
     6637                & ©valloc©                      & no                    & no            & yes (page size)& no   \\
     6638                & ©pvalloc©\footnote{Same as ©valloc© but rounds size to multiple of page size.}
     6639                                                        & no                    & no            & yes (page size)& no   \\
    66356640\hline
    66366641\CFA    & ©cmemalign©           & yes (0 only)  & no            & yes           & yes   \\
    66376642                & ©realloc©                     & copy                  & yes           & yes           & no    \\
    6638                 & ©alloc©                       & no                    & no            & no            & no    \\
    6639                 & ©alloc©                       & copy                  & no/yes        & no            & yes   \\
    6640                 & ©alloc©                       & no/copy/yes   & no/yes        & no            & yes   \\
    6641                 & ©alloc_set©           & no/yes                & no            & yes           & yes   \\
    6642                 & ©alloc_align©         & no/yes                & no            & yes           & yes   \\
    6643                 & ©alloc_align_set©     & no/yes                & no            & yes           & yes   \\
     6643                & ©alloc©                       & no                    & yes           & no            & yes   \\
     6644                & ©alloc_set©           & yes                   & yes           & no            & yes   \\
     6645                & ©alloc_align©         & no                    & yes           & yes           & yes   \\
     6646                & ©alloc_align_set©     & yes                   & yes           & yes           & yes   \\
    66446647\end{tabular}
     6648\end{minipage}
    66456649\caption{Allocation Routines versus Storage-Management Capabilities}
    66466650\label{t:AllocationVersusCapabilities}
    66476651\end{table}
     6652
     6653\CFA memory management extends the type safety of all allocations by using the type of the left-hand-side type to determine the allocation size and return a matching type for the new storage.
     6654Type-safe allocation is provided for all C allocation routines and new \CFA allocation routines, \eg in
     6655\begin{cfa}
     6656int * ip = (int *)malloc( sizeof(int) );                §\C{// C}§
     6657int * ip = malloc();                                                    §\C{// \CFA type-safe version of C malloc}§
     6658int * ip = alloc();                                                             §\C{// \CFA type-safe uniform alloc}§
     6659\end{cfa}
     6660the latter two allocations determine the allocation size from the type of ©p© (©int©) and cast the pointer to the allocated storage to ©int *©.
     6661
     6662\CFA memory management extends allocation safety by implicitly honouring all alignment requirements, \eg in
     6663\begin{cfa}
     6664struct S { int i; } __attribute__(( aligned( 128 ) )); // cache-line alignment
     6665S * sp = malloc();                                                              §\C{// honour type alignment}§
     6666\end{cfa}
     6667the storage allocation is implicitly aligned to 128 rather than the default 16.
     6668The alignment check is performed at compile time so there is no runtime cost.
    66486669
    66496670\CFA memory management extends the resize capability with the notion of \newterm{sticky properties}.
     
    66526673Without sticky properties it is dangerous to use ©realloc©, resulting in an idiom of manually performing the reallocation to maintain correctness.
    66536674
     6675\CFA memory management extends allocation to support constructors for initialization of allocated storage, \eg in
     6676\begin{cfa}
     6677struct S { int i; };                                                    §\C{// cache-line aglinment}§
     6678void ?{}( S & s, int i ) { s.i = i; }
     6679// assume ?|? operator for printing an S
     6680
     6681S & sp = *®new®( 3 );                                                   §\C{// call constructor after allocation}§
     6682sout | sp.i;
     6683®delete®( &sp );
     6684
     6685S * spa = ®anew®( 10, 5 );                                              §\C{// allocate array and initialize each array element}§
     6686for ( i; 10 ) sout | spa[i] | nonl;
     6687sout | nl;
     6688®adelete®( 10, spa );
     6689\end{cfa}
     6690Allocation routines ©new©/©anew© allocate a variable/array and initialize storage using the allocated type's constructor.
     6691Note, the matching deallocation routines ©delete©/©adelete©.
    66546692
    66556693\leavevmode
Note: See TracChangeset for help on using the changeset viewer.