Changeset 33474e6 for doc/theses


Ignore:
Timestamp:
Oct 28, 2024, 11:31:39 AM (3 months ago)
Author:
Michael Brooks <mlbrooks@…>
Branches:
master
Children:
d0296db6
Parents:
720eec9 (diff), bf91d1d (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc/theses/mike_brooks_MMath
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/intro.tex

    r720eec9 r33474e6  
    33All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
    44Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
    5 For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating using subscripting.
     5For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements.
    66
    77This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
     
    1010An equally important goal is balancing good performance with safety.
    1111
     12The thesis describes improvements made to the \CFA language design, both syntax and semantics, to support the container features, and the source code created within the \CFA compiler, libraries, and runtime system to implement these features.
     13This work leverages preexisting work within the compiler's type and runtime systems generated by prior graduate students working on the \CFA project.
     14
    1215
    1316\section{Array}
     17\label{s:ArrayIntro}
    1418
    1519An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
    16 Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provides a better programmer experience.
     20Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provide a better programmer experience.
    1721The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
    1822For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
    1923Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
    2024
    21 C provides an array as a language feature.
     25C provides a simple array type as a language feature.
     26However, it adopts the controversial language position of treating pointer and array as duals, leading to multiple problems.
     27
    2228
    2329\section{Linked list}
    2430
    2531A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
    26 Subscripting by value is sometimes available, \eg hash table.
    27 Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
    28 If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
    29 otherwise, elements are heap allocated with explicitly/implicitly managed.
     32Subscripting by value (rather than position or location as for array) is sometimes available, \eg hash table.
     33Linked types are normally dynamically sized by adding and removing nodes using link fields internal or external to the elements (nodes).
     34If a programming language allows pointers to stack storage, linked-list nodes can be allocated on the stack and connected with stack addresses (pointers);
     35otherwise, elements are heap allocated with internal or external link fields.
    3036
    31 C does not provide a linked-list, either as an obvious commonly-accepted library, nor as an outright language feature.
    32 C programmers commonly build linked-list behaviours into their bespoke data structures, directly using its language feature of pointers.
     37C provides a few simple, polymorphic, library data-structures (@glibc@):
     38\begin{itemize}
     39\item
     40singly-linked lists, singly-linked tail queues, lists, and tail queues (@<sys/queue.h>@) \see{\VRef{s:PreexistingLinked-ListLibraries}}
     41\item
     42hash search table consisting of a key (string) with associated data (@<search.h>@)
     43\end{itemize}
     44Because these libraries are simple, awkward to use, and unsafe, C programmers commonly build linked-list behaviours into their bespoke data structures.
     45
    3346
    3447\section{String}
    3548
    3649A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
    37 What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@;
    38 subscripting individual elements is usually available, too.
    39 Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
     50What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@.
     51While subscripting individual elements is usually available, working at the individual character level is considered poor practise, \ie underutilizing the powerful string operations.
     52Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing string text.
    4053The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
    4154In some cases, string management is separate from heap management, \ie strings roll their own heap.
    4255
    43 The C string type is a user-managed allocation of the language-provided character array,
    44 plus the convention of marking its (variable) length by placing the 0-valued control character at the end (``null-terminated'').
    45 The C standard library includes many operations for working on this representation.
     56The C string type requires user storage-management of a language-provided character array.
     57The character array uses the convention of marking its (variable) array length by placing the 0-valued control character at the end (null-terminated).
     58The C standard library includes a number of high-level operations for working with this representation.
     59
    4660
    4761\section{Iterator}
    4862
    49 As a side issue to working with complex structured types is iterating through them.
    50 Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work, but the general iteration work is only a sketch for others as future work.
     63As a side issue to complex structured-types is iterating through them.
     64Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work.
     65However, the general iteration work is only a sketch for others as future work.
    5166Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
    5267
     
    6176\item
    6277These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
    63 Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
     78Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors), providing the major hacker attack-vectors.
    6479\end{enumerate}
    6580Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
     
    6883
    6984While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
    70 Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
    71 Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
     85Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
     86Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
    7287Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
    7388
     
    7994When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
    8095Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
    81 While some key aspects of C need to be explained and understood by quoting from the language reference manual, to illustrate actual program semantics, this thesis uses constructs a program whose behaviour exercises a particular point and then confirms the behaviour by running the program.
     96Some key aspects of C need to be explained and understood by quoting from the language reference manual.
     97However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers.
    8298These example programs show
    83 \begin{itemize}[leftmargin=*]
     99\begin{itemize}
    84100        \item if the compiler accepts or rejects certain syntax,
    85101        \item prints output to buttress a behavioural claim,
    86102        \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
    87103\end{itemize}
    88 This work has been tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
    89 Any discovered anomalies among compilers or versions is discussed.
    90 In all case, it is never clear whether the \emph{truth} lies in the compiler(s) or the C standard.
     104These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
     105Any discovered anomalies among compilers, versions, or architectures is discussed.
     106In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages.
    91107
    92108
    93109\section{Contributions}
    94110
    95 This work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
     111Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
    96112As well, a strong plan for general iteration has been sketched out.
     113The following are the detailed contributions, where performance and safety were always the motivating factors.
    97114
    98 This thesis mainly describes improvements made to the source code of the \CFA compiler and runtime system,
    99 available at TODO.
     115\subsection{Array}
    100116
    101 This work often leverages preexisting    xxxxxxxxxx
     117This work's array improvements are:
     118\begin{enumerate}[leftmargin=*]
     119\item
     120Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility
     121\item
     122Create a new polymorphic mechanism into the \CFA @forall@ clause for specifying array dimension values, similar to a fixed-typed parameter in a \CC \lstinline[language=C++]{template}.
     123\item
     124Construct a new standard-library array-type, available through @#include <array.hfa>@.
     125\end{enumerate}
     126The new array type, enabled by prior features, defines an array with guaranteed runtime bound checks (often optimizer-removable) and implicit (guaranteed accurate) inter-function length communication.
     127Leveraging the preexisting \CFA type-system to achieve this length-related type checking is an original contribution.
     128Furthermore, the new array incorporates multidimensional capabilities typical of scientific/machine-learning languages, made to coexist with the C raw-memory-aware tradition in a novel way.
     129The thesis includes a performance evaluation that shows \CFA arrays perform comparably with C arrays in many programming cases.
     130
    102131
    103132\subsection{Linked list}
    104133
    105 The linked list is a new runtime library, available as @#include <dlist.hfa>@.
    106 The library offers a novel combination of the preexisting \CFA features of:
    107 references, inline inheritance, polymorphism and declaration scoping.
    108 A programmer's experience of the list datatype within the type system is novel.
    109 The list's runtime representation follows the established design pattern of intrusion.
    110 This thesis includes a performance evaluation that shows
    111 the list performing comparably with other C-family intrusive lists,
    112 and notably better than the \CC standard list.
     134The linked list is a new runtime library, available through @#include <dlist.hfa>@.
     135The library offers a novel combination of the preexisting \CFA features: references, inline inheritance, polymorphism, and declaration scoping.
     136A programmer's experience using the list data-types within the type system is novel.
     137The list's runtime representation follows the established design pattern of intrusive link-fields for performance reasons, especially in concurrent programs.
     138The thesis includes a performance evaluation that shows the list performing comparably with other C-family intrusive lists, and notably better than the \CC standard list.
    113139
    114 \subsection{Array}
    115 
    116 This work's array improvements are
    117 \begin{enumerate}[leftmargin=*]
    118 \item
    119 subtle changes to the typing rules for the C array,
    120 \item
    121 a new \CFA language construct for a type-system managed array length, and
    122 \item
    123 a new standard-library array type, available as @#include <array.hfa>@.
    124 \end{enumerate}
    125 The new array type, enabled by the earlier features, defines an array with
    126 guaranteed runtime bound checks (often optimizer-removable) and
    127 implicit (guaranteed accurate) inter-function length communication.
    128 Leveraging the preexisting \CFA type system to achieve
    129 this length-related type checking is an original contribution.
    130 
    131 Furthermore, the new array incorporates multidimensional capabilities
    132 typical of scientific/machine-learning languages,
    133 made to coexist with the C ``raw-memory-aware'' tradition in a novel way.
    134140
    135141\subsection{String}
    136142
    137 The string is a new runtime library, available as @#include <string.hfa>@.
    138 The library offers a type whose basic usage is comparable to the \CC @string@ type,
    139 including analogous coexistence with raw character pointers.
    140 Its implementation, however, follows different princliples,
    141 enabling programs to work with strings by value, without incurring excessive copying.
     143The string is a new runtime library, available through @#include <string.hfa>@.
     144The library offers a type where basic usage is comparable to the \CC @string@ type, including analogous coexistence with raw-character pointers.
     145Its implementation, however, follows different principles, enabling programs to work with strings by value, without incurring excessive copying.
    142146Mutability is retained.
    143147Substrings are supported, including the ability for overlapping ranges to share edits transparently.
    144 Ultimately, this implementation is a case of strings rolling their own heap.
     148Ultimately, the implementation is a set of strings rolled into their own specialized heap.
    145149
    146 The string library includes writing and reading strings via the preexsiting \CFA I/O stream library.
    147 Enabling transparent reading (of unknown length into a managed allocation) included revision of
    148 the stream libarary's existing handling of character arrays.
     150The string library includes writing and reading strings via the preexisting \CFA I/O stream library.
     151Enabling transparent reading (of unknown length into a managed allocation) included revision of the stream library's existing handling of character arrays.
     152All reasonable stream manipulators are supported, \eg width, justification (left/right), printing characters as their numeric values, and complex input scanning to isolate strings.
     153
    149154
    150155\subsection{Iterator}
    151156
    152 Some prototyping, available in the \CFA standard library,
    153 addresses the infamous \CC bug source of iterator invalidation.
    154 A family of iterator styles is presented, with cheap iterators that resist being misused,
    155 plus full-featured iterators that observe modifications without becoming invalid.
     157Some advanced iterator prototyping now exists, available in the \CFA standard library.
     158Specifically, a family of iterator styles is provided, with cheap iterators that are robust to being misused, plus full-featured iterators that observe structural modifications without becoming invalid.
     159Hence, the infamous \CC bug of iterator invalidation, where structural changes cause legitimate operations to fail, \eg iterator over a list and deleting each element.
    156160
    157 Further design within this thesis presents a home for Liskov-style iterators in \CFA.
    158 This design extends a preexisting proposal to adapt the \CFA for-each style loop to be more user-pluggable.
    159 It also builds upon preexisting \CFA coroutines.
    160 It simplifies the work a programmer must do to leverage the suspended-state abstraction.
     161Further design provides a home for Liskov-style iterators in \CFA.
     162This design extends a preexisting proposal to adapt the \CFA (fixed) for-each loop to be more user-pluggable, and builds upon preexisting \CFA coroutines.
     163Overall, it simplifies the work a programmer must do to leverage the suspended-state abstraction during iteration.
  • doc/theses/mike_brooks_MMath/programs/bkgd-carray-arrty.c

    r720eec9 r33474e6  
    5757        f( &ar );
    5858
    59         float fs[] = {3.14, 1.77};
    60         char cs[] = "hello";
     59        float fs@[]@ = {3.14, 1.77};
     60        char cs@[]@ = "hello";    // shorthand for 'h', 'e', 'l', 'l', 'o', '\0'
    6161        static_assert( sizeof(fs) == 2 * sizeof(float) );
    6262        static_assert( sizeof(cs) == 6 * sizeof(char) );  $\C{// 5 letters + 1 null terminator}$
    6363
    64         float fm[][2] = { {3.14, 1.77}, {12.4, 0.01}, {7.8, 1.23} };  $\C{// brackets define structuring}$
    65         char cm[][sizeof("hello")] = { "hello", "hello", "hello" };
     64        float fm[]@[2]@ = { {3.14, 1.77}, {12.4, 0.01}, {7.8, 1.23} };  $\C{// brackets define structuring}$
     65        char cm[]@[sizeof("hello")]@ = { "hello", "hello", "hello" };
    6666        static_assert( sizeof(fm) == 3 * 2 * sizeof(float) );
    6767        static_assert( sizeof(cm) == 3 * 6 * sizeof(char) );
Note: See TracChangeset for help on using the changeset viewer.