Changes in / [79ba50c:5e0b6657]
- Location:
- doc/theses/mike_brooks_MMath
- Files:
-
- 1 deleted
- 8 edited
-
background.tex (modified) (40 diffs)
-
intro.tex (modified) (4 diffs)
-
papers/Final-ONCD-Technical-Report.pdf (deleted)
-
programs/bkgd-c-tyerr.c (modified) (1 diff)
-
programs/bkgd-carray-decay.c (modified) (2 diffs)
-
programs/bkgd-carray-mdim.c (modified) (1 diff)
-
programs/lst-issues-multi-static.run.c (modified) (4 diffs)
-
uw-ethesis-frontpgs.tex (modified) (1 diff)
-
uw-ethesis.bib (modified) (3 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/background.tex
r79ba50c r5e0b6657 7 7 8 8 C reports many ill-typed expressions as warnings. 9 For example, these attempts to assign @y@ to @x@ and vice-versa are ill-typed.9 For example, these attempts to assign @y@ to @x@ and vice-versa are obviously ill-typed. 10 10 \lstinput{12-15}{bkgd-c-tyerr.c} 11 11 with warnings: … … 21 21 note: expected 'void (*)(void)' but argument is of type 'float *' 22 22 \end{cfa} 23 with a segmentation fault at runtime. 23 24 Clearly, @gcc@ understands these ill-typed case, and yet allows the program to compile, which seems inappropriate. 24 25 Compiling with flag @-Werror@, which turns warnings into errors, is often too pervasive, because some warnings are just warnings, \eg an unused variable. … … 86 87 (\VRef{s:ArraysDecay} shows pointer decay allows the first form to be written @pa[i] += 1@, which is further syntax confusion.) 87 88 Again, if the priorities were reversed, the expressions become more intuitive: @*pa[i] += 1@ and @*(ap[i]) += 1@. 88 Note, a similar priority inversion exists between deference '@*@' and field selection '@.@'(period), so @*ps.f@ means @*(ps.f)@;89 Note, a similar priority inversion exists between deference @*@ and field selection @.@ (period), so @*ps.f@ means @*(ps.f)@; 89 90 this anomaly is \emph{fixed} with operator @->@, which performs the two operations in the more intuitive order: @sp->f@ $\Rightarrow$ @(*sp).f@. 90 91 \end{itemize} … … 127 128 As declaration size increases, it becomes corresponding difficult to read and understand the C form, whereas reading and understanding a \CFA declaration has linear complexity. 128 129 Note, writing declarations left to right is common in other programming languages, where the function return-type is often placed after the parameter declarations, \eg \CC \lstinline[language=C++]{auto f( int ) -> int}. 129 However, putting the return type at the end deviates from where the return value logically appears in an expression, @x = f(...)@ versus @f(...) = x@. 130 (Note, putting the return type at the end deviates from where the return value logically appears in an expression, @x = f(...)@ versus @f(...) = x@.) 130 131 Interestingly, programmers normally speak a declaration from left to right, regardless of how it is written. 131 132 (It is unclear if Hebrew or Arabic speakers, say declarations right to left.) 132 133 133 Specifically,\VRef[Table]{bkgd:ar:usr:avp} introduces the many layers of the C and \CFA array story, where the \CFA story is discussion in \VRef[Chapter]{c:Array}.134 \VRef[Table]{bkgd:ar:usr:avp} introduces the many layers of the C and \CFA array story, where the \CFA story is discussion in \VRef[Chapter]{c:Array}. 134 135 The \CFA-thesis column shows the new array declaration form, which is my contribution to safety and ergonomics. 135 136 The table shows there are multiple yet equivalent forms for the array types under discussion, and subsequent discussion shows interactions with orthogonal (but easily confused) language features. … … 137 138 The simplest occurrences of types distinguished in the preceding discussion are marked with $\triangleright$. 138 139 Removing the declared variable @x@, gives the type used for variable, structure field, cast, or error messages. 139 Unfortunately, parameter declarations have more syntactic forms and rules \see{Section~\ref{s:ArraysDecay}~\VPageref{p:ParameterDeclarations}}.140 Unfortunately, parameter declarations have more syntactic forms and rules. 140 141 141 142 \begin{table} … … 143 144 \caption{Syntactic Reference for Array vs Pointer. Includes interaction with \lstinline{const}ness.} 144 145 \label{bkgd:ar:usr:avp} 145 \setlength{\tabcolsep}{12pt} 146 \begin{tabular}{@{}ll|l|l|l@{}} 146 \begin{tabular}{ll|l|l|l} 147 147 & Description & \multicolumn{1}{c|}{C} & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c}{\CFA-thesis} \\ 148 148 \hline … … 253 253 Using the type form yields the same results as the prior expression form. 254 254 \lstinput{46-49}{bkgd-carray-arrty.c} 255 The results are also the same when there is no allocation at all , \eg a parameter declaration.255 The results are also the same when there is no allocation at all. 256 256 This time, starting from a pointer-to-array type: 257 257 \lstinput{51-57}{bkgd-carray-arrty.c} … … 297 297 ``pointer to \emph{type}'' that points to the initial element of the array object~\cite[\S~6.3.2.1.3]{C11} 298 298 \end{quote} 299 This phenomenon is the famous (infamous)\newterm{pointer decay}, which is a decay of an array-typed expression into a pointer-typed one.299 This phenomenon is the famous \newterm{pointer decay}, which is a decay of an array-typed expression into a pointer-typed one. 300 300 It is worthy to note that the list of exceptional cases does not feature the occurrence of @ar@ in @ar[i]@. 301 301 Thus, subscripting happens on pointers not arrays. … … 305 305 \cite[\S~6.5.6.8]{C11} explains that the addition, of a pointer with an integer type, is defined only when the pointer refers to an element that is in an array, with a meaning of @i@ elements away from, which is valid if @ar@ is big enough and @i@ is small enough. 306 306 Finally, \cite[\S~6.5.3.2.4]{C11} explains that the @*@ operator's result is the referenced element. 307 Taken together, these rules illustrate that @ar[i]@ and @i[a]@ mean the same thing , as plus is commutative.307 Taken together, these rules illustrate that @ar[i]@ and @i[a]@ mean the same thing! 308 308 309 309 Subscripting a pointer when the target is standard-inappropriate is still practically well-defined. … … 318 318 But \emph{nothing} more is said about this pointer value, specifically that its referent might \emph{be} an array allowing subscripting. 319 319 320 Under this assumption, a pointer , @p@, being subscripted (or added to, then dereferenced) by any value (positive, zero, or negative), gives a view of the program's entire address space, centred around @p@'s address, divided into adjacent @sizeof(*p)@ chunks, each potentially (re)interpreted as @typeof(*p).t@.321 I call this phenomenon \emph{array diffraction}, which is a diffraction of a single-element pointer into the assumption that its target is conceptuallyin the middle of an array whose size is unlimited in both directions.320 Under this assumption, a pointer being subscripted (or added to, then dereferenced) by any value (positive, zero, or negative), gives a view of the program's entire address space, centred around the @p@ address, divided into adjacent @sizeof(*p)@ chunks, each potentially (re)interpreted as @typeof(*p)@. 321 I call this phenomenon \emph{array diffraction}, which is a diffraction of a single-element pointer into the assumption that its target is in the middle of an array whose size is unlimited in both directions. 322 322 No pointer is exempt from array diffraction. 323 323 No array shows its elements without pointer decay. 324 324 325 \label{p:ParameterDeclarations}326 325 A further pointer--array confusion, closely related to decay, occurs in parameter declarations. 327 326 \cite[\S~6.7.6.3.7]{C11} explains that when an array type is written for a parameter, … … 380 379 \label{bkgd:ar:usr:decay-parm} 381 380 \centering 382 \setlength{\tabcolsep}{10pt} 383 \begin{tabular}{@{}llllll@{}} 381 \begin{tabular}{llllll} 384 382 & Description & Type & Parameter Declaration & \CFA \\ 385 383 \hline … … 428 426 Unfortunately, there is significant misinformation about VLAs, \eg the stack size is limited (small), or VLAs cause stack failures or are inefficient. 429 427 VLAs exist as far back as Algol W~\cite[\S~5.2]{AlgolW} and are a sound and efficient data type. 430 Most shells limit the maximum stack size to guard against infinite recursion, but the stack size can be set to unlimited, just like the heap.431 428 For types with a dynamic-fixed stack, \eg coroutines or user-level threads, large VLAs can overflow the stack without appropriately sizing the stack, so heap allocation is used when the array size is unbounded. 432 429 … … 439 436 When working with arrays, \eg linear algebra, array dimensions are referred to as \emph{rows} and \emph{columns} for a matrix, adding \emph{planes} for a cube. 440 437 (There is little terminology for higher dimensional arrays.) 441 For example, an acrostic poem\footnote{A kind of poetry where the first, last or other letters in a line spell out aword or phrase in a vertical column.}438 For example, an acrostic poem\footnote{A type of poetry where the first, last or other letters in a line spell out a particular word or phrase in a vertical column.} 442 439 can be treated as a grid of characters, where the rows are the text and the columns are the embedded keyword(s). 443 440 Within a poem, there is the concept of a \newterm{slice}, \eg a row is a slice for the poem text, a column is a slice for a keyword. … … 513 510 For example, an array-of-array may be an array of row pointers to arrays of columns, so the rows may not be contiguous in memory nor even the same length (triangular matrix). 514 511 Regardless, there is usually a uniform subscripting syntax masking the memory layout, even though a language could differentiated between the two forms using subscript syntax, \eg @m[1,2]@ \vs @aa[1][2]@. 515 Nevertheless, controlling memory layout can make a difference in what operations are allowed and in performance (\eg caching/NUMA effects). 516 C also provides non-contiguous arrays-of-arrays: 512 Nevertheless, controlling memory layout can make a difference in what operations are allowed and in performance (caching/NUMA effects). 513 514 C also provides non-contiguous arrays-of-arrays. 517 515 \begin{cfa} 518 516 int m[5][5]; $\C{// contiguous}$ 519 517 int * aa[5]; $\C{// non-contiguous}$ 520 518 \end{cfa} 521 both with different memory layout using the same subscripting and different degrees of issues. 522 519 both with different memory layout using the same subscripting, and both with different degrees of issues. 523 520 The focus of this work is on the contiguous multidimensional arrays in C. 524 521 The reason is that programmers are often forced to use the more complex array-of-array form when a contiguous array would be simpler, faster, and safer. … … 527 524 \VRef[Figure]{f:ContiguousNon-contiguous} shows a powerful extension made in C99 for manipulating contiguous \vs non-contiguous arrays.\footnote{C90 also supported non-contiguous arrays.} 528 525 For contiguous-array arguments (including VLA), C99 conjoins one or more of the parameters as a downstream dimension(s), \eg @cols@, implicitly using this parameter to compute the row stride of @m@. 529 There is now sufficient information to support array copying and subscript checking along the columns to prevent changing the argument or buffer-overflow problems, \emph{but neither feature is provided}.526 There is now sufficient information to support array copying and subscript checking along the columns to prevent changing the argument or buffer-overflow problems, but neither feature is provided. 530 527 If the declaration of @fc@ is changed to: 531 528 \begin{cfa} … … 535 532 While this contiguous-array capability is a step forward, it is still the programmer's responsibility to manually manage the number of dimensions and their sizes, both at the function definition and call sites. 536 533 That is, the array does not automatically carry its structure and sizes for use in computing subscripts. 537 While the non-contiguous style in @faa@ looks very similar to @fc@, the compiler only understands the unknown-sized array of row pointers, and it relies on the programmer to traverse the columns in a row with a correctly bounded loop index.534 While the non-contiguous style in @faa@ looks very similar to @fc@, the compiler only understands the unknown-sized array of row pointers, and it relies on the programmer to traverse the columns in a row correctly with a correctly bounded loop index. 538 535 Specifically, there is no requirement that the rows are the same length, like a poem with different length lines. 539 536 … … 586 583 Again, an array and a point to each of its axes are different. 587 584 \lstinput{28-36}{bkgd-carray-mdim.c} 588 As well, there is pointer decay from each of the matrix axes to pointers, all havingthe same address.585 As well, there is pointer decay from each of the matrix axes to pointers, which all have the same address. 589 586 \lstinput{38-44}{bkgd-carray-mdim.c} 590 Finally, subscripting is allowedon a @malloc@ result, where the referent may or may not allow subscripting or have the right number of subscripts.587 Finally, subscripting on a @malloc@ result, where the referent may or may not allow subscripting or have the right number of subscripts. 591 588 592 589 … … 608 605 // caller's perspectives of foo and bar 609 606 \end{cfa} 610 From the caller's perspective, the parameter names are semantically optional, but provide useful comments.611 From the callee's perspective, parameter names are semantically essential.607 From the caller's perspective, the parameter names (by virtue of being optional) are (useful) comments; 608 From the callee's perspective, parameter names are semantically significant. 612 609 Array parameters introduce a further, subtle, semantic difference and considerable freedom to comment. 613 610 614 At the semantic level, there is no such thing as an array parameter, except for one case , @T [static 5]@,discussed shortly.615 Instead, there are only pointer parameters.611 At the semantic level, there is no such thing as an array parameter, except for one case (@T [static 5]@) discussed shortly. 612 Rather, there are only pointer parameters. 616 613 This fact probably shares considerable responsibility for the common sense of \emph{an array is just a pointer}, which has been refuted in non-parameter contexts. 617 614 This fact holds in both the caller's and callee's perspectives. 618 However, a parameter's type can include ``array of'', \eg the type ``pointer to array of 5 ints'' (@T (*)[5]@) , whichis a pointer type.619 This type is fully meaningful in the sense that its description does not contain any information that the type system ignores, and the type appears the same in the caller's andcallee's perspectives.615 However, a parameter's type can include ``array of'', \eg the type ``pointer to array of 5 ints'' (@T (*)[5]@) is a pointer type. 616 This type is fully meaningful in the sense that its description does not contain any information that the type system ignores, and the type appears the same in the caller's \vs callee's perspectives. 620 617 In fact, the outermost type constructor (syntactically first dimension) is really the one that determines the parameter flavour. 621 618 622 C ignores length information given in parameter declarations entirely, when determining a function's type. 623 For example, it accepts this pair of declarations 624 \begin{cfa} 625 void f( int, float[][42] ); $\C{// array of len-42 arrays}$ 626 void f( int n, float[][n] ); $\C{// array of VLAs}$ 627 \end{cfa} 628 as a repeat, without an error about conflicting types for @f@. 629 Yet, entirely different stride calculations would occur in a function body whose parameters were declared in each of the two styles. 619 \PAB{TODO: add examples of mycode/arrr/bugs/c-dependent/x.cfa:v5102,5103, 620 which are shocking how much C ignores.} 630 621 631 622 \begin{figure} 632 \setlength{\tabcolsep}{7pt}633 623 \begin{tabular}{@{}llll@{}} 634 624 \begin{cfa} … … 673 663 Yet, C allows array syntax for the outermost type constructor, from which comes the freedom to comment. 674 664 An array parameter declaration can specify the outermost dimension with a dimension value, @[10]@ (which is ignored), an empty dimension list, @[ ]@, or a pointer, @*@, as seen in \VRef[Figure]{f:ArParmEquivDecl}. 675 The rationale for rejecting the first invalid row follows shortly, while the second invalid row is nonsense, included to complete the pattern; its syntax hints at what the final row actually achieves.676 Note, in the leftmost style, the typechecker ignores the actual value , even fora dynamic expression.665 The rationale for rejecting the first invalid row follows shortly, while the second invalid row is simple nonsense, included to complete the pattern; its syntax hints at what the final row actually achieves. 666 Note, in the leftmost style, the typechecker ignores the actual value even in a dynamic expression. 677 667 \begin{cfa} 678 668 int N; … … 684 674 % So are @float[5]*@, @float[]*@ and @float (*)*@. These latter ones are simply nonsense, though they hint at ``1d array of pointers'', whose equivalent syntax options are, @float *[5]@, @float *[]@, and @float **@. 685 675 686 It is a matter of taste as to whether a programmer should use the left form to get the most out of commenting subscripting and dimension sizes, sticking to the right (avoiding false comfort from suggesting the typechecker is checking more than it is), or compromising in the middle (reducing unchecked information, yet clearly stating, ``I will subscript'').676 It is a matter of taste as to whether a programmer should use a form as far left as possible (getting the most out of possible subscripting and dimension sizes), sticking to the right (avoiding false comfort from suggesting the typechecker is checking more than it is), or compromising in the middle (reducing unchecked information, yet clearly stating, ``I will subscript''). 687 677 688 678 Note that this equivalence of pointer and array declarations is special to parameters. … … 690 680 \begin{cfa} 691 681 void f( float * a ) { 692 float * b = a; // ok693 float c[] = a; // reject694 float d[] = { 1.0, 2.0, 3.0 }; // ok682 float * b = a; // ok 683 float c[] = a; // reject 684 float d[] = { 1.0, 2.0, 3.0 }; // ok 695 685 static_assert( sizeof(b) == sizeof(float*) ); 696 686 static_assert( sizeof(d) != sizeof(float*) ); 697 687 } 698 688 \end{cfa} 699 The consequence of this equivalence isthat the type system does not help a caller get it right.689 Unfortunately, this equivalence has the consequence that the type system does not help a caller get it right. 700 690 \begin{cfa} 701 691 float sum( float v[] ); … … 725 715 void foo( int [static @3@] ); 726 716 int ar[@10@]; 727 foo( ar ); // check argument dimension 10 > 3717 foo( ar ); // check argument dimension 10 > 3 728 718 \end{cfa} 729 719 Here, the @static@ storage qualifier defines the minimum array size for its argument. … … 738 728 739 729 With multidimensional arrays, on dimensions after the first, a size is required and, is not ignored. 740 These sizes (strides)are required for the callee to be able to subscript.730 These sizes are required for the callee to be able to subscript. 741 731 \begin{cfa} 742 732 void f( float a[][10], float b[][100] ) { … … 749 739 The significance of an inner dimension's length is a fact of the callee's perspective. 750 740 In the caller's perspective, the type system is quite lax. 751 Here, there is (some, but) little checking of what is being passed matches the parameter.741 Here, there is (some, but) little checking that what is being passed, matches. 752 742 % void f( float [][10] ); 753 743 % int n = 100; … … 784 774 void foo( float a[][10] ) { ... } $\C{// definition}$ 785 775 \end{cfa} 786 Repeating it with the full context of a VLA is useful .776 Repeating it with the full context of a VLA is useful: 787 777 \begin{cfa} 788 778 void foo( int, float [][@*@] ); $\C{// declaration}$ … … 804 794 typedef long int jmp_buf[8]; 805 795 \end{cfa} 806 A ninstance of this array can be declared as a structure field.796 A instance of this array can be declared as a structure field. 807 797 \begin{cfa} 808 798 struct Jmp_Buf { … … 829 819 Since the data is opaque, list structures are often polymorphic over the data, which is often homogeneous. 830 820 831 The links organize nodes into a particular kind of data structure, \eg queue, tree, hash table, \etc, with operations specific to that kind.821 The links organize nodes into a particular format, \eg queue, tree, hash table, \etc, with operations specific to that format. 832 822 Because a node's existence is independent of the data structure that organizes it, all nodes are manipulated by address not value; 833 823 hence, all data structure routines take and return pointers to nodes and not the nodes themselves. … … 838 828 839 829 This thesis focuses on a reduced design space for linked lists that target \emph{system programmers}. 840 Within this restricted space, all design-issue discussions assume the following invariants. 830 Within this restricted space, all design-issue discussions assume the following invariants; 831 alternatives to the assumptions are discussed under Future Work (\VRef{toc:lst:futwork}). 841 832 \begin{itemize} 842 833 \item A doubly-linked list is being designed. … … 844 835 Circular \vs ordered linking is discussed under List identity (\VRef{toc:lst:issue:ident}). 845 836 \item Link fields are system-managed. 837 The user works with the system-provided API to query and modify list membership. 846 838 The system has freedom over how to represent these links. 847 The user works with the system-provided API to query and modify list membership.848 839 \item The user data must provide storage for the list link-fields. 849 840 Hence, a list node is \emph{statically} defined as data and links \vs a node that is \emph{dynamically} constructed from data and links \see{\VRef{toc:lst:issue:attach}}. 850 841 \end{itemize} 851 Alternatives to these assumptions are discussed under Future Work (\VRef{toc:lst:futwork}).852 842 853 843 … … 855 845 \label{s:PreexistingLinked-ListLibraries} 856 846 857 To show examples of the concepts being defined, two preexisting linked-list libraries are used throughout and further libraries are introduced as needed. 847 Two preexisting linked-list libraries are used throughout, to show examples of the concepts being defined, 848 and further libraries are introduced as needed. 858 849 \begin{enumerate} 859 850 \item Linux Queue library~\cite{lst:linuxq} (LQ) of @<sys/queue.h>@. … … 861 852 \end{enumerate} 862 853 %A general comparison of libraries' abilities is given under Related Work (\VRef{toc:lst:relwork}). 863 For the discussion, assume the type @req@ (request) is the user's payload in examples.864 Then the job of a list library is to help a user manage (organize) requests, \eg a request can be a network arrival-event processed by a web browser or a thread blocked/scheduled by the runtime.854 For the discussion, assume the fictional type @req@ (request) is the user's payload in examples. 855 As well, the list library is helping the user manage (organize) requests, \eg a request can be work on the level of handling a network arrival-event or scheduling a thread. 865 856 866 857 … … 875 866 The wrapped style distinguishes between wrapping a reference and wrapping a value, \eg @list<req *>@ or @list<req>@. 876 867 (For this discussion, @list<req &>@ is similar to @list<req *>@.) 877 This difference is one of user style and performance (copying), not framework capability.868 This difference is one of user style, not framework capability. 878 869 Library LQ is intrusive; STL is wrapped with reference and value. 879 870 … … 982 973 An example of an attribute on the link fields is cache alignment, possibly in conjunction with other @req@ fields, improving locality and/or avoiding false sharing. 983 974 For example, if a list is frequently traversed in the forward direction, and infrequently gets elements removed at random positions, then an ideal layout for cache locality puts the forward links, together with frequently-used payload data on one cache line, leaving the reverse links on a colder cache line. 984 In contrast, supplyinglink fields by inheritance makes them implicit and relies on compiler placement, such as the start or end of @req@, and no explicit attributes.975 Supplying the link fields by inheritance makes them implicit and relies on compiler placement, such as the start or end of @req@, and no explicit attributes. 985 976 Wrapped reference has no control over the link fields, but the separate data allows some control; 986 977 wrapped value has no control over data or links. 987 978 988 979 Another subtle advantage of intrusive arrangement is that a reference to a user-level item (@req@) is sufficient to navigate or manage the item's membership. 989 In LQ, the intrusive @req@ pointer is the correct argument type for operations @LIST_NEXT@ or @LIST_REMOVE@;980 In LQ, the intrusive @req@ pointer is the right argument type for operations @LIST_NEXT@ or @LIST_REMOVE@; 990 981 there is no distinguishing a @req@ from a @req@ in a list. 991 982 The same is not true of STL, wrapped reference or value. … … 998 989 Then, a novel use can put a @req@ in a list, without requiring any upstream change in the @req@ library. 999 990 In intrusive, the ability to be listed must be planned during the definition of @req@. 1000 Optimistically adding a couple links for future use is normally cheap because links are small and memory is big.1001 991 1002 992 \begin{figure} … … 1024 1014 An intrusive-primitive library like LQ lets users choose when to make this tradeoff. 1025 1015 A wrapped-primitive library like STL forces users to incur the costs of wrapping, whether or not they access its benefits. 1026 Like LQ,\CFA is capable of supporting a wrapped library, if need arose.1016 \CFA is capable of supporting a wrapped library, if need arose. 1027 1017 1028 1018 … … 1155 1145 \subsection{User Integration: Preprocessed \vs Type-System Mediated} 1156 1146 1157 While the syntax for LQ is reasonably succinct, it comes at the cost of using C preprocessor macros for generics, which are not part of the language type-system, unlike \CC templates.1147 While the syntax for LQ is reasonably succinct, it comes at the cost of using C preprocessor macros for generics, which are not part of the language type-system, like \CC templates. 1158 1148 Hence, small errors in macro arguments can lead to large substitution mistakes, as the arguments maybe textually written in many places and/or concatenated with other arguments/text to create new names and expressions. 1159 Hence, textual expansion can lead to a cascade of error messages that are confusing and difficult to debug. 1160 For example, argument errors like @a.b@{\Large\color{red},}@c@, comma instead of period, or @by@{\Large\color{red}-}@pri@, minus instead of underscore, can produce many error messages. 1161 Note, similar problems exist for \CC templates. 1162 1163 Instead, language function calls (even with inlining) handle argument mistakes locally at the call, giving very specific error message. 1164 \CC @concepts@ were introduced in @templates@ to deal with this problem. 1149 This can lead to a cascade of error messages that are confusing and difficult to debug. 1150 For example, argument errors like @a.b,c@, comma instead of period, or @by-pri@, minus instead of underscore, can produce many error messages. 1151 1152 Instead, language function calls (even with inlining) handled argument mistakes locally at the call, giving very specific error message. 1153 \CC @concepts@ were introduced in @templates@ to deal with just this problem. 1165 1154 1166 1155 % example of poor error message due to LQ's preprocessed integration … … 1177 1166 All examples so far use two distinct types for: 1178 1167 an item found in a list (type @req@ of variable @r1@, see \VRef[Figure]{fig:lst-issues-attach}), and the list (type @reql@ of variable @reqs_pri@, see \VRef[Figure]{fig:lst-issues-ident}). 1179 This kind of list is \newterm{headed}, where the empty list is just a head.1180 An alternate ad-hocapproach omits the header, where the empty list is no nodes.1168 This kind of list is ``headed'', where the empty list is just a head. 1169 An alternate ``ad-hoc'' approach omits the header, where the empty list is no nodes. 1181 1170 Here, a pointer to any node can traverse its link fields: right or left and around, depending on the data structure. 1182 1171 Note, a headed list is superset of an ad-hoc list, and can normally perform all of the ad-hoc operations. … … 1213 1202 Note, a list does not need to use links to denote its size; 1214 1203 it can use a node counter in the header, where $N$ node traversals indicates complete navigation of the list. 1215 However, managing the number of nodes is an additional cost, whereas the links must always be managed.1204 However, managing the number of nodes is an additional cost, as the links must always be managed. 1216 1205 1217 1206 The following discussion refers to the LQ representations, detailed in \VRef[Figure]{fig:lst-issues-end}, using a null pointer to mark end points. … … 1262 1251 1263 1252 A C character constant is an ASCII/Latin-1 character enclosed in single-quotes, \eg @'x'@, @'@\textsterling@'@. 1264 A wide C character constant is the same, except prefixed by the letter @L@, @u@, or @U@, \eg @u'\u25A0'@ (black square), where the @ '\u'@ identifies a universal character name.1265 A character can be formed from an escape sequence, which expresses a non-typable character @'\f'@ , form feed, a delimiter character @'\''@, embedded single quote, or a raw character @'\xa3'@,\textsterling.1253 A wide C character constant is the same, except prefixed by the letter @L@, @u@, or @U@, \eg @u'\u25A0'@ (black square), where the @\u@ identifies a universal character name. 1254 A character can be formed from an escape sequence, which expresses a non-typable character @'\f'@ form feed, a delimiter character @'\''@ embedded single quote, or a raw character @'\xa3'@ \textsterling. 1266 1255 1267 1256 A C character string is zero or more regular, wide, or escape characters enclosed in double-quotes @"xyz\n"@. 1268 The kind of characters in the string is denoted by a prefix: wide characters are prefixed by @L@, @u@, or @U@; UTF-8 characters are prefixed by @u8@. 1269 1257 The kind of characters in the string is denoted by a prefix: UTF-8 characters are prefixed by @u8@, wide characters are prefixed by @L@, @u@, or @U@. 1258 1259 For UTF-8 string literals, the array elements have type @char@ and are initialized with the characters of the multi-byte character sequences, \eg @u8"\xe1\x90\x87"@ (Canadian syllabic Y-Cree OO). 1270 1260 For wide string literals prefixed by the letter @L@, the array elements have type @wchar_t@ and are initialized with the wide characters corresponding to the multi-byte character sequence, \eg @L"abc@$\mu$@"@ and are read/printed using @wscanf@/@wprintf@. 1271 The value of a wide-character is implementation defined, usually a UTF-16 character.1261 The value of a wide-character is implementation-defined, usually a UTF-16 character. 1272 1262 For wide string literals prefixed by the letter @u@ or @U@, the array elements have type @char16_t@ or @char32_t@, respectively, and are initialized with wide characters corresponding to the multi-byte character sequence, \eg @u"abc@$\mu$@"@, @U"abc@$\mu$@"@. 1273 1263 The value of a @"u"@ character is an UTF-16 character; 1274 1264 the value of a @"U"@ character is an UTF-32 character. 1275 The value of a string literal containing a multi-byte character or escape sequence not represented in the execution character set is implementation defined. 1276 For UTF-8 string literals, the array elements have type @char@ and are initialized with the characters of the multi-byte character sequences, \eg @u8"\xe1\x90\x87"@ (Canadian syllabic Y-Cree OO). 1265 The value of a string literal containing a multi-byte character or escape sequence not represented in the execution character set is implementation-defined. 1277 1266 1278 1267 C strings are null-terminated rather than maintaining a separate string length. … … 1284 1273 This property is only preserved by the compiler with respect to character constants, \eg @"abc"@ is actually @"abc\0"@, \ie 4 characters rather than 3. 1285 1274 Otherwise, the compiler does not participate, making string operations both unsafe and inefficient. 1286 For example, it is common in C to: 1287 \begin{itemize} 1288 \item 1289 forget that a character constant is larger than it appears during manipulation, 1290 \item 1291 that extra storage is needed in a character array for the terminator, 1292 \item 1293 or that the terminator must be preserved during string operations, otherwise there are array overruns. 1294 \end{itemize} 1275 For example, it is common in C to: forget that a character constant is larger than it appears during manipulation, that extra storage is needed in a character array for the terminator, or that the terminator must be preserved during string operations, otherwise there are array overruns. 1295 1276 Finally, the need to repeatedly scan an entire string to determine its length can result in significant cost, as it is impossible to cache the length in many cases, \eg when a string is passed into another function. 1296 1277 -
doc/theses/mike_brooks_MMath/intro.tex
r79ba50c r5e0b6657 1 1 \chapter{Introduction} 2 2 3 All modern programming languages provide at least thesethree high-level containers (collections): array, linked-list, and string.4 Often , the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.5 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level s, such as copying, slicing, extracting, and iterating among elements.3 All modern programming languages provide three high-level containers (collections): array, linked-list, and string. 4 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list. 5 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements. 6 6 7 Unfortunately, memory misuse under C-idiomatic programming causes a significant number oferrors~\cite{Oorschot23}.7 Unfortunately, these three aspects of C cause a significant number of memory errors~\cite{Oorschot23}. 8 8 Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors). 9 9 For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}. 10 10 For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}. 11 In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}. 11 In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is (memory) buffer errors, among 19 vulnerability categories~\cite{Cifuentes19}. 12 Therefore, hardening these three C types goes a long way to make the majority of C programs safer and eliminating major hacker attack-vectors. 12 13 13 While these figures address memory errors, defined without limit to the three primary collections, these authors' explanations of memory errors, among other C analysis efforts, suggest improving on C's language/library options for working with these three collections would be widely beneficial. 14 Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features. 15 Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well. 16 The one that specifically surveys C features, \cite{Oorschot23}, also has similarly prominent blame on both of: the null-terminated text idiom, with its support/mandate in the C standard library; the unprotected nature of C pointers. 17 An effort at translating C to safe Rust~\cite{Emre2022} has pointers as the stand-out difficult aspect to analyze.\footnote{ 18 \cite{Emre2022} demonstrates the difficulty of analyzing C pointers for safety (while tacking some of this difficult problem). 19 It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker. 20 Then, it analyzes these uses and presents a technique for reducing those uses that hinge exclusively on referenced object lifetime. 21 Pointer dereference accounts for 80\% of the cases that the naive translation did not make safe (Table 3.2, p33). 22 Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p39). 23 The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p83). 24 This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p27). 25 % 3.3 = 1398/413,428 * 1000 26 % 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000 27 } 28 A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}. 29 So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are an attrictively recognizable player in the arena of (infamously) unsafe raw pointer uses. 30 Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, and eliminating major hacker attack vectors. 31 32 This work looks at extending these foundational container types in the programming language \CFA, which is a new dialect of the C programming language. 14 This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language. 33 15 A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition. 34 16 This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base. … … 48 30 49 31 C provides a simple array type as a language feature. 50 However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems. 51 The way that arrays are typicaly passed around a program removes the information necessary to do bound checks. 32 However, it adopts the controversial position of treating pointer and array as duals, leading to multiple problems. 52 33 53 34 … … 60 41 otherwise, elements are heap allocated with internal or external link fields. 61 42 62 C provides a few simple, library data-structures (@glibc@):43 C provides a few simple, polymorphic, library data-structures (@glibc@): 63 44 \begin{itemize} 64 45 \item … … 67 48 hash search table consisting of a key (string) with associated data (@<search.h>@) 68 49 \end{itemize} 69 Because these container libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.50 Because these container libraries can be restrictive, awkward to use, and unsafe, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors. 70 51 71 52 -
doc/theses/mike_brooks_MMath/programs/bkgd-c-tyerr.c
r79ba50c r5e0b6657 16 16 17 17 float pi = 3.14; 18 void f( void (*g)(void) ) ;18 void f( void (*g)(void) ) { g(); } 19 19 @f( &pi );@ $\C{// wrong}$ 20 20 } -
doc/theses/mike_brooks_MMath/programs/bkgd-carray-decay.c
r79ba50c r5e0b6657 10 10 assert( sizeof(pa0x) != sizeof(ar) ); 11 11 12 void f( @float x[10]@, @float * y@) {12 void f( float x[10], float * y ) { 13 13 static_assert( sizeof(x) == sizeof(void *) ); 14 14 static_assert( sizeof(y) == sizeof(void *) ); … … 16 16 f( 0, 0 ); 17 17 18 // reusing local var iable float ar[10]18 // reusing local var `float a[10];`} 19 19 float v; 20 20 f( ar, ar ); $\C{// ok: two decays, one into an array spelling}$ -
doc/theses/mike_brooks_MMath/programs/bkgd-carray-mdim.c
r79ba50c r5e0b6657 38 38 // float (*b[3])[10]; 39 39 float * b[3]; 40 for ( int i = 0; i < 3; i ++) {41 b[i] = malloc( sizeof( float[10] ));40 for (int i = 0; i < 3; i ++) { 41 b[i] = malloc(sizeof(float[10])); 42 42 } 43 43 mx[2][3]; -
doc/theses/mike_brooks_MMath/programs/lst-issues-multi-static.run.c
r79ba50c r5e0b6657 20 20 struct req { 21 21 int pri, rqr; 22 LIST_ENTRY( req) by_pri;23 LIST_ENTRY( req) by_rqr;22 LIST_ENTRY(req) by_pri; 23 LIST_ENTRY(req) by_rqr; 24 24 }; 25 25 26 LIST_HEAD( reql, req);26 LIST_HEAD(reql, req); 27 27 28 28 struct reql reqs_pri; … … 31 31 struct reql reqs_rqr_99; 32 32 33 LIST_INIT( &reqs_pri);34 LIST_INIT( &reqs_rqr_42);35 LIST_INIT( &reqs_rqr_17);36 LIST_INIT( &reqs_rqr_99);33 LIST_INIT(&reqs_pri); 34 LIST_INIT(&reqs_rqr_42); 35 LIST_INIT(&reqs_rqr_17); 36 LIST_INIT(&reqs_rqr_99); 37 37 38 38 struct req … … 44 44 r99a = {3, 99}; 45 45 46 LIST_INSERT_HEAD( &reqs_pri, &r17c, by_pri);47 LIST_INSERT_HEAD( &reqs_pri, &r99a, by_pri);48 LIST_INSERT_HEAD( &reqs_pri, &r17b, by_pri);49 LIST_INSERT_HEAD( &reqs_pri, &r42b, by_pri);50 LIST_INSERT_HEAD( &reqs_pri, &r17a, by_pri);51 LIST_INSERT_HEAD( &reqs_pri, &r42a, by_pri);46 LIST_INSERT_HEAD(&reqs_pri, &r17c, by_pri); 47 LIST_INSERT_HEAD(&reqs_pri, &r99a, by_pri); 48 LIST_INSERT_HEAD(&reqs_pri, &r17b, by_pri); 49 LIST_INSERT_HEAD(&reqs_pri, &r42b, by_pri); 50 LIST_INSERT_HEAD(&reqs_pri, &r17a, by_pri); 51 LIST_INSERT_HEAD(&reqs_pri, &r42a, by_pri); 52 52 53 LIST_INSERT_HEAD( &reqs_rqr_42, &r42b, by_rqr);54 LIST_INSERT_HEAD( &reqs_rqr_42, &r42a, by_rqr);53 LIST_INSERT_HEAD(&reqs_rqr_42, &r42b, by_rqr); 54 LIST_INSERT_HEAD(&reqs_rqr_42, &r42a, by_rqr); 55 55 56 LIST_INSERT_HEAD( &reqs_rqr_17, &r17c, by_rqr);57 LIST_INSERT_HEAD( &reqs_rqr_17, &r17b, by_rqr);58 LIST_INSERT_HEAD( &reqs_rqr_17, &r17a, by_rqr);56 LIST_INSERT_HEAD(&reqs_rqr_17, &r17c, by_rqr); 57 LIST_INSERT_HEAD(&reqs_rqr_17, &r17b, by_rqr); 58 LIST_INSERT_HEAD(&reqs_rqr_17, &r17a, by_rqr); 59 59 60 LIST_INSERT_HEAD( &reqs_rqr_99, &r99a, by_rqr);60 LIST_INSERT_HEAD(&reqs_rqr_99, &r99a, by_rqr); 61 61 62 62 … … 67 67 68 68 struct req *cur; 69 LIST_FOREACH( cur, &reqs_pri, by_pri)70 printf( "{%d %d} ", cur->pri, cur->rqr);71 printf( "| ");72 LIST_FOREACH( cur, &reqs_rqr_42, by_rqr)73 printf( "{%d %d} ", cur->pri, cur->rqr);74 printf( "| ");75 LIST_FOREACH( cur, &reqs_rqr_17, by_rqr)76 printf( "{%d %d} ", cur->pri, cur->rqr);77 printf( "| ");78 LIST_FOREACH( cur, &reqs_rqr_99, by_rqr)79 printf( "{%d %d} ", cur->pri, cur->rqr);80 printf( "\n");69 LIST_FOREACH(cur, &reqs_pri, by_pri) 70 printf("{%d %d} ", cur->pri, cur->rqr); 71 printf("| "); 72 LIST_FOREACH(cur, &reqs_rqr_42, by_rqr) 73 printf("{%d %d} ", cur->pri, cur->rqr); 74 printf("| "); 75 LIST_FOREACH(cur, &reqs_rqr_17, by_rqr) 76 printf("{%d %d} ", cur->pri, cur->rqr); 77 printf("| "); 78 LIST_FOREACH(cur, &reqs_rqr_99, by_rqr) 79 printf("{%d %d} ", cur->pri, cur->rqr); 80 printf("\n"); 81 81 82 82 } -
doc/theses/mike_brooks_MMath/uw-ethesis-frontpgs.tex
r79ba50c r5e0b6657 131 131 \CFA strives to fix mistakes in C, chief among them, safety. 132 132 This thesis presents a significant step forward in \CFA's goal to remove unsafe pointer operations. 133 It describes improvements to the \CFA language designto support advanced container features.134 These features are implemented across the \CFA compiler and runtime libraries.135 The results maintain another \CFA goal of offering strong backwards compatibilitywith C.136 This work leverages preexisting \CFA contributiongs of prior students working on the \CFA project, particularly through novel applications of the compiler's type system.137 138 All modern programming languages provide at least thesethree high-level containers (collections): array, linked-list, and string.139 Often , the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.140 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level s, such as copying, slicing, extracting, and iterating among elements.141 Unfortunately, t ypical solutions for the the key types in C cause 60\%--70\% of the reported software vulnerabilities involving memory errors; 70\%--80\% of hacker attackvectors target these types.133 The thesis presents improvements to the \CFA language design, both syntax and semantics, to support advanced container features. 134 These features are implemented across the \CFA compiler, libraries, and runtime system. 135 The results maintain another \CFA goal of remaining 99\% backwards compatible with C. 136 This thesis leverages preexisting work within the compiler's type and runtime systems generated by prior students working on the \CFA project. 137 138 All modern programming languages provide three high-level containers (collections): array, linked-list, and string. 139 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list. 140 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements. 141 Unfortunately, these three aspects of C cause 60\%--70\% of the reported software vulnerabilities involving memory errors, and 70\%--80\% of hacker attack-vectors target these types. 142 142 Therefore, hardening these three C types goes a long way to make the majority of C programs safer. 143 143 144 144 Specifically, an array utility is provided that tracks length internally, relieving the user of managing explicit length parameters and stopping buffer-overrun errors. 145 145 This feature requires augmenting the \CFA type system, making array length available at compile and runtime. 146 A linked-list utility is provided, which obviates many user-managed recursive pointers by catering directly to system-programming uses (intrusive linking, ad-hoc listing) for which a library solution is often dismissed.146 A linked-list utility is provided, which obviates many explicit recursive pointers by catering directly to system-programming uses (intrusive lists) for which a library solution is often dismissed. 147 147 Finally, a string utility is provided with implicit memory management of text in a specialized heap, relieving error-prone buffer management, including overrun, and providing a copy-on-write speed boost. 148 148 For all three utilities, performance is argued to be on-par with, and occasionally surpassing relevant comparators. -
doc/theses/mike_brooks_MMath/uw-ethesis.bib
r79ba50c r5e0b6657 94 94 } 95 95 96 % -------------------------------------------------97 % Safety98 99 96 @article{Blache19, 100 97 author = {Gunter Blache}, … … 125 122 } 126 123 127 @phdthesis{Emre2022,128 author = "Mehmet Emre",129 title = "Translating C to Safe Rust: Reasoning about Pointer Types and Lifetimes",130 school = "UC Santa Barbara",131 year = "2022"132 }133 134 @inproceedings{White2016,135 author = {White, David H. and Rupprecht, Thomas and L\"{u}ttgen, Gerald},136 title = {DSI: an evidence-based approach to identify dynamic data structures in C programs},137 year = {2016},138 isbn = {9781450343909},139 publisher = {Association for Computing Machinery},140 address = {New York, NY, USA},141 url = {https://doi.org/10.1145/2931037.2931071},142 doi = {10.1145/2931037.2931071},143 booktitle = {Proceedings of the 25th International Symposium on Software Testing and Analysis},144 pages = {259–269},145 numpages = {11},146 keywords = {Data structure identification, dynamic data structures, pointer programs, program comprehension},147 location = {Saarbr\"{u}cken, Germany},148 series = {ISSTA 2016}149 }150 151 % -----------------------------------------------152 % Misc153 154 124 @misc{RVO20, 155 125 contributer = {pabuhr@plg}, … … 179 149 howpublished= {\url{https://c-faq.com/decl/spiral.anderson.html}}, 180 150 } 181
Note:
See TracChangeset
for help on using the changeset viewer.