Changeset 8501107


Ignore:
Timestamp:
Apr 24, 2026, 5:12:09 PM (4 days ago)
Author:
Alvin Zhang <alvin.zhang@…>
Branches:
master
Children:
79f1285
Parents:
bf73608
Message:

staged modules proposal (updated)

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/proposals/modules-alvin/2_staged_modules/staged_modules.md

    rbf73608 r8501107  
    1212export struct S1 s[N];
    1313export struct S { int i; };
     14export struct S1 *f() {
     15    return (struct S1 *) s;
     16}
    1417</pre>
    1518        </td>
     
    3942        <td>
    4043            <b>Stage 1:</b><br>
    41             &emsp;  Exported names<br>
     44            &emsp;  Top-level names<br>
    4245            &emsp;  (file only)<br>
    4346        </td>
    4447        <td>
    4548<pre>
    46 M
    47 M1
    48 S
    49 s
     49Module name:          M
     50Imports:              M1
     51Type definitions:     S
     52Variable definitions: s, f
    5053</pre>
    5154        </td>
     
    7174            <b>Stage 2:</b><br>
    7275            &emsp;  Scope resolution<br>
    73             &emsp;  and building a symbol table<br>
    7476            &emsp;  (file + imports)<br>
    7577        </td>
    7678        <td>
    7779<pre>
    78 M  -> S, s
     80M  -> S, s, f
    7981M1 -> S1, N
     82--------
     83Types: S, S1
     84Variables: s, f, N
    8085</pre>
    8186        </td>
     
    8388<pre>
    8489M1 -> S1, N
    85 M  -> S, s
     90M  -> S, s, f
    8691M2 -> S2
     92--------
     93Types: S, S1, S2
     94Variables: s, f, N
    8795</pre>
    8896        </td>
     
    9199M2 -> S2
    92100M1 -> S1, N
     101--------
     102Types: S1, S2
     103Variables: N
    93104</pre>
    94105        </td>
     
    97108        <td>
    98109            Stage 3:<br>
    99             &emsp;  Symbol resolution in expressions<br>
     110            &emsp;  Symbol resolution<br>
    100111            &emsp;  and resolving transitive dependencies<br>
    101112            &emsp;  (file + transitive import closure)<br>
     
    103114        <td>
    104115<pre>
     116"needs definition" graph:
    105117    S  <-\
    106118/-> S1 --/
    107119|-> N
    108120\-- s
     121    f
    109122</pre>
    110123        </td>
     
    133146To be a module, `module <identifier>;` must be the first declaration.
    134147
    135 `import <identifier>;` must be before any symbol definitions.
    136 
    137 `export` must be the first keyword of a symbol definition.
    138 
    139 Only symbol definitions are allowed at top-level in modules &mdash; all top-level symbols are already visible to each other, so `struct S1;` is unnecessary.
    140 
    141 Exports make symbols visible to importers.
     148`import <identifier>;` must be before all symbol definitions.
     149
     150If it is used, `export` must be the first keyword of a symbol definition.
     151
     152Only symbol definitions are allowed at top-level in modules &mdash; `struct S1;` is unnecessary, as all top-level symbols are already visible to each other.
     153
     154Exports make symbols visible to importers. A module implicitly imports itself. Inline functions cannot be exported (moved to a future extension).
    142155
    143156*Some of these restrictions can be modified through future extensions (see Language extensions).*
     
    145158### Stage 1:
    146159
    147 Stage 1 is about gathering useful information from analyzing a single module file in isolation.
    148 
    149 At this stage, there is no symbol table, so we're limited to just exported names.
    150 
    151 In the example, first line is module name, followed by import names, type names and functions/globals.
    152 
    153 Some additional information can be gathered at this stage (eg. `const`, field names), but it's not very useful at this point &mdash; wait until stage 2.
    154 
    155 *In order to parse the file without knowing the symbol table beforehand, the first name of a function/global definition is treated as a type.*
    156 
    157 *This means implicit int return types are disallowed here (and have been since C99). Note expression statements are disallowed at top-level.*
     160Stage 1 gathers useful information from a single module file. To support cyclic imports (M and M1 import each other), this stage cannot rely on other files.
     161
     162We lack a symbol table at this stage, so we can only parse context-free information. However, we can extract top-level names.
     163
     164Some additional information can be gathered at this stage (eg. `const`, field names), but it isn't required. We do note which top-level symbols are exported.
     165
     166*In order to parse the file without a symbol table, the first name of a function/global definition is treated as a type.*
     167
     168*This means we disallow implicit int return types (they have been disallowed since C99). Note expression statements are disallowed at top-level.*
    158169
    159170### Stage 2:
    160171
    161 Stage 2 is about what can be done with the stage 1 outputs of imported files (a module file implicitly imports itself).
    162 
    163 This information allows us to determine which symbols are in scope, allowing us to resolve names.
    164 
    165 This stage lacks the definitions of the imported symbols, so anything that requires those (eg. `x.y.z`) are left for stage 3 to resolve.
    166 
    167 *This means expressions are not parsed, so expression-dependent aspects of types are unresolved (eg. array sizes, `auto`, `typeof`, `decltype`, templates).*
    168 
    169 *Note: `auto` and `typeof` are not allowed at top-level in C23. Regardless, they can still be resolved in stage 3 by lazily loading their type information.*
     172Stage 2 gathers useful information from a module file and its direct imports. We gather import information from stage 1.
     173
     174Here, we determine which symbols are in scope (ie. build a symbol table).
     175
     176This stage lacks the ability to inspect symbol definitions, so expressions (eg. `x.y.z`) are left for stage 3 to resolve.
     177
     178Types are not fully resolvable due to array sizes, `decltype` and templates. The names of types can be resolved in C, but can be done in stage 3 instead.
     179
     180*Note: `auto` and `typeof` are not allowed at top-level in C23.*
    170181
    171182### Stage 3:
    172183
    173 Stage 3 allows access to stage 2 outputs of the transitive import closure (ie. all modules that are reachable by following imports recursively).
    174 
    175 This allows us to resolve expressions and other remaining details by loading symbol definitions as needed.
     184Stage 3 can look up symbol definitions by following imports recursively (ie. can access the transitive import closure). We use information from stage 2.
     185
     186Here, we can resolve the rest of the AST by loading symbol definitions as needed (eg. S1 needs the definition of S, but only needs declaration of S2).
    176187
    177188If a cycle is detected while loading symbol definitions, we have a type of infinite size or an unresolvable ambiguity &mdash; produce an error diagnostic.
     
    183194*The stitched modules approach is to output code that would be reparsed in the original compiler, which is simpler to implement.*
    184195
    185 *However, the module system does not parse expressions, so it would have to eagerly grab all reachable symbol definitions, which is inefficient.*
    186 
    187 *The point of breaking into stages is more about unlocking parallelization opportunities and avoiding duplicate work.*
    188 
    189 ## Notes
    190 
    191 ### Stage 3 does a lot of duplicated work
    192 
    193 For example, S1 needs the definition of S in all 3 modules. Ideally we would fetch from a cache instead.
    194 
    195 Another example is inline functions. Some thought would need to be put in to figure out exactly how the cache would function.
    196 
    197 This means stage 3 is grabbing information from a previous stage 3 run, so this is an optimization, not functionally required.
    198 
    199 ### Why not merge stages 2 and 3 together?
    200 
    201 Stage 2 determines which names are in scope, stage 3 takes it further.
    202 
    203 As a language becomes more powerful, the value of stage 2 diminishes because a lot of language features require having the symbol definition.
    204 
    205 Zig weaves both stages into a single pass. Names are resolved lazily and symbol definitions loaded lazily.
    206 
    207 So stage 2 can be thought of as a substage of stage 3.
    208 
    209 ### What build optimizations does this enable?
    210 
    211 Each file within a stage can be processed in parallel.
    212 
    213 If stage 1 doesn't change anything the later stages need, could skip recompiling importers.
    214 
    215 If stage 2 gets new symbols but doesn't use them, could also skip.
    216 
    217 This fine-grained analysis could be accomplished by hashing symbol definitions, similar to Rust incremental compilation.
    218 
    219 Contrast with Zig's top-down approach, which has a different set of advantages and disadvantages.
     196*However, if the module system does not parse expressions, it has to make a conservative guess on reachable symbol definitions, which is less efficient.*
    220197
    221198## Language extensions
     
    226203* export import: This allows us to avoid having to write out every import, by having one import expand into multiple imports.
    227204* module namespacing: We can follow the same ABI as C++20 modules (ie. `name@module`). This helps us avoid linker symbol clashes.
     205* export inline functions: Allow inline functions to be exported, which would expose the definition of the function to the importer.
    228206* import as module_name: Makes it so we qualify names as `module_name.S` for more control.
    229207* tagged exports: Some modules may need more implementation details than others. `export(tag)` means `import M(tag);` gets it.
     
    231209* alternative module names: What if we imported using the file path, similar to `#include` ?
    232210* constexpr: While not a module-specific feature, having this reduces our reliance on preprocessor directives.
     211* declaration diagnostics: If a user writes `struct S1;` in a module file, provide a diagnostic instead of erroring out.
     212* dependency files: similar to preprocessors, have the module system output a list of files it has analyzed.
    233213* ordering initializers: `import ordered M;` would mean M is initialized before this module.
    234214* escape hatches: There may be cases where the user needs to circumvent the module system. Let them.
     215* generating header files: Provide a way to generate .h files from a module file so plain C can use libraries written with modules.
    235216* migration tools: By handling circular imports, we can support much of what .c/.h files do. So build an automatic migration tool.
    236217* code analysis tools: Modules make the link between symbol names explicit, so we can perform more complex analyses and refactorings.
     218
     219## Notes
     220
     221### Stage 3 does a lot of duplicated work
     222
     223For example, S1 needs the definition of S in all 3 modules. Another example is if we had inline functions.
     224
     225This information could be cached. We would need to consider how to handle staleness, dependency tracking, race conditions and duplicate symbol definitions.
     226
     227### Stage 2 vs stage 3
     228
     229Consider a codebase as a graph (DAG), where nodes are the symbols and directed edges are dependencies (either "needs definition" or "needs declaration").
     230
     231Stage 2 builds the symbol table, which is used to determine these edges. Stage 3 resolves the graph by following the edges as necessary.
     232
     233### What build optimizations does this enable?
     234
     235Each file within a stage can be processed in parallel.
     236
     237If recompiling earlier stages generates new information that importers don't use, we can skip recompiling importers.
     238
     239This fine-grained analysis could be accomplished by hashing symbol definitions, similar to Rust incremental compilation.
     240
     241### Extracting top-level names
     242
     243Extracting top-level names can have multiple edge cases. For example, the following is valid in gcc and clang:
     244
     245```
     246struct Outer {
     247    int i;
     248    struct Inner {
     249        int j;
     250    } k;
     251};
     252
     253struct Inner i;
     254```
     255
     256Since C is not formally specified, it is infeasible to catch all cases (we aim to eventually produce a "semi-formalism" of the module system, however).
     257
     258Rather than trying to cover all cases, it is more practical to work towards a use-case &mdash; migrating existing C code to this module system.
     259
     260In that light, the edge case demonstrated above is a lesser priority, as it is used infrequently and is relatively easy to manually fix.
     261
     262Handling cases such as `export struct S { int i; } s;` (exports S and s) are more important in comparison, as this pattern is fairly common.
     263
     264### To reparse or not to reparse
     265
     266While Zig demonstrates that all 3 stages can be performed in a single pass, this likely requires rewriting large parts of the compiler.
     267
     268Instead, we can implement stages 1 and 2 as producing JSON objects, and have stages 2 and 3 reparse the module files instead of caching the tokens.
     269The outputs of previous stages can be passed via the command line for simplicity.
     270
     271We can consider optimizing this at a later point, and measure speedups. But at this stage, getting functionality is more important than execution speed.
     272
     273### Reusing existing code
     274
     275Taking the idea of reparsing files even further, we can take the stitched modules approach &mdash; output code to be reparsed in the original compiler.
     276This means the module system does not parse expressions.
     277
     278The naive approach of treating an expression like a black-box would force us to load all visible symbols, since we wouldn't know which symbols it uses.
     279This is highly likely to result in a lot of false cycles, which would severly hamper functionality.
     280
     281A better approach is to extract a set of identifiers from the token stream of the expression, since an expression only uses symbols that it specifies.
     282
     283### Dependency tracking
     284
     285In a large codebase, only a few files change between compilations. We would like to avoid recompiling the codebase every time.
     286
     287The C preprocessor can output a dependency file (.d) that lists the files that have been inspected, which is later used to determine when to recompile.
     288Similarly, each stage of the module system can also output a dependency file.
     289
     290Like Zig, we can go further by using hashes instead of just timestamps, and track dependencies at the granularity of top-level symbols instead of files.
     291
     292At this stage of implementation, we defer this work until later.
     293
     294### Comparison with Zig
     295
     296Zig is a systems language with cyclic imports. It performs all 3 stages in a single pass by lazily analyzing information.
     297
     298Building a module resolves expressions (stage 3), which needs a symbol table (stage 2), which means extracting information from imports (stage 1).
     299If an expression needs a symbol definition from an import, then to parse the symbol definition you need its symbol table. Keep going until resolved.
     300
     301Since all symbols get resolved, a library or executable file can be generated directly without having to call the linker afterwards.
     302This shows that having a module system allows a language to subsume responsibilities which would otherwise need to be delegated to a separate build system.
     303
     304Object files and intermediate representation are stored in a cache. Incremental compilation patches binaries with top-level declaration granularity.
     305Zig's build system also uses a DAG of steps that run concurrently, and track compilation inputs (eg. compiler options) to get deterministic builds.
     306
     307While this would be useful, creating a build system is a huge engineering effort and outside the scope of this proposal.
     308This proposal is concerned with adding language features that a module system would use, which may be leveraged by a build system in the future.
     309
     310*To highlight the challenge of creating a build system, many of Zig's features (eg. incremental compilation) are still considered experimental.*
     311*Rust has spent almost a decade refining its incremental compilation architecture.*
     312
     313*A quick note about Rust: it also has a module and build system. A crate is considered one translation unit, which is aggressively optimized.*
     314*This hampers incremental compilation. By contrast, Zig compiles multiple object files for a project, so it can do some form of separate compilation.*
Note: See TracChangeset for help on using the changeset viewer.