Changeset 8501107
- Timestamp:
- Apr 24, 2026, 5:12:09 PM (4 days ago)
- Branches:
- master
- Children:
- 79f1285
- Parents:
- bf73608
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/proposals/modules-alvin/2_staged_modules/staged_modules.md
rbf73608 r8501107 12 12 export struct S1 s[N]; 13 13 export struct S { int i; }; 14 export struct S1 *f() { 15 return (struct S1 *) s; 16 } 14 17 </pre> 15 18 </td> … … 39 42 <td> 40 43 <b>Stage 1:</b><br> 41   Exportednames<br>44   Top-level names<br> 42 45   (file only)<br> 43 46 </td> 44 47 <td> 45 48 <pre> 46 M 47 M148 S49 s 49 Module name: M 50 Imports: M1 51 Type definitions: S 52 Variable definitions: s, f 50 53 </pre> 51 54 </td> … … 71 74 <b>Stage 2:</b><br> 72 75   Scope resolution<br> 73   and building a symbol table<br>74 76   (file + imports)<br> 75 77 </td> 76 78 <td> 77 79 <pre> 78 M -> S, s 80 M -> S, s, f 79 81 M1 -> S1, N 82 -------- 83 Types: S, S1 84 Variables: s, f, N 80 85 </pre> 81 86 </td> … … 83 88 <pre> 84 89 M1 -> S1, N 85 M -> S, s 90 M -> S, s, f 86 91 M2 -> S2 92 -------- 93 Types: S, S1, S2 94 Variables: s, f, N 87 95 </pre> 88 96 </td> … … 91 99 M2 -> S2 92 100 M1 -> S1, N 101 -------- 102 Types: S1, S2 103 Variables: N 93 104 </pre> 94 105 </td> … … 97 108 <td> 98 109 Stage 3:<br> 99   Symbol resolution in expressions<br>110   Symbol resolution<br> 100 111   and resolving transitive dependencies<br> 101 112   (file + transitive import closure)<br> … … 103 114 <td> 104 115 <pre> 116 "needs definition" graph: 105 117 S <-\ 106 118 /-> S1 --/ 107 119 |-> N 108 120 \-- s 121 f 109 122 </pre> 110 123 </td> … … 133 146 To be a module, `module <identifier>;` must be the first declaration. 134 147 135 `import <identifier>;` must be before a nysymbol definitions.136 137 `export` must be the first keyword of a symbol definition.138 139 Only symbol definitions are allowed at top-level in modules — all top-level symbols are already visible to each other, so `struct S1;` is unnecessary.140 141 Exports make symbols visible to importers. 148 `import <identifier>;` must be before all symbol definitions. 149 150 If it is used, `export` must be the first keyword of a symbol definition. 151 152 Only symbol definitions are allowed at top-level in modules — `struct S1;` is unnecessary, as all top-level symbols are already visible to each other. 153 154 Exports make symbols visible to importers. A module implicitly imports itself. Inline functions cannot be exported (moved to a future extension). 142 155 143 156 *Some of these restrictions can be modified through future extensions (see Language extensions).* … … 145 158 ### Stage 1: 146 159 147 Stage 1 is about gathering useful information from analyzing a single module file in isolation. 148 149 At this stage, there is no symbol table, so we're limited to just exported names. 150 151 In the example, first line is module name, followed by import names, type names and functions/globals. 152 153 Some additional information can be gathered at this stage (eg. `const`, field names), but it's not very useful at this point — wait until stage 2. 154 155 *In order to parse the file without knowing the symbol table beforehand, the first name of a function/global definition is treated as a type.* 156 157 *This means implicit int return types are disallowed here (and have been since C99). Note expression statements are disallowed at top-level.* 160 Stage 1 gathers useful information from a single module file. To support cyclic imports (M and M1 import each other), this stage cannot rely on other files. 161 162 We lack a symbol table at this stage, so we can only parse context-free information. However, we can extract top-level names. 163 164 Some additional information can be gathered at this stage (eg. `const`, field names), but it isn't required. We do note which top-level symbols are exported. 165 166 *In order to parse the file without a symbol table, the first name of a function/global definition is treated as a type.* 167 168 *This means we disallow implicit int return types (they have been disallowed since C99). Note expression statements are disallowed at top-level.* 158 169 159 170 ### Stage 2: 160 171 161 Stage 2 is about what can be done with the stage 1 outputs of imported files (a module file implicitly imports itself).162 163 This information allows us to determine which symbols are in scope, allowing us to resolve names.164 165 This stage lacks the definitions of the imported symbols, so anything that requires those(eg. `x.y.z`) are left for stage 3 to resolve.166 167 *This means expressions are not parsed, so expression-dependent aspects of types are unresolved (eg. array sizes, `auto`, `typeof`, `decltype`, templates).* 168 169 *Note: `auto` and `typeof` are not allowed at top-level in C23. Regardless, they can still be resolved in stage 3 by lazily loading their type information.*172 Stage 2 gathers useful information from a module file and its direct imports. We gather import information from stage 1. 173 174 Here, we determine which symbols are in scope (ie. build a symbol table). 175 176 This stage lacks the ability to inspect symbol definitions, so expressions (eg. `x.y.z`) are left for stage 3 to resolve. 177 178 Types are not fully resolvable due to array sizes, `decltype` and templates. The names of types can be resolved in C, but can be done in stage 3 instead. 179 180 *Note: `auto` and `typeof` are not allowed at top-level in C23.* 170 181 171 182 ### Stage 3: 172 183 173 Stage 3 allows access to stage 2 outputs of the transitive import closure (ie. all modules that are reachable by following imports recursively).174 175 This allows us to resolve expressions and other remaining details by loading symbol definitions as needed.184 Stage 3 can look up symbol definitions by following imports recursively (ie. can access the transitive import closure). We use information from stage 2. 185 186 Here, we can resolve the rest of the AST by loading symbol definitions as needed (eg. S1 needs the definition of S, but only needs declaration of S2). 176 187 177 188 If a cycle is detected while loading symbol definitions, we have a type of infinite size or an unresolvable ambiguity — produce an error diagnostic. … … 183 194 *The stitched modules approach is to output code that would be reparsed in the original compiler, which is simpler to implement.* 184 195 185 *However, the module system does not parse expressions, so it would have to eagerly grab all reachable symbol definitions, which is inefficient.* 186 187 *The point of breaking into stages is more about unlocking parallelization opportunities and avoiding duplicate work.* 188 189 ## Notes 190 191 ### Stage 3 does a lot of duplicated work 192 193 For example, S1 needs the definition of S in all 3 modules. Ideally we would fetch from a cache instead. 194 195 Another example is inline functions. Some thought would need to be put in to figure out exactly how the cache would function. 196 197 This means stage 3 is grabbing information from a previous stage 3 run, so this is an optimization, not functionally required. 198 199 ### Why not merge stages 2 and 3 together? 200 201 Stage 2 determines which names are in scope, stage 3 takes it further. 202 203 As a language becomes more powerful, the value of stage 2 diminishes because a lot of language features require having the symbol definition. 204 205 Zig weaves both stages into a single pass. Names are resolved lazily and symbol definitions loaded lazily. 206 207 So stage 2 can be thought of as a substage of stage 3. 208 209 ### What build optimizations does this enable? 210 211 Each file within a stage can be processed in parallel. 212 213 If stage 1 doesn't change anything the later stages need, could skip recompiling importers. 214 215 If stage 2 gets new symbols but doesn't use them, could also skip. 216 217 This fine-grained analysis could be accomplished by hashing symbol definitions, similar to Rust incremental compilation. 218 219 Contrast with Zig's top-down approach, which has a different set of advantages and disadvantages. 196 *However, if the module system does not parse expressions, it has to make a conservative guess on reachable symbol definitions, which is less efficient.* 220 197 221 198 ## Language extensions … … 226 203 * export import: This allows us to avoid having to write out every import, by having one import expand into multiple imports. 227 204 * module namespacing: We can follow the same ABI as C++20 modules (ie. `name@module`). This helps us avoid linker symbol clashes. 205 * export inline functions: Allow inline functions to be exported, which would expose the definition of the function to the importer. 228 206 * import as module_name: Makes it so we qualify names as `module_name.S` for more control. 229 207 * tagged exports: Some modules may need more implementation details than others. `export(tag)` means `import M(tag);` gets it. … … 231 209 * alternative module names: What if we imported using the file path, similar to `#include` ? 232 210 * constexpr: While not a module-specific feature, having this reduces our reliance on preprocessor directives. 211 * declaration diagnostics: If a user writes `struct S1;` in a module file, provide a diagnostic instead of erroring out. 212 * dependency files: similar to preprocessors, have the module system output a list of files it has analyzed. 233 213 * ordering initializers: `import ordered M;` would mean M is initialized before this module. 234 214 * escape hatches: There may be cases where the user needs to circumvent the module system. Let them. 215 * generating header files: Provide a way to generate .h files from a module file so plain C can use libraries written with modules. 235 216 * migration tools: By handling circular imports, we can support much of what .c/.h files do. So build an automatic migration tool. 236 217 * code analysis tools: Modules make the link between symbol names explicit, so we can perform more complex analyses and refactorings. 218 219 ## Notes 220 221 ### Stage 3 does a lot of duplicated work 222 223 For example, S1 needs the definition of S in all 3 modules. Another example is if we had inline functions. 224 225 This information could be cached. We would need to consider how to handle staleness, dependency tracking, race conditions and duplicate symbol definitions. 226 227 ### Stage 2 vs stage 3 228 229 Consider a codebase as a graph (DAG), where nodes are the symbols and directed edges are dependencies (either "needs definition" or "needs declaration"). 230 231 Stage 2 builds the symbol table, which is used to determine these edges. Stage 3 resolves the graph by following the edges as necessary. 232 233 ### What build optimizations does this enable? 234 235 Each file within a stage can be processed in parallel. 236 237 If recompiling earlier stages generates new information that importers don't use, we can skip recompiling importers. 238 239 This fine-grained analysis could be accomplished by hashing symbol definitions, similar to Rust incremental compilation. 240 241 ### Extracting top-level names 242 243 Extracting top-level names can have multiple edge cases. For example, the following is valid in gcc and clang: 244 245 ``` 246 struct Outer { 247 int i; 248 struct Inner { 249 int j; 250 } k; 251 }; 252 253 struct Inner i; 254 ``` 255 256 Since C is not formally specified, it is infeasible to catch all cases (we aim to eventually produce a "semi-formalism" of the module system, however). 257 258 Rather than trying to cover all cases, it is more practical to work towards a use-case — migrating existing C code to this module system. 259 260 In that light, the edge case demonstrated above is a lesser priority, as it is used infrequently and is relatively easy to manually fix. 261 262 Handling cases such as `export struct S { int i; } s;` (exports S and s) are more important in comparison, as this pattern is fairly common. 263 264 ### To reparse or not to reparse 265 266 While Zig demonstrates that all 3 stages can be performed in a single pass, this likely requires rewriting large parts of the compiler. 267 268 Instead, we can implement stages 1 and 2 as producing JSON objects, and have stages 2 and 3 reparse the module files instead of caching the tokens. 269 The outputs of previous stages can be passed via the command line for simplicity. 270 271 We can consider optimizing this at a later point, and measure speedups. But at this stage, getting functionality is more important than execution speed. 272 273 ### Reusing existing code 274 275 Taking the idea of reparsing files even further, we can take the stitched modules approach — output code to be reparsed in the original compiler. 276 This means the module system does not parse expressions. 277 278 The naive approach of treating an expression like a black-box would force us to load all visible symbols, since we wouldn't know which symbols it uses. 279 This is highly likely to result in a lot of false cycles, which would severly hamper functionality. 280 281 A better approach is to extract a set of identifiers from the token stream of the expression, since an expression only uses symbols that it specifies. 282 283 ### Dependency tracking 284 285 In a large codebase, only a few files change between compilations. We would like to avoid recompiling the codebase every time. 286 287 The C preprocessor can output a dependency file (.d) that lists the files that have been inspected, which is later used to determine when to recompile. 288 Similarly, each stage of the module system can also output a dependency file. 289 290 Like Zig, we can go further by using hashes instead of just timestamps, and track dependencies at the granularity of top-level symbols instead of files. 291 292 At this stage of implementation, we defer this work until later. 293 294 ### Comparison with Zig 295 296 Zig is a systems language with cyclic imports. It performs all 3 stages in a single pass by lazily analyzing information. 297 298 Building a module resolves expressions (stage 3), which needs a symbol table (stage 2), which means extracting information from imports (stage 1). 299 If an expression needs a symbol definition from an import, then to parse the symbol definition you need its symbol table. Keep going until resolved. 300 301 Since all symbols get resolved, a library or executable file can be generated directly without having to call the linker afterwards. 302 This shows that having a module system allows a language to subsume responsibilities which would otherwise need to be delegated to a separate build system. 303 304 Object files and intermediate representation are stored in a cache. Incremental compilation patches binaries with top-level declaration granularity. 305 Zig's build system also uses a DAG of steps that run concurrently, and track compilation inputs (eg. compiler options) to get deterministic builds. 306 307 While this would be useful, creating a build system is a huge engineering effort and outside the scope of this proposal. 308 This proposal is concerned with adding language features that a module system would use, which may be leveraged by a build system in the future. 309 310 *To highlight the challenge of creating a build system, many of Zig's features (eg. incremental compilation) are still considered experimental.* 311 *Rust has spent almost a decade refining its incremental compilation architecture.* 312 313 *A quick note about Rust: it also has a module and build system. A crate is considered one translation unit, which is aggressively optimized.* 314 *This hampers incremental compilation. By contrast, Zig compiles multiple object files for a project, so it can do some form of separate compilation.*
Note:
See TracChangeset
for help on using the changeset viewer.