Context Navigation

source: doc/proposals/modules.md @ 2325b57

Last change on this file since 2325b57 was eaeba79, checked in by Peter A. Buhr <pabuhr@…>, 2 months ago
update module proposal with some emails on the topic
Property mode set to `100644`
File size: 23.5 KB

Line
1	AJB \|
2	----'
3
4	Module System Proposal
5	======================
6
7	In this proposal we will be descussing modules. Although their exact nature changes between programming languages, modules are the smallest unit of code reuse between programs, or the base unit in separate compilation. Modules, and the extended module system, will be tied up in various stages of compilation and execution, with a particular focus on visibility between different parts of the program.
8
9	Note that terminology is not fixed across languages. For instance, some languages use the word package or library instead. Module was chosen as the generic term because it seems to have the least amount of other uses (for example, a package is sometimes a group of modules).
10
11	In C there is no formal definition of module, but informally modules are a pair of files, the body file (.c) and the header (.h). The header provides the interface and the body file gives the implementation. (A translation unit is a source file, usually a .c file, and all the recursively included files.) Some modules, like the main module, may only be the body and others may only be the header.
12
13	Uses of Modules
14	---------------
15	This section covers the features module system. What are the components that allow for seprate compliation? And then what are the common ways it is implemented, as a baseline.
16
17	Modules are often, but not always, how a language views and manages source files. There is an overlap in purpose, reuse and organization, between modules and files, so it is a natural fit.
18
19	In almost all languages, there is some kind of parity between modules and source files, with modules being mapped onto one (or a few) source files. Sometimes the use of modules is used to find the approprate source files, requring this parity to be enforced in the language. Other times the parity is just a convention or is enforced for other reasons.
20
21	If there is a universal feature of modules, it is information visibility. Modules decide what information within them is visible to other modules. Here visibility is the course grained sense of "visible in another module for any purpose".
22
23	Visibility can be implemented different ways, but most can be conidered to be an explicit or implicit export list, that is some way to list declarations that are visible to other modules. This may be an explicit listing of modules, another direct system, such as specifiers an the declarations, or some implicit style that calls out certain declarations as visible.
24
25	Accessablity is the more fine grained partner to visibility, allowing for information to be visible, but only usable for certain purposes. This includes privacy and friendship - only usable in certain parts of the program - or inlining information - only usable by the optimizer.
26
27	Accessablity is usually implemented by notes as part of a declaration.
28
29	In languages that have namespacing/qualified-names, modules will often interact with namespaces. Such as each module placing its declarations in a namespace of the same name.
30
31	When this does happen, the module name is treated as a namespace name after the module is imported.
32
33	C Comparisons
34	.............
35	To be 99% compatible with C, Cforall pretty much has to use the C-preprocessor (or replace it with a Cforall-preprocessor, that is in turn backwards compatible). To this end, how well does the C-preprocessor operate in these areas?
36
37	Separate compilation can be broken into two parts, parallel compliation and reducing recompilation.
38	The C-preprocessor is almost ideal in terms of parallel computation, it has no compilation order dependences between modules, reguardless of the logical dependeces between the modules.
39	Reducing recompliation is a bit more mixed. A file is recompiled if it or any file it directly or indirectly includes. This is a reasonable approximation, but has many false positives from generally irrelivant changes (ex. to a comment) or recursive includes that are not needed in this particular module, only for other header information incidentally included.
40
41	Of the supporting features, C only uses visibility. All information in a header is visible and anything only in the body is not. Modules are erased (except for source locations used in error messages) and cannot be used after the preprocessor.
42
43	This leads to a very particular limitation of modules in C (and C++), no declaration remembers which module it came from. In early C this meant that nothing in a header could cause anything to be added to a translation unit, only to help give meaning to declarations in bodies. Now there are cases where that is allowed, but only if redundent copies are allowed or the linker has to be able to remove all but one copy (and which one is left is the linker's decision).
44
45	Module Linkage Specification
46	----------------------------
47	A proposed solution keep track of code and whether or not we are in the module we are currently being compiled. This "is_in_module" linkage* is used in the compiler (and perhaps the preprocessor) to mark different declarations. Usually, only the original source file (the `.cfa` file) and its header (a `.hfa` file) are considered to be in the module.
48
49	Prelude definitions are never considered to be inside the current module, except when compiling the prelude itself.
50
51	* That is linkage in the sense of linkage specifier (like mangled, or overridable) not external/internal linkage (part of storage classes).
52
53	How to Specify the Module
54	-------------------------
55	Perhaps the trickiest issue is figuring out where the module is after the C-preprocessor has finished its work.
56
57	If we don't include the preprocessor in the this (which has the distinct advantage of not needing to update the C-preprocessor). Then the module needs to be blocked out in C code. This is fairly trivial in the source file, marking the end of the include statements is usually good enough.
58
59	Headers are harder because they are almost always mixed in with other includes, both in other files and their own. I have been able to think of two solutions that do not get caught up in these problems:
60	1. Mark out the header include in the source file (in addition to the source file body) and have the header escape all of its includes. This gives us start and stop points for the module.
61	2. Have the header mark its body in a way that mentions the source file. Most includes may have these blocks, but the non-matching ones can be discarded.
62
63	Using the preprocessor (or at least relying on the line marks/processed line directives) opens things up a bit more. With accurate knowledge of what original file a declaration came from, all that needs to be done it map files onto modules. This is less flexible, but it covers the standard layout of headers, and even many of the unconventional layouts I have seen.
64
65	Given which files are part of the module, a source file is always part of its own module. The paired header (same path and name, except for the extension) could automatically be included in the module, but this might take away some needed flexibility. Allowing intermediate extensions (see the AST/Pass files for an example) would allow for slight more flexibility. The other way would be to specify in the source files themselves. Headers could say which modules they are a part of, but I think the more natural solution may be to have a file already in the module say what other files in the module it is including.
66
67	Within that, it could always go with the include, part of the include or a list of files in the source files. Any of these options should work.
68	> // With the include:
69	> #pragma module "filename.hfa"
70	> #include "filename.hfa"
71	>
72	> // Part of the include:
73	> #include_module "filename.hfa"
74	>
75	> // Listed Source Files:
76	> #pragma module "filename.hfa" "included-from-filename.hfa"
77	> #include "filename.hfa"
78	> // In the previous examples, the include in filename.hfa would be updated.
79
80	Uses of Module Linkage
81	----------------------
82	After we know what sections are in the module and which are not, how do we use this to actually support coding?
83
84	In the preprocessor, the simplest use is a conditional macro. Takes two arguments, and expands to one of them depending on if the tokens were found in the module or not. This would require an implementation directly in the preprocessor.
85	> __MODULE__(if_inside_module, if_outside_module)
86
87	In the compiler proper, the linkage can be checked on declarations to handle them in the compiler. A simple example is a function specifier that takes the module status into account. Say "module_inline", which becomes "inline" (if anything) in the module and "extern inline" elsewhere. This (using some GCC behaviour) allows every file to see the function definition and inline it, but only the module will keep a non-inlined copy. This ensures that there is only one translation unit with a copy without involving the linker.
88
89	This may also help solve other memory-allocated-in-header problems, as this memory can then only be allocated in the module.
90
91	It may also be used to help implement visibility. The level of granularity is still module level, but private information can be included in the header, used by the compiler, but it will be hidden from direct use in other modules. For example, you could make the fields of a structure as private, while the layout is known for the compiler, other modules cannot perform field access and would have to use other provided functions to manipulate and read the type. (There are a few containers that do this by convention by in the library.)
92
93	Remaining Issues
94	----------------
95	Not all of these have to be solved, but there are still some areas that could really use an improvement.
96
97	First, using modules as the visibility tool does lead to a major short-coming. That is, because there is only "in-module" and "out-of-module", multiple things in the same header don't know that they are in the same module. Which could prevent adding inline functions in the header.
98
99	Second, this does nothing to solve the oversized header issue. It does not reduce any requirements on what includes need to be use.
100
101	Alternate Solutions
102	-------------------
103	There are other ways C's modules could be improved in Cforall.
104
105	Explicit Module Blocks
106	......................
107	Instead of trying map files to modules, they could instead be declared explicitly. Marking out the beginning and the end of a section of code as a module. If built on top of the body/header and include system might look like this.
108
109	> extern module NAME {
110	> BODY
111	> }
112	>
113	> module NAME {
114	> BODY
115	> }
116
117	The extern module goes in the header, the other module goes in the body. The basic usage is the forward declarations in the header module and the body contains the definitions. It can be used to check that the two sets match, but on its own it is only replicates the current header/body divide with a bit more explicit syntax. However, it can be used as the base for a lot of features of the module linkage system. It does solve the "knowning two declarations came from the same other module" problem (and could work with namespaces) but is otherwise very similar for a heavier syntax.
118
119	Compiled Headers
120	................
121	Most programming languages do not share source code between modules. Instead each module is compiled without looking at the source code in other modules. The result of compilation includes all the information required for later stages of compilation and information for compiling other modules.
122
123	This is a more popular pattern more recent programming languages. It does have some advantages, such as reducing the amount of times that a file will need to be processed and can cut out unneeded transitive information. It is downsides include adding dependences between modules and it prevents any circlar dependences between modules.
124
125	There is one other notable downside, and that is retrofitting this pattern on top of C. The problems with GCC precompiled headers and C++ modules give some indication of how tricky the situation is. The problem is the C pre-processor, not only is this the tool by which modules are implemented, but they contain information for the preprocessor itself, such as macros. Macro definitions must also be applied to the text of source files and so must be preserved. This might be possible in cases with strict dependences from the included file, but there are more unusual uses where macros depend on their context (previous includes or a define before the include) in their definition and these would almost imposible to translate over.
126
127	##########################################################################################
128
129	PAB \|
130	----'
131
132	Programming languages are divided into those embedded in an IDE, think Smalltalk and Lisp, Database, largely manipulating a symbol-table/abstract-symbol-tree, and those where the IDE is an external program largely manipulating program text.
133	Separate compilation in programming languages without an embedded IDE is the process of giving a compiler command a series of files that are read and processed as a whole.
134	The compiler output is placed in another set of files for execution loading or further processing.
135	Therefore, in languages without an embedded IDE, the translation unit is some combination of files, where files are defined by the underlying operating system.
136	I am unaware of a programming language where it is possible to say: within the following F files, only compile the following C components without compiling anything else.
137	I'm sure such a language exists somewhere, but I don't know of it.
138	For languages with non-embedded IDEs, there exist separate program configuration and management tools, like Make, Maven, etc.
139
140	Since C, and therefore CFA, is in the non-embedded IDE category, separate compilation is reading multiple translation units that are embedded in operating-system files.
141	In a file system where file-links can be embedded in data creating a tree, duplicate source code can be eliminated by generating a complex linking structure among the source files.
142	Without embedded file-links, dynamic embedding using #include/import is necessary to compose all the program components necessary for a compilation.
143
144	inlining?
145
146	I see two separate issues with respect to program structuring for controlling visibility and initializing a program.
147
148	Information hiding can occur locally and globally.
149
150	Local information hiding leverages lexical scoping to control visibility, such as public/private.
151
152	struct S {
153	private:
154	...
155	public:
156	...
157	}
158
159	In a non-OO language, like CFA, this might be accomplished with friendship.
160
161	struct S {
162	friend void foo( ... );
163	friend void bar( ... );
164	...
165	private:
166	... // friends only
167	public:
168	...
169	}
170
171
172	I'm assuming this might work with polymorphic routines, too, like friend templates.
173	I appreciate this is not 100% secure, as for C++ friendship.
174
175	Global information hiding is controlling imports/exports from a translation unit (file).
176	C++ namespace provides control of names but not information hiding (I think).
177	Modules provide name and information hiding.
178
179	module M using M1, M2 { // extra scope level => qualification
180	private:
181	...
182	public:
183	...
184	?( M & ){ ... } // module constructor
185	}
186
187	The "using" is defining module dependences, i.e., what include files have to be brought in.
188	The purpose of modules is for organize a collection of program components, like the link-list and string stuff, within the same translation unit, versus multiple separate TUs.
189	Hence, all of Mike's stuff is in the same translation unit, but nicely subdivided into multiple independent sections within that unit.
190	The module constructor runs any global initialization required to ensure its contents is in a sound state, like zeroing global state or running code.
191
192	At the linker level, an extra step is necessary to perform a transitive closure across module dependences, i.e., build a "using" graph to know what order to run the module constructors.
193	For example, the heap has to be initialized before any other code that uses it.
194
195	=============================================================================
196
197	From: Andrew James Beach <ajbeach@uwaterloo.ca>
198	To: Peter Buhr <pabuhr@uwaterloo.ca>
199	Subject: Re: A Module Proposal
200	Date: Fri, 31 May 2024 20:32:49 +0000
201
202	Ada uses several constructs for what you described:
203
204	First the includes are handled by:
205	with MODULE_NAME;
206
207	(There is a separate "use" command that handles the name spacing part.)
208
209	In the header file (.ads) you declare:
210	package NAME is BODY end NAME;
211	(NAME is the possibly qualified name of this module, BODY all the contents of the module. Which I think is everything except the with/use commands and whatever comments you would like.)
212
213	In the source file (.adb) you declare:
214	package body NAME is BODY end NAME;
215	(Same rules on NAME and BODY.)
216
217	Of course I say same rules for BODY, but obviously it isn't quite the same. You have something like the declaration / definition divide C uses about what goes where. You do seem to have to repeat some information as well.
218
219	Anyways, I did some double checks, but mostly this is just me rattling off what I remember from earlier investigation of Ada.
220
221	Andrew
222	________________________________
223	From: Peter A. Buhr <pabuhr@uwaterloo.ca>
224	Sent: May 31, 2024 4:13 PM
225	To: Andrew James Beach <ajbeach@uwaterloo.ca>
226	Subject: Re: A Module Proposal
227
228	For the section on "file-links can be embedded in data creating a tree", I
229	don't know what that means.
230
231	Think of a file system like a database, where a table can have data and links to
232	other tables. For a program you might have.
233
234	for ( link to shared expression ) { link to shared for body }
235
236	In smalltalk, I believe they have this kind of structure.
237
238	class X {
239	link to some code in another file
240	code code code
241	link to some code in another file
242	}
243
244	Then we two ways you can use modules in a language: Visibility and
245	initialization. I did say a few things about the first, but nothing on the
246	second.
247
248	Agreed.
249
250	Is there any interaction between modules and local information hiding? None seem to be called out.
251
252	It was just an outline. I need to look at Ada packets to get more details.
253
254	Now you have an example where you declare a module syntax:
255	What section of code does this wrap? Does this go in a header, a source file?
256	Is the module also a namespace?
257	Does the using clause actually trigger #include? How does it interact with #include directives?
258	Looking at the constructor: is the module a (real) type? If so what properties does it have?
259
260	Let's see what Ada does with packages. It has to be VERY similar to what CFA
261	needs.
262
263	How does this effect organization across translation units? You say it would
264	put Mike's work into one translation unit, how does it do that and what is
265	the gain there?
266
267	module link-lists {
268	link-list stuff
269	}
270	module arrays {
271	array stuff
272	}
273	module strings {
274	string stuff
275	}
276
277	It is possible to import arrays and not other modules, so the compiler
278	selectively reads the above translation unit for the modules it is looking for
279	and does not have to parse modules it does not need. The code is nicely grouped
280	and named.
281
282	"At the linker level", does this mean we also have to rewrite/wrap the compiler?
283
284	The linker has a whole language that allows you to write complex instructions on
285	what it is suppose to. I don't know how powerful the link language is.
286
287	https://ftp.gnu.org/old-gnu/Manuals/ld-2.9.1/html_chapter/ld_3.html
288
289
290
291	From: Andrew James Beach <ajbeach@uwaterloo.ca>
292	To: Peter Buhr <pabuhr@uwaterloo.ca>
293	Subject: Re: A Module Proposal
294	Date: Sun, 2 Jun 2024 19:53:56 +0000
295
296	OK, down the list:
297
298	> Why does CFA have include files? Why don't we use the Smalltalk for modules?
299
300	It feels like the IDE is just the surface level of something else. After all, some people already develop Cforall embedded in an IDE (I think, it looks like that is what they are doing when they screen share). Maybe I'm misreading the situation, but it feels like we are talking about the amount of non-code configuration and/or a compile-time vs. runtime divide.
301
302	> Think of a file system like a database, where a table can have data and links to other tables. For a program you might have.
303
304	Do you mean includes and other uses of file paths?
305
306	> In smalltalk, I believe they have this kind of structure.
307
308	What kind of Smalltalk are we talking about? (Old Smalltalk as OS or something like the relatively modern GNU Smalltalk?)
309
310	> Let's see what Ada does with packages. It has to be VERY similar to what CFA needs.
311
312	I will not go over the whole thing again but Ada seems to have for constructs for this:
313	with NAME; imports a qualified module.
314	use NAME; can be used to rename a bunch of qualified names.
315	package NAME is BODY end NAME; declares a module (package) interface.
316	package body NAME is BODY end NAME; provides the implementation for the above.
317
318	> It is possible to import arrays and not other modules, so the compiler selectively reads the above translation unit for the modules it is looking for and does not have to parse modules it does not need. The code is nicely grouped and named.
319
320	Can't we already do that by putting them in separate files? Also, how do we get away with parsing only part of a file?
321
322	Andrew
323	________________________________
324	From: Andrew James Beach <ajbeach@uwaterloo.ca>
325	Sent: May 31, 2024 2:57 PM
326	To: Peter Buhr <pabuhr@uwaterloo.ca>
327	Subject: A Module Proposal
328
329	I am working of folding the two sections of the proposal and I have some questions.
330
331	The paragraph on IDE passed languages. What is it for? C and Cforall are not that type of language and you never bring it up again as a comparison.
332
333	For the section on "file-links can be embedded in data creating a tree", I don't know what that means. For bit I thought you meant includes, but you talk about those separately. Maybe module names using with import statements. Could you go into more detail?
334
335	Then we two ways you can use modules in a language: Visibility and initialization. I did say a few things about the first, but nothing on the second.
336
337	Is there any interaction between modules and local information hiding? None seem to be called out.
338
339	Now you have an example where you declare a module syntax:
340	What section of code does this wrap? Does this go in a header, a source file?
341	Is the module also a namespace?
342	Does the using clause actually trigger #include? How does it interact with #include directives?
343	Looking at the constructor: is the module a (real) type? If so what properties does it have?
344
345	How does this effect organization across translation units? You say it would put Mike's work into one translation unit, how does it do that and what is the gain there?
346
347	"At the linker level", does this mean we also have to rewrite/wrap the compiler?
348
349
350
351	From: Michael Leslie Brooks <mlbrooks@uwaterloo.ca>
352	To: Peter Buhr <pabuhr@uwaterloo.ca>,
353	Andrew James Beach
354	<ajbeach@uwaterloo.ca>,
355	Fangren Yu <f37yu@uwaterloo.ca>, Jiada Liang
356	<j82liang@uwaterloo.ca>
357	Subject: Modules
358	Date: Wed, 26 Jun 2024 20:25:23 +0000
359
360	I wrote down some of what was said during our call with Bryan today...
361
362	Peter's modules' intro
363
364	fine - like cpp public-private
365	med - when bits of several sources mash into a translation unit
366	coarse - the translation unit
367
368
369	Bryan's remarks
370
371	Often a library will have
372	external headers - what others include
373	internal headers - what all the library's units need to know
374
375	A points to B B can never get outside the library opportunity for object
376	inlining
377
378	Problem with pimpl pattern is after you do it, the compiler can't see that this
379	is what you're doing, it only sees it's a another plain old object. If it
380	could benefit from a pragma-pimpl, be assured that the impl part can't leak
381	out, then it could inline the impl.
382
383	For a Friday discussion group, the team would be interested in an improvement
384	in what C++ can do.
385
386

Note: See TracBrowser for help on using the repository browser.

Download in other formats: