Opened 3 years ago

Last modified 7 months ago

#248 new enhancement

Lazily generate prelude functions on demande.

Reported by: Thierry Delisle Owned by:
Priority: major Component: prelude
Version: 1.0 Keywords:
Cc:

Description

Prelude functions are currenctly generated as regular C, except when used through assertions. In that case, the assertions require a function pointer so we have a function definition for each prelude function in libcfa.

This is a performance problem since it stops inlining of polymorphic function using basic types.

A solution to this would be to generate the functions for each source files as static inline.

However, doing this naively doesn't work because the definition of prelude functions is self-referential. The solution is to only insert the bodies of the needed prelude function on demande before code generation.

Change History (2)

comment:1 Changed 3 years ago by Thierry Delisle

Here is minimal example where this is a problem:

forall(T | { T ?+=?(T&, T); })
{
	struct wrap {
		T val;
	};

	static inline wrap(T) ?+?(wrap(T) lhs, const wrap(T) & rhs) {
		lhs.val += rhs.val;
		return lhs;
	}
}

wrap(int) a, b;
int foo() {
	return (a + b).val;
}

Looking at the output assembly for this example, compiled with -O3, we see calls to _X12_constructorFv_ii_intrinsic___1, _X19_operator_addassignFi_ii_intrinsic___1 and _X11_destructorFv_i_intrinsic___1. All three of these functions are trivial and we want them to be inlined.

comment:2 Changed 7 months ago by ajbeach

This issue came up again recently. Peter and I spent a bit more time digging into why this happens. I thought I would share some of that additional information, and why the simple (although not always cheap) fix is usually just to write a non-polymorphic (or less polymorphic) overload of the function, a specialization.

The for example, take the current implementation of min in the standard library:

static inline __attribute__((always_inline)) {
    forall( T | { int ?<?( T, T ); } )
    T min( T v1, T v2 ) { return v1 < v2 ? v1 : v2; }
    // ...
    // Specializations
    char min( char v1, char v2 ) { return v1 < v2 ? v1 : v2; }
    int min( int v1, int v2 ) { return v1 < v2 ? v1 : v2; }
    // ... including more specializations
}

(There is also code to handle 3 or more arguments, but that doesn't matter for the example.)

Now all of these mins will be inlined by GCC after we have done our code generation. However, min'T (the polymorphic min) will be much larger than the others. That is because when we generate, say min'int (the min for int), the intrinsic functions, like the constructor, destructor and ?<? are generated as the equivalent C code; something in the range of an operator to nothing. This is effectively inlining the intrinsic function call, and the code for min'int is very small. Then GCC comes along, inlines min'int into its caller and the entire thing almost disappears.

However, function pointers which are used to communicate assertions, cannot be unlined. So min'T is generated with a long list of parameters. GCC still manages to optimizes this but it does not know that the intrinsic function calls represent primitive C operations. So it calls each one manually, considering how small these functions are the function call overhead is very significant and you get the performance hit.

In short specializations allow for a lot more inlining and that improves performance.

Generating the prelude functions (when there are non-inlined uses of them) may allow GCC to also inline the intrinsic functions, this gets rid of the ordering problem. We just have to mark that the bodies of the functions are definitely C primitives and try and make sure any unused code is discarded.

Note: See TracTickets for help on using tickets.