## Light-weight Closures for Cforall ##

A core capability of the Cforall type system is the ability to use 
monomorphic specializations of polymorphic functions seamlessly and 
invisibly to the user programmer, primarily in type assertions:

    forall(otype T | { T ?+?(T, T); })
    T double( T x ) { return x + x; }

    forall(otype R | { R double(R); })
    R quadruple( R y ) { return double( double( y ) ); }

    void fred() {
        float magic = quadruple( 10.5f );
    }

In the example above, `R` and `T` are both bound to `float`, and the 
`double` type assertion on `quadruple` is satisfied by monomorphizing the 
polymorphic `double` function to `float` (using the builtin `float` addition 
operator). The existing implementation uses GCC nested functions to 
implement this monomorphization, as in the following (much simplified) 
generated code:

    void double(size_t _sizeof_T,  
                void (*_assign_T)(void*, void*),
                void (*_add_T)(void*, void*, void*),
                void *_rtn, 
                void *x ) {
        _add_T( _rtn, x, x );
    }

    void quadruple(size_t _sizeof_R,  
                   void (*_assign_R)(void*, void*),
                   void (*_double_R)(void*, void*), 
                   void *_rtn,
                   void *y ) {
        void *_tmp0 = alloca(_sizeof_R);
        _double_R( _rtn, (_double_R( _tmp0, y ), _tmp0) );
     }

     void fred() {
         // nested thunk to adapt double() to _double_R()
         void _thunk0( void *_rtn, void *x ) {
             double( sizeof(float), _builtin_assign_float, 
                     _builtin_add_float,
                     _rtn, x );
         }

         float magic;
         float _tmp1 = 10.5f;
         quadruple( sizeof(float), _builtin_assign_float, _thunk0,
                    &magic, &_tmp1 );
     }

Now, in the example above, `_thunk0` could be hoisted to static scope, as 
`sizeof(float)`, `_builtin_assign_float`, and `builtin_add_float` all exist 
in static scope. In general, however, these parameters which are used to 
monomorphize polymorphic functions could be local to the calling scope (e.g. 
if `fred()` was a polymorphic function itself, or had a local overload of 
`float` addition).

The crux of the issue is that these monomorphization thunks need a lexical 
link to their creation context, but C's standard calling convention provides 
no way to include such a lexical link. GCC fixes this for nested functions 
by placing an executable trampoline on the stack to modify the calling 
convention; this trampoline has the standard calling convention, and calls 
the nested function after setting up the lexical link. This prevents the 
stack from being marked as non-executable, opening a variety of potential 
security vulnerabilities. More to the point of this proposal, it also means 
that the thunk exists on the stack, and may go out of scope before it is 
used if it is copied somewhere else (for instance, to the root of a new 
stack in a coroutine).

The standard solution to this sort of problem is a *closure*; this proposal 
describes how to integrate a restricted sort of closure into Cforall that 
would be sufficiently powerful to monomorphize polymorphic functions. 

Monomorphization parameters in the current implementation fall into four 
categories:
  1. Size/alignment of types; a single unsigned integer
  2. Field offsets for generic types; a fixed length array of unsigned 
  3. Functions used to satisfy type assertions: a single function pointer
  4. Variables used to satisfy type assertions: a single void pointer

The gist of this proposal is to develop a copyable closure object (similar  
in concept to `std::function` in C++) that can encapsulate a function 
pointer and an arbitrary list of these monomorphization parameters and 
provide a function call operator that passes the appropriate parameters to 
the underlying function. In (very-)pseudo-Cforall, it might look something 
like the following:

    forall(ttype Rtns, ttype Args) struct Fn {
        [assertion...] closed;
        forall(closed) Rtns (*f)(Args...);
    };

    forall(ttype Rtns, ttype Args, [assertion...] Closed)
    void ?{}( Fn(Rtns, Args) *this, 
              forall(Closed) Rtns (*f)(Args), Closed closed ) {
        this->closed = closed;
        this->f = f;
    }
    // ^ function pointers convert implicitly to Fn, as they have an empty 
    // assertion list here

    forall(ttype Rtns, ttype Args)
    Rtns ?() ( Fn(Rtns, Args) fn, Args args ) {
        return fn.f( fn.closed, args );
    }

Using this `Fn` closure internally (even if it was never exposed to user 
programmers), the top example would codegen something like this, with `Fn` 
substituted for the implicit function pointers:

    void double(size_t _sizeof_T,  
                Fn(void, [void*, void*]) _assign_T,
                Fn(void, [void*, void*, void*]) _add_T,
                void *_rtn, 
                void *x ) {
        _add_T( _rtn, x, x );
    }

    void quadruple(size_t _sizeof_R,  
                   Fn(void, [void*, void*]) _assign_R,
                   Fn(void, [void*, void*]) _double_R, 
                   void *_rtn,
                   void *y ) {
        void *_tmp0 = alloca(_sizeof_R);
        _double_R( _rtn, (_double_R( _tmp0, y ), _tmp0) );
     }

     void fred() {
         // closure wrapper for static function
         Fn(void, [void*, void*]) _thunk0 = { _builtin_assign_float };
         // nested closure to adapt double() to _double_R()
         Fn(void, [void*, void*]) _thunk1 = { double, 
             [sizeof(float), _builtin_assign_float, _builtin_add_float] };
            
         float magic;
         float _tmp1 = 10.5f;
         quadruple( sizeof(float), _thunk0, _thunk1,
                    &magic, &_tmp1 );
     }

The main challenge with this approach is that the `Fn` closure is 
(necessarily) variable in size, as it can close over an arbitrary (but fixed 
at construction time) number of parameters. This will make memory management 
for it somewhat challenging, and writing the code in the translator to 
implement the function call operator passing a variable number of type 
assertions should also be non-trivial, but tractable.
