Opened 17 months ago
#278 new defect
Performance issue on ARM64 due to __aarch64_sync_cache_range calls
Reported by: | caparson | Owned by: | |
---|---|---|---|
Priority: | major | Component: | cfa-cc |
Version: | 1.0 | Keywords: | |
Cc: |
Description
On the ARM64, calls to aarch64_sync_cache_range occur when a pointer to a nested function is stored. This is likely due to the compiler worrying about self-modifying code.
The following minimal repro will generate a call to aarch64_sync_cache_range:
int main() { inline void foo() {} void (*f)(void) = foo; }
Calls to aarch64_sync_cache_range are very expensive since they involve a memory barrier instruction ISB.
ISB - whenever instruction fetches need to explicitly take place after a certain point in the program, for example after memory map updates or after writing code to be executed. (In practice, this means "throw away any prefetched instructions at this point".)
Currently we store pointers to nested functions in 2 cases: polymorphic adapters and destructors of polymorphic fields.
We believe that the adapter issue can be fixed by hoisting adapters to global scope since they do not need a closure.
The dtor issue may be able to be fixed by changing the calling convention of destructors to not store a function pointer:
__attribute__ ((cleanup(__destroy_Destructor))) struct __Destructor __memberDtor0 = { ((void *)_X1tS1A_Y1T__1), ((void (*)(void *__param_0))__cleanup_dtor20) }; // old way __attribute__ ((cleanup(__cleanup_dtor20))) void * nd = ((void *)_X1tS1A_Y1T__1); // fix
Attached is a program that generates output with both the issues. Also attached is a trimmed down version of the output C code with the fixes implemented inline.
Attachments (2)
Change History (2)
Changed 17 months ago by
Attachment: | poly_dtor.cfa added |
---|
Changed 17 months ago by
Attachment: | poly_dtor_fixed.c added |
---|
Output C from the repro that has been modified with a potential fix for both issues
Minimal reproduction of the issue