[dac16a0] | 1 | \chapter{Performance} |
---|
| 2 | \label{c:performance} |
---|
| 3 | |
---|
[c9f9d4f] | 4 | Performance is of secondary importance for most of this project. |
---|
[262deda0] | 5 | Instead, the focus was to get the features working. The only performance |
---|
[c9f9d4f] | 6 | requirement is to ensure the tests for correctness run in a reasonable |
---|
[262deda0] | 7 | amount of time. Hence, a few basic performance tests were performed to |
---|
| 8 | check this requirement. |
---|
[b51e389c] | 9 | |
---|
| 10 | \section{Test Set-Up} |
---|
[c9f9d4f] | 11 | Tests were run in \CFA, C++, Java and Python. |
---|
[9698690] | 12 | In addition there are two sets of tests for \CFA, |
---|
[c9f9d4f] | 13 | one for termination and one for resumption exceptions. |
---|
[dac16a0] | 14 | |
---|
| 15 | C++ is the most comparable language because both it and \CFA use the same |
---|
| 16 | framework, libunwind. |
---|
[262deda0] | 17 | In fact, the comparison is almost entirely a quality of implementation. |
---|
| 18 | Specifically, \CFA's EHM has had significantly less time to be optimized and |
---|
[dac16a0] | 19 | does not generate its own assembly. It does have a slight advantage in that |
---|
[c9f9d4f] | 20 | there are some features it handles directly instead of through utility functions, |
---|
[262deda0] | 21 | but otherwise \Cpp should have a significant advantage. |
---|
[dac16a0] | 22 | |
---|
[262deda0] | 23 | Java is a popular language with similar termination semantics, but |
---|
| 24 | it is implemented in a very different environment, a virtual machine with |
---|
[029cbc0] | 25 | garbage collection. |
---|
[c9f9d4f] | 26 | It also implements the @finally@ clause on @try@ blocks allowing for a direct |
---|
[029cbc0] | 27 | feature-to-feature comparison. |
---|
[262deda0] | 28 | As with \Cpp, Java's implementation is mature, optimized |
---|
[c9f9d4f] | 29 | and has extra features. |
---|
[9698690] | 30 | |
---|
[262deda0] | 31 | Python is used as an alternative comparison because of the \CFA EHM's |
---|
[c9f9d4f] | 32 | current performance goals, which is not to be prohibitively slow while the |
---|
[9698690] | 33 | features are designed and examined. Python has similar performance goals for |
---|
| 34 | creating quick scripts and its wide use suggests it has achieved those goals. |
---|
| 35 | |
---|
[c9f9d4f] | 36 | Unfortunately, there are no notable modern programming languages with |
---|
| 37 | resumption exceptions. Even the older programming languages with resumption |
---|
| 38 | seem to be notable only for having resumption. |
---|
[262deda0] | 39 | So instead, resumption is compared to its simulation in other programming |
---|
| 40 | languages using fixup functions that are explicitly passed for correction or |
---|
| 41 | logging purposes. |
---|
| 42 | % So instead, resumption is compared to a less similar but much more familiar |
---|
| 43 | %feature, termination exceptions. |
---|
[dac16a0] | 44 | |
---|
[c9f9d4f] | 45 | All tests are run inside a main loop that repeatedly performs a test. |
---|
| 46 | This approach avoids start-up or tear-down time from |
---|
[dac16a0] | 47 | affecting the timing results. |
---|
[262deda0] | 48 | Each test is run a N times (configurable from the command line). |
---|
| 49 | The Java tests runs the main loop 1000 times before |
---|
| 50 | beginning the actual test to ``warm-up" the JVM. |
---|
[9698690] | 51 | |
---|
| 52 | Timing is done internally, with time measured immediately before and |
---|
[c9f9d4f] | 53 | after the test loop. The difference is calculated and printed. |
---|
[9698690] | 54 | The loop structure and internal timing means it is impossible to test |
---|
| 55 | unhandled exceptions in \Cpp and Java as that would cause the process to |
---|
| 56 | terminate. |
---|
| 57 | Luckily, performance on the ``give-up and kill the process" path is not |
---|
| 58 | critical. |
---|
[dac16a0] | 59 | |
---|
[c9f9d4f] | 60 | The exceptions used in these tests are always based off of |
---|
| 61 | a base exception. This requirement minimizes performance differences based |
---|
| 62 | on the object model used to represent the exception. |
---|
[9698690] | 63 | |
---|
[c9f9d4f] | 64 | All tests are designed to be as minimal as possible, while still preventing |
---|
| 65 | excessive optimizations. |
---|
[9698690] | 66 | For example, empty inline assembly blocks are used in \CFA and \Cpp to |
---|
| 67 | prevent excessive optimizations while adding no actual work. |
---|
[c9f9d4f] | 68 | Each test was run eleven times. The top three and bottom three results were |
---|
| 69 | discarded and the remaining five values are averaged. |
---|
| 70 | |
---|
| 71 | The tests are compiled with gcc-10 for \CFA and g++-10 for \Cpp. Java is |
---|
[262deda0] | 72 | compiled with version 11.0.11. Python with version 3.8. The tests were run on: |
---|
[c9f9d4f] | 73 | \begin{itemize}[nosep] |
---|
| 74 | \item |
---|
| 75 | ARM 2280 Kunpeng 920 48-core 2$\times$socket \lstinline{@} 2.6 GHz running Linux v5.11.0-25 |
---|
| 76 | \item |
---|
| 77 | AMD 6380 Abu Dhabi 16-core 4$\times$socket \lstinline{@} 2.5 GHz running Linux v5.11.0-25 |
---|
| 78 | \end{itemize} |
---|
[262deda0] | 79 | Two kinds of hardware architecture allows discriminating any implementation and |
---|
| 80 | architectural effects. |
---|
| 81 | |
---|
[dac16a0] | 82 | |
---|
[9698690] | 83 | % We don't use catch-alls but if we did: |
---|
| 84 | % Catch-alls are done by catching the root exception type (not using \Cpp's |
---|
| 85 | % \code{C++}{catch(...)}). |
---|
[dac16a0] | 86 | |
---|
[b51e389c] | 87 | \section{Tests} |
---|
[029cbc0] | 88 | The following tests were selected to test the performance of different |
---|
| 89 | components of the exception system. |
---|
[c9f9d4f] | 90 | They should provide a guide as to where the EHM's costs are found. |
---|
[029cbc0] | 91 | |
---|
[ea593a3] | 92 | \paragraph{Raise and Handle} |
---|
[262deda0] | 93 | This group measures the cost of a try statement when exceptions are raised and |
---|
| 94 | the stack is unwound (termination) or not unwound (resumption). Each test has |
---|
| 95 | has a repeating function like the following |
---|
| 96 | \begin{lstlisting}[language=CFA,{moredelim=**[is][\color{red}]{@}{@}}] |
---|
[c9f9d4f] | 97 | void unwind_empty(unsigned int frames) { |
---|
| 98 | if (frames) { |
---|
[262deda0] | 99 | @unwind_empty(frames - 1);@ // AUGMENTED IN OTHER EXPERIMENTS |
---|
[c9f9d4f] | 100 | } else throw (empty_exception){&empty_vt}; |
---|
| 101 | } |
---|
[262deda0] | 102 | \end{lstlisting} |
---|
| 103 | which is called N times, where each call recurses to a depth of R (configurable from the command line), an |
---|
[c9f9d4f] | 104 | exception is raised, the stack is a unwound, and the exception caught. |
---|
[9698690] | 105 | \begin{itemize}[nosep] |
---|
[c9f9d4f] | 106 | \item Empty: |
---|
[262deda0] | 107 | For termination, this test measures the cost of raising (stack walking) an |
---|
| 108 | exception through empty stack frames from the bottom of the recursion to an |
---|
| 109 | empty handler, and unwinding the stack. (see above code) |
---|
| 110 | |
---|
| 111 | \medskip |
---|
| 112 | For resumption, this test measures the same raising cost but does not unwind |
---|
| 113 | the stack. For languages without resumption, a fixup function is to the bottom |
---|
| 114 | of the recursion and called to simulate a fixup operation at that point. |
---|
| 115 | \begin{cfa} |
---|
| 116 | void nounwind_fixup(unsigned int frames, void (*raised_rtn)(int &)) { |
---|
| 117 | if (frames) { |
---|
| 118 | nounwind_fixup(frames - 1, raised_rtn); |
---|
| 119 | } else { |
---|
| 120 | int fixup = 17; |
---|
| 121 | raised_rtn(fixup); |
---|
| 122 | } |
---|
| 123 | } |
---|
| 124 | \end{cfa} |
---|
| 125 | where the passed fixup function is: |
---|
| 126 | \begin{cfa} |
---|
| 127 | void raised(int & fixup) { |
---|
| 128 | fixup = 42; |
---|
| 129 | } |
---|
| 130 | \end{cfa} |
---|
| 131 | For comparison, a \CFA version passing a function is also included. |
---|
[ea593a3] | 132 | \item Destructor: |
---|
[262deda0] | 133 | This test measures the cost of raising an exception through non-empty frames, |
---|
| 134 | where each frame has an object requiring destruction, from the bottom of the |
---|
| 135 | recursion to an empty handler. Hence, there are N destructor calls during |
---|
| 136 | unwinding. |
---|
[c9f9d4f] | 137 | |
---|
[262deda0] | 138 | \medskip |
---|
| 139 | This test is not meaningful for resumption because the stack is only unwound as |
---|
| 140 | the recursion returns. |
---|
[c9f9d4f] | 141 | \begin{cfa} |
---|
| 142 | WithDestructor object; |
---|
[262deda0] | 143 | unwind_destructor(frames - 1); |
---|
[c9f9d4f] | 144 | \end{cfa} |
---|
[ea593a3] | 145 | \item Finally: |
---|
[c9f9d4f] | 146 | This test measures the cost of establishing a try block with an empty finally |
---|
[262deda0] | 147 | clause on the front side of the recursion and running the empty finally clauses |
---|
| 148 | during stack unwinding from the bottom of the recursion to an empty handler. |
---|
[c9f9d4f] | 149 | \begin{cfa} |
---|
| 150 | try { |
---|
| 151 | unwind_finally(frames - 1); |
---|
| 152 | } finally {} |
---|
| 153 | \end{cfa} |
---|
[262deda0] | 154 | |
---|
| 155 | \medskip |
---|
| 156 | This test is not meaningful for resumption because the stack is only unwound as |
---|
| 157 | the recursion returns. |
---|
[ea593a3] | 158 | \item Other Handler: |
---|
[262deda0] | 159 | For termination, this test is like the finally test but the try block has a |
---|
| 160 | catch clause for an exception that is not raised, so catch matching is executed |
---|
| 161 | during stack unwinding but the match never successes until the catch at the |
---|
| 162 | bottom of the recursion. |
---|
[c9f9d4f] | 163 | \begin{cfa} |
---|
| 164 | try { |
---|
| 165 | unwind_other(frames - 1); |
---|
| 166 | } catch (not_raised_exception *) {} |
---|
| 167 | \end{cfa} |
---|
[262deda0] | 168 | |
---|
| 169 | \medskip |
---|
| 170 | For resumption, this test measures the same raising cost but does not unwind |
---|
| 171 | the stack. For languages without resumption, the same fixup function is passed |
---|
| 172 | and called. |
---|
[ea593a3] | 173 | \end{itemize} |
---|
| 174 | |
---|
[262deda0] | 175 | \paragraph{Try/Handle/Finally Statement} |
---|
| 176 | This group measures just the cost of executing a try statement so |
---|
[c9f9d4f] | 177 | \emph{there is no stack unwinding}. Hence, the program main loops N times |
---|
| 178 | around: |
---|
| 179 | \begin{cfa} |
---|
| 180 | try { |
---|
| 181 | } catch (not_raised_exception *) {} |
---|
| 182 | \end{cfa} |
---|
[9698690] | 183 | \begin{itemize}[nosep] |
---|
[ea593a3] | 184 | \item Handler: |
---|
[262deda0] | 185 | The try statement has a handler (catch/resume). |
---|
[ea593a3] | 186 | \item Finally: |
---|
[262deda0] | 187 | The try statement has a finally clause. |
---|
[ea593a3] | 188 | \end{itemize} |
---|
| 189 | |
---|
| 190 | \paragraph{Conditional Matching} |
---|
[262deda0] | 191 | This group measures the cost of conditional matching. |
---|
[ea593a3] | 192 | Only \CFA implements the language level conditional match, |
---|
[262deda0] | 193 | the other languages mimic with an ``unconditional" match (it still |
---|
| 194 | checks the exception's type) and conditional re-raise if it is not suppose |
---|
[9698690] | 195 | to handle that exception. |
---|
[c9f9d4f] | 196 | \begin{center} |
---|
| 197 | \begin{tabular}{ll} |
---|
| 198 | \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp, Java, Python} \\ |
---|
| 199 | \begin{cfa} |
---|
| 200 | try { |
---|
| 201 | throw_exception(); |
---|
| 202 | } catch (empty_exception * exc; |
---|
| 203 | should_catch) { |
---|
| 204 | } |
---|
| 205 | \end{cfa} |
---|
| 206 | & |
---|
| 207 | \begin{cfa} |
---|
| 208 | try { |
---|
| 209 | throw_exception(); |
---|
| 210 | } catch (EmptyException & exc) { |
---|
| 211 | if (!should_catch) throw; |
---|
| 212 | } |
---|
| 213 | \end{cfa} |
---|
| 214 | \end{tabular} |
---|
| 215 | \end{center} |
---|
[9698690] | 216 | \begin{itemize}[nosep] |
---|
| 217 | \item Match All: |
---|
[ea593a3] | 218 | The condition is always true. (Always matches or never re-raises.) |
---|
[9698690] | 219 | \item Match None: |
---|
[ea593a3] | 220 | The condition is always false. (Never matches or always re-raises.) |
---|
| 221 | \end{itemize} |
---|
[dac16a0] | 222 | |
---|
[262deda0] | 223 | \medskip |
---|
| 224 | \noindent |
---|
| 225 | All omitted test code for other languages is functionally identical to the \CFA |
---|
| 226 | tests or simulated, and available online~\cite{CforallExceptionBenchmarks}. |
---|
| 227 | |
---|
[dac16a0] | 228 | %\section{Cost in Size} |
---|
| 229 | %Using exceptions also has a cost in the size of the executable. |
---|
| 230 | %Although it is sometimes ignored |
---|
| 231 | % |
---|
| 232 | %There is a size cost to defining a personality function but the later problem |
---|
| 233 | %is the LSDA which will be generated for every function. |
---|
| 234 | % |
---|
| 235 | %(I haven't actually figured out how to compare this, probably using something |
---|
| 236 | %related to -fexceptions.) |
---|
[029cbc0] | 237 | |
---|
[9698690] | 238 | \section{Results} |
---|
[262deda0] | 239 | One result not directly related to \CFA but important to keep in |
---|
| 240 | mind is that, for exceptions, the standard intuition about which languages |
---|
| 241 | should go faster often does not hold. For example, there are a few cases where Python out-performs |
---|
| 242 | \CFA, \Cpp and Java. The most likely explanation is that, since exceptions are |
---|
| 243 | rarely considered to be the common case, the more optimized languages |
---|
| 244 | make that case expense. In addition, languages with high-level |
---|
| 245 | representations have a much easier time scanning the stack as there is less |
---|
| 246 | to decode. |
---|
[c9f9d4f] | 247 | |
---|
[262deda0] | 248 | Tables~\ref{t:PerformanceTermination} and~\ref{t:PerformanceResumption} show |
---|
| 249 | the test results for termination and resumption, respectively. In cases where |
---|
| 250 | a feature is not supported by a language, the test is skipped for that language |
---|
| 251 | (marked N/A). For some Java experiments it was impossible to measure certain |
---|
| 252 | effects because the JIT corrupted the test (marked N/C). No workaround was |
---|
| 253 | possible~\cite{Dice21}. To get experiments in the range of 1--100 seconds, the |
---|
| 254 | number of times an experiment is run (N) is varied (N is marked beside each |
---|
| 255 | experiment, e.g., 1M $\Rightarrow$ 1 million test iterations). |
---|
| 256 | |
---|
| 257 | An anomaly exists with gcc nested functions used as thunks for implementing |
---|
| 258 | much of the \CFA EHM. If a nested-function closure captures local variables in |
---|
| 259 | its lexical scope, performance dropped by a factor of 10. Specifically, in try |
---|
| 260 | statements of the form: |
---|
| 261 | \begin{cfa} |
---|
| 262 | try { |
---|
| 263 | unwind_other(frames - 1); |
---|
| 264 | } catch (not_raised_exception *) {} |
---|
| 265 | \end{cfa} |
---|
| 266 | the try block is hoisted into a nested function and the variable @frames@ is |
---|
| 267 | the local parameter to the recursive function, which triggers the anomaly. The |
---|
| 268 | workaround is to remove the recursion parameter and make it a global variable |
---|
| 269 | that is explicitly decremented outside of the try block (marked with a ``*''): |
---|
| 270 | \begin{cfa} |
---|
| 271 | frames -= 1; |
---|
| 272 | try { |
---|
| 273 | unwind_other(); |
---|
| 274 | } catch (not_raised_exception *) {} |
---|
| 275 | \end{cfa} |
---|
| 276 | To make comparisons fair, a dummy parameter is added and the dummy value passed |
---|
| 277 | in the recursion. Note, nested functions in gcc are rarely used (if not |
---|
| 278 | completely unknown) and must follow the C calling convention, unlike \Cpp |
---|
| 279 | lambdas, so it is not surprising if there are performance issues efficiently |
---|
| 280 | capturing closures. |
---|
| 281 | |
---|
| 282 | % Similarly, if a test does not change between resumption |
---|
| 283 | % and termination in \CFA, then only one test is written and the result |
---|
| 284 | % was put into the termination column. |
---|
[9698690] | 285 | |
---|
[0b67a19] | 286 | % Raw Data: |
---|
| 287 | % run-algol-a.sat |
---|
| 288 | % --------------- |
---|
| 289 | % Raise Empty & 82687046678 & 291616256 & 3252824847 & 15422937623 & 14736271114 \\ |
---|
| 290 | % Raise D'tor & 219933199603 & 297897792 & 223602799362 & N/A & N/A \\ |
---|
| 291 | % Raise Finally & 219703078448 & 298391745 & N/A & ... & 18923060958 \\ |
---|
| 292 | % Raise Other & 296744104920 & 2854342084 & 112981255103 & 15475924808 & 21293137454 \\ |
---|
| 293 | % Cross Handler & 9256648 & 13518430 & 769328 & 3486252 & 31790804 \\ |
---|
| 294 | % Cross Finally & 769319 & N/A & N/A & 2272831 & 37491962 \\ |
---|
| 295 | % Match All & 3654278402 & 47518560 & 3218907794 & 1296748192 & 624071886 \\ |
---|
| 296 | % Match None & 4788861754 & 58418952 & 9458936430 & 1318065020 & 625200906 \\ |
---|
| 297 | % |
---|
| 298 | % run-algol-thr-c |
---|
| 299 | % --------------- |
---|
| 300 | % Raise Empty & 3757606400 & 36472972 & 3257803337 & 15439375452 & 14717808642 \\ |
---|
| 301 | % Raise D'tor & 64546302019 & 102148375 & 223648121635 & N/A & N/A \\ |
---|
| 302 | % Raise Finally & 64671359172 & 103285005 & N/A & 15442729458 & 18927008844 \\ |
---|
| 303 | % Raise Other & 294143497130 & 2630130385 & 112969055576 & 15448220154 & 21279953424 \\ |
---|
| 304 | % Cross Handler & 9646462 & 11955668 & 769328 & 3453707 & 31864074 \\ |
---|
| 305 | % Cross Finally & 773412 & N/A & N/A & 2253825 & 37266476 \\ |
---|
| 306 | % Match All & 3719462155 & 43294042 & 3223004977 & 1286054154 & 623887874 \\ |
---|
| 307 | % Match None & 4971630929 & 55311709 & 9481225467 & 1310251289 & 623752624 \\ |
---|
[5438e41] | 308 | % |
---|
| 309 | % run-algol-04-a |
---|
| 310 | % -------------- |
---|
| 311 | % Raise Empty & 0.0 & 0.0 & 3250260945 & 0.0 & 0.0 \\ |
---|
| 312 | % Raise D'tor & 0.0 & 0.0 & 29017675113 & N/A & N/A \\ |
---|
| 313 | % Raise Finally & 0.0 & 0.0 & N/A & 0.0 & 0.0 \\ |
---|
| 314 | % Raise Other & 0.0 & 0.0 & 24411823773 & 0.0 & 0.0 \\ |
---|
| 315 | % Cross Handler & 0.0 & 0.0 & 769334 & 0.0 & 0.0 \\ |
---|
| 316 | % Cross Finally & 0.0 & N/A & N/A & 0.0 & 0.0 \\ |
---|
| 317 | % Match All & 0.0 & 0.0 & 3254283504 & 0.0 & 0.0 \\ |
---|
| 318 | % Match None & 0.0 & 0.0 & 9476060146 & 0.0 & 0.0 \\ |
---|
| 319 | |
---|
[0b67a19] | 320 | % run-plg7a-a.sat |
---|
| 321 | % --------------- |
---|
| 322 | % Raise Empty & 57169011329 & 296612564 & 2788557155 & 17511466039 & 23324548496 \\ |
---|
| 323 | % Raise D'tor & 150599858014 & 318443709 & 149651693682 & N/A & N/A \\ |
---|
| 324 | % Raise Finally & 148223145000 & 373325807 & N/A & ... & 29074552998 \\ |
---|
| 325 | % Raise Other & 189463708732 & 3017109322 & 85819281694 & 17584295487 & 32602686679 \\ |
---|
| 326 | % Cross Handler & 8001654 & 13584858 & 1555995 & 6626775 & 41927358 \\ |
---|
| 327 | % Cross Finally & 1002473 & N/A & N/A & 4554344 & 51114381 \\ |
---|
| 328 | % Match All & 3162460860 & 37315018 & 2649464591 & 1523205769 & 742374509 \\ |
---|
| 329 | % Match None & 4054773797 & 47052659 & 7759229131 & 1555373654 & 744656403 \\ |
---|
| 330 | % |
---|
| 331 | % run-plg7a-thr-a |
---|
| 332 | % --------------- |
---|
| 333 | % Raise Empty & 3604235388 & 29829965 & 2786931833 & 17576506385 & 23352975105 \\ |
---|
| 334 | % Raise D'tor & 46552380948 & 178709605 & 149834207219 & N/A & N/A \\ |
---|
| 335 | % Raise Finally & 46265157775 & 177906320 & N/A & 17493045092 & 29170962959 \\ |
---|
| 336 | % Raise Other & 195659245764 & 2376968982 & 86070431924 & 17552979675 & 32501882918 \\ |
---|
| 337 | % Cross Handler & 397031776 & 12503552 & 1451225 & 6658628 & 42304965 \\ |
---|
| 338 | % Cross Finally & 1136746 & N/A & N/A & 4468799 & 46155817 \\ |
---|
| 339 | % Match All & 3189512499 & 39124453 & 2667795989 & 1525889031 & 733785613 \\ |
---|
| 340 | % Match None & 4094675477 & 48749857 & 7850618572 & 1566713577 & 733478963 \\ |
---|
[5438e41] | 341 | % |
---|
| 342 | % run-plg7a-04-a |
---|
| 343 | % -------------- |
---|
| 344 | % 0.0 are unfilled. |
---|
| 345 | % Raise Empty & 0.0 & 0.0 & 2770781479 & 0.0 & 0.0 \\ |
---|
| 346 | % Raise D'tor & 0.0 & 0.0 & 23530084907 & N/A & N/A \\ |
---|
| 347 | % Raise Finally & 0.0 & 0.0 & N/A & 0.0 & 0.0 \\ |
---|
| 348 | % Raise Other & 0.0 & 0.0 & 23816827982 & 0.0 & 0.0 \\ |
---|
| 349 | % Cross Handler & 0.0 & 0.0 & 1422188 & 0.0 & 0.0 \\ |
---|
| 350 | % Cross Finally & 0.0 & N/A & N/A & 0.0 & 0.0 \\ |
---|
| 351 | % Match All & 0.0 & 0.0 & 2671989778 & 0.0 & 0.0 \\ |
---|
| 352 | % Match None & 0.0 & 0.0 & 7829059869 & 0.0 & 0.0 \\ |
---|
[0b67a19] | 353 | |
---|
[262deda0] | 354 | \begin{table} |
---|
| 355 | \centering |
---|
| 356 | \caption{Performance Results Termination (sec)} |
---|
| 357 | \label{t:PerformanceTermination} |
---|
| 358 | \begin{tabular}{|r|*{2}{|r r r r|}} |
---|
[0b67a19] | 359 | \hline |
---|
[262deda0] | 360 | & \multicolumn{4}{c||}{AMD} & \multicolumn{4}{c|}{ARM} \\ |
---|
| 361 | \cline{2-9} |
---|
| 362 | N\hspace{8pt} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} & |
---|
| 363 | \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\ |
---|
| 364 | \hline |
---|
| 365 | Throw Empty (1M) & 3.4 & 2.8 & 18.3 & 23.4 & 3.7 & 3.2 & 15.5 & 14.8 \\ |
---|
| 366 | Throw D'tor (1M) & 48.4 & 23.6 & N/A & N/A & 64.2 & 29.0 & N/A & N/A \\ |
---|
| 367 | Throw Finally (1M) & 3.4* & N/A & 17.9 & 29.0 & 4.1* & N/A & 15.6 & 19.0 \\ |
---|
| 368 | Throw Other (1M) & 3.6* & 23.2 & 18.2 & 32.7 & 4.0* & 24.5 & 15.5 & 21.4 \\ |
---|
| 369 | Try/Catch (100M) & 6.0 & 0.9 & N/C & 37.4 & 10.0 & 0.8 & N/C & 32.2 \\ |
---|
| 370 | Try/Finally (100M) & 0.9 & N/A & N/C & 44.1 & 0.8 & N/A & N/C & 37.3 \\ |
---|
| 371 | Match All (10M) & 32.9 & 20.7 & 13.4 & 4.9 & 36.2 & 24.5 & 12.0 & 3.1 \\ |
---|
| 372 | Match None (10M) & 32.7 & 50.3 & 11.0 & 5.1 & 36.3 & 71.9 & 12.3 & 4.2 \\ |
---|
[0b67a19] | 373 | \hline |
---|
[262deda0] | 374 | \end{tabular} |
---|
| 375 | \end{table} |
---|
| 376 | |
---|
| 377 | \begin{table} |
---|
| 378 | \centering |
---|
| 379 | \small |
---|
| 380 | \caption{Performance Results Resumption (sec)} |
---|
| 381 | \label{t:PerformanceResumption} |
---|
| 382 | \setlength{\tabcolsep}{5pt} |
---|
| 383 | \begin{tabular}{|r|*{2}{|r r r r|}} |
---|
| 384 | \hline |
---|
| 385 | & \multicolumn{4}{c||}{AMD} & \multicolumn{4}{c|}{ARM} \\ |
---|
| 386 | \cline{2-9} |
---|
| 387 | N\hspace{8pt} & \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} & |
---|
| 388 | \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\ |
---|
| 389 | \hline |
---|
| 390 | Resume Empty (10M) & 3.8/3.5 & 14.7 & 2.3 & 176.1 & 0.3/0.1 & 8.9 & 1.2 & 119.9 \\ |
---|
| 391 | Resume Other (10M) & 4.0*/0.1* & 21.9 & 6.2 & 381.0 & 0.3*/0.1* & 13.2 & 5.0 & 290.7 \\ |
---|
| 392 | Try/Resume (100M) & 8.8 & N/A & N/A & N/A & 12.3 & N/A & N/A & N/A \\ |
---|
| 393 | Match All (10M) & 0.3 & N/A & N/A & N/A & 0.3 & N/A & N/A & N/A \\ |
---|
| 394 | Match None (10M) & 0.3 & N/A & N/A & N/A & 0.4 & N/A & N/A & N/A \\ |
---|
[0b67a19] | 395 | \hline |
---|
| 396 | \end{tabular} |
---|
[262deda0] | 397 | \end{table} |
---|
[0b67a19] | 398 | |
---|
[262deda0] | 399 | As stated, the performance tests are not attempting to compare exception |
---|
| 400 | handling across languages. The only performance requirement is to ensure the |
---|
| 401 | \CFA EHM implementation runs in a reasonable amount of time, given its |
---|
| 402 | constraints. In general, the \CFA implement did very well. Each of the tests is |
---|
| 403 | analysed. |
---|
| 404 | \begin{description} |
---|
| 405 | \item[Throw/Resume Empty] |
---|
| 406 | For termination, \CFA is close to \Cpp, where other languages have a higher cost. |
---|
| 407 | |
---|
| 408 | For resumption, \CFA is better than the fixup simulations in the other languages, except Java. |
---|
| 409 | The \CFA results on the ARM computer for both resumption and function simulation are particularly low; |
---|
| 410 | I have no explanation for this anomaly, except the optimizer has managed to remove part of the experiment. |
---|
| 411 | Python has a high cost for passing the lambda during the recursion. |
---|
| 412 | |
---|
| 413 | \item[Throw D'tor] |
---|
| 414 | For termination, \CFA is twice the cost of \Cpp. |
---|
| 415 | The higher cost for \CFA must be related to how destructors are handled. |
---|
| 416 | |
---|
| 417 | \item[Throw Finally] |
---|
| 418 | \CFA is better than the other languages with a @finally@ clause, which is the |
---|
| 419 | same for termination and resumption. |
---|
| 420 | |
---|
| 421 | \item[Throw/Resume Other] |
---|
| 422 | For termination, \CFA is better than the other languages. |
---|
| 423 | |
---|
| 424 | For resumption, \CFA is equal to or better the other languages. |
---|
| 425 | Again, the \CFA results on the ARM computer for both resumption and function simulation are particularly low. |
---|
| 426 | Python has a high cost for passing the lambda during the recursion. |
---|
| 427 | |
---|
| 428 | \item[Try/Catch/Resume] |
---|
| 429 | For termination, installing a try statement is more expressive than \Cpp |
---|
| 430 | because the try components are hoisted into local functions. At runtime, these |
---|
| 431 | functions are than passed to libunwind functions to set up the try statement. |
---|
| 432 | \Cpp zero-cost try-entry accounts for its performance advantage. |
---|
| 433 | |
---|
| 434 | For resumption, there are similar costs to termination to set up the try |
---|
| 435 | statement but libunwind is not used. |
---|
| 436 | |
---|
| 437 | \item[Try/Finally] |
---|
| 438 | Setting up a try finally is less expensive in \CFA than setting up handlers, |
---|
| 439 | and is significantly less than other languages. |
---|
[0b67a19] | 440 | |
---|
[262deda0] | 441 | \item[Throw/Resume Match All] |
---|
| 442 | For termination, \CFA is close to the other language simulations. |
---|
| 443 | |
---|
| 444 | For resumption, the stack unwinding is much faster because it does not use |
---|
| 445 | libunwind. Instead resumption is just traversing a linked list with each node |
---|
| 446 | being the next stack frame with the try block. |
---|
| 447 | |
---|
| 448 | \item[Throw/Resume Match None] |
---|
| 449 | The same results as for Match All. |
---|
| 450 | \end{description} |
---|
| 451 | |
---|
| 452 | \begin{comment} |
---|
| 453 | This observation means that while \CFA does not actually keep up with Python in |
---|
| 454 | every case, it is usually no worse than roughly half the speed of \Cpp. This |
---|
| 455 | performance is good enough for the prototyping purposes of the project. |
---|
[0b67a19] | 456 | |
---|
[5438e41] | 457 | The test case where \CFA falls short is Raise Other, the case where the |
---|
| 458 | stack is unwound including a bunch of non-matching handlers. |
---|
[c9f9d4f] | 459 | This slowdown seems to come from missing optimizations. |
---|
| 460 | |
---|
[5438e41] | 461 | This suggests that the performance issue in Raise Other is just an |
---|
| 462 | optimization not being applied. Later versions of gcc may be able to |
---|
| 463 | optimize this case further, at least down to the half of \Cpp mark. |
---|
| 464 | A \CFA compiler that directly produced assembly could do even better as it |
---|
| 465 | would not have to work across some of \CFA's current abstractions, like |
---|
| 466 | the try terminate function. |
---|
[0b67a19] | 467 | |
---|
| 468 | Resumption exception handling is also incredibly fast. Often an order of |
---|
| 469 | magnitude or two better than the best termination speed. |
---|
[c9f9d4f] | 470 | There is a simple explanation for this; traversing a linked list is much |
---|
[0b67a19] | 471 | faster than examining and unwinding the stack. When resumption does not do as |
---|
[c9f9d4f] | 472 | well its when more try statements are used per raise. Updating the internal |
---|
| 473 | linked list is not very expensive but it does add up. |
---|
[0b67a19] | 474 | |
---|
| 475 | The relative speed of the Match All and Match None tests (within each |
---|
| 476 | language) can also show the effectiveness conditional matching as compared |
---|
| 477 | to catch and rethrow. |
---|
| 478 | \begin{itemize}[nosep] |
---|
| 479 | \item |
---|
| 480 | Java and Python get similar values in both tests. |
---|
[c9f9d4f] | 481 | Between the interpreted code, a higher level representation of the call |
---|
[0b67a19] | 482 | stack and exception reuse it it is possible the cost for a second |
---|
| 483 | throw can be folded into the first. |
---|
| 484 | % Is this due to optimization? |
---|
| 485 | \item |
---|
[c9f9d4f] | 486 | Both types of \CFA are slightly slower if there is not a match. |
---|
[0b67a19] | 487 | For termination this likely comes from unwinding a bit more stack through |
---|
| 488 | libunwind instead of executing the code normally. |
---|
| 489 | For resumption there is extra work in traversing more of the list and running |
---|
| 490 | more checks for a matching exceptions. |
---|
| 491 | % Resumption is a bit high for that but this is my best theory. |
---|
| 492 | \item |
---|
| 493 | Then there is \Cpp, which takes 2--3 times longer to catch and rethrow vs. |
---|
| 494 | just the catch. This is very high, but it does have to repeat the same |
---|
| 495 | process of unwinding the stack and may have to parse the LSDA of the function |
---|
| 496 | with the catch and rethrow twice, once before the catch and once after the |
---|
| 497 | rethrow. |
---|
| 498 | % I spent a long time thinking of what could push it over twice, this is all |
---|
| 499 | % I have to explain it. |
---|
| 500 | \end{itemize} |
---|
| 501 | The difference in relative performance does show that there are savings to |
---|
| 502 | be made by performing the check without catching the exception. |
---|
[262deda0] | 503 | \end{comment} |
---|
| 504 | |
---|
| 505 | |
---|
| 506 | \begin{comment} |
---|
| 507 | From: Dave Dice <dave.dice@oracle.com> |
---|
| 508 | To: "Peter A. Buhr" <pabuhr@uwaterloo.ca> |
---|
| 509 | Subject: Re: [External] : JIT |
---|
| 510 | Date: Mon, 16 Aug 2021 01:21:56 +0000 |
---|
| 511 | |
---|
| 512 | > On 2021-8-15, at 7:14 PM, Peter A. Buhr <pabuhr@uwaterloo.ca> wrote: |
---|
| 513 | > |
---|
| 514 | > My student is trying to measure the cost of installing a try block with a |
---|
| 515 | > finally clause in Java. |
---|
| 516 | > |
---|
| 517 | > We tried the random trick (see below). But if the try block is comment out, the |
---|
| 518 | > results are the same. So the program measures the calls to the random number |
---|
| 519 | > generator and there is no cost for installing the try block. |
---|
| 520 | > |
---|
| 521 | > Maybe there is no cost for a try block with an empty finally, i.e., the try is |
---|
| 522 | > optimized away from the get-go. |
---|
| 523 | |
---|
| 524 | There's quite a bit of optimization magic behind the HotSpot curtains for |
---|
| 525 | try-finally. (I sound like the proverbial broken record (:>)). |
---|
| 526 | |
---|
| 527 | In many cases we can determine that the try block can't throw any exceptions, |
---|
| 528 | so we can elide all try-finally plumbing. In other cases, we can convert the |
---|
| 529 | try-finally to normal if-then control flow, in the case where the exception is |
---|
| 530 | thrown into the same method. This makes exceptions _almost cost-free. If we |
---|
| 531 | actually need to "physically" rip down stacks, then things get expensive, |
---|
| 532 | impacting both the throw cost, and inhibiting other useful optimizations at the |
---|
| 533 | catch point. Such "true" throws are not just expensive, they're _very |
---|
| 534 | expensive. The extremely aggressive inlining used by the JIT helps, because we |
---|
| 535 | can convert cases where a heavy rip-down would normally needed back into simple |
---|
| 536 | control flow. |
---|
| 537 | |
---|
| 538 | Other quirks involve the thrown exception object. If it's never accessed then |
---|
| 539 | we're apply a nice set of optimizations to avoid its construction. If it's |
---|
| 540 | accessed but never escapes the catch frame (common) then we can also cheat. |
---|
| 541 | And if we find we're hitting lots of heavy rip-down cases, the JIT will |
---|
| 542 | consider recompilation - better inlining -- to see if we can merge the throw |
---|
| 543 | and catch into the same physical frame, and shift to simple branches. |
---|
| 544 | |
---|
| 545 | In your example below, System.out.print() can throw, I believe. (I could be |
---|
| 546 | wrong, but most IO can throw). Native calls that throw will "unwind" normally |
---|
| 547 | in C++ code until they hit the boundary where they reenter java emitted code, |
---|
| 548 | at which point the JIT-ed code checks for a potential pending exception. So in |
---|
| 549 | a sense the throw point is implicitly after the call to the native method, so |
---|
| 550 | we can usually make those cases efficient. |
---|
| 551 | |
---|
| 552 | Also, when we're running in the interpreter and warming up, we'll notice that |
---|
| 553 | the == 42 case never occurs, and so when we start to JIT the code, we elide the |
---|
| 554 | call to System.out.print(), replacing it (and anything else which appears in |
---|
| 555 | that if x == 42 block) with a bit of code we call an "uncommon trap". I'm |
---|
| 556 | presuming we encounter 42 rarely. So if we ever hit the x == 42 case, control |
---|
| 557 | hits the trap, which triggers synchronous recompilation of the method, this |
---|
| 558 | time with the call to System.out.print() and, because of that, we now to adapt |
---|
| 559 | the new code to handle any traps thrown by print(). This is tricky stuff, as |
---|
| 560 | we may need to rebuild stack frames to reflect the newly emitted method. And |
---|
| 561 | we have to construct a weird bit of "thunk" code that allows us to fall back |
---|
| 562 | directly into the newly emitted "if" block. So there's a large one-time cost |
---|
| 563 | when we bump into the uncommon trap and recompile, and subsequent execution |
---|
| 564 | might get slightly slower as the exception could actually be generated, whereas |
---|
| 565 | before we hit the trap, we knew the exception could never be raised. |
---|
| 566 | |
---|
| 567 | Oh, and things also get expensive if we need to actually fill in the stack |
---|
| 568 | trace associated with the exception object. Walking stacks is hellish. |
---|
| 569 | |
---|
| 570 | Quite a bit of effort was put into all this as some of the specjvm benchmarks |
---|
| 571 | showed significant benefit. |
---|
| 572 | |
---|
| 573 | It's hard to get sensible measurements as the JIT is working against you at |
---|
| 574 | every turn. What's good for the normal user is awful for anybody trying to |
---|
| 575 | benchmark. Also, all the magic results in fairly noisy and less reproducible |
---|
| 576 | results. |
---|
| 577 | |
---|
| 578 | Regards |
---|
| 579 | Dave |
---|
| 580 | |
---|
| 581 | p.s., I think I've mentioned this before, but throwing in C++ is grim as |
---|
| 582 | unrelated throws in different threads take common locks, so nothing scales as |
---|
| 583 | you might expect. |
---|
| 584 | \end{comment} |
---|