source: doc/theses/andrew_beach_MMath/performance.tex @ 6c2dc00

ADTast-experimentalenumforall-pointer-decayjacob/cs343-translationpthread-emulationqualifiedEnum
Last change on this file since 6c2dc00 was 262deda0, checked in by Peter A. Buhr <pabuhr@…>, 3 years ago

proofread performance chapter and add local bibliography entry

  • Property mode set to 100644
File size: 26.2 KB
RevLine 
[dac16a0]1\chapter{Performance}
2\label{c:performance}
3
[c9f9d4f]4Performance is of secondary importance for most of this project.
[262deda0]5Instead, the focus was to get the features working. The only performance
[c9f9d4f]6requirement is to ensure the tests for correctness run in a reasonable
[262deda0]7amount of time. Hence, a few basic performance tests were performed to
8check this requirement.
[b51e389c]9
10\section{Test Set-Up}
[c9f9d4f]11Tests were run in \CFA, C++, Java and Python.
[9698690]12In addition there are two sets of tests for \CFA,
[c9f9d4f]13one for termination and one for resumption exceptions.
[dac16a0]14
15C++ is the most comparable language because both it and \CFA use the same
16framework, libunwind.
[262deda0]17In fact, the comparison is almost entirely a quality of implementation.
18Specifically, \CFA's EHM has had significantly less time to be optimized and
[dac16a0]19does not generate its own assembly. It does have a slight advantage in that
[c9f9d4f]20there are some features it handles directly instead of through utility functions,
[262deda0]21but otherwise \Cpp should have a significant advantage.
[dac16a0]22
[262deda0]23Java is a popular language with similar termination semantics, but
24it is implemented in a very different environment, a virtual machine with
[029cbc0]25garbage collection.
[c9f9d4f]26It also implements the @finally@ clause on @try@ blocks allowing for a direct
[029cbc0]27feature-to-feature comparison.
[262deda0]28As with \Cpp, Java's implementation is mature, optimized
[c9f9d4f]29and has extra features.
[9698690]30
[262deda0]31Python is used as an alternative comparison because of the \CFA EHM's
[c9f9d4f]32current performance goals, which is not to be prohibitively slow while the
[9698690]33features are designed and examined. Python has similar performance goals for
34creating quick scripts and its wide use suggests it has achieved those goals.
35
[c9f9d4f]36Unfortunately, there are no notable modern programming languages with
37resumption exceptions. Even the older programming languages with resumption
38seem to be notable only for having resumption.
[262deda0]39So instead, resumption is compared to its simulation in other programming
40languages using fixup functions that are explicitly passed for correction or
41logging purposes.
42% So instead, resumption is compared to a less similar but much more familiar
43%feature, termination exceptions.
[dac16a0]44
[c9f9d4f]45All tests are run inside a main loop that repeatedly performs a test.
46This approach avoids start-up or tear-down time from
[dac16a0]47affecting the timing results.
[262deda0]48Each test is run a N times (configurable from the command line).
49The Java tests runs the main loop 1000 times before
50beginning the actual test to ``warm-up" the JVM.
[9698690]51
52Timing is done internally, with time measured immediately before and
[c9f9d4f]53after the test loop. The difference is calculated and printed.
[9698690]54The loop structure and internal timing means it is impossible to test
55unhandled exceptions in \Cpp and Java as that would cause the process to
56terminate.
57Luckily, performance on the ``give-up and kill the process" path is not
58critical.
[dac16a0]59
[c9f9d4f]60The exceptions used in these tests are always based off of
61a base exception. This requirement minimizes performance differences based
62on the object model used to represent the exception.
[9698690]63
[c9f9d4f]64All tests are designed to be as minimal as possible, while still preventing
65excessive optimizations.
[9698690]66For example, empty inline assembly blocks are used in \CFA and \Cpp to
67prevent excessive optimizations while adding no actual work.
[c9f9d4f]68Each test was run eleven times. The top three and bottom three results were
69discarded and the remaining five values are averaged.
70
71The tests are compiled with gcc-10 for \CFA and g++-10 for \Cpp. Java is
[262deda0]72compiled with version 11.0.11. Python with version 3.8. The tests were run on:
[c9f9d4f]73\begin{itemize}[nosep]
74\item
75ARM 2280 Kunpeng 920 48-core 2$\times$socket \lstinline{@} 2.6 GHz running Linux v5.11.0-25
76\item
77AMD 6380 Abu Dhabi 16-core 4$\times$socket \lstinline{@} 2.5 GHz running Linux v5.11.0-25
78\end{itemize}
[262deda0]79Two kinds of hardware architecture allows discriminating any implementation and
80architectural effects.
81
[dac16a0]82
[9698690]83% We don't use catch-alls but if we did:
84% Catch-alls are done by catching the root exception type (not using \Cpp's
85% \code{C++}{catch(...)}).
[dac16a0]86
[b51e389c]87\section{Tests}
[029cbc0]88The following tests were selected to test the performance of different
89components of the exception system.
[c9f9d4f]90They should provide a guide as to where the EHM's costs are found.
[029cbc0]91
[ea593a3]92\paragraph{Raise and Handle}
[262deda0]93This group measures the cost of a try statement when exceptions are raised and
94the stack is unwound (termination) or not unwound (resumption).  Each test has
95has a repeating function like the following
96\begin{lstlisting}[language=CFA,{moredelim=**[is][\color{red}]{@}{@}}]
[c9f9d4f]97void unwind_empty(unsigned int frames) {
98        if (frames) {
[262deda0]99                @unwind_empty(frames - 1);@ // AUGMENTED IN OTHER EXPERIMENTS
[c9f9d4f]100        } else throw (empty_exception){&empty_vt};
101}
[262deda0]102\end{lstlisting}
103which is called N times, where each call recurses to a depth of R (configurable from the command line), an
[c9f9d4f]104exception is raised, the stack is a unwound, and the exception caught.
[9698690]105\begin{itemize}[nosep]
[c9f9d4f]106\item Empty:
[262deda0]107For termination, this test measures the cost of raising (stack walking) an
108exception through empty stack frames from the bottom of the recursion to an
109empty handler, and unwinding the stack. (see above code)
110
111\medskip
112For resumption, this test measures the same raising cost but does not unwind
113the stack. For languages without resumption, a fixup function is to the bottom
114of the recursion and called to simulate a fixup operation at that point.
115\begin{cfa}
116void nounwind_fixup(unsigned int frames, void (*raised_rtn)(int &)) {
117        if (frames) {
118                nounwind_fixup(frames - 1, raised_rtn);
119        } else {
120                int fixup = 17;
121                raised_rtn(fixup);
122        }
123}
124\end{cfa}
125where the passed fixup function is:
126\begin{cfa}
127void raised(int & fixup) {
128        fixup = 42;
129}
130\end{cfa}
131For comparison, a \CFA version passing a function is also included.
[ea593a3]132\item Destructor:
[262deda0]133This test measures the cost of raising an exception through non-empty frames,
134where each frame has an object requiring destruction, from the bottom of the
135recursion to an empty handler. Hence, there are N destructor calls during
136unwinding.
[c9f9d4f]137
[262deda0]138\medskip
139This test is not meaningful for resumption because the stack is only unwound as
140the recursion returns.
[c9f9d4f]141\begin{cfa}
142        WithDestructor object;
[262deda0]143        unwind_destructor(frames - 1);
[c9f9d4f]144\end{cfa}
[ea593a3]145\item Finally:
[c9f9d4f]146This test measures the cost of establishing a try block with an empty finally
[262deda0]147clause on the front side of the recursion and running the empty finally clauses
148during stack unwinding from the bottom of the recursion to an empty handler.
[c9f9d4f]149\begin{cfa}
150        try {
151                unwind_finally(frames - 1);
152        } finally {}
153\end{cfa}
[262deda0]154
155\medskip
156This test is not meaningful for resumption because the stack is only unwound as
157the recursion returns.
[ea593a3]158\item Other Handler:
[262deda0]159For termination, this test is like the finally test but the try block has a
160catch clause for an exception that is not raised, so catch matching is executed
161during stack unwinding but the match never successes until the catch at the
162bottom of the recursion.
[c9f9d4f]163\begin{cfa}
164        try {
165                unwind_other(frames - 1);
166        } catch (not_raised_exception *) {}
167\end{cfa}
[262deda0]168
169\medskip
170For resumption, this test measures the same raising cost but does not unwind
171the stack. For languages without resumption, the same fixup function is passed
172and called.
[ea593a3]173\end{itemize}
174
[262deda0]175\paragraph{Try/Handle/Finally Statement}
176This group measures just the cost of executing a try statement so
[c9f9d4f]177\emph{there is no stack unwinding}.  Hence, the program main loops N times
178around:
179\begin{cfa}
180try {
181} catch (not_raised_exception *) {}
182\end{cfa}
[9698690]183\begin{itemize}[nosep]
[ea593a3]184\item Handler:
[262deda0]185The try statement has a handler (catch/resume).
[ea593a3]186\item Finally:
[262deda0]187The try statement has a finally clause.
[ea593a3]188\end{itemize}
189
190\paragraph{Conditional Matching}
[262deda0]191This group measures the cost of conditional matching.
[ea593a3]192Only \CFA implements the language level conditional match,
[262deda0]193the other languages mimic with an ``unconditional" match (it still
194checks the exception's type) and conditional re-raise if it is not suppose
[9698690]195to handle that exception.
[c9f9d4f]196\begin{center}
197\begin{tabular}{ll}
198\multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp, Java, Python} \\
199\begin{cfa}
200try {
201        throw_exception();
202} catch (empty_exception * exc;
203                 should_catch) {
204}
205\end{cfa}
206&
207\begin{cfa}
208try {
209        throw_exception();
210} catch (EmptyException & exc) {
211        if (!should_catch) throw;
212}
213\end{cfa}
214\end{tabular}
215\end{center}
[9698690]216\begin{itemize}[nosep]
217\item Match All:
[ea593a3]218The condition is always true. (Always matches or never re-raises.)
[9698690]219\item Match None:
[ea593a3]220The condition is always false. (Never matches or always re-raises.)
221\end{itemize}
[dac16a0]222
[262deda0]223\medskip
224\noindent
225All omitted test code for other languages is functionally identical to the \CFA
226tests or simulated, and available online~\cite{CforallExceptionBenchmarks}.
227
[dac16a0]228%\section{Cost in Size}
229%Using exceptions also has a cost in the size of the executable.
230%Although it is sometimes ignored
231%
232%There is a size cost to defining a personality function but the later problem
233%is the LSDA which will be generated for every function.
234%
235%(I haven't actually figured out how to compare this, probably using something
236%related to -fexceptions.)
[029cbc0]237
[9698690]238\section{Results}
[262deda0]239One result not directly related to \CFA but important to keep in
240mind is that, for exceptions, the standard intuition about which languages
241should go faster often does not hold. For example, there are a few cases where Python out-performs
242\CFA, \Cpp and Java. The most likely explanation is that, since exceptions are
243rarely considered to be the common case, the more optimized languages
244make that case expense. In addition, languages with high-level
245representations have a much easier time scanning the stack as there is less
246to decode.
[c9f9d4f]247
[262deda0]248Tables~\ref{t:PerformanceTermination} and~\ref{t:PerformanceResumption} show
249the test results for termination and resumption, respectively.  In cases where
250a feature is not supported by a language, the test is skipped for that language
251(marked N/A).  For some Java experiments it was impossible to measure certain
252effects because the JIT corrupted the test (marked N/C). No workaround was
253possible~\cite{Dice21}.  To get experiments in the range of 1--100 seconds, the
254number of times an experiment is run (N) is varied (N is marked beside each
255experiment, e.g., 1M $\Rightarrow$ 1 million test iterations).
256
257An anomaly exists with gcc nested functions used as thunks for implementing
258much of the \CFA EHM. If a nested-function closure captures local variables in
259its lexical scope, performance dropped by a factor of 10.  Specifically, in try
260statements of the form:
261\begin{cfa}
262        try {
263                unwind_other(frames - 1);
264        } catch (not_raised_exception *) {}
265\end{cfa}
266the try block is hoisted into a nested function and the variable @frames@ is
267the local parameter to the recursive function, which triggers the anomaly. The
268workaround is to remove the recursion parameter and make it a global variable
269that is explicitly decremented outside of the try block (marked with a ``*''):
270\begin{cfa}
271        frames -= 1;
272        try {
273                unwind_other();
274        } catch (not_raised_exception *) {}
275\end{cfa}
276To make comparisons fair, a dummy parameter is added and the dummy value passed
277in the recursion. Note, nested functions in gcc are rarely used (if not
278completely unknown) and must follow the C calling convention, unlike \Cpp
279lambdas, so it is not surprising if there are performance issues efficiently
280capturing closures.
281
282% Similarly, if a test does not change between resumption
283% and termination in \CFA, then only one test is written and the result
284% was put into the termination column.
[9698690]285
[0b67a19]286% Raw Data:
287% run-algol-a.sat
288% ---------------
289% Raise Empty   &  82687046678 &  291616256 &   3252824847 & 15422937623 & 14736271114 \\
290% Raise D'tor   & 219933199603 &  297897792 & 223602799362 &         N/A &         N/A \\
291% Raise Finally & 219703078448 &  298391745 &          N/A &         ... & 18923060958 \\
292% Raise Other   & 296744104920 & 2854342084 & 112981255103 & 15475924808 & 21293137454 \\
293% Cross Handler &      9256648 &   13518430 &       769328 &     3486252 &    31790804 \\
294% Cross Finally &       769319 &        N/A &          N/A &     2272831 &    37491962 \\
295% Match All     &   3654278402 &   47518560 &   3218907794 &  1296748192 &   624071886 \\
296% Match None    &   4788861754 &   58418952 &   9458936430 &  1318065020 &   625200906 \\
297%
298% run-algol-thr-c
299% ---------------
300% Raise Empty   &   3757606400 &   36472972 &   3257803337 & 15439375452 & 14717808642 \\
301% Raise D'tor   &  64546302019 &  102148375 & 223648121635 &         N/A &         N/A \\
302% Raise Finally &  64671359172 &  103285005 &          N/A & 15442729458 & 18927008844 \\
303% Raise Other   & 294143497130 & 2630130385 & 112969055576 & 15448220154 & 21279953424 \\
304% Cross Handler &      9646462 &   11955668 &       769328 &     3453707 &    31864074 \\
305% Cross Finally &       773412 &        N/A &          N/A &     2253825 &    37266476 \\
306% Match All     &   3719462155 &   43294042 &   3223004977 &  1286054154 &   623887874 \\
307% Match None    &   4971630929 &   55311709 &   9481225467 &  1310251289 &   623752624 \\
[5438e41]308%
309% run-algol-04-a
310% --------------
311% Raise Empty   & 0.0 & 0.0 &  3250260945 & 0.0 & 0.0 \\
312% Raise D'tor   & 0.0 & 0.0 & 29017675113 & N/A & N/A \\
313% Raise Finally & 0.0 & 0.0 &         N/A & 0.0 & 0.0 \\
314% Raise Other   & 0.0 & 0.0 & 24411823773 & 0.0 & 0.0 \\
315% Cross Handler & 0.0 & 0.0 &      769334 & 0.0 & 0.0 \\
316% Cross Finally & 0.0 & N/A &         N/A & 0.0 & 0.0 \\
317% Match All     & 0.0 & 0.0 &  3254283504 & 0.0 & 0.0 \\
318% Match None    & 0.0 & 0.0 &  9476060146 & 0.0 & 0.0 \\
319
[0b67a19]320% run-plg7a-a.sat
321% ---------------
322% Raise Empty   &  57169011329 &  296612564 &   2788557155 & 17511466039 & 23324548496 \\
323% Raise D'tor   & 150599858014 &  318443709 & 149651693682 &         N/A &         N/A \\
324% Raise Finally & 148223145000 &  373325807 &          N/A &         ... & 29074552998 \\
325% Raise Other   & 189463708732 & 3017109322 &  85819281694 & 17584295487 & 32602686679 \\
326% Cross Handler &      8001654 &   13584858 &      1555995 &     6626775 &    41927358 \\
327% Cross Finally &      1002473 &        N/A &          N/A &     4554344 &    51114381 \\
328% Match All     &   3162460860 &   37315018 &   2649464591 &  1523205769 &   742374509 \\
329% Match None    &   4054773797 &   47052659 &   7759229131 &  1555373654 &   744656403 \\
330%
331% run-plg7a-thr-a
332% ---------------
333% Raise Empty   &   3604235388 &   29829965 &   2786931833 & 17576506385 & 23352975105 \\
334% Raise D'tor   &  46552380948 &  178709605 & 149834207219 &         N/A &         N/A \\
335% Raise Finally &  46265157775 &  177906320 &          N/A & 17493045092 & 29170962959 \\
336% Raise Other   & 195659245764 & 2376968982 &  86070431924 & 17552979675 & 32501882918 \\
337% Cross Handler &    397031776 &   12503552 &      1451225 &     6658628 &    42304965 \\
338% Cross Finally &      1136746 &        N/A &          N/A &     4468799 &    46155817 \\
339% Match All     &   3189512499 &   39124453 &   2667795989 &  1525889031 &   733785613 \\
340% Match None    &   4094675477 &   48749857 &   7850618572 &  1566713577 &   733478963 \\
[5438e41]341%
342% run-plg7a-04-a
343% --------------
344% 0.0 are unfilled.
345% Raise Empty   & 0.0 & 0.0 &  2770781479 & 0.0 & 0.0 \\
346% Raise D'tor   & 0.0 & 0.0 & 23530084907 & N/A & N/A \\
347% Raise Finally & 0.0 & 0.0 &         N/A & 0.0 & 0.0 \\
348% Raise Other   & 0.0 & 0.0 & 23816827982 & 0.0 & 0.0 \\
349% Cross Handler & 0.0 & 0.0 &     1422188 & 0.0 & 0.0 \\
350% Cross Finally & 0.0 & N/A &         N/A & 0.0 & 0.0 \\
351% Match All     & 0.0 & 0.0 &  2671989778 & 0.0 & 0.0 \\
352% Match None    & 0.0 & 0.0 &  7829059869 & 0.0 & 0.0 \\
[0b67a19]353
[262deda0]354\begin{table}
355\centering
356\caption{Performance Results Termination (sec)}
357\label{t:PerformanceTermination}
358\begin{tabular}{|r|*{2}{|r r r r|}}
[0b67a19]359\hline
[262deda0]360                        & \multicolumn{4}{c||}{AMD}             & \multicolumn{4}{c|}{ARM}      \\
361\cline{2-9}
362N\hspace{8pt}           & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
363                          \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
364\hline                                                                             
365Throw Empty (1M)        & 3.4   & 2.8   & 18.3  & 23.4          & 3.7   & 3.2   & 15.5  & 14.8  \\
366Throw D'tor (1M)        & 48.4  & 23.6  & N/A   & N/A           & 64.2  & 29.0  & N/A   & N/A   \\
367Throw Finally (1M)      & 3.4*  & N/A   & 17.9  & 29.0          & 4.1*  & N/A   & 15.6  & 19.0  \\
368Throw Other (1M)        & 3.6*  & 23.2  & 18.2  & 32.7          & 4.0*  & 24.5  & 15.5  & 21.4  \\
369Try/Catch (100M)        & 6.0   & 0.9   & N/C   & 37.4          & 10.0  & 0.8   & N/C   & 32.2  \\
370Try/Finally (100M)      & 0.9   & N/A   & N/C   & 44.1          & 0.8   & N/A   & N/C   & 37.3  \\
371Match All (10M)         & 32.9  & 20.7  & 13.4  & 4.9           & 36.2  & 24.5  & 12.0  & 3.1   \\
372Match None (10M)        & 32.7  & 50.3  & 11.0  & 5.1           & 36.3  & 71.9  & 12.3  & 4.2   \\
[0b67a19]373\hline
[262deda0]374\end{tabular}
375\end{table}
376
377\begin{table}
378\centering
379\small
380\caption{Performance Results Resumption (sec)}
381\label{t:PerformanceResumption}
382\setlength{\tabcolsep}{5pt}
383\begin{tabular}{|r|*{2}{|r r r r|}}
384\hline
385                        & \multicolumn{4}{c||}{AMD}             & \multicolumn{4}{c|}{ARM}      \\
386\cline{2-9}
387N\hspace{8pt}           & \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
388                          \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
389\hline                                                                             
390Resume Empty (10M)      & 3.8/3.5       & 14.7  & 2.3   & 176.1 & 0.3/0.1       & 8.9   & 1.2   & 119.9 \\
391Resume Other (10M)      & 4.0*/0.1*     & 21.9  & 6.2   & 381.0 & 0.3*/0.1*     & 13.2  & 5.0   & 290.7 \\
392Try/Resume (100M)       & 8.8           & N/A   & N/A   & N/A   & 12.3          & N/A   & N/A   & N/A   \\
393Match All (10M)         & 0.3           & N/A   & N/A   & N/A   & 0.3           & N/A   & N/A   & N/A   \\
394Match None (10M)        & 0.3           & N/A   & N/A   & N/A   & 0.4           & N/A   & N/A   & N/A   \\
[0b67a19]395\hline
396\end{tabular}
[262deda0]397\end{table}
[0b67a19]398
[262deda0]399As stated, the performance tests are not attempting to compare exception
400handling across languages.  The only performance requirement is to ensure the
401\CFA EHM implementation runs in a reasonable amount of time, given its
402constraints. In general, the \CFA implement did very well. Each of the tests is
403analysed.
404\begin{description}
405\item[Throw/Resume Empty]
406For termination, \CFA is close to \Cpp, where other languages have a higher cost.
407
408For resumption, \CFA is better than the fixup simulations in the other languages, except Java.
409The \CFA results on the ARM computer for both resumption and function simulation are particularly low;
410I have no explanation for this anomaly, except the optimizer has managed to remove part of the experiment.
411Python has a high cost for passing the lambda during the recursion.
412
413\item[Throw D'tor]
414For termination, \CFA is twice the cost of \Cpp.
415The higher cost for \CFA must be related to how destructors are handled.
416
417\item[Throw Finally]
418\CFA is better than the other languages with a @finally@ clause, which is the
419same for termination and resumption.
420
421\item[Throw/Resume Other]
422For termination, \CFA is better than the other languages.
423
424For resumption, \CFA is equal to or better the other languages.
425Again, the \CFA results on the ARM computer for both resumption and function simulation are particularly low.
426Python has a high cost for passing the lambda during the recursion.
427
428\item[Try/Catch/Resume]
429For termination, installing a try statement is more expressive than \Cpp
430because the try components are hoisted into local functions.  At runtime, these
431functions are than passed to libunwind functions to set up the try statement.
432\Cpp zero-cost try-entry accounts for its performance advantage.
433
434For resumption, there are similar costs to termination to set up the try
435statement but libunwind is not used.
436
437\item[Try/Finally]
438Setting up a try finally is less expensive in \CFA than setting up handlers,
439and is significantly less than other languages.
[0b67a19]440
[262deda0]441\item[Throw/Resume Match All]
442For termination, \CFA is close to the other language simulations.
443
444For resumption, the stack unwinding is much faster because it does not use
445libunwind.  Instead resumption is just traversing a linked list with each node
446being the next stack frame with the try block.
447
448\item[Throw/Resume Match None]
449The same results as for Match All.
450\end{description}
451
452\begin{comment}
453This observation means that while \CFA does not actually keep up with Python in
454every case, it is usually no worse than roughly half the speed of \Cpp. This
455performance is good enough for the prototyping purposes of the project.
[0b67a19]456
[5438e41]457The test case where \CFA falls short is Raise Other, the case where the
458stack is unwound including a bunch of non-matching handlers.
[c9f9d4f]459This slowdown seems to come from missing optimizations.
460
[5438e41]461This suggests that the performance issue in Raise Other is just an
462optimization not being applied. Later versions of gcc may be able to
463optimize this case further, at least down to the half of \Cpp mark.
464A \CFA compiler that directly produced assembly could do even better as it
465would not have to work across some of \CFA's current abstractions, like
466the try terminate function.
[0b67a19]467
468Resumption exception handling is also incredibly fast. Often an order of
469magnitude or two better than the best termination speed.
[c9f9d4f]470There is a simple explanation for this; traversing a linked list is much   
[0b67a19]471faster than examining and unwinding the stack. When resumption does not do as
[c9f9d4f]472well its when more try statements are used per raise. Updating the internal
473linked list is not very expensive but it does add up.
[0b67a19]474
475The relative speed of the Match All and Match None tests (within each
476language) can also show the effectiveness conditional matching as compared
477to catch and rethrow.
478\begin{itemize}[nosep]
479\item
480Java and Python get similar values in both tests.
[c9f9d4f]481Between the interpreted code, a higher level representation of the call
[0b67a19]482stack and exception reuse it it is possible the cost for a second
483throw can be folded into the first.
484% Is this due to optimization?
485\item
[c9f9d4f]486Both types of \CFA are slightly slower if there is not a match.
[0b67a19]487For termination this likely comes from unwinding a bit more stack through
488libunwind instead of executing the code normally.
489For resumption there is extra work in traversing more of the list and running
490more checks for a matching exceptions.
491% Resumption is a bit high for that but this is my best theory.
492\item
493Then there is \Cpp, which takes 2--3 times longer to catch and rethrow vs.
494just the catch. This is very high, but it does have to repeat the same
495process of unwinding the stack and may have to parse the LSDA of the function
496with the catch and rethrow twice, once before the catch and once after the
497rethrow.
498% I spent a long time thinking of what could push it over twice, this is all
499% I have to explain it.
500\end{itemize}
501The difference in relative performance does show that there are savings to
502be made by performing the check without catching the exception.
[262deda0]503\end{comment}
504
505
506\begin{comment}
507From: Dave Dice <dave.dice@oracle.com>
508To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
509Subject: Re: [External] : JIT
510Date: Mon, 16 Aug 2021 01:21:56 +0000
511
512> On 2021-8-15, at 7:14 PM, Peter A. Buhr <pabuhr@uwaterloo.ca> wrote:
513>
514> My student is trying to measure the cost of installing a try block with a
515> finally clause in Java.
516>
517> We tried the random trick (see below). But if the try block is comment out, the
518> results are the same. So the program measures the calls to the random number
519> generator and there is no cost for installing the try block.
520>
521> Maybe there is no cost for a try block with an empty finally, i.e., the try is
522> optimized away from the get-go.
523
524There's quite a bit of optimization magic behind the HotSpot curtains for
525try-finally.  (I sound like the proverbial broken record (:>)).
526
527In many cases we can determine that the try block can't throw any exceptions,
528so we can elide all try-finally plumbing.  In other cases, we can convert the
529try-finally to normal if-then control flow, in the case where the exception is
530thrown into the same method.  This makes exceptions _almost cost-free.  If we
531actually need to "physically" rip down stacks, then things get expensive,
532impacting both the throw cost, and inhibiting other useful optimizations at the
533catch point.  Such "true" throws are not just expensive, they're _very
534expensive.  The extremely aggressive inlining used by the JIT helps, because we
535can convert cases where a heavy rip-down would normally needed back into simple
536control flow.
537
538Other quirks involve the thrown exception object.  If it's never accessed then
539we're apply a nice set of optimizations to avoid its construction.  If it's
540accessed but never escapes the catch frame (common) then we can also cheat.
541And if we find we're hitting lots of heavy rip-down cases, the JIT will
542consider recompilation - better inlining -- to see if we can merge the throw
543and catch into the same physical frame, and shift to simple branches.
544
545In your example below, System.out.print() can throw, I believe.  (I could be
546wrong, but most IO can throw).  Native calls that throw will "unwind" normally
547in C++ code until they hit the boundary where they reenter java emitted code,
548at which point the JIT-ed code checks for a potential pending exception.  So in
549a sense the throw point is implicitly after the call to the native method, so
550we can usually make those cases efficient.
551
552Also, when we're running in the interpreter and warming up, we'll notice that
553the == 42 case never occurs, and so when we start to JIT the code, we elide the
554call to System.out.print(), replacing it (and anything else which appears in
555that if x == 42 block) with a bit of code we call an "uncommon trap".  I'm
556presuming we encounter 42 rarely.  So if we ever hit the x == 42 case, control
557hits the trap, which triggers synchronous recompilation of the method, this
558time with the call to System.out.print() and, because of that, we now to adapt
559the new code to handle any traps thrown by print().  This is tricky stuff, as
560we may need to rebuild stack frames to reflect the newly emitted method.  And
561we have to construct a weird bit of "thunk" code that allows us to fall back
562directly into the newly emitted "if" block.  So there's a large one-time cost
563when we bump into the uncommon trap and recompile, and subsequent execution
564might get slightly slower as the exception could actually be generated, whereas
565before we hit the trap, we knew the exception could never be raised.
566
567Oh, and things also get expensive if we need to actually fill in the stack
568trace associated with the exception object.  Walking stacks is hellish.
569
570Quite a bit of effort was put into all this as some of the specjvm benchmarks
571showed significant benefit.
572
573It's hard to get sensible measurements as the JIT is working against you at
574every turn.  What's good for the normal user is awful for anybody trying to
575benchmark.  Also, all the magic results in fairly noisy and less reproducible
576results.
577
578Regards
579Dave
580
581p.s., I think I've mentioned this before, but throwing in C++ is grim as
582unrelated throws in different threads take common locks, so nothing scales as
583you might expect.
584\end{comment}
Note: See TracBrowser for help on using the repository browser.