\chapter{Performance}
\label{c:performance}

Performance is of secondary importance for most of this project.
Instead, the focus was to get the features working. The only performance
requirement is to ensure the tests for correctness run in a reasonable
amount of time. Hence, a few basic performance tests were performed to
check this requirement.

\section{Test Set-Up}
Tests were run in \CFA, C++, Java and Python.
In addition there are two sets of tests for \CFA,
one for termination and one for resumption exceptions.

C++ is the most comparable language because both it and \CFA use the same
framework, libunwind.
In fact, the comparison is almost entirely a quality of implementation.
Specifically, \CFA's EHM has had significantly less time to be optimized and
does not generate its own assembly. It does have a slight advantage in that
there are some features it handles directly instead of through utility functions,
but otherwise \Cpp should have a significant advantage.

Java is a popular language with similar termination semantics, but
it is implemented in a very different environment, a virtual machine with
garbage collection.
It also implements the @finally@ clause on @try@ blocks allowing for a direct
feature-to-feature comparison.
As with \Cpp, Java's implementation is mature, optimized
and has extra features.

Python is used as an alternative comparison because of the \CFA EHM's
current performance goals, which is not to be prohibitively slow while the
features are designed and examined. Python has similar performance goals for
creating quick scripts and its wide use suggests it has achieved those goals.

Unfortunately, there are no notable modern programming languages with
resumption exceptions. Even the older programming languages with resumption
seem to be notable only for having resumption.
So instead, resumption is compared to its simulation in other programming
languages using fixup functions that are explicitly passed for correction or
logging purposes.
% So instead, resumption is compared to a less similar but much more familiar
%feature, termination exceptions.

All tests are run inside a main loop that repeatedly performs a test.
This approach avoids start-up or tear-down time from
affecting the timing results.
Each test is run a N times (configurable from the command line).
The Java tests runs the main loop 1000 times before
beginning the actual test to ``warm-up" the JVM.

Timing is done internally, with time measured immediately before and
after the test loop. The difference is calculated and printed.
The loop structure and internal timing means it is impossible to test
unhandled exceptions in \Cpp and Java as that would cause the process to
terminate.
Luckily, performance on the ``give-up and kill the process" path is not
critical.

The exceptions used in these tests are always based off of
a base exception. This requirement minimizes performance differences based
on the object model used to represent the exception.

All tests are designed to be as minimal as possible, while still preventing
excessive optimizations.
For example, empty inline assembly blocks are used in \CFA and \Cpp to
prevent excessive optimizations while adding no actual work.
Each test was run eleven times. The top three and bottom three results were
discarded and the remaining five values are averaged.

The tests are compiled with gcc-10 for \CFA and g++-10 for \Cpp. Java is
compiled with version 11.0.11. Python with version 3.8. The tests were run on:
\begin{itemize}[nosep]
\item
ARM 2280 Kunpeng 920 48-core 2$\times$socket \lstinline{@} 2.6 GHz running Linux v5.11.0-25
\item
AMD 6380 Abu Dhabi 16-core 4$\times$socket \lstinline{@} 2.5 GHz running Linux v5.11.0-25
\end{itemize}
Two kinds of hardware architecture allows discriminating any implementation and
architectural effects.


% We don't use catch-alls but if we did:
% Catch-alls are done by catching the root exception type (not using \Cpp's
% \code{C++}{catch(...)}).

\section{Tests}
The following tests were selected to test the performance of different
components of the exception system.
They should provide a guide as to where the EHM's costs are found.

\paragraph{Raise and Handle}
This group measures the cost of a try statement when exceptions are raised and
the stack is unwound (termination) or not unwound (resumption).  Each test has
has a repeating function like the following
\begin{lstlisting}[language=CFA,{moredelim=**[is][\color{red}]{@}{@}}]
void unwind_empty(unsigned int frames) {
	if (frames) {
		@unwind_empty(frames - 1);@ // AUGMENTED IN OTHER EXPERIMENTS
	} else throw (empty_exception){&empty_vt};
}
\end{lstlisting}
which is called N times, where each call recurses to a depth of R (configurable from the command line), an
exception is raised, the stack is a unwound, and the exception caught.
\begin{itemize}[nosep]
\item Empty:
For termination, this test measures the cost of raising (stack walking) an
exception through empty stack frames from the bottom of the recursion to an
empty handler, and unwinding the stack. (see above code)

\medskip
For resumption, this test measures the same raising cost but does not unwind
the stack. For languages without resumption, a fixup function is to the bottom
of the recursion and called to simulate a fixup operation at that point.
\begin{cfa}
void nounwind_fixup(unsigned int frames, void (*raised_rtn)(int &)) {
	if (frames) {
		nounwind_fixup(frames - 1, raised_rtn);
	} else {
		int fixup = 17;
		raised_rtn(fixup);
	}
}
\end{cfa}
where the passed fixup function is:
\begin{cfa}
void raised(int & fixup) {
	fixup = 42;
}
\end{cfa}
For comparison, a \CFA version passing a function is also included.
\item Destructor:
This test measures the cost of raising an exception through non-empty frames,
where each frame has an object requiring destruction, from the bottom of the
recursion to an empty handler. Hence, there are N destructor calls during
unwinding.

\medskip
This test is not meaningful for resumption because the stack is only unwound as
the recursion returns.
\begin{cfa}
	WithDestructor object;
	unwind_destructor(frames - 1);
\end{cfa}
\item Finally:
This test measures the cost of establishing a try block with an empty finally
clause on the front side of the recursion and running the empty finally clauses
during stack unwinding from the bottom of the recursion to an empty handler.
\begin{cfa}
	try {
		unwind_finally(frames - 1);
	} finally {}
\end{cfa}

\medskip
This test is not meaningful for resumption because the stack is only unwound as
the recursion returns.
\item Other Handler:
For termination, this test is like the finally test but the try block has a
catch clause for an exception that is not raised, so catch matching is executed
during stack unwinding but the match never successes until the catch at the
bottom of the recursion.
\begin{cfa}
	try {
		unwind_other(frames - 1);
	} catch (not_raised_exception *) {}
\end{cfa}

\medskip
For resumption, this test measures the same raising cost but does not unwind
the stack. For languages without resumption, the same fixup function is passed
and called.
\end{itemize}

\paragraph{Try/Handle/Finally Statement}
This group measures just the cost of executing a try statement so
\emph{there is no stack unwinding}.  Hence, the program main loops N times
around:
\begin{cfa}
try {
} catch (not_raised_exception *) {}
\end{cfa}
\begin{itemize}[nosep]
\item Handler:
The try statement has a handler (catch/resume).
\item Finally:
The try statement has a finally clause.
\end{itemize}

\paragraph{Conditional Matching}
This group measures the cost of conditional matching.
Only \CFA implements the language level conditional match,
the other languages mimic with an ``unconditional" match (it still
checks the exception's type) and conditional re-raise if it is not suppose
to handle that exception.
\begin{center}
\begin{tabular}{ll}
\multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp, Java, Python} \\
\begin{cfa}
try {
	throw_exception();
} catch (empty_exception * exc;
		 should_catch) {
}
\end{cfa}
&
\begin{cfa}
try {
	throw_exception();
} catch (EmptyException & exc) {
	if (!should_catch) throw;
}
\end{cfa}
\end{tabular}
\end{center}
\begin{itemize}[nosep]
\item Match All:
The condition is always true. (Always matches or never re-raises.)
\item Match None:
The condition is always false. (Never matches or always re-raises.)
\end{itemize}

\medskip
\noindent
All omitted test code for other languages is functionally identical to the \CFA
tests or simulated, and available online~\cite{CforallExceptionBenchmarks}.

%\section{Cost in Size}
%Using exceptions also has a cost in the size of the executable.
%Although it is sometimes ignored
%
%There is a size cost to defining a personality function but the later problem
%is the LSDA which will be generated for every function.
%
%(I haven't actually figured out how to compare this, probably using something
%related to -fexceptions.)

\section{Results}
One result not directly related to \CFA but important to keep in
mind is that, for exceptions, the standard intuition about which languages
should go faster often does not hold. For example, there are a few cases where Python out-performs
\CFA, \Cpp and Java. The most likely explanation is that, since exceptions are
rarely considered to be the common case, the more optimized languages
make that case expense. In addition, languages with high-level
representations have a much easier time scanning the stack as there is less
to decode.

Tables~\ref{t:PerformanceTermination} and~\ref{t:PerformanceResumption} show
the test results for termination and resumption, respectively.  In cases where
a feature is not supported by a language, the test is skipped for that language
(marked N/A).  For some Java experiments it was impossible to measure certain
effects because the JIT corrupted the test (marked N/C). No workaround was
possible~\cite{Dice21}.  To get experiments in the range of 1--100 seconds, the
number of times an experiment is run (N) is varied (N is marked beside each
experiment, e.g., 1M $\Rightarrow$ 1 million test iterations).

An anomaly exists with gcc nested functions used as thunks for implementing
much of the \CFA EHM. If a nested-function closure captures local variables in
its lexical scope, performance dropped by a factor of 10.  Specifically, in try
statements of the form:
\begin{cfa}
	try {
		unwind_other(frames - 1);
	} catch (not_raised_exception *) {}
\end{cfa}
the try block is hoisted into a nested function and the variable @frames@ is
the local parameter to the recursive function, which triggers the anomaly. The
workaround is to remove the recursion parameter and make it a global variable
that is explicitly decremented outside of the try block (marked with a ``*''):
\begin{cfa}
	frames -= 1;
	try {
		unwind_other();
	} catch (not_raised_exception *) {}
\end{cfa}
To make comparisons fair, a dummy parameter is added and the dummy value passed
in the recursion. Note, nested functions in gcc are rarely used (if not
completely unknown) and must follow the C calling convention, unlike \Cpp
lambdas, so it is not surprising if there are performance issues efficiently
capturing closures.

% Similarly, if a test does not change between resumption
% and termination in \CFA, then only one test is written and the result
% was put into the termination column.

% Raw Data:
% run-algol-a.sat
% ---------------
% Raise Empty   &  82687046678 &  291616256 &   3252824847 & 15422937623 & 14736271114 \\
% Raise D'tor   & 219933199603 &  297897792 & 223602799362 &         N/A &         N/A \\
% Raise Finally & 219703078448 &  298391745 &          N/A &         ... & 18923060958 \\
% Raise Other   & 296744104920 & 2854342084 & 112981255103 & 15475924808 & 21293137454 \\
% Cross Handler &      9256648 &   13518430 &       769328 &     3486252 &    31790804 \\
% Cross Finally &       769319 &        N/A &          N/A &     2272831 &    37491962 \\
% Match All     &   3654278402 &   47518560 &   3218907794 &  1296748192 &   624071886 \\
% Match None    &   4788861754 &   58418952 &   9458936430 &  1318065020 &   625200906 \\
%
% run-algol-thr-c
% ---------------
% Raise Empty   &   3757606400 &   36472972 &   3257803337 & 15439375452 & 14717808642 \\
% Raise D'tor   &  64546302019 &  102148375 & 223648121635 &         N/A &         N/A \\
% Raise Finally &  64671359172 &  103285005 &          N/A & 15442729458 & 18927008844 \\
% Raise Other   & 294143497130 & 2630130385 & 112969055576 & 15448220154 & 21279953424 \\
% Cross Handler &      9646462 &   11955668 &       769328 &     3453707 &    31864074 \\
% Cross Finally &       773412 &        N/A &          N/A &     2253825 &    37266476 \\
% Match All     &   3719462155 &   43294042 &   3223004977 &  1286054154 &   623887874 \\
% Match None    &   4971630929 &   55311709 &   9481225467 &  1310251289 &   623752624 \\
%
% run-algol-04-a
% --------------
% Raise Empty   & 0.0 & 0.0 &  3250260945 & 0.0 & 0.0 \\
% Raise D'tor   & 0.0 & 0.0 & 29017675113 & N/A & N/A \\
% Raise Finally & 0.0 & 0.0 &         N/A & 0.0 & 0.0 \\
% Raise Other   & 0.0 & 0.0 & 24411823773 & 0.0 & 0.0 \\
% Cross Handler & 0.0 & 0.0 &      769334 & 0.0 & 0.0 \\
% Cross Finally & 0.0 & N/A &         N/A & 0.0 & 0.0 \\
% Match All     & 0.0 & 0.0 &  3254283504 & 0.0 & 0.0 \\
% Match None    & 0.0 & 0.0 &  9476060146 & 0.0 & 0.0 \\

% run-plg7a-a.sat
% ---------------
% Raise Empty   &  57169011329 &  296612564 &   2788557155 & 17511466039 & 23324548496 \\
% Raise D'tor   & 150599858014 &  318443709 & 149651693682 &         N/A &         N/A \\
% Raise Finally & 148223145000 &  373325807 &          N/A &         ... & 29074552998 \\
% Raise Other   & 189463708732 & 3017109322 &  85819281694 & 17584295487 & 32602686679 \\
% Cross Handler &      8001654 &   13584858 &      1555995 &     6626775 &    41927358 \\
% Cross Finally &      1002473 &        N/A &          N/A &     4554344 &    51114381 \\
% Match All     &   3162460860 &   37315018 &   2649464591 &  1523205769 &   742374509 \\
% Match None    &   4054773797 &   47052659 &   7759229131 &  1555373654 &   744656403 \\
%
% run-plg7a-thr-a
% ---------------
% Raise Empty   &   3604235388 &   29829965 &   2786931833 & 17576506385 & 23352975105 \\
% Raise D'tor   &  46552380948 &  178709605 & 149834207219 &         N/A &         N/A \\
% Raise Finally &  46265157775 &  177906320 &          N/A & 17493045092 & 29170962959 \\
% Raise Other   & 195659245764 & 2376968982 &  86070431924 & 17552979675 & 32501882918 \\
% Cross Handler &    397031776 &   12503552 &      1451225 &     6658628 &    42304965 \\
% Cross Finally &      1136746 &        N/A &          N/A &     4468799 &    46155817 \\
% Match All     &   3189512499 &   39124453 &   2667795989 &  1525889031 &   733785613 \\
% Match None    &   4094675477 &   48749857 &   7850618572 &  1566713577 &   733478963 \\
%
% run-plg7a-04-a
% --------------
% 0.0 are unfilled.
% Raise Empty   & 0.0 & 0.0 &  2770781479 & 0.0 & 0.0 \\
% Raise D'tor   & 0.0 & 0.0 & 23530084907 & N/A & N/A \\
% Raise Finally & 0.0 & 0.0 &         N/A & 0.0 & 0.0 \\
% Raise Other   & 0.0 & 0.0 & 23816827982 & 0.0 & 0.0 \\
% Cross Handler & 0.0 & 0.0 &     1422188 & 0.0 & 0.0 \\
% Cross Finally & 0.0 & N/A &         N/A & 0.0 & 0.0 \\
% Match All     & 0.0 & 0.0 &  2671989778 & 0.0 & 0.0 \\
% Match None    & 0.0 & 0.0 &  7829059869 & 0.0 & 0.0 \\

\begin{table}
\centering
\caption{Performance Results Termination (sec)}
\label{t:PerformanceTermination}
\begin{tabular}{|r|*{2}{|r r r r|}}
\hline
			& \multicolumn{4}{c||}{AMD}		& \multicolumn{4}{c|}{ARM}	\\
\cline{2-9}
N\hspace{8pt}		& \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
			  \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
\hline						                                    
Throw Empty (1M)	& 3.4	& 2.8	& 18.3	& 23.4		& 3.7	& 3.2	& 15.5	& 14.8	\\
Throw D'tor (1M)	& 48.4	& 23.6	& N/A	& N/A		& 64.2	& 29.0	& N/A	& N/A	\\
Throw Finally (1M)	& 3.4*	& N/A	& 17.9	& 29.0		& 4.1*	& N/A	& 15.6	& 19.0	\\
Throw Other (1M)	& 3.6*	& 23.2	& 18.2	& 32.7		& 4.0*	& 24.5	& 15.5	& 21.4	\\
Try/Catch (100M)	& 6.0	& 0.9	& N/C	& 37.4		& 10.0	& 0.8	& N/C	& 32.2	\\
Try/Finally (100M)	& 0.9	& N/A	& N/C	& 44.1		& 0.8	& N/A	& N/C	& 37.3	\\
Match All (10M)		& 32.9	& 20.7	& 13.4	& 4.9		& 36.2	& 24.5	& 12.0	& 3.1	\\
Match None (10M)	& 32.7	& 50.3	& 11.0	& 5.1		& 36.3	& 71.9	& 12.3	& 4.2	\\
\hline
\end{tabular}
\end{table}

\begin{table}
\centering
\small
\caption{Performance Results Resumption (sec)}
\label{t:PerformanceResumption}
\setlength{\tabcolsep}{5pt}
\begin{tabular}{|r|*{2}{|r r r r|}}
\hline
			& \multicolumn{4}{c||}{AMD}		& \multicolumn{4}{c|}{ARM}	\\
\cline{2-9}
N\hspace{8pt}		& \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
			  \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
\hline						                                    
Resume Empty (10M)	& 3.8/3.5	& 14.7	& 2.3	& 176.1	& 0.3/0.1	& 8.9	& 1.2	& 119.9	\\
Resume Other (10M)	& 4.0*/0.1*	& 21.9	& 6.2	& 381.0	& 0.3*/0.1*	& 13.2	& 5.0	& 290.7	\\
Try/Resume (100M)	& 8.8		& N/A	& N/A	& N/A	& 12.3		& N/A	& N/A	& N/A	\\
Match All (10M)		& 0.3		& N/A	& N/A	& N/A	& 0.3		& N/A	& N/A	& N/A	\\
Match None (10M)	& 0.3		& N/A	& N/A	& N/A	& 0.4		& N/A	& N/A	& N/A	\\
\hline
\end{tabular}
\end{table}

As stated, the performance tests are not attempting to compare exception
handling across languages.  The only performance requirement is to ensure the
\CFA EHM implementation runs in a reasonable amount of time, given its
constraints. In general, the \CFA implement did very well. Each of the tests is
analysed.
\begin{description}
\item[Throw/Resume Empty]
For termination, \CFA is close to \Cpp, where other languages have a higher cost.

For resumption, \CFA is better than the fixup simulations in the other languages, except Java.
The \CFA results on the ARM computer for both resumption and function simulation are particularly low;
I have no explanation for this anomaly, except the optimizer has managed to remove part of the experiment.
Python has a high cost for passing the lambda during the recursion.

\item[Throw D'tor]
For termination, \CFA is twice the cost of \Cpp.
The higher cost for \CFA must be related to how destructors are handled.

\item[Throw Finally]
\CFA is better than the other languages with a @finally@ clause, which is the
same for termination and resumption.

\item[Throw/Resume Other]
For termination, \CFA is better than the other languages.

For resumption, \CFA is equal to or better the other languages.
Again, the \CFA results on the ARM computer for both resumption and function simulation are particularly low.
Python has a high cost for passing the lambda during the recursion.

\item[Try/Catch/Resume]
For termination, installing a try statement is more expressive than \Cpp
because the try components are hoisted into local functions.  At runtime, these
functions are than passed to libunwind functions to set up the try statement.
\Cpp zero-cost try-entry accounts for its performance advantage.

For resumption, there are similar costs to termination to set up the try
statement but libunwind is not used.

\item[Try/Finally]
Setting up a try finally is less expensive in \CFA than setting up handlers,
and is significantly less than other languages.

\item[Throw/Resume Match All]
For termination, \CFA is close to the other language simulations.

For resumption, the stack unwinding is much faster because it does not use
libunwind.  Instead resumption is just traversing a linked list with each node
being the next stack frame with the try block.

\item[Throw/Resume Match None]
The same results as for Match All.
\end{description}

\begin{comment}
This observation means that while \CFA does not actually keep up with Python in
every case, it is usually no worse than roughly half the speed of \Cpp. This
performance is good enough for the prototyping purposes of the project.

The test case where \CFA falls short is Raise Other, the case where the
stack is unwound including a bunch of non-matching handlers.
This slowdown seems to come from missing optimizations.

This suggests that the performance issue in Raise Other is just an
optimization not being applied. Later versions of gcc may be able to
optimize this case further, at least down to the half of \Cpp mark.
A \CFA compiler that directly produced assembly could do even better as it
would not have to work across some of \CFA's current abstractions, like
the try terminate function.

Resumption exception handling is also incredibly fast. Often an order of
magnitude or two better than the best termination speed.
There is a simple explanation for this; traversing a linked list is much   
faster than examining and unwinding the stack. When resumption does not do as
well its when more try statements are used per raise. Updating the internal
linked list is not very expensive but it does add up.

The relative speed of the Match All and Match None tests (within each
language) can also show the effectiveness conditional matching as compared
to catch and rethrow.
\begin{itemize}[nosep]
\item
Java and Python get similar values in both tests.
Between the interpreted code, a higher level representation of the call
stack and exception reuse it it is possible the cost for a second
throw can be folded into the first.
% Is this due to optimization?
\item
Both types of \CFA are slightly slower if there is not a match.
For termination this likely comes from unwinding a bit more stack through
libunwind instead of executing the code normally.
For resumption there is extra work in traversing more of the list and running
more checks for a matching exceptions.
% Resumption is a bit high for that but this is my best theory.
\item
Then there is \Cpp, which takes 2--3 times longer to catch and rethrow vs.
just the catch. This is very high, but it does have to repeat the same
process of unwinding the stack and may have to parse the LSDA of the function
with the catch and rethrow twice, once before the catch and once after the
rethrow.
% I spent a long time thinking of what could push it over twice, this is all
% I have to explain it.
\end{itemize}
The difference in relative performance does show that there are savings to
be made by performing the check without catching the exception.
\end{comment}


\begin{comment}
From: Dave Dice <dave.dice@oracle.com>
To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
Subject: Re: [External] : JIT
Date: Mon, 16 Aug 2021 01:21:56 +0000

> On 2021-8-15, at 7:14 PM, Peter A. Buhr <pabuhr@uwaterloo.ca> wrote:
> 
> My student is trying to measure the cost of installing a try block with a
> finally clause in Java.
> 
> We tried the random trick (see below). But if the try block is comment out, the
> results are the same. So the program measures the calls to the random number
> generator and there is no cost for installing the try block.
> 
> Maybe there is no cost for a try block with an empty finally, i.e., the try is
> optimized away from the get-go.

There's quite a bit of optimization magic behind the HotSpot curtains for
try-finally.  (I sound like the proverbial broken record (:>)).

In many cases we can determine that the try block can't throw any exceptions,
so we can elide all try-finally plumbing.  In other cases, we can convert the
try-finally to normal if-then control flow, in the case where the exception is
thrown into the same method.  This makes exceptions _almost cost-free.  If we
actually need to "physically" rip down stacks, then things get expensive,
impacting both the throw cost, and inhibiting other useful optimizations at the
catch point.  Such "true" throws are not just expensive, they're _very
expensive.  The extremely aggressive inlining used by the JIT helps, because we
can convert cases where a heavy rip-down would normally needed back into simple
control flow.

Other quirks involve the thrown exception object.  If it's never accessed then
we're apply a nice set of optimizations to avoid its construction.  If it's
accessed but never escapes the catch frame (common) then we can also cheat.
And if we find we're hitting lots of heavy rip-down cases, the JIT will
consider recompilation - better inlining -- to see if we can merge the throw
and catch into the same physical frame, and shift to simple branches.

In your example below, System.out.print() can throw, I believe.  (I could be
wrong, but most IO can throw).  Native calls that throw will "unwind" normally
in C++ code until they hit the boundary where they reenter java emitted code,
at which point the JIT-ed code checks for a potential pending exception.  So in
a sense the throw point is implicitly after the call to the native method, so
we can usually make those cases efficient.

Also, when we're running in the interpreter and warming up, we'll notice that
the == 42 case never occurs, and so when we start to JIT the code, we elide the
call to System.out.print(), replacing it (and anything else which appears in
that if x == 42 block) with a bit of code we call an "uncommon trap".  I'm
presuming we encounter 42 rarely.  So if we ever hit the x == 42 case, control
hits the trap, which triggers synchronous recompilation of the method, this
time with the call to System.out.print() and, because of that, we now to adapt
the new code to handle any traps thrown by print().  This is tricky stuff, as
we may need to rebuild stack frames to reflect the newly emitted method.  And
we have to construct a weird bit of "thunk" code that allows us to fall back
directly into the newly emitted "if" block.  So there's a large one-time cost
when we bump into the uncommon trap and recompile, and subsequent execution
might get slightly slower as the exception could actually be generated, whereas
before we hit the trap, we knew the exception could never be raised.

Oh, and things also get expensive if we need to actually fill in the stack
trace associated with the exception object.  Walking stacks is hellish.

Quite a bit of effort was put into all this as some of the specjvm benchmarks
showed significant benefit.

It's hard to get sensible measurements as the JIT is working against you at
every turn.  What's good for the normal user is awful for anybody trying to
benchmark.  Also, all the magic results in fairly noisy and less reproducible
results.

Regards
Dave

p.s., I think I've mentioned this before, but throwing in C++ is grim as
unrelated throws in different threads take common locks, so nothing scales as
you might expect.
\end{comment}