1 | \chapter{Performance}
|
---|
2 | \label{c:performance}
|
---|
3 |
|
---|
4 | Performance is of secondary importance for most of this project.
|
---|
5 | Instead, the focus was to get the features working. The only performance
|
---|
6 | requirement is to ensure the tests for correctness run in a reasonable
|
---|
7 | amount of time. Hence, a few basic performance tests were performed to
|
---|
8 | check this requirement.
|
---|
9 |
|
---|
10 | \section{Test Set-Up}
|
---|
11 | Tests were run in \CFA, C++, Java and Python.
|
---|
12 | In addition there are two sets of tests for \CFA,
|
---|
13 | one for termination and one for resumption exceptions.
|
---|
14 |
|
---|
15 | C++ is the most comparable language because both it and \CFA use the same
|
---|
16 | framework, libunwind.
|
---|
17 | In fact, the comparison is almost entirely a quality of implementation.
|
---|
18 | Specifically, \CFA's EHM has had significantly less time to be optimized and
|
---|
19 | does not generate its own assembly. It does have a slight advantage in that
|
---|
20 | there are some features it handles directly instead of through utility functions,
|
---|
21 | but otherwise \Cpp should have a significant advantage.
|
---|
22 |
|
---|
23 | Java is a popular language with similar termination semantics, but
|
---|
24 | it is implemented in a very different environment, a virtual machine with
|
---|
25 | garbage collection.
|
---|
26 | It also implements the @finally@ clause on @try@ blocks allowing for a direct
|
---|
27 | feature-to-feature comparison.
|
---|
28 | As with \Cpp, Java's implementation is mature, optimized
|
---|
29 | and has extra features.
|
---|
30 |
|
---|
31 | Python is used as an alternative comparison because of the \CFA EHM's
|
---|
32 | current performance goals, which is not to be prohibitively slow while the
|
---|
33 | features are designed and examined. Python has similar performance goals for
|
---|
34 | creating quick scripts and its wide use suggests it has achieved those goals.
|
---|
35 |
|
---|
36 | Unfortunately, there are no notable modern programming languages with
|
---|
37 | resumption exceptions. Even the older programming languages with resumption
|
---|
38 | seem to be notable only for having resumption.
|
---|
39 | So instead, resumption is compared to its simulation in other programming
|
---|
40 | languages using fixup functions that are explicitly passed for correction or
|
---|
41 | logging purposes.
|
---|
42 | % So instead, resumption is compared to a less similar but much more familiar
|
---|
43 | %feature, termination exceptions.
|
---|
44 |
|
---|
45 | All tests are run inside a main loop that repeatedly performs a test.
|
---|
46 | This approach avoids start-up or tear-down time from
|
---|
47 | affecting the timing results.
|
---|
48 | Each test is run a N times (configurable from the command line).
|
---|
49 | The Java tests runs the main loop 1000 times before
|
---|
50 | beginning the actual test to ``warm-up" the JVM.
|
---|
51 |
|
---|
52 | Timing is done internally, with time measured immediately before and
|
---|
53 | after the test loop. The difference is calculated and printed.
|
---|
54 | The loop structure and internal timing means it is impossible to test
|
---|
55 | unhandled exceptions in \Cpp and Java as that would cause the process to
|
---|
56 | terminate.
|
---|
57 | Luckily, performance on the ``give-up and kill the process" path is not
|
---|
58 | critical.
|
---|
59 |
|
---|
60 | The exceptions used in these tests are always based off of
|
---|
61 | a base exception. This requirement minimizes performance differences based
|
---|
62 | on the object model used to represent the exception.
|
---|
63 |
|
---|
64 | All tests are designed to be as minimal as possible, while still preventing
|
---|
65 | excessive optimizations.
|
---|
66 | For example, empty inline assembly blocks are used in \CFA and \Cpp to
|
---|
67 | prevent excessive optimizations while adding no actual work.
|
---|
68 | Each test was run eleven times. The top three and bottom three results were
|
---|
69 | discarded and the remaining five values are averaged.
|
---|
70 |
|
---|
71 | The tests are compiled with gcc-10 for \CFA and g++-10 for \Cpp. Java is
|
---|
72 | compiled with version 11.0.11. Python with version 3.8. The tests were run on:
|
---|
73 | \begin{itemize}[nosep]
|
---|
74 | \item
|
---|
75 | ARM 2280 Kunpeng 920 48-core 2$\times$socket \lstinline{@} 2.6 GHz running Linux v5.11.0-25
|
---|
76 | \item
|
---|
77 | AMD 6380 Abu Dhabi 16-core 4$\times$socket \lstinline{@} 2.5 GHz running Linux v5.11.0-25
|
---|
78 | \end{itemize}
|
---|
79 | Two kinds of hardware architecture allows discriminating any implementation and
|
---|
80 | architectural effects.
|
---|
81 |
|
---|
82 |
|
---|
83 | % We don't use catch-alls but if we did:
|
---|
84 | % Catch-alls are done by catching the root exception type (not using \Cpp's
|
---|
85 | % \code{C++}{catch(...)}).
|
---|
86 |
|
---|
87 | \section{Tests}
|
---|
88 | The following tests were selected to test the performance of different
|
---|
89 | components of the exception system.
|
---|
90 | They should provide a guide as to where the EHM's costs are found.
|
---|
91 |
|
---|
92 | \paragraph{Raise and Handle}
|
---|
93 | This group measures the cost of a try statement when exceptions are raised and
|
---|
94 | the stack is unwound (termination) or not unwound (resumption). Each test has
|
---|
95 | has a repeating function like the following
|
---|
96 | \begin{lstlisting}[language=CFA,{moredelim=**[is][\color{red}]{@}{@}}]
|
---|
97 | void unwind_empty(unsigned int frames) {
|
---|
98 | if (frames) {
|
---|
99 | @unwind_empty(frames - 1);@ // AUGMENTED IN OTHER EXPERIMENTS
|
---|
100 | } else throw (empty_exception){&empty_vt};
|
---|
101 | }
|
---|
102 | \end{lstlisting}
|
---|
103 | which is called N times, where each call recurses to a depth of R (configurable from the command line), an
|
---|
104 | exception is raised, the stack is a unwound, and the exception caught.
|
---|
105 | \begin{itemize}[nosep]
|
---|
106 | \item Empty:
|
---|
107 | For termination, this test measures the cost of raising (stack walking) an
|
---|
108 | exception through empty stack frames from the bottom of the recursion to an
|
---|
109 | empty handler, and unwinding the stack. (see above code)
|
---|
110 |
|
---|
111 | \medskip
|
---|
112 | For resumption, this test measures the same raising cost but does not unwind
|
---|
113 | the stack. For languages without resumption, a fixup function is to the bottom
|
---|
114 | of the recursion and called to simulate a fixup operation at that point.
|
---|
115 | \begin{cfa}
|
---|
116 | void nounwind_fixup(unsigned int frames, void (*raised_rtn)(int &)) {
|
---|
117 | if (frames) {
|
---|
118 | nounwind_fixup(frames - 1, raised_rtn);
|
---|
119 | } else {
|
---|
120 | int fixup = 17;
|
---|
121 | raised_rtn(fixup);
|
---|
122 | }
|
---|
123 | }
|
---|
124 | \end{cfa}
|
---|
125 | where the passed fixup function is:
|
---|
126 | \begin{cfa}
|
---|
127 | void raised(int & fixup) {
|
---|
128 | fixup = 42;
|
---|
129 | }
|
---|
130 | \end{cfa}
|
---|
131 | For comparison, a \CFA version passing a function is also included.
|
---|
132 | \item Destructor:
|
---|
133 | This test measures the cost of raising an exception through non-empty frames,
|
---|
134 | where each frame has an object requiring destruction, from the bottom of the
|
---|
135 | recursion to an empty handler. Hence, there are N destructor calls during
|
---|
136 | unwinding.
|
---|
137 |
|
---|
138 | \medskip
|
---|
139 | This test is not meaningful for resumption because the stack is only unwound as
|
---|
140 | the recursion returns.
|
---|
141 | \begin{cfa}
|
---|
142 | WithDestructor object;
|
---|
143 | unwind_destructor(frames - 1);
|
---|
144 | \end{cfa}
|
---|
145 | \item Finally:
|
---|
146 | This test measures the cost of establishing a try block with an empty finally
|
---|
147 | clause on the front side of the recursion and running the empty finally clauses
|
---|
148 | during stack unwinding from the bottom of the recursion to an empty handler.
|
---|
149 | \begin{cfa}
|
---|
150 | try {
|
---|
151 | unwind_finally(frames - 1);
|
---|
152 | } finally {}
|
---|
153 | \end{cfa}
|
---|
154 |
|
---|
155 | \medskip
|
---|
156 | This test is not meaningful for resumption because the stack is only unwound as
|
---|
157 | the recursion returns.
|
---|
158 | \item Other Handler:
|
---|
159 | For termination, this test is like the finally test but the try block has a
|
---|
160 | catch clause for an exception that is not raised, so catch matching is executed
|
---|
161 | during stack unwinding but the match never successes until the catch at the
|
---|
162 | bottom of the recursion.
|
---|
163 | \begin{cfa}
|
---|
164 | try {
|
---|
165 | unwind_other(frames - 1);
|
---|
166 | } catch (not_raised_exception *) {}
|
---|
167 | \end{cfa}
|
---|
168 |
|
---|
169 | \medskip
|
---|
170 | For resumption, this test measures the same raising cost but does not unwind
|
---|
171 | the stack. For languages without resumption, the same fixup function is passed
|
---|
172 | and called.
|
---|
173 | \end{itemize}
|
---|
174 |
|
---|
175 | \paragraph{Try/Handle/Finally Statement}
|
---|
176 | This group measures just the cost of executing a try statement so
|
---|
177 | \emph{there is no stack unwinding}. Hence, the program main loops N times
|
---|
178 | around:
|
---|
179 | \begin{cfa}
|
---|
180 | try {
|
---|
181 | } catch (not_raised_exception *) {}
|
---|
182 | \end{cfa}
|
---|
183 | \begin{itemize}[nosep]
|
---|
184 | \item Handler:
|
---|
185 | The try statement has a handler (catch/resume).
|
---|
186 | \item Finally:
|
---|
187 | The try statement has a finally clause.
|
---|
188 | \end{itemize}
|
---|
189 |
|
---|
190 | \paragraph{Conditional Matching}
|
---|
191 | This group measures the cost of conditional matching.
|
---|
192 | Only \CFA implements the language level conditional match,
|
---|
193 | the other languages mimic with an ``unconditional" match (it still
|
---|
194 | checks the exception's type) and conditional re-raise if it is not suppose
|
---|
195 | to handle that exception.
|
---|
196 | \begin{center}
|
---|
197 | \begin{tabular}{ll}
|
---|
198 | \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp, Java, Python} \\
|
---|
199 | \begin{cfa}
|
---|
200 | try {
|
---|
201 | throw_exception();
|
---|
202 | } catch (empty_exception * exc;
|
---|
203 | should_catch) {
|
---|
204 | }
|
---|
205 | \end{cfa}
|
---|
206 | &
|
---|
207 | \begin{cfa}
|
---|
208 | try {
|
---|
209 | throw_exception();
|
---|
210 | } catch (EmptyException & exc) {
|
---|
211 | if (!should_catch) throw;
|
---|
212 | }
|
---|
213 | \end{cfa}
|
---|
214 | \end{tabular}
|
---|
215 | \end{center}
|
---|
216 | \begin{itemize}[nosep]
|
---|
217 | \item Match All:
|
---|
218 | The condition is always true. (Always matches or never re-raises.)
|
---|
219 | \item Match None:
|
---|
220 | The condition is always false. (Never matches or always re-raises.)
|
---|
221 | \end{itemize}
|
---|
222 |
|
---|
223 | \medskip
|
---|
224 | \noindent
|
---|
225 | All omitted test code for other languages is functionally identical to the \CFA
|
---|
226 | tests or simulated, and available online~\cite{CforallExceptionBenchmarks}.
|
---|
227 |
|
---|
228 | %\section{Cost in Size}
|
---|
229 | %Using exceptions also has a cost in the size of the executable.
|
---|
230 | %Although it is sometimes ignored
|
---|
231 | %
|
---|
232 | %There is a size cost to defining a personality function but the later problem
|
---|
233 | %is the LSDA which will be generated for every function.
|
---|
234 | %
|
---|
235 | %(I haven't actually figured out how to compare this, probably using something
|
---|
236 | %related to -fexceptions.)
|
---|
237 |
|
---|
238 | \section{Results}
|
---|
239 | One result not directly related to \CFA but important to keep in
|
---|
240 | mind is that, for exceptions, the standard intuition about which languages
|
---|
241 | should go faster often does not hold. For example, there are a few cases where Python out-performs
|
---|
242 | \CFA, \Cpp and Java. The most likely explanation is that, since exceptions are
|
---|
243 | rarely considered to be the common case, the more optimized languages
|
---|
244 | make that case expense. In addition, languages with high-level
|
---|
245 | representations have a much easier time scanning the stack as there is less
|
---|
246 | to decode.
|
---|
247 |
|
---|
248 | Tables~\ref{t:PerformanceTermination} and~\ref{t:PerformanceResumption} show
|
---|
249 | the test results for termination and resumption, respectively. In cases where
|
---|
250 | a feature is not supported by a language, the test is skipped for that language
|
---|
251 | (marked N/A). For some Java experiments it was impossible to measure certain
|
---|
252 | effects because the JIT corrupted the test (marked N/C). No workaround was
|
---|
253 | possible~\cite{Dice21}. To get experiments in the range of 1--100 seconds, the
|
---|
254 | number of times an experiment is run (N) is varied (N is marked beside each
|
---|
255 | experiment, e.g., 1M $\Rightarrow$ 1 million test iterations).
|
---|
256 |
|
---|
257 | An anomaly exists with gcc nested functions used as thunks for implementing
|
---|
258 | much of the \CFA EHM. If a nested-function closure captures local variables in
|
---|
259 | its lexical scope, performance dropped by a factor of 10. Specifically, in try
|
---|
260 | statements of the form:
|
---|
261 | \begin{cfa}
|
---|
262 | try {
|
---|
263 | unwind_other(frames - 1);
|
---|
264 | } catch (not_raised_exception *) {}
|
---|
265 | \end{cfa}
|
---|
266 | the try block is hoisted into a nested function and the variable @frames@ is
|
---|
267 | the local parameter to the recursive function, which triggers the anomaly. The
|
---|
268 | workaround is to remove the recursion parameter and make it a global variable
|
---|
269 | that is explicitly decremented outside of the try block (marked with a ``*''):
|
---|
270 | \begin{cfa}
|
---|
271 | frames -= 1;
|
---|
272 | try {
|
---|
273 | unwind_other();
|
---|
274 | } catch (not_raised_exception *) {}
|
---|
275 | \end{cfa}
|
---|
276 | To make comparisons fair, a dummy parameter is added and the dummy value passed
|
---|
277 | in the recursion. Note, nested functions in gcc are rarely used (if not
|
---|
278 | completely unknown) and must follow the C calling convention, unlike \Cpp
|
---|
279 | lambdas, so it is not surprising if there are performance issues efficiently
|
---|
280 | capturing closures.
|
---|
281 |
|
---|
282 | % Similarly, if a test does not change between resumption
|
---|
283 | % and termination in \CFA, then only one test is written and the result
|
---|
284 | % was put into the termination column.
|
---|
285 |
|
---|
286 | % Raw Data:
|
---|
287 | % run-algol-a.sat
|
---|
288 | % ---------------
|
---|
289 | % Raise Empty & 82687046678 & 291616256 & 3252824847 & 15422937623 & 14736271114 \\
|
---|
290 | % Raise D'tor & 219933199603 & 297897792 & 223602799362 & N/A & N/A \\
|
---|
291 | % Raise Finally & 219703078448 & 298391745 & N/A & ... & 18923060958 \\
|
---|
292 | % Raise Other & 296744104920 & 2854342084 & 112981255103 & 15475924808 & 21293137454 \\
|
---|
293 | % Cross Handler & 9256648 & 13518430 & 769328 & 3486252 & 31790804 \\
|
---|
294 | % Cross Finally & 769319 & N/A & N/A & 2272831 & 37491962 \\
|
---|
295 | % Match All & 3654278402 & 47518560 & 3218907794 & 1296748192 & 624071886 \\
|
---|
296 | % Match None & 4788861754 & 58418952 & 9458936430 & 1318065020 & 625200906 \\
|
---|
297 | %
|
---|
298 | % run-algol-thr-c
|
---|
299 | % ---------------
|
---|
300 | % Raise Empty & 3757606400 & 36472972 & 3257803337 & 15439375452 & 14717808642 \\
|
---|
301 | % Raise D'tor & 64546302019 & 102148375 & 223648121635 & N/A & N/A \\
|
---|
302 | % Raise Finally & 64671359172 & 103285005 & N/A & 15442729458 & 18927008844 \\
|
---|
303 | % Raise Other & 294143497130 & 2630130385 & 112969055576 & 15448220154 & 21279953424 \\
|
---|
304 | % Cross Handler & 9646462 & 11955668 & 769328 & 3453707 & 31864074 \\
|
---|
305 | % Cross Finally & 773412 & N/A & N/A & 2253825 & 37266476 \\
|
---|
306 | % Match All & 3719462155 & 43294042 & 3223004977 & 1286054154 & 623887874 \\
|
---|
307 | % Match None & 4971630929 & 55311709 & 9481225467 & 1310251289 & 623752624 \\
|
---|
308 | %
|
---|
309 | % run-algol-04-a
|
---|
310 | % --------------
|
---|
311 | % Raise Empty & 0.0 & 0.0 & 3250260945 & 0.0 & 0.0 \\
|
---|
312 | % Raise D'tor & 0.0 & 0.0 & 29017675113 & N/A & N/A \\
|
---|
313 | % Raise Finally & 0.0 & 0.0 & N/A & 0.0 & 0.0 \\
|
---|
314 | % Raise Other & 0.0 & 0.0 & 24411823773 & 0.0 & 0.0 \\
|
---|
315 | % Cross Handler & 0.0 & 0.0 & 769334 & 0.0 & 0.0 \\
|
---|
316 | % Cross Finally & 0.0 & N/A & N/A & 0.0 & 0.0 \\
|
---|
317 | % Match All & 0.0 & 0.0 & 3254283504 & 0.0 & 0.0 \\
|
---|
318 | % Match None & 0.0 & 0.0 & 9476060146 & 0.0 & 0.0 \\
|
---|
319 |
|
---|
320 | % run-plg7a-a.sat
|
---|
321 | % ---------------
|
---|
322 | % Raise Empty & 57169011329 & 296612564 & 2788557155 & 17511466039 & 23324548496 \\
|
---|
323 | % Raise D'tor & 150599858014 & 318443709 & 149651693682 & N/A & N/A \\
|
---|
324 | % Raise Finally & 148223145000 & 373325807 & N/A & ... & 29074552998 \\
|
---|
325 | % Raise Other & 189463708732 & 3017109322 & 85819281694 & 17584295487 & 32602686679 \\
|
---|
326 | % Cross Handler & 8001654 & 13584858 & 1555995 & 6626775 & 41927358 \\
|
---|
327 | % Cross Finally & 1002473 & N/A & N/A & 4554344 & 51114381 \\
|
---|
328 | % Match All & 3162460860 & 37315018 & 2649464591 & 1523205769 & 742374509 \\
|
---|
329 | % Match None & 4054773797 & 47052659 & 7759229131 & 1555373654 & 744656403 \\
|
---|
330 | %
|
---|
331 | % run-plg7a-thr-a
|
---|
332 | % ---------------
|
---|
333 | % Raise Empty & 3604235388 & 29829965 & 2786931833 & 17576506385 & 23352975105 \\
|
---|
334 | % Raise D'tor & 46552380948 & 178709605 & 149834207219 & N/A & N/A \\
|
---|
335 | % Raise Finally & 46265157775 & 177906320 & N/A & 17493045092 & 29170962959 \\
|
---|
336 | % Raise Other & 195659245764 & 2376968982 & 86070431924 & 17552979675 & 32501882918 \\
|
---|
337 | % Cross Handler & 397031776 & 12503552 & 1451225 & 6658628 & 42304965 \\
|
---|
338 | % Cross Finally & 1136746 & N/A & N/A & 4468799 & 46155817 \\
|
---|
339 | % Match All & 3189512499 & 39124453 & 2667795989 & 1525889031 & 733785613 \\
|
---|
340 | % Match None & 4094675477 & 48749857 & 7850618572 & 1566713577 & 733478963 \\
|
---|
341 | %
|
---|
342 | % run-plg7a-04-a
|
---|
343 | % --------------
|
---|
344 | % 0.0 are unfilled.
|
---|
345 | % Raise Empty & 0.0 & 0.0 & 2770781479 & 0.0 & 0.0 \\
|
---|
346 | % Raise D'tor & 0.0 & 0.0 & 23530084907 & N/A & N/A \\
|
---|
347 | % Raise Finally & 0.0 & 0.0 & N/A & 0.0 & 0.0 \\
|
---|
348 | % Raise Other & 0.0 & 0.0 & 23816827982 & 0.0 & 0.0 \\
|
---|
349 | % Cross Handler & 0.0 & 0.0 & 1422188 & 0.0 & 0.0 \\
|
---|
350 | % Cross Finally & 0.0 & N/A & N/A & 0.0 & 0.0 \\
|
---|
351 | % Match All & 0.0 & 0.0 & 2671989778 & 0.0 & 0.0 \\
|
---|
352 | % Match None & 0.0 & 0.0 & 7829059869 & 0.0 & 0.0 \\
|
---|
353 |
|
---|
354 | \begin{table}
|
---|
355 | \centering
|
---|
356 | \caption{Performance Results Termination (sec)}
|
---|
357 | \label{t:PerformanceTermination}
|
---|
358 | \begin{tabular}{|r|*{2}{|r r r r|}}
|
---|
359 | \hline
|
---|
360 | & \multicolumn{4}{c||}{AMD} & \multicolumn{4}{c|}{ARM} \\
|
---|
361 | \cline{2-9}
|
---|
362 | N\hspace{8pt} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
|
---|
363 | \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
|
---|
364 | \hline
|
---|
365 | Throw Empty (1M) & 3.4 & 2.8 & 18.3 & 23.4 & 3.7 & 3.2 & 15.5 & 14.8 \\
|
---|
366 | Throw D'tor (1M) & 48.4 & 23.6 & N/A & N/A & 64.2 & 29.0 & N/A & N/A \\
|
---|
367 | Throw Finally (1M) & 3.4* & N/A & 17.9 & 29.0 & 4.1* & N/A & 15.6 & 19.0 \\
|
---|
368 | Throw Other (1M) & 3.6* & 23.2 & 18.2 & 32.7 & 4.0* & 24.5 & 15.5 & 21.4 \\
|
---|
369 | Try/Catch (100M) & 6.0 & 0.9 & N/C & 37.4 & 10.0 & 0.8 & N/C & 32.2 \\
|
---|
370 | Try/Finally (100M) & 0.9 & N/A & N/C & 44.1 & 0.8 & N/A & N/C & 37.3 \\
|
---|
371 | Match All (10M) & 32.9 & 20.7 & 13.4 & 4.9 & 36.2 & 24.5 & 12.0 & 3.1 \\
|
---|
372 | Match None (10M) & 32.7 & 50.3 & 11.0 & 5.1 & 36.3 & 71.9 & 12.3 & 4.2 \\
|
---|
373 | \hline
|
---|
374 | \end{tabular}
|
---|
375 | \end{table}
|
---|
376 |
|
---|
377 | \begin{table}
|
---|
378 | \centering
|
---|
379 | \small
|
---|
380 | \caption{Performance Results Resumption (sec)}
|
---|
381 | \label{t:PerformanceResumption}
|
---|
382 | \setlength{\tabcolsep}{5pt}
|
---|
383 | \begin{tabular}{|r|*{2}{|r r r r|}}
|
---|
384 | \hline
|
---|
385 | & \multicolumn{4}{c||}{AMD} & \multicolumn{4}{c|}{ARM} \\
|
---|
386 | \cline{2-9}
|
---|
387 | N\hspace{8pt} & \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
|
---|
388 | \multicolumn{1}{c}{\CFA (R/F)} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
|
---|
389 | \hline
|
---|
390 | Resume Empty (10M) & 3.8/3.5 & 14.7 & 2.3 & 176.1 & 0.3/0.1 & 8.9 & 1.2 & 119.9 \\
|
---|
391 | Resume Other (10M) & 4.0*/0.1* & 21.9 & 6.2 & 381.0 & 0.3*/0.1* & 13.2 & 5.0 & 290.7 \\
|
---|
392 | Try/Resume (100M) & 8.8 & N/A & N/A & N/A & 12.3 & N/A & N/A & N/A \\
|
---|
393 | Match All (10M) & 0.3 & N/A & N/A & N/A & 0.3 & N/A & N/A & N/A \\
|
---|
394 | Match None (10M) & 0.3 & N/A & N/A & N/A & 0.4 & N/A & N/A & N/A \\
|
---|
395 | \hline
|
---|
396 | \end{tabular}
|
---|
397 | \end{table}
|
---|
398 |
|
---|
399 | As stated, the performance tests are not attempting to compare exception
|
---|
400 | handling across languages. The only performance requirement is to ensure the
|
---|
401 | \CFA EHM implementation runs in a reasonable amount of time, given its
|
---|
402 | constraints. In general, the \CFA implement did very well. Each of the tests is
|
---|
403 | analysed.
|
---|
404 | \begin{description}
|
---|
405 | \item[Throw/Resume Empty]
|
---|
406 | For termination, \CFA is close to \Cpp, where other languages have a higher cost.
|
---|
407 |
|
---|
408 | For resumption, \CFA is better than the fixup simulations in the other languages, except Java.
|
---|
409 | The \CFA results on the ARM computer for both resumption and function simulation are particularly low;
|
---|
410 | I have no explanation for this anomaly, except the optimizer has managed to remove part of the experiment.
|
---|
411 | Python has a high cost for passing the lambda during the recursion.
|
---|
412 |
|
---|
413 | \item[Throw D'tor]
|
---|
414 | For termination, \CFA is twice the cost of \Cpp.
|
---|
415 | The higher cost for \CFA must be related to how destructors are handled.
|
---|
416 |
|
---|
417 | \item[Throw Finally]
|
---|
418 | \CFA is better than the other languages with a @finally@ clause, which is the
|
---|
419 | same for termination and resumption.
|
---|
420 |
|
---|
421 | \item[Throw/Resume Other]
|
---|
422 | For termination, \CFA is better than the other languages.
|
---|
423 |
|
---|
424 | For resumption, \CFA is equal to or better the other languages.
|
---|
425 | Again, the \CFA results on the ARM computer for both resumption and function simulation are particularly low.
|
---|
426 | Python has a high cost for passing the lambda during the recursion.
|
---|
427 |
|
---|
428 | \item[Try/Catch/Resume]
|
---|
429 | For termination, installing a try statement is more expressive than \Cpp
|
---|
430 | because the try components are hoisted into local functions. At runtime, these
|
---|
431 | functions are than passed to libunwind functions to set up the try statement.
|
---|
432 | \Cpp zero-cost try-entry accounts for its performance advantage.
|
---|
433 |
|
---|
434 | For resumption, there are similar costs to termination to set up the try
|
---|
435 | statement but libunwind is not used.
|
---|
436 |
|
---|
437 | \item[Try/Finally]
|
---|
438 | Setting up a try finally is less expensive in \CFA than setting up handlers,
|
---|
439 | and is significantly less than other languages.
|
---|
440 |
|
---|
441 | \item[Throw/Resume Match All]
|
---|
442 | For termination, \CFA is close to the other language simulations.
|
---|
443 |
|
---|
444 | For resumption, the stack unwinding is much faster because it does not use
|
---|
445 | libunwind. Instead resumption is just traversing a linked list with each node
|
---|
446 | being the next stack frame with the try block.
|
---|
447 |
|
---|
448 | \item[Throw/Resume Match None]
|
---|
449 | The same results as for Match All.
|
---|
450 | \end{description}
|
---|
451 |
|
---|
452 | \begin{comment}
|
---|
453 | This observation means that while \CFA does not actually keep up with Python in
|
---|
454 | every case, it is usually no worse than roughly half the speed of \Cpp. This
|
---|
455 | performance is good enough for the prototyping purposes of the project.
|
---|
456 |
|
---|
457 | The test case where \CFA falls short is Raise Other, the case where the
|
---|
458 | stack is unwound including a bunch of non-matching handlers.
|
---|
459 | This slowdown seems to come from missing optimizations.
|
---|
460 |
|
---|
461 | This suggests that the performance issue in Raise Other is just an
|
---|
462 | optimization not being applied. Later versions of gcc may be able to
|
---|
463 | optimize this case further, at least down to the half of \Cpp mark.
|
---|
464 | A \CFA compiler that directly produced assembly could do even better as it
|
---|
465 | would not have to work across some of \CFA's current abstractions, like
|
---|
466 | the try terminate function.
|
---|
467 |
|
---|
468 | Resumption exception handling is also incredibly fast. Often an order of
|
---|
469 | magnitude or two better than the best termination speed.
|
---|
470 | There is a simple explanation for this; traversing a linked list is much
|
---|
471 | faster than examining and unwinding the stack. When resumption does not do as
|
---|
472 | well its when more try statements are used per raise. Updating the internal
|
---|
473 | linked list is not very expensive but it does add up.
|
---|
474 |
|
---|
475 | The relative speed of the Match All and Match None tests (within each
|
---|
476 | language) can also show the effectiveness conditional matching as compared
|
---|
477 | to catch and rethrow.
|
---|
478 | \begin{itemize}[nosep]
|
---|
479 | \item
|
---|
480 | Java and Python get similar values in both tests.
|
---|
481 | Between the interpreted code, a higher level representation of the call
|
---|
482 | stack and exception reuse it it is possible the cost for a second
|
---|
483 | throw can be folded into the first.
|
---|
484 | % Is this due to optimization?
|
---|
485 | \item
|
---|
486 | Both types of \CFA are slightly slower if there is not a match.
|
---|
487 | For termination this likely comes from unwinding a bit more stack through
|
---|
488 | libunwind instead of executing the code normally.
|
---|
489 | For resumption there is extra work in traversing more of the list and running
|
---|
490 | more checks for a matching exceptions.
|
---|
491 | % Resumption is a bit high for that but this is my best theory.
|
---|
492 | \item
|
---|
493 | Then there is \Cpp, which takes 2--3 times longer to catch and rethrow vs.
|
---|
494 | just the catch. This is very high, but it does have to repeat the same
|
---|
495 | process of unwinding the stack and may have to parse the LSDA of the function
|
---|
496 | with the catch and rethrow twice, once before the catch and once after the
|
---|
497 | rethrow.
|
---|
498 | % I spent a long time thinking of what could push it over twice, this is all
|
---|
499 | % I have to explain it.
|
---|
500 | \end{itemize}
|
---|
501 | The difference in relative performance does show that there are savings to
|
---|
502 | be made by performing the check without catching the exception.
|
---|
503 | \end{comment}
|
---|
504 |
|
---|
505 |
|
---|
506 | \begin{comment}
|
---|
507 | From: Dave Dice <dave.dice@oracle.com>
|
---|
508 | To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
|
---|
509 | Subject: Re: [External] : JIT
|
---|
510 | Date: Mon, 16 Aug 2021 01:21:56 +0000
|
---|
511 |
|
---|
512 | > On 2021-8-15, at 7:14 PM, Peter A. Buhr <pabuhr@uwaterloo.ca> wrote:
|
---|
513 | >
|
---|
514 | > My student is trying to measure the cost of installing a try block with a
|
---|
515 | > finally clause in Java.
|
---|
516 | >
|
---|
517 | > We tried the random trick (see below). But if the try block is comment out, the
|
---|
518 | > results are the same. So the program measures the calls to the random number
|
---|
519 | > generator and there is no cost for installing the try block.
|
---|
520 | >
|
---|
521 | > Maybe there is no cost for a try block with an empty finally, i.e., the try is
|
---|
522 | > optimized away from the get-go.
|
---|
523 |
|
---|
524 | There's quite a bit of optimization magic behind the HotSpot curtains for
|
---|
525 | try-finally. (I sound like the proverbial broken record (:>)).
|
---|
526 |
|
---|
527 | In many cases we can determine that the try block can't throw any exceptions,
|
---|
528 | so we can elide all try-finally plumbing. In other cases, we can convert the
|
---|
529 | try-finally to normal if-then control flow, in the case where the exception is
|
---|
530 | thrown into the same method. This makes exceptions _almost cost-free. If we
|
---|
531 | actually need to "physically" rip down stacks, then things get expensive,
|
---|
532 | impacting both the throw cost, and inhibiting other useful optimizations at the
|
---|
533 | catch point. Such "true" throws are not just expensive, they're _very
|
---|
534 | expensive. The extremely aggressive inlining used by the JIT helps, because we
|
---|
535 | can convert cases where a heavy rip-down would normally needed back into simple
|
---|
536 | control flow.
|
---|
537 |
|
---|
538 | Other quirks involve the thrown exception object. If it's never accessed then
|
---|
539 | we're apply a nice set of optimizations to avoid its construction. If it's
|
---|
540 | accessed but never escapes the catch frame (common) then we can also cheat.
|
---|
541 | And if we find we're hitting lots of heavy rip-down cases, the JIT will
|
---|
542 | consider recompilation - better inlining -- to see if we can merge the throw
|
---|
543 | and catch into the same physical frame, and shift to simple branches.
|
---|
544 |
|
---|
545 | In your example below, System.out.print() can throw, I believe. (I could be
|
---|
546 | wrong, but most IO can throw). Native calls that throw will "unwind" normally
|
---|
547 | in C++ code until they hit the boundary where they reenter java emitted code,
|
---|
548 | at which point the JIT-ed code checks for a potential pending exception. So in
|
---|
549 | a sense the throw point is implicitly after the call to the native method, so
|
---|
550 | we can usually make those cases efficient.
|
---|
551 |
|
---|
552 | Also, when we're running in the interpreter and warming up, we'll notice that
|
---|
553 | the == 42 case never occurs, and so when we start to JIT the code, we elide the
|
---|
554 | call to System.out.print(), replacing it (and anything else which appears in
|
---|
555 | that if x == 42 block) with a bit of code we call an "uncommon trap". I'm
|
---|
556 | presuming we encounter 42 rarely. So if we ever hit the x == 42 case, control
|
---|
557 | hits the trap, which triggers synchronous recompilation of the method, this
|
---|
558 | time with the call to System.out.print() and, because of that, we now to adapt
|
---|
559 | the new code to handle any traps thrown by print(). This is tricky stuff, as
|
---|
560 | we may need to rebuild stack frames to reflect the newly emitted method. And
|
---|
561 | we have to construct a weird bit of "thunk" code that allows us to fall back
|
---|
562 | directly into the newly emitted "if" block. So there's a large one-time cost
|
---|
563 | when we bump into the uncommon trap and recompile, and subsequent execution
|
---|
564 | might get slightly slower as the exception could actually be generated, whereas
|
---|
565 | before we hit the trap, we knew the exception could never be raised.
|
---|
566 |
|
---|
567 | Oh, and things also get expensive if we need to actually fill in the stack
|
---|
568 | trace associated with the exception object. Walking stacks is hellish.
|
---|
569 |
|
---|
570 | Quite a bit of effort was put into all this as some of the specjvm benchmarks
|
---|
571 | showed significant benefit.
|
---|
572 |
|
---|
573 | It's hard to get sensible measurements as the JIT is working against you at
|
---|
574 | every turn. What's good for the normal user is awful for anybody trying to
|
---|
575 | benchmark. Also, all the magic results in fairly noisy and less reproducible
|
---|
576 | results.
|
---|
577 |
|
---|
578 | Regards
|
---|
579 | Dave
|
---|
580 |
|
---|
581 | p.s., I think I've mentioned this before, but throwing in C++ is grim as
|
---|
582 | unrelated throws in different threads take common locks, so nothing scales as
|
---|
583 | you might expect.
|
---|
584 | \end{comment}
|
---|