source: doc/theses/andrew_beach_MMath/performance.tex @ 9cb6514

ADTast-experimentalenumforall-pointer-decayjacob/cs343-translationpthread-emulationqualifiedEnum
Last change on this file since 9cb6514 was 9cb6514, checked in by Peter A. Buhr <pabuhr@…>, 3 years ago

proofread chapter performance.tex

  • Property mode set to 100644
File size: 19.1 KB
Line 
1\chapter{Performance}
2\label{c:performance}
3
4Performance is of secondary importance for most of this project.
5Instead, the focus was to get the features working. The only performance
6requirement is to ensure the tests for correctness run in a reasonable
7amount of time. Hence, a few basic performance tests were performed to
8check this requirement.
9
10\section{Test Set-Up}
11Tests were run in \CFA, C++, Java and Python.
12In addition there are two sets of tests for \CFA,
13one with termination and one with resumption.
14
15C++ is the most comparable language because both it and \CFA use the same
16framework, libunwind.
17In fact, the comparison is almost entirely in quality of implementation.
18Specifically, \CFA's EHM has had significantly less time to be optimized and
19does not generate its own assembly. It does have a slight advantage in that
20\Cpp has to do some extra bookkeeping to support its utility functions,
21but otherwise \Cpp should have a significant advantage.
22
23Java, a popular language with similar termination semantics, but
24it is implemented in a very different environment, a virtual machine with
25garbage collection.
26It also implements the finally clause on try blocks allowing for a direct
27feature-to-feature comparison.
28As with \Cpp, Java's implementation is mature, has more optimizations
29and extra features as compared to \CFA.
30
31Python is used as an alternative comparison because of the \CFA EHM's
32current performance goals, which is to not be prohibitively slow while the
33features are designed and examined. Python has similar performance goals for
34creating quick scripts and its wide use suggests it has achieved those goals.
35
36Unfortunately, there are no notable modern programming languages with
37resumption exceptions. Even the older programming languages with resumption
38seem to be notable only for having resumption.
39Instead, resumption is compared to its simulation in other programming
40languages: fixup functions that are explicitly passed into a function.
41
42All tests are run inside a main loop that repeatedly performs a test.
43This approach avoids start-up or tear-down time from
44affecting the timing results.
45The number of times the loop is run is configurable from the command line;
46the number used in the timing runs is given with the results per test.
47% Tests ran their main loop a million times.
48The Java tests run the main loop 1000 times before
49beginning the actual test to ``warm-up" the JVM.
50% All other languages are precompiled or interpreted.
51
52Timing is done internally, with time measured immediately before and
53after the test loop. The difference is calculated and printed.
54The loop structure and internal timing means it is impossible to test
55unhandled exceptions in \Cpp and Java as that would cause the process to
56terminate.
57Luckily, performance on the ``give-up and kill the process" path is not
58critical.
59
60The exceptions used in these tests are always based off of
61the base exception for the language.
62This requirement minimizes performance differences based
63on the object model used to represent the exception.
64
65All tests are designed to be as minimal as possible, while still preventing
66excessive optimizations.
67For example, empty inline assembly blocks are used in \CFA and \Cpp to
68prevent excessive optimizations while adding no actual work.
69
70% We don't use catch-alls but if we did:
71% Catch-alls are done by catching the root exception type (not using \Cpp's
72% \code{C++}{catch(...)}).
73
74When collecting data, each test is run eleven times. The top three and bottom
75three results are discarded and the remaining five values are averaged.
76The test are run with the latest (still pre-release) \CFA compiler,
77using gcc-10 as a backend.
78g++-10 is used for \Cpp.
79Java tests are complied and run with version 11.0.11.
80Python used version 3.8.
81The machines used to run the tests are:
82\todo{Get patch versions for python, gcc and g++.}
83\begin{itemize}[nosep]
84\item ARM 2280 Kunpeng 920 48-core 2$\times$socket
85      \lstinline{@} 2.6 GHz running Linux v5.11.0-25
86\item AMD 6380 Abu Dhabi 16-core 4$\times$socket
87      \lstinline{@} 2.5 GHz running Linux v5.11.0-25
88\end{itemize}
89Representing the two major families of hardware architecture.
90
91\section{Tests}
92The following tests were selected to test the performance of different
93components of the exception system.
94They should provide a guide as to where the EHM's costs are found.
95
96\paragraph{Stack Traversal}
97This group measures the cost of traversing the stack,
98(and in termination, unwinding it).
99Inside the main loop is a call to a recursive function.
100This function calls itself F times before raising an exception.
101F is configurable from the command line, but is usually 100.
102This builds up many stack frames, and any contents they may have,
103before the raise.
104The exception is always handled at the base of the stack.
105For example the Empty test for \CFA resumption looks like:
106\begin{cfa}
107void unwind_empty(unsigned int frames) {
108        if (frames) {
109                unwind_empty(frames - 1);
110        } else {
111                throwResume (empty_exception){&empty_vt};
112        }
113}
114\end{cfa}
115Other test cases have additional code around the recursive call adding
116something besides simple stack frames to the stack.
117Note that both termination and resumption have to traverse over
118the stack but only termination has to unwind it.
119\begin{itemize}[nosep]
120% \item None:
121% Reuses the empty test code (see below) except that the number of frames
122% is set to 0 (this is the only test for which the number of frames is not
123% 100). This isolates the start-up and shut-down time of a throw.
124\item Empty:
125The repeating function is empty except for the necessary control code.
126As other traversal tests add to this, it is the baseline for the group
127as the cost comes from traversing over and unwinding a stack frame
128that has no other interactions with the exception system.
129\item Destructor:
130The repeating function creates an object with a destructor before calling
131itself.
132Comparing this to the empty test gives the time to traverse over and
133unwind a destructor.
134\item Finally:
135The repeating function calls itself inside a try block with a finally clause
136attached.
137Comparing this to the empty test gives the time to traverse over and
138unwind a finally clause.
139\item Other Handler:
140The repeating function calls itself inside a try block with a handler that
141does not match the raised exception, but is of the same kind of handler.
142This means that the EHM has to check each handler, but continue
143over all of them until it reaches the base of the stack.
144Comparing this to the empty test gives the time to traverse over and
145unwind a handler.
146\end{itemize}
147
148\paragraph{Cross Try Statement}
149This group of tests measures the cost for setting up exception handling, if it is
150not used (because the exceptional case did not occur).
151Tests repeatedly cross (enter, execute, and leave) a try statement but never
152perform a raise.
153\begin{itemize}[nosep]
154\item Handler:
155The try statement has a handler (of the appropriate kind).
156\item Finally:
157The try statement has a finally clause.
158\end{itemize}
159
160\paragraph{Conditional Matching}
161This group measures the cost of conditional matching.
162Only \CFA implements the language level conditional match,
163the other languages mimic it with an ``unconditional" match (it still
164checks the exception's type) and conditional re-raise if it is not supposed
165to handle that exception.
166
167Here is the pattern shown in \CFA and \Cpp. Java and Python use the same
168pattern as \Cpp, but with their own syntax.
169
170\begin{minipage}{0.45\textwidth}
171\begin{cfa}
172try {
173        ...
174} catch (exception_t * e ;
175                should_catch(e)) {
176        ...
177}
178\end{cfa}
179\end{minipage}
180\begin{minipage}{0.55\textwidth}
181\begin{lstlisting}[language=C++]
182try {
183        ...
184} catch (std::exception & e) {
185        if (!should_catch(e)) throw;
186        ...
187}
188\end{lstlisting}
189\end{minipage}
190\begin{itemize}[nosep]
191\item Match All:
192The condition is always true. (Always matches or never re-raises.)
193\item Match None:
194The condition is always false. (Never matches or always re-raises.)
195\end{itemize}
196
197\paragraph{Resumption Simulation}
198A slightly altered version of the Empty Traversal test is used when comparing
199resumption to fix-up routines.
200The handler, the actual resumption handler or the fix-up routine,
201always captures a variable at the base of the loop,
202and receives a reference to a variable at the raise site, either as a
203field on the exception or an argument to the fix-up routine.
204% I don't actually know why that is here but not anywhere else.
205
206%\section{Cost in Size}
207%Using exceptions also has a cost in the size of the executable.
208%Although it is sometimes ignored
209%
210%There is a size cost to defining a personality function but the later problem
211%is the LSDA which will be generated for every function.
212%
213%(I haven't actually figured out how to compare this, probably using something
214%related to -fexceptions.)
215
216\section{Results}
217% First, introduce the tables.
218\autoref{t:PerformanceTermination},
219\autoref{t:PerformanceResumption}
220and~\autoref{t:PerformanceFixupRoutines}
221show the test results.
222In cases where a feature is not supported by a language, the test is skipped
223for that language and the result is marked N/A.
224There are also cases where the feature is supported but measuring its
225cost is impossible. This happened with Java, which uses a JIT that optimize
226away the tests and it cannot be stopped.\cite{Dice21}
227These tests are marked N/C.
228To get results in a consistent range (1 second to 1 minute is ideal,
229going higher is better than going low) N, the number of iterations of the
230main loop in each test, is varied between tests. It is also given in the
231results and has a value in the millions.
232
233An anomaly in some results came from \CFA's use of gcc nested functions.
234These nested functions are used to create closures that can access stack
235variables in their lexical scope.
236However, if they do so, then they can cause the benchmark's run-time to
237increase by an order of magnitude.
238The simplest solution is to make those values global variables instead
239of function local variables.
240% Do we know if editing a global inside nested function is a problem?
241Tests that had to be modified to avoid this problem have been marked
242with a ``*'' in the results.
243
244% Now come the tables themselves:
245% You might need a wider window for this.
246
247\begin{table}[htb]
248\centering
249\caption{Termination Performance Results (sec)}
250\label{t:PerformanceTermination}
251\begin{tabular}{|r|*{2}{|r r r r|}}
252\hline
253                       & \multicolumn{4}{c||}{AMD}         & \multicolumn{4}{c|}{ARM}  \\
254\cline{2-9}
255N\hspace{8pt}          & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
256                         \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
257\hline
258Empty Traversal (1M)   & 3.4   & 2.8   & 18.3  & 23.4      & 3.7   & 3.2   & 15.5  & 14.8  \\
259D'tor Traversal (1M)   & 48.4  & 23.6  & N/A   & N/A       & 64.2  & 29.0  & N/A   & N/A   \\
260Finally Traversal (1M) & 3.4*  & N/A   & 17.9  & 29.0      & 4.1*  & N/A   & 15.6  & 19.0  \\
261Other Traversal (1M)   & 3.6*  & 23.2  & 18.2  & 32.7      & 4.0*  & 24.5  & 15.5  & 21.4  \\
262Cross Handler (100M)   & 6.0   & 0.9   & N/C   & 37.4      & 10.0  & 0.8   & N/C   & 32.2  \\
263Cross Finally (100M)   & 0.9   & N/A   & N/C   & 44.1      & 0.8   & N/A   & N/C   & 37.3  \\
264Match All (10M)        & 32.9  & 20.7  & 13.4  & 4.9       & 36.2  & 24.5  & 12.0  & 3.1   \\
265Match None (10M)       & 32.7  & 50.3  & 11.0  & 5.1       & 36.3  & 71.9  & 12.3  & 4.2   \\
266\hline
267\end{tabular}
268\end{table}
269
270\begin{table}[htb]
271\centering
272\caption{Resumption Performance Results (sec)}
273\label{t:PerformanceResumption}
274\begin{tabular}{|r||r||r|}
275\hline
276N\hspace{8pt}
277                        & AMD     & ARM  \\
278\hline
279Empty Traversal (10M)   & 0.2     & 0.3  \\
280D'tor Traversal (10M)   & 1.8     & 1.0  \\
281Finally Traversal (10M) & 1.7     & 1.0  \\
282Other Traversal (10M)   & 22.6    & 25.9 \\
283Cross Handler (100M)    & 8.4     & 11.9 \\
284Match All (100M)        & 2.3     & 3.2  \\
285Match None (100M)       & 2.9     & 3.9  \\
286\hline
287\end{tabular}
288\end{table}
289
290\begin{table}[htb]
291\centering
292\small
293\caption{Resumption/Fixup Routine Comparison (sec)}
294\label{t:PerformanceFixupRoutines}
295\setlength{\tabcolsep}{5pt}
296\begin{tabular}{|r|*{2}{|r r r r r|}}
297\hline
298            & \multicolumn{5}{c||}{AMD}     & \multicolumn{5}{c|}{ARM}  \\
299\cline{2-11}
300N\hspace{8pt}       & \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
301              \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
302\hline
303Resume Empty (10M)  & 3.8  & 3.5  & 14.7  & 2.3   & 176.1 & 0.3  & 0.1  & 8.9   & 1.2   & 119.9 \\
304%Resume Other (10M)  & 4.0* & 0.1* & 21.9  & 6.2   & 381.0 & 0.3* & 0.1* & 13.2  & 5.0   & 290.7 \\
305\hline
306\end{tabular}
307\end{table}
308
309% Now discuss the results in the tables.
310One result not directly related to \CFA but important to keep in mind is that,
311for exceptions, the standard intuition about which languages should go
312faster often does not hold.
313For example, there are a few cases where Python out-performs
314\CFA, \Cpp and Java.
315The most likely explanation is that, since exceptions
316are rarely considered to be the common case, the more optimized languages
317make that case expensive to improve other cases.
318In addition, languages with high-level representations have a much
319easier time scanning the stack as there is less to decode.
320
321As stated,
322the performance tests are not attempting to show \CFA has a new competitive
323way of implementing exception handling.
324The only performance requirement is to insure the \CFA EHM has reasonable
325performance for prototyping.
326Although that may be hard to exactly quantify, I believe it has succeeded
327in that regard.
328Details on the different test cases follow.
329
330\subsection{Termination, \autoref{t:PerformanceTermination}}
331
332\begin{description}
333\item[Empty Traversal]
334\CFA is slower than \Cpp, but is still faster than the other languages
335and closer to \Cpp than other languages.
336This result is to be expected as \CFA is closer to \Cpp than the other languages.
337
338\item[D'tor Traversal]
339Running destructors causes a huge slowdown in the two languages that support
340them. \CFA has a higher proportionate slowdown but it is similar to \Cpp's.
341Considering the amount of work done in destructors is virtually zero (asm instruction), the cost
342must come from the change of context required to trigger the destructor.
343
344\item[Finally Traversal]
345Performance is similar to Empty Traversal in all languages that support finally
346clauses. Only Python seems to have a larger than random noise change in
347its run-time and it is still not large.
348Despite the similarity between finally clauses and destructors,
349finally clauses seem to avoid the spike that run-time destructors have.
350Possibly some optimization removes the cost of changing contexts.
351\todo{OK, I think the finally clause may have been optimized out.}
352
353\item[Other Traversal]
354For \Cpp, stopping to check if a handler applies seems to be about as
355expensive as stopping to run a destructor.
356This results in a significant jump.
357
358Other languages experience a small increase in run-time.
359The small increase likely comes from running the checks,
360but they could avoid the spike by not having the same kind of overhead for
361switching to the check's context.
362
363\todo{Could revisit Other Traversal, after Finally Traversal.}
364
365\item[Cross Handler]
366Here \CFA falls behind \Cpp by a much more significant margin.
367This is likely due to the fact \CFA has to insert two extra function
368calls, while \Cpp does not have to execute any other instructions.
369Python is much further behind.
370
371\item[Cross Finally]
372\CFA's performance now matches \Cpp's from Cross Handler.
373If the code from the finally clause is being inlined,
374which is just an asm comment, than there are no additional instructions
375to execute again when exiting the try statement normally.
376
377\item[Conditional Match]
378Both of the conditional matching tests can be considered on their own.
379However for evaluating the value of conditional matching itself, the
380comparison of the two sets of results is useful.
381Consider the massive jump in run-time for \Cpp going from match all to match
382none, which none of the other languages have.
383Some strange interaction is causing run-time to more than double for doing
384twice as many raises.
385Java and Python avoid this problem and have similar run-time for both tests,
386possibly through resource reuse or their program representation.
387However \CFA is built like \Cpp and avoids the problem as well, this matches
388the pattern of the conditional match, which makes the two execution paths
389very similar.
390
391\end{description}
392
393\subsection{Resumption, \autoref{t:PerformanceResumption}}
394
395Moving on to resumption, there is one general note,
396resumption is \textit{fast}. The only test where it fell
397behind termination is Cross Handler.
398In every other case, the number of iterations had to be increased by a
399factor of 10 to get the run-time in an appropriate range
400and in some cases resumption still took less time.
401
402% I tried \paragraph and \subparagraph, maybe if I could adjust spacing
403% between paragraphs those would work.
404\begin{description}
405\item[Empty Traversal]
406See above for the general speed-up notes.
407This result is not surprising as resumption's linked-list approach
408means that traversing over stack frames without a resumption handler is
409$O(1)$.
410
411\item[D'tor Traversal]
412Resumption does have the same spike in run-time that termination has.
413The run-time is actually very similar to Finally Traversal.
414As resumption does not unwind the stack, both destructors and finally
415clauses are run while walking down the stack during the recursion returns.
416So it follows their performance is similar.
417
418\item[Finally Traversal]
419% The increase in run-time from Empty Traversal (once adjusted for
420% the number of iterations) is roughly the same as for termination.
421% This suggests that the
422See D'tor Traversal discussion.
423
424\item[Other Traversal]
425Traversing across handlers reduces resumption's advantage as it actually
426has to stop and check each one.
427Resumption still came out ahead (adjusting for iterations) but by much less
428than the other cases.
429
430\item[Cross Handler]
431The only test case where resumption could not keep up with termination,
432although the difference is not as significant as many other cases.
433It is simply a matter of where the costs come from. \PAB{What does this mean?
434Even if \CFA termination
435is not ``zero-cost", passing through an empty function still seems to be
436cheaper than updating global values.}
437
438\item[Conditional Match]
439Resumption shows a slight slowdown if the exception is not matched
440by the first handler, which follows from the fact the second handler now has
441to be checked. However the difference is not large.
442
443\end{description}
444
445\subsection{Resumption/Fixup, \autoref{t:PerformanceFixupRoutines}}
446
447Finally are the results of the resumption/fixup routine comparison.
448These results are surprisingly varied. It is possible that creating a closure
449has more to do with performance than passing the argument through layers of
450calls.
451Even with 100 stack frames though, resumption is only about as fast as
452manually passing a fixup routine.
453However, as the number of fixup routines is increased, the cost of passing them
454should make the resumption dynamic-search cheaper.
455So there is a cost for the additional power and flexibility exceptions
456provide.
Note: See TracBrowser for help on using the repository browser.