source: doc/theses/andrew_beach_MMath/performance.tex @ 6aa84e0

ADTast-experimentalenumforall-pointer-decaypthread-emulationqualifiedEnum
Last change on this file since 6aa84e0 was 6aa84e0, checked in by Andrew Beach <ajbeach@…>, 3 years ago

Andrew MMath: Removed (updated one) the remaining \todo items.

  • Property mode set to 100644
File size: 19.2 KB
RevLine 
[dac16a0]1\chapter{Performance}
2\label{c:performance}
3
[cfbab07]4Performance is of secondary importance for most of this project.
5Instead, the focus was to get the features working. The only performance
6requirement is to ensure the tests for correctness run in a reasonable
[9cdfa5fb]7amount of time. Hence, only a few basic performance tests were performed to
[cfbab07]8check this requirement.
[b51e389c]9
10\section{Test Set-Up}
[cfbab07]11Tests were run in \CFA, C++, Java and Python.
[9698690]12In addition there are two sets of tests for \CFA,
[0477127]13one with termination and one with resumption.
[dac16a0]14
[9cdfa5fb]15GCC C++ is the most comparable language because both it and \CFA use the same
[dac16a0]16framework, libunwind.
[cfbab07]17In fact, the comparison is almost entirely in quality of implementation.
18Specifically, \CFA's EHM has had significantly less time to be optimized and
[dac16a0]19does not generate its own assembly. It does have a slight advantage in that
[cfbab07]20\Cpp has to do some extra bookkeeping to support its utility functions,
21but otherwise \Cpp should have a significant advantage.
[dac16a0]22
[0477127]23Java, a popular language with similar termination semantics,
24is implemented in a very different environment, a virtual machine with
[029cbc0]25garbage collection.
[7372065]26It also implements the finally clause on try blocks allowing for a direct
[029cbc0]27feature-to-feature comparison.
[cfbab07]28As with \Cpp, Java's implementation is mature, has more optimizations
29and extra features as compared to \CFA.
[9698690]30
[cfbab07]31Python is used as an alternative comparison because of the \CFA EHM's
32current performance goals, which is to not be prohibitively slow while the
[9698690]33features are designed and examined. Python has similar performance goals for
34creating quick scripts and its wide use suggests it has achieved those goals.
35
[cfbab07]36Unfortunately, there are no notable modern programming languages with
37resumption exceptions. Even the older programming languages with resumption
38seem to be notable only for having resumption.
39Instead, resumption is compared to its simulation in other programming
[0477127]40languages: fixup functions that are explicitly passed into a function.
[dac16a0]41
[cfbab07]42All tests are run inside a main loop that repeatedly performs a test.
43This approach avoids start-up or tear-down time from
[dac16a0]44affecting the timing results.
[0477127]45The number of times the loop is run is configurable from the command line;
[cfbab07]46the number used in the timing runs is given with the results per test.
[0477127]47The Java tests run the main loop 1000 times before
[9cdfa5fb]48beginning the actual test to ``warm up" the JVM.
[cfbab07]49% All other languages are precompiled or interpreted.
[9698690]50
51Timing is done internally, with time measured immediately before and
[cfbab07]52after the test loop. The difference is calculated and printed.
[9698690]53The loop structure and internal timing means it is impossible to test
54unhandled exceptions in \Cpp and Java as that would cause the process to
55terminate.
[9cdfa5fb]56Luckily, performance on the ``give up and kill the process" path is not
[9698690]57critical.
[dac16a0]58
[cfbab07]59The exceptions used in these tests are always based off of
60the base exception for the language.
61This requirement minimizes performance differences based
62on the object model used to represent the exception.
[9698690]63
[cfbab07]64All tests are designed to be as minimal as possible, while still preventing
65excessive optimizations.
[9698690]66For example, empty inline assembly blocks are used in \CFA and \Cpp to
67prevent excessive optimizations while adding no actual work.
[dac16a0]68
[9698690]69% We don't use catch-alls but if we did:
70% Catch-alls are done by catching the root exception type (not using \Cpp's
71% \code{C++}{catch(...)}).
[dac16a0]72
[0477127]73When collecting data, each test is run eleven times. The top three and bottom
[cfbab07]74three results are discarded and the remaining five values are averaged.
[0477127]75The test are run with the latest (still pre-release) \CFA compiler,
[cd03b76d]76using gcc-10 10.3.0 as a backend.
77g++-10 10.3.0 is used for \Cpp.
[9cdfa5fb]78Java tests are complied and run with Oracle OpenJDK version 11.0.11.
79Python used CPython version 3.8.10.
[cfbab07]80The machines used to run the tests are:
81\begin{itemize}[nosep]
82\item ARM 2280 Kunpeng 920 48-core 2$\times$socket
83      \lstinline{@} 2.6 GHz running Linux v5.11.0-25
84\item AMD 6380 Abu Dhabi 16-core 4$\times$socket
85      \lstinline{@} 2.5 GHz running Linux v5.11.0-25
86\end{itemize}
[9cdfa5fb]87These represent the two major families of hardware architecture.
[cfbab07]88
[b51e389c]89\section{Tests}
[029cbc0]90The following tests were selected to test the performance of different
91components of the exception system.
[cfbab07]92They should provide a guide as to where the EHM's costs are found.
93
94\paragraph{Stack Traversal}
[9cdfa5fb]95This group of tests measures the cost of traversing the stack
[cfbab07]96(and in termination, unwinding it).
97Inside the main loop is a call to a recursive function.
98This function calls itself F times before raising an exception.
99F is configurable from the command line, but is usually 100.
100This builds up many stack frames, and any contents they may have,
101before the raise.
102The exception is always handled at the base of the stack.
103For example the Empty test for \CFA resumption looks like:
104\begin{cfa}
105void unwind_empty(unsigned int frames) {
106        if (frames) {
107                unwind_empty(frames - 1);
108        } else {
109                throwResume (empty_exception){&empty_vt};
110        }
111}
112\end{cfa}
[0477127]113Other test cases have additional code around the recursive call adding
[cfbab07]114something besides simple stack frames to the stack.
[0477127]115Note that both termination and resumption have to traverse over
[cfbab07]116the stack but only termination has to unwind it.
[9698690]117\begin{itemize}[nosep]
[cfbab07]118% \item None:
119% Reuses the empty test code (see below) except that the number of frames
120% is set to 0 (this is the only test for which the number of frames is not
121% 100). This isolates the start-up and shut-down time of a throw.
122\item Empty:
[7372065]123The repeating function is empty except for the necessary control code.
[0477127]124As other traversal tests add to this, it is the baseline for the group
[cfbab07]125as the cost comes from traversing over and unwinding a stack frame
126that has no other interactions with the exception system.
[ea593a3]127\item Destructor:
[7372065]128The repeating function creates an object with a destructor before calling
129itself.
[0477127]130Comparing this to the empty test gives the time to traverse over and
[cfbab07]131unwind a destructor.
[ea593a3]132\item Finally:
[7372065]133The repeating function calls itself inside a try block with a finally clause
134attached.
[0477127]135Comparing this to the empty test gives the time to traverse over and
[cfbab07]136unwind a finally clause.
[ea593a3]137\item Other Handler:
[7372065]138The repeating function calls itself inside a try block with a handler that
[0477127]139does not match the raised exception, but is of the same kind of handler.
140This means that the EHM has to check each handler, and continue
141over all of them until it reaches the base of the stack.
142Comparing this to the empty test gives the time to traverse over and
[cfbab07]143unwind a handler.
[ea593a3]144\end{itemize}
145
[7372065]146\paragraph{Cross Try Statement}
[0477127]147This group of tests measures the cost for setting up exception handling,
148if it is
[9cdfa5fb]149not used because the exceptional case did not occur.
[0477127]150Tests repeatedly cross (enter, execute and leave) a try statement but never
151perform a raise.
[9698690]152\begin{itemize}[nosep]
[ea593a3]153\item Handler:
[cfbab07]154The try statement has a handler (of the appropriate kind).
[ea593a3]155\item Finally:
[262deda0]156The try statement has a finally clause.
[ea593a3]157\end{itemize}
158
159\paragraph{Conditional Matching}
[cfbab07]160This group measures the cost of conditional matching.
[ea593a3]161Only \CFA implements the language level conditional match,
[cfbab07]162the other languages mimic it with an ``unconditional" match (it still
163checks the exception's type) and conditional re-raise if it is not supposed
[9698690]164to handle that exception.
[cfbab07]165
[0477127]166Here is the pattern shown in \CFA and \Cpp. Java and Python use the same
[cfbab07]167pattern as \Cpp, but with their own syntax.
168
169\begin{minipage}{0.45\textwidth}
170\begin{cfa}
171try {
172        ...
173} catch (exception_t * e ;
174                should_catch(e)) {
175        ...
176}
177\end{cfa}
178\end{minipage}
179\begin{minipage}{0.55\textwidth}
180\begin{lstlisting}[language=C++]
181try {
182        ...
183} catch (std::exception & e) {
184        if (!should_catch(e)) throw;
185        ...
186}
187\end{lstlisting}
188\end{minipage}
[9698690]189\begin{itemize}[nosep]
190\item Match All:
[ea593a3]191The condition is always true. (Always matches or never re-raises.)
[9698690]192\item Match None:
[ea593a3]193The condition is always false. (Never matches or always re-raises.)
194\end{itemize}
[dac16a0]195
[cfbab07]196\paragraph{Resumption Simulation}
197A slightly altered version of the Empty Traversal test is used when comparing
198resumption to fix-up routines.
199The handler, the actual resumption handler or the fix-up routine,
200always captures a variable at the base of the loop,
201and receives a reference to a variable at the raise site, either as a
202field on the exception or an argument to the fix-up routine.
203% I don't actually know why that is here but not anywhere else.
204
[dac16a0]205%\section{Cost in Size}
206%Using exceptions also has a cost in the size of the executable.
207%Although it is sometimes ignored
208%
209%There is a size cost to defining a personality function but the later problem
210%is the LSDA which will be generated for every function.
211%
212%(I haven't actually figured out how to compare this, probably using something
213%related to -fexceptions.)
[029cbc0]214
[9698690]215\section{Results}
[cfbab07]216% First, introduce the tables.
217\autoref{t:PerformanceTermination},
218\autoref{t:PerformanceResumption}
219and~\autoref{t:PerformanceFixupRoutines}
220show the test results.
221In cases where a feature is not supported by a language, the test is skipped
222for that language and the result is marked N/A.
223There are also cases where the feature is supported but measuring its
[9cdfa5fb]224cost is impossible. This happened with Java, which uses a JIT that optimizes
225away the tests and cannot be stopped.\cite{Dice21}
[cfbab07]226These tests are marked N/C.
227To get results in a consistent range (1 second to 1 minute is ideal,
228going higher is better than going low) N, the number of iterations of the
229main loop in each test, is varied between tests. It is also given in the
[0477127]230results and has a value in the millions.
[cfbab07]231
[9cdfa5fb]232An anomaly in some results came from \CFA's use of GCC nested functions.
[cfbab07]233These nested functions are used to create closures that can access stack
234variables in their lexical scope.
[9cdfa5fb]235However, if they do so, then they can cause the benchmark's run time to
[cfbab07]236increase by an order of magnitude.
237The simplest solution is to make those values global variables instead
[9cdfa5fb]238of function-local variables.
[cfbab07]239% Do we know if editing a global inside nested function is a problem?
240Tests that had to be modified to avoid this problem have been marked
241with a ``*'' in the results.
242
243% Now come the tables themselves:
244% You might need a wider window for this.
245
246\begin{table}[htb]
247\centering
248\caption{Termination Performance Results (sec)}
249\label{t:PerformanceTermination}
250\begin{tabular}{|r|*{2}{|r r r r|}}
[7372065]251\hline
[cfbab07]252                       & \multicolumn{4}{c||}{AMD}         & \multicolumn{4}{c|}{ARM}  \\
253\cline{2-9}
254N\hspace{8pt}          & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
255                         \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
[7372065]256\hline
[140eb16]257Empty Traversal (1M)   & 23.0  & 9.6   & 17.6  & 23.4      & 30.6  & 13.6  & 15.5  & 14.7  \\
258D'tor Traversal (1M)   & 48.1  & 23.5  & N/A   & N/A       & 64.2  & 29.2  & N/A   & N/A   \\
259Finally Traversal (1M) & 3.2*  & N/A   & 17.6  & 29.2      & 3.9*  & N/A   & 15.5  & 19.0  \\
260Other Traversal (1M)   & 3.3*  & 23.9  & 17.7  & 32.8      & 3.9*  & 24.5  & 15.5  & 21.6  \\
261Cross Handler (1B)     & 6.5   & 0.9   & N/C   & 38.0      & 9.6   & 0.8   & N/C   & 32.1  \\
262Cross Finally (1B)     & 0.8   & N/A   & N/C   & 44.6      & 0.6   & N/A   & N/C   & 37.3  \\
263Match All (10M)        & 30.5  & 20.6  & 11.2  & 3.9       & 36.9  & 24.6  & 10.7  & 3.1   \\
264Match None (10M)       & 30.6  & 50.9  & 11.2  & 5.0       & 36.9  & 71.9  & 10.7  & 4.1   \\
[7372065]265\hline
266\end{tabular}
[cfbab07]267\end{table}
[7372065]268
[cfbab07]269\begin{table}[htb]
270\centering
271\caption{Resumption Performance Results (sec)}
272\label{t:PerformanceResumption}
273\begin{tabular}{|r||r||r|}
[0b67a19]274\hline
[cfbab07]275N\hspace{8pt}
276                        & AMD     & ARM  \\
[0b67a19]277\hline
[140eb16]278Empty Traversal (10M)   & 1.4     & 1.2  \\
[cfbab07]279D'tor Traversal (10M)   & 1.8     & 1.0  \\
[140eb16]280Finally Traversal (10M) & 1.8     & 1.0  \\
281Other Traversal (10M)   & 22.6    & 25.8 \\
282Cross Handler (1B)      & 9.0     & 11.9 \\
[cfbab07]283Match All (100M)        & 2.3     & 3.2  \\
[140eb16]284Match None (100M)       & 3.0     & 3.8  \\
[0b67a19]285\hline
286\end{tabular}
[cfbab07]287\end{table}
[262deda0]288
[cfbab07]289\begin{table}[htb]
290\centering
291\small
292\caption{Resumption/Fixup Routine Comparison (sec)}
293\label{t:PerformanceFixupRoutines}
294\setlength{\tabcolsep}{5pt}
295\begin{tabular}{|r|*{2}{|r r r r r|}}
296\hline
297            & \multicolumn{5}{c||}{AMD}     & \multicolumn{5}{c|}{ARM}  \\
298\cline{2-11}
299N\hspace{8pt}       & \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
300              \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
301\hline
[140eb16]302Resume Empty (10M)  & 1.4 & 1.4 & 15.4 & 2.3 & 178.0  & 1.2 & 1.2 & 8.9 & 1.2 & 118.4 \\
[cfbab07]303\hline
304\end{tabular}
305\end{table}
306
307% Now discuss the results in the tables.
308One result not directly related to \CFA but important to keep in mind is that,
[0477127]309for exceptions, the standard intuition about which languages should go
[cfbab07]310faster often does not hold.
311For example, there are a few cases where Python out-performs
312\CFA, \Cpp and Java.
[cd03b76d]313% To be exact, the Match All and Match None cases.
[6aa84e0]314The most likely explination is that,
315the generally faster languages have made ``common cases fast" at the expense
316of the rarer cases. Since exceptions are considered rare, they are made
317expensive to help speed up common actions, such as entering and leaving try
318statements.
319Python on the other hand, while generally slower than the other languages,
320uses exceptions more and has not scarified their performance.
[cfbab07]321In addition, languages with high-level representations have a much
322easier time scanning the stack as there is less to decode.
323
324As stated,
325the performance tests are not attempting to show \CFA has a new competitive
326way of implementing exception handling.
327The only performance requirement is to insure the \CFA EHM has reasonable
328performance for prototyping.
[0477127]329Although that may be hard to exactly quantify, I believe it has succeeded
[cfbab07]330in that regard.
331Details on the different test cases follow.
332
[0477127]333\subsection{Termination \texorpdfstring{(\autoref{t:PerformanceTermination})}{}}
334
[cfbab07]335\begin{description}
336\item[Empty Traversal]
337\CFA is slower than \Cpp, but is still faster than the other languages
338and closer to \Cpp than other languages.
[0477127]339This result is to be expected,
340as \CFA is closer to \Cpp than the other languages.
[cfbab07]341
342\item[D'tor Traversal]
[0477127]343Running destructors causes a huge slowdown in the two languages that support
[cfbab07]344them. \CFA has a higher proportionate slowdown but it is similar to \Cpp's.
[0477127]345Considering the amount of work done in destructors is effectively zero
346(an assembly comment), the cost
347must come from the change of context required to run the destructor.
[cfbab07]348
349\item[Finally Traversal]
[0477127]350Performance is similar to Empty Traversal in all languages that support finally
[cfbab07]351clauses. Only Python seems to have a larger than random noise change in
[9cdfa5fb]352its run time and it is still not large.
[cfbab07]353Despite the similarity between finally clauses and destructors,
[9cdfa5fb]354finally clauses seem to avoid the spike that run time destructors have.
[cfbab07]355Possibly some optimization removes the cost of changing contexts.
356
357\item[Other Traversal]
358For \Cpp, stopping to check if a handler applies seems to be about as
359expensive as stopping to run a destructor.
360This results in a significant jump.
361
[9cdfa5fb]362Other languages experience a small increase in run time.
[cfbab07]363The small increase likely comes from running the checks,
364but they could avoid the spike by not having the same kind of overhead for
365switching to the check's context.
366
367\item[Cross Handler]
[9cdfa5fb]368Here, \CFA falls behind \Cpp by a much more significant margin.
369This is likely due to the fact that \CFA has to insert two extra function
370calls, while \Cpp does not have to execute any other instructions.
[cfbab07]371Python is much further behind.
372
373\item[Cross Finally]
374\CFA's performance now matches \Cpp's from Cross Handler.
375If the code from the finally clause is being inlined,
[0477127]376which is just an asm comment, than there are no additional instructions
[cfbab07]377to execute again when exiting the try statement normally.
378
379\item[Conditional Match]
[0477127]380Both of the conditional matching tests can be considered on their own.
[9cdfa5fb]381However, for evaluating the value of conditional matching itself, the
[cfbab07]382comparison of the two sets of results is useful.
[9cdfa5fb]383Consider the massive jump in run time for \Cpp going from match all to match
[cfbab07]384none, which none of the other languages have.
[9cdfa5fb]385Some strange interaction is causing run time to more than double for doing
[cfbab07]386twice as many raises.
[9cdfa5fb]387Java and Python avoid this problem and have similar run time for both tests,
[cfbab07]388possibly through resource reuse or their program representation.
[9cdfa5fb]389However, \CFA is built like \Cpp, and avoids the problem as well.
390This matches
[0477127]391the pattern of the conditional match, which makes the two execution paths
392very similar.
[cfbab07]393
394\end{description}
395
[0477127]396\subsection{Resumption \texorpdfstring{(\autoref{t:PerformanceResumption})}{}}
397
[9cdfa5fb]398Moving on to resumption, there is one general note:
[0477127]399resumption is \textit{fast}. The only test where it fell
[cfbab07]400behind termination is Cross Handler.
401In every other case, the number of iterations had to be increased by a
[9cdfa5fb]402factor of 10 to get the run time in an appropriate range
[cfbab07]403and in some cases resumption still took less time.
404
405% I tried \paragraph and \subparagraph, maybe if I could adjust spacing
406% between paragraphs those would work.
407\begin{description}
408\item[Empty Traversal]
409See above for the general speed-up notes.
[0477127]410This result is not surprising as resumption's linked-list approach
[cfbab07]411means that traversing over stack frames without a resumption handler is
412$O(1)$.
413
414\item[D'tor Traversal]
[9cdfa5fb]415Resumption does have the same spike in run time that termination has.
416The run time is actually very similar to Finally Traversal.
[0477127]417As resumption does not unwind the stack, both destructors and finally
418clauses are run while walking down the stack during the recursive returns.
[cfbab07]419So it follows their performance is similar.
420
421\item[Finally Traversal]
[0477127]422Same as D'tor Traversal,
[9cdfa5fb]423except termination did not have a spike in run time on this test case.
[cfbab07]424
425\item[Other Traversal]
426Traversing across handlers reduces resumption's advantage as it actually
427has to stop and check each one.
428Resumption still came out ahead (adjusting for iterations) but by much less
429than the other cases.
430
431\item[Cross Handler]
432The only test case where resumption could not keep up with termination,
433although the difference is not as significant as many other cases.
[9cdfa5fb]434It is simply a matter of where the costs come from:
435both termination and resumption have some work to set up or tear down a
[0477127]436handler. It just so happens that resumption's work is slightly slower.
[cfbab07]437
438\item[Conditional Match]
439Resumption shows a slight slowdown if the exception is not matched
440by the first handler, which follows from the fact the second handler now has
[9cdfa5fb]441to be checked. However, the difference is not large.
[cfbab07]442
443\end{description}
444
[0477127]445\subsection{Resumption/Fixup \texorpdfstring{(\autoref{t:PerformanceFixupRoutines})}{}}
446
[cfbab07]447Finally are the results of the resumption/fixup routine comparison.
[0477127]448These results are surprisingly varied. It is possible that creating a closure
[cfbab07]449has more to do with performance than passing the argument through layers of
450calls.
[0477127]451At 100 stack frames, resumption and manual fixup routines have similar
452performance in \CFA.
453More experiments could try to tease out the exact trade-offs,
454but the prototype's only performance goal is to be reasonable.
[9cdfa5fb]455It is already in that range, and \CFA's fixup routine simulation is
[0477127]456one of the faster simulations as well.
[9cdfa5fb]457Plus, exceptions add features and remove syntactic overhead,
458so even at similar performance, resumptions have advantages
[0477127]459over fixup routines.
Note: See TracBrowser for help on using the repository browser.