source: doc/theses/andrew_beach_MMath/performance.tex @ 9411cf0

ADTast-experimentalenumforall-pointer-decaypthread-emulationqualifiedEnum
Last change on this file since 9411cf0 was 814f87d, checked in by Andrew Beach <ajbeach@…>, 3 years ago

Andrew MMath: Updated thesis with Yizhou Zhang's feedback.

  • Property mode set to 100644
File size: 19.5 KB
Line 
1\chapter{Performance}
2\label{c:performance}
3
4Performance is of secondary importance for most of this project.
5Instead, the focus was to get the features working. The only performance
6requirement is to ensure the tests for correctness run in a reasonable
7amount of time. Hence, only a few basic performance tests were performed to
8check this requirement.
9
10\section{Test Set-Up}
11Tests were run in \CFA, C++, Java and Python.
12In addition there are two sets of tests for \CFA,
13one with termination and one with resumption.
14
15GCC C++ is the most comparable language because both it and \CFA use the same
16framework, libunwind.
17In fact, the comparison is almost entirely in quality of implementation.
18Specifically, \CFA's EHM has had significantly less time to be optimized and
19does not generate its own assembly. It does have a slight advantage in that
20\Cpp has to do some extra bookkeeping to support its utility functions,
21but otherwise \Cpp should have a significant advantage.
22
23Java, a popular language with similar termination semantics,
24is implemented in a very different environment, a virtual machine with
25garbage collection.
26It also implements the finally clause on try blocks allowing for a direct
27feature-to-feature comparison.
28As with \Cpp, Java's implementation is mature, has more optimizations
29and extra features as compared to \CFA.
30
31Python is used as an alternative comparison because of the \CFA EHM's
32current performance goals, which is to not be prohibitively slow while the
33features are designed and examined. Python has similar performance goals for
34creating quick scripts and its wide use suggests it has achieved those goals.
35
36Unfortunately, there are no notable modern programming languages with
37resumption exceptions. Even the older programming languages with resumption
38seem to be notable only for having resumption.
39On the other hand, the functional equivalents to resumption are too new.
40There does not seem to be any standard implementations in well-known
41languages, so far they seem confined to extensions and research languages.
42% There was some maybe interesting comparison to an OCaml extension
43% but I'm not sure how to get that working if it is interesting.
44Instead, resumption is compared to its simulation in other programming
45languages: fixup functions that are explicitly passed into a function.
46
47All tests are run inside a main loop that repeatedly performs a test.
48This approach avoids start-up or tear-down time from
49affecting the timing results.
50The number of times the loop is run is configurable from the command line;
51the number used in the timing runs is given with the results per test.
52The Java tests run the main loop 1000 times before
53beginning the actual test to ``warm up" the JVM.
54% All other languages are precompiled or interpreted.
55
56Timing is done internally, with time measured immediately before and
57after the test loop. The difference is calculated and printed.
58The loop structure and internal timing means it is impossible to test
59unhandled exceptions in \Cpp and Java as that would cause the process to
60terminate.
61Luckily, performance on the ``give up and kill the process" path is not
62critical.
63
64The exceptions used in these tests are always based off of
65the base exception for the language.
66This requirement minimizes performance differences based
67on the object model used to represent the exception.
68
69All tests are designed to be as minimal as possible, while still preventing
70excessive optimizations.
71For example, empty inline assembly blocks are used in \CFA and \Cpp to
72prevent excessive optimizations while adding no actual work.
73
74% We don't use catch-alls but if we did:
75% Catch-alls are done by catching the root exception type (not using \Cpp's
76% \code{C++}{catch(...)}).
77
78When collecting data, each test is run eleven times. The top three and bottom
79three results are discarded and the remaining five values are averaged.
80The test are run with the latest (still pre-release) \CFA compiler,
81using gcc-10 10.3.0 as a backend.
82g++-10 10.3.0 is used for \Cpp.
83Java tests are complied and run with Oracle OpenJDK version 11.0.11.
84Python used CPython version 3.8.10.
85The machines used to run the tests are:
86\begin{itemize}[nosep]
87\item ARM 2280 Kunpeng 920 48-core 2$\times$socket
88      \lstinline{@} 2.6 GHz running Linux v5.11.0-25
89\item AMD 6380 Abu Dhabi 16-core 4$\times$socket
90      \lstinline{@} 2.5 GHz running Linux v5.11.0-25
91\end{itemize}
92These represent the two major families of hardware architecture.
93
94\section{Tests}
95The following tests were selected to test the performance of different
96components of the exception system.
97They should provide a guide as to where the EHM's costs are found.
98
99\paragraph{Stack Traversal}
100This group of tests measures the cost of traversing the stack
101(and in termination, unwinding it).
102Inside the main loop is a call to a recursive function.
103This function calls itself F times before raising an exception.
104F is configurable from the command line, but is usually 100.
105This builds up many stack frames, and any contents they may have,
106before the raise.
107The exception is always handled at the base of the stack.
108For example the Empty test for \CFA resumption looks like:
109\begin{cfa}
110void unwind_empty(unsigned int frames) {
111        if (frames) {
112                unwind_empty(frames - 1);
113        } else {
114                throwResume (empty_exception){&empty_vt};
115        }
116}
117\end{cfa}
118Other test cases have additional code around the recursive call adding
119something besides simple stack frames to the stack.
120Note that both termination and resumption have to traverse over
121the stack but only termination has to unwind it.
122\begin{itemize}[nosep]
123% \item None:
124% Reuses the empty test code (see below) except that the number of frames
125% is set to 0 (this is the only test for which the number of frames is not
126% 100). This isolates the start-up and shut-down time of a throw.
127\item Empty:
128The repeating function is empty except for the necessary control code.
129As other traversal tests add to this, it is the baseline for the group
130as the cost comes from traversing over and unwinding a stack frame
131that has no other interactions with the exception system.
132\item Destructor:
133The repeating function creates an object with a destructor before calling
134itself.
135Comparing this to the empty test gives the time to traverse over and
136unwind a destructor.
137\item Finally:
138The repeating function calls itself inside a try block with a finally clause
139attached.
140Comparing this to the empty test gives the time to traverse over and
141unwind a finally clause.
142\item Other Handler:
143The repeating function calls itself inside a try block with a handler that
144does not match the raised exception, but is of the same kind of handler.
145This means that the EHM has to check each handler, and continue
146over all of them until it reaches the base of the stack.
147Comparing this to the empty test gives the time to traverse over and
148unwind a handler.
149\end{itemize}
150
151\paragraph{Cross Try Statement}
152This group of tests measures the cost for setting up exception handling,
153if it is
154not used because the exceptional case did not occur.
155Tests repeatedly cross (enter, execute and leave) a try statement but never
156perform a raise.
157\begin{itemize}[nosep]
158\item Handler:
159The try statement has a handler (of the appropriate kind).
160\item Finally:
161The try statement has a finally clause.
162\end{itemize}
163
164\paragraph{Conditional Matching}
165This group measures the cost of conditional matching.
166Only \CFA implements the language level conditional match,
167the other languages mimic it with an ``unconditional" match (it still
168checks the exception's type) and conditional re-raise if it is not supposed
169to handle that exception.
170
171Here is the pattern shown in \CFA and \Cpp. Java and Python use the same
172pattern as \Cpp, but with their own syntax.
173
174\begin{minipage}{0.45\textwidth}
175\begin{cfa}
176try {
177        ...
178} catch (exception_t * e ;
179                should_catch(e)) {
180        ...
181}
182\end{cfa}
183\end{minipage}
184\begin{minipage}{0.55\textwidth}
185\begin{lstlisting}[language=C++]
186try {
187        ...
188} catch (std::exception & e) {
189        if (!should_catch(e)) throw;
190        ...
191}
192\end{lstlisting}
193\end{minipage}
194\begin{itemize}[nosep]
195\item Match All:
196The condition is always true. (Always matches or never re-raises.)
197\item Match None:
198The condition is always false. (Never matches or always re-raises.)
199\end{itemize}
200
201\paragraph{Resumption Simulation}
202A slightly altered version of the Empty Traversal test is used when comparing
203resumption to fix-up routines.
204The handler, the actual resumption handler or the fix-up routine,
205always captures a variable at the base of the loop,
206and receives a reference to a variable at the raise site, either as a
207field on the exception or an argument to the fix-up routine.
208% I don't actually know why that is here but not anywhere else.
209
210%\section{Cost in Size}
211%Using exceptions also has a cost in the size of the executable.
212%Although it is sometimes ignored
213%
214%There is a size cost to defining a personality function but the later problem
215%is the LSDA which will be generated for every function.
216%
217%(I haven't actually figured out how to compare this, probably using something
218%related to -fexceptions.)
219
220\section{Results}
221% First, introduce the tables.
222\autoref{t:PerformanceTermination},
223\autoref{t:PerformanceResumption}
224and~\autoref{t:PerformanceFixupRoutines}
225show the test results.
226In cases where a feature is not supported by a language, the test is skipped
227for that language and the result is marked N/A.
228There are also cases where the feature is supported but measuring its
229cost is impossible. This happened with Java, which uses a JIT that optimizes
230away the tests and cannot be stopped.\cite{Dice21}
231These tests are marked N/C.
232To get results in a consistent range (1 second to 1 minute is ideal,
233going higher is better than going low) N, the number of iterations of the
234main loop in each test, is varied between tests. It is also given in the
235results and has a value in the millions.
236
237An anomaly in some results came from \CFA's use of GCC nested functions.
238These nested functions are used to create closures that can access stack
239variables in their lexical scope.
240However, if they do so, then they can cause the benchmark's run time to
241increase by an order of magnitude.
242The simplest solution is to make those values global variables instead
243of function-local variables.
244% Do we know if editing a global inside nested function is a problem?
245Tests that had to be modified to avoid this problem have been marked
246with a ``*'' in the results.
247
248% Now come the tables themselves:
249% You might need a wider window for this.
250
251\begin{table}[htb]
252\centering
253\caption{Termination Performance Results (sec)}
254\label{t:PerformanceTermination}
255\begin{tabular}{|r|*{2}{|r r r r|}}
256\hline
257                       & \multicolumn{4}{c||}{AMD}         & \multicolumn{4}{c|}{ARM}  \\
258\cline{2-9}
259N\hspace{8pt}          & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
260                         \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
261\hline
262Empty Traversal (1M)   & 23.0  & 9.6   & 17.6  & 23.4      & 30.6  & 13.6  & 15.5  & 14.7  \\
263D'tor Traversal (1M)   & 48.1  & 23.5  & N/A   & N/A       & 64.2  & 29.2  & N/A   & N/A   \\
264Finally Traversal (1M) & 3.2*  & N/A   & 17.6  & 29.2      & 3.9*  & N/A   & 15.5  & 19.0  \\
265Other Traversal (1M)   & 3.3*  & 23.9  & 17.7  & 32.8      & 3.9*  & 24.5  & 15.5  & 21.6  \\
266Cross Handler (1B)     & 6.5   & 0.9   & N/C   & 38.0      & 9.6   & 0.8   & N/C   & 32.1  \\
267Cross Finally (1B)     & 0.8   & N/A   & N/C   & 44.6      & 0.6   & N/A   & N/C   & 37.3  \\
268Match All (10M)        & 30.5  & 20.6  & 11.2  & 3.9       & 36.9  & 24.6  & 10.7  & 3.1   \\
269Match None (10M)       & 30.6  & 50.9  & 11.2  & 5.0       & 36.9  & 71.9  & 10.7  & 4.1   \\
270\hline
271\end{tabular}
272\end{table}
273
274\begin{table}[htb]
275\centering
276\caption{Resumption Performance Results (sec)}
277\label{t:PerformanceResumption}
278\begin{tabular}{|r||r||r|}
279\hline
280N\hspace{8pt}
281                        & AMD     & ARM  \\
282\hline
283Empty Traversal (10M)   & 1.4     & 1.2  \\
284D'tor Traversal (10M)   & 1.8     & 1.0  \\
285Finally Traversal (10M) & 1.8     & 1.0  \\
286Other Traversal (10M)   & 22.6    & 25.8 \\
287Cross Handler (1B)      & 9.0     & 11.9 \\
288Match All (100M)        & 2.3     & 3.2  \\
289Match None (100M)       & 3.0     & 3.8  \\
290\hline
291\end{tabular}
292\end{table}
293
294\begin{table}[htb]
295\centering
296\small
297\caption{Resumption/Fixup Routine Comparison (sec)}
298\label{t:PerformanceFixupRoutines}
299\setlength{\tabcolsep}{5pt}
300\begin{tabular}{|r|*{2}{|r r r r r|}}
301\hline
302            & \multicolumn{5}{c||}{AMD}     & \multicolumn{5}{c|}{ARM}  \\
303\cline{2-11}
304N\hspace{8pt}       & \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c||}{Python} &
305              \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c|}{Python} \\
306\hline
307Resume Empty (10M)  & 1.4 & 1.4 & 15.4 & 2.3 & 178.0  & 1.2 & 1.2 & 8.9 & 1.2 & 118.4 \\
308\hline
309\end{tabular}
310\end{table}
311
312% Now discuss the results in the tables.
313One result not directly related to \CFA but important to keep in mind is that,
314for exceptions, the standard intuition about which languages should go
315faster often does not hold.
316For example, there are a few cases where Python out-performs
317\CFA, \Cpp and Java.
318% To be exact, the Match All and Match None cases.
319The most likely explination is that,
320the generally faster languages have made ``common cases fast" at the expense
321of the rarer cases. Since exceptions are considered rare, they are made
322expensive to help speed up common actions, such as entering and leaving try
323statements.
324Python on the other hand, while generally slower than the other languages,
325uses exceptions more and has not scarified their performance.
326In addition, languages with high-level representations have a much
327easier time scanning the stack as there is less to decode.
328
329As stated,
330the performance tests are not attempting to show \CFA has a new competitive
331way of implementing exception handling.
332The only performance requirement is to insure the \CFA EHM has reasonable
333performance for prototyping.
334Although that may be hard to exactly quantify, I believe it has succeeded
335in that regard.
336Details on the different test cases follow.
337
338\subsection{Termination \texorpdfstring{(\autoref{t:PerformanceTermination})}{}}
339
340\begin{description}
341\item[Empty Traversal]
342\CFA is slower than \Cpp, but is still faster than the other languages
343and closer to \Cpp than other languages.
344This result is to be expected,
345as \CFA is closer to \Cpp than the other languages.
346
347\item[D'tor Traversal]
348Running destructors causes a huge slowdown in the two languages that support
349them. \CFA has a higher proportionate slowdown but it is similar to \Cpp's.
350Considering the amount of work done in destructors is effectively zero
351(an assembly comment), the cost
352must come from the change of context required to run the destructor.
353
354\item[Finally Traversal]
355Performance is similar to Empty Traversal in all languages that support finally
356clauses. Only Python seems to have a larger than random noise change in
357its run time and it is still not large.
358Despite the similarity between finally clauses and destructors,
359finally clauses seem to avoid the spike that run time destructors have.
360Possibly some optimization removes the cost of changing contexts.
361
362\item[Other Traversal]
363For \Cpp, stopping to check if a handler applies seems to be about as
364expensive as stopping to run a destructor.
365This results in a significant jump.
366
367Other languages experience a small increase in run time.
368The small increase likely comes from running the checks,
369but they could avoid the spike by not having the same kind of overhead for
370switching to the check's context.
371
372\item[Cross Handler]
373Here, \CFA falls behind \Cpp by a much more significant margin.
374This is likely due to the fact that \CFA has to insert two extra function
375calls, while \Cpp does not have to execute any other instructions.
376Python is much further behind.
377
378\item[Cross Finally]
379\CFA's performance now matches \Cpp's from Cross Handler.
380If the code from the finally clause is being inlined,
381which is just an asm comment, than there are no additional instructions
382to execute again when exiting the try statement normally.
383
384\item[Conditional Match]
385Both of the conditional matching tests can be considered on their own.
386However, for evaluating the value of conditional matching itself, the
387comparison of the two sets of results is useful.
388Consider the massive jump in run time for \Cpp going from match all to match
389none, which none of the other languages have.
390Some strange interaction is causing run time to more than double for doing
391twice as many raises.
392Java and Python avoid this problem and have similar run time for both tests,
393possibly through resource reuse or their program representation.
394However, \CFA is built like \Cpp, and avoids the problem as well.
395This matches
396the pattern of the conditional match, which makes the two execution paths
397very similar.
398
399\end{description}
400
401\subsection{Resumption \texorpdfstring{(\autoref{t:PerformanceResumption})}{}}
402
403Moving on to resumption, there is one general note:
404resumption is \textit{fast}. The only test where it fell
405behind termination is Cross Handler.
406In every other case, the number of iterations had to be increased by a
407factor of 10 to get the run time in an appropriate range
408and in some cases resumption still took less time.
409
410% I tried \paragraph and \subparagraph, maybe if I could adjust spacing
411% between paragraphs those would work.
412\begin{description}
413\item[Empty Traversal]
414See above for the general speed-up notes.
415This result is not surprising as resumption's linked-list approach
416means that traversing over stack frames without a resumption handler is
417$O(1)$.
418
419\item[D'tor Traversal]
420Resumption does have the same spike in run time that termination has.
421The run time is actually very similar to Finally Traversal.
422As resumption does not unwind the stack, both destructors and finally
423clauses are run while walking down the stack during the recursive returns.
424So it follows their performance is similar.
425
426\item[Finally Traversal]
427Same as D'tor Traversal,
428except termination did not have a spike in run time on this test case.
429
430\item[Other Traversal]
431Traversing across handlers reduces resumption's advantage as it actually
432has to stop and check each one.
433Resumption still came out ahead (adjusting for iterations) but by much less
434than the other cases.
435
436\item[Cross Handler]
437The only test case where resumption could not keep up with termination,
438although the difference is not as significant as many other cases.
439It is simply a matter of where the costs come from:
440both termination and resumption have some work to set up or tear down a
441handler. It just so happens that resumption's work is slightly slower.
442
443\item[Conditional Match]
444Resumption shows a slight slowdown if the exception is not matched
445by the first handler, which follows from the fact the second handler now has
446to be checked. However, the difference is not large.
447
448\end{description}
449
450\subsection{Resumption/Fixup \texorpdfstring{(\autoref{t:PerformanceFixupRoutines})}{}}
451
452Finally are the results of the resumption/fixup routine comparison.
453These results are surprisingly varied. It is possible that creating a closure
454has more to do with performance than passing the argument through layers of
455calls.
456At 100 stack frames, resumption and manual fixup routines have similar
457performance in \CFA.
458More experiments could try to tease out the exact trade-offs,
459but the prototype's only performance goal is to be reasonable.
460It is already in that range, and \CFA's fixup routine simulation is
461one of the faster simulations as well.
462Plus, exceptions add features and remove syntactic overhead,
463so even at similar performance, resumptions have advantages
464over fixup routines.
Note: See TracBrowser for help on using the repository browser.