Context Navigation

source: doc/theses/andrew_beach_MMath/performance.tex @ 6aa84e0

ADTast-experimentalenumforall-pointer-decaypthread-emulationqualifiedEnum

Last change on this file since 6aa84e0 was 6aa84e0, checked in by Andrew Beach <ajbeach@…>, 3 years ago
Andrew MMath: Removed (updated one) the remaining \todo items.
Property mode set to `100644`
File size: 19.2 KB

Rev	Line
[dac16a0]	1	\chapter{Performance}
	2	\label{c:performance}
	3
[cfbab07]	4	Performance is of secondary importance for most of this project.
	5	Instead, the focus was to get the features working. The only performance
	6	requirement is to ensure the tests for correctness run in a reasonable
[9cdfa5fb]	7	amount of time. Hence, only a few basic performance tests were performed to
[cfbab07]	8	check this requirement.
[b51e389c]	9
	10	\section{Test Set-Up}
[cfbab07]	11	Tests were run in \CFA, C++, Java and Python.
[9698690]	12	In addition there are two sets of tests for \CFA,
[0477127]	13	one with termination and one with resumption.
[dac16a0]	14
[9cdfa5fb]	15	GCC C++ is the most comparable language because both it and \CFA use the same
[dac16a0]	16	framework, libunwind.
[cfbab07]	17	In fact, the comparison is almost entirely in quality of implementation.
	18	Specifically, \CFA's EHM has had significantly less time to be optimized and
[dac16a0]	19	does not generate its own assembly. It does have a slight advantage in that
[cfbab07]	20	\Cpp has to do some extra bookkeeping to support its utility functions,
	21	but otherwise \Cpp should have a significant advantage.
[dac16a0]	22
[0477127]	23	Java, a popular language with similar termination semantics,
	24	is implemented in a very different environment, a virtual machine with
[029cbc0]	25	garbage collection.
[7372065]	26	It also implements the finally clause on try blocks allowing for a direct
[029cbc0]	27	feature-to-feature comparison.
[cfbab07]	28	As with \Cpp, Java's implementation is mature, has more optimizations
	29	and extra features as compared to \CFA.
[9698690]	30
[cfbab07]	31	Python is used as an alternative comparison because of the \CFA EHM's
	32	current performance goals, which is to not be prohibitively slow while the
[9698690]	33	features are designed and examined. Python has similar performance goals for
	34	creating quick scripts and its wide use suggests it has achieved those goals.
	35
[cfbab07]	36	Unfortunately, there are no notable modern programming languages with
	37	resumption exceptions. Even the older programming languages with resumption
	38	seem to be notable only for having resumption.
	39	Instead, resumption is compared to its simulation in other programming
[0477127]	40	languages: fixup functions that are explicitly passed into a function.
[dac16a0]	41
[cfbab07]	42	All tests are run inside a main loop that repeatedly performs a test.
	43	This approach avoids start-up or tear-down time from
[dac16a0]	44	affecting the timing results.
[0477127]	45	The number of times the loop is run is configurable from the command line;
[cfbab07]	46	the number used in the timing runs is given with the results per test.
[0477127]	47	The Java tests run the main loop 1000 times before
[9cdfa5fb]	48	beginning the actual test to ``warm up" the JVM.
[cfbab07]	49	% All other languages are precompiled or interpreted.
[9698690]	50
	51	Timing is done internally, with time measured immediately before and
[cfbab07]	52	after the test loop. The difference is calculated and printed.
[9698690]	53	The loop structure and internal timing means it is impossible to test
	54	unhandled exceptions in \Cpp and Java as that would cause the process to
	55	terminate.
[9cdfa5fb]	56	Luckily, performance on the ``give up and kill the process" path is not
[9698690]	57	critical.
[dac16a0]	58
[cfbab07]	59	The exceptions used in these tests are always based off of
	60	the base exception for the language.
	61	This requirement minimizes performance differences based
	62	on the object model used to represent the exception.
[9698690]	63
[cfbab07]	64	All tests are designed to be as minimal as possible, while still preventing
	65	excessive optimizations.
[9698690]	66	For example, empty inline assembly blocks are used in \CFA and \Cpp to
	67	prevent excessive optimizations while adding no actual work.
[dac16a0]	68
[9698690]	69	% We don't use catch-alls but if we did:
	70	% Catch-alls are done by catching the root exception type (not using \Cpp's
	71	% \code{C++}{catch(...)}).
[dac16a0]	72
[0477127]	73	When collecting data, each test is run eleven times. The top three and bottom
[cfbab07]	74	three results are discarded and the remaining five values are averaged.
[0477127]	75	The test are run with the latest (still pre-release) \CFA compiler,
[cd03b76d]	76	using gcc-10 10.3.0 as a backend.
	77	g++-10 10.3.0 is used for \Cpp.
[9cdfa5fb]	78	Java tests are complied and run with Oracle OpenJDK version 11.0.11.
	79	Python used CPython version 3.8.10.
[cfbab07]	80	The machines used to run the tests are:
	81	\begin{itemize}[nosep]
	82	\item ARM 2280 Kunpeng 920 48-core 2$\times$socket
	83	\lstinline{@} 2.6 GHz running Linux v5.11.0-25
	84	\item AMD 6380 Abu Dhabi 16-core 4$\times$socket
	85	\lstinline{@} 2.5 GHz running Linux v5.11.0-25
	86	\end{itemize}
[9cdfa5fb]	87	These represent the two major families of hardware architecture.
[cfbab07]	88
[b51e389c]	89	\section{Tests}
[029cbc0]	90	The following tests were selected to test the performance of different
	91	components of the exception system.
[cfbab07]	92	They should provide a guide as to where the EHM's costs are found.
	93
	94	\paragraph{Stack Traversal}
[9cdfa5fb]	95	This group of tests measures the cost of traversing the stack
[cfbab07]	96	(and in termination, unwinding it).
	97	Inside the main loop is a call to a recursive function.
	98	This function calls itself F times before raising an exception.
	99	F is configurable from the command line, but is usually 100.
	100	This builds up many stack frames, and any contents they may have,
	101	before the raise.
	102	The exception is always handled at the base of the stack.
	103	For example the Empty test for \CFA resumption looks like:
	104	\begin{cfa}
	105	void unwind_empty(unsigned int frames) {
	106	if (frames) {
	107	unwind_empty(frames - 1);
	108	} else {
	109	throwResume (empty_exception){&empty_vt};
	110	}
	111	}
	112	\end{cfa}
[0477127]	113	Other test cases have additional code around the recursive call adding
[cfbab07]	114	something besides simple stack frames to the stack.
[0477127]	115	Note that both termination and resumption have to traverse over
[cfbab07]	116	the stack but only termination has to unwind it.
[9698690]	117	\begin{itemize}[nosep]
[cfbab07]	118	% \item None:
	119	% Reuses the empty test code (see below) except that the number of frames
	120	% is set to 0 (this is the only test for which the number of frames is not
	121	% 100). This isolates the start-up and shut-down time of a throw.
	122	\item Empty:
[7372065]	123	The repeating function is empty except for the necessary control code.
[0477127]	124	As other traversal tests add to this, it is the baseline for the group
[cfbab07]	125	as the cost comes from traversing over and unwinding a stack frame
	126	that has no other interactions with the exception system.
[ea593a3]	127	\item Destructor:
[7372065]	128	The repeating function creates an object with a destructor before calling
	129	itself.
[0477127]	130	Comparing this to the empty test gives the time to traverse over and
[cfbab07]	131	unwind a destructor.
[ea593a3]	132	\item Finally:
[7372065]	133	The repeating function calls itself inside a try block with a finally clause
	134	attached.
[0477127]	135	Comparing this to the empty test gives the time to traverse over and
[cfbab07]	136	unwind a finally clause.
[ea593a3]	137	\item Other Handler:
[7372065]	138	The repeating function calls itself inside a try block with a handler that
[0477127]	139	does not match the raised exception, but is of the same kind of handler.
	140	This means that the EHM has to check each handler, and continue
	141	over all of them until it reaches the base of the stack.
	142	Comparing this to the empty test gives the time to traverse over and
[cfbab07]	143	unwind a handler.
[ea593a3]	144	\end{itemize}
	145
[7372065]	146	\paragraph{Cross Try Statement}
[0477127]	147	This group of tests measures the cost for setting up exception handling,
	148	if it is
[9cdfa5fb]	149	not used because the exceptional case did not occur.
[0477127]	150	Tests repeatedly cross (enter, execute and leave) a try statement but never
	151	perform a raise.
[9698690]	152	\begin{itemize}[nosep]
[ea593a3]	153	\item Handler:
[cfbab07]	154	The try statement has a handler (of the appropriate kind).
[ea593a3]	155	\item Finally:
[262deda0]	156	The try statement has a finally clause.
[ea593a3]	157	\end{itemize}
	158
	159	\paragraph{Conditional Matching}
[cfbab07]	160	This group measures the cost of conditional matching.
[ea593a3]	161	Only \CFA implements the language level conditional match,
[cfbab07]	162	the other languages mimic it with an ``unconditional" match (it still
	163	checks the exception's type) and conditional re-raise if it is not supposed
[9698690]	164	to handle that exception.
[cfbab07]	165
[0477127]	166	Here is the pattern shown in \CFA and \Cpp. Java and Python use the same
[cfbab07]	167	pattern as \Cpp, but with their own syntax.
	168
	169	\begin{minipage}{0.45\textwidth}
	170	\begin{cfa}
	171	try {
	172	...
	173	} catch (exception_t * e ;
	174	should_catch(e)) {
	175	...
	176	}
	177	\end{cfa}
	178	\end{minipage}
	179	\begin{minipage}{0.55\textwidth}
	180	\begin{lstlisting}[language=C++]
	181	try {
	182	...
	183	} catch (std::exception & e) {
	184	if (!should_catch(e)) throw;
	185	...
	186	}
	187	\end{lstlisting}
	188	\end{minipage}
[9698690]	189	\begin{itemize}[nosep]
	190	\item Match All:
[ea593a3]	191	The condition is always true. (Always matches or never re-raises.)
[9698690]	192	\item Match None:
[ea593a3]	193	The condition is always false. (Never matches or always re-raises.)
	194	\end{itemize}
[dac16a0]	195
[cfbab07]	196	\paragraph{Resumption Simulation}
	197	A slightly altered version of the Empty Traversal test is used when comparing
	198	resumption to fix-up routines.
	199	The handler, the actual resumption handler or the fix-up routine,
	200	always captures a variable at the base of the loop,
	201	and receives a reference to a variable at the raise site, either as a
	202	field on the exception or an argument to the fix-up routine.
	203	% I don't actually know why that is here but not anywhere else.
	204
[dac16a0]	205	%\section{Cost in Size}
	206	%Using exceptions also has a cost in the size of the executable.
	207	%Although it is sometimes ignored
	208	%
	209	%There is a size cost to defining a personality function but the later problem
	210	%is the LSDA which will be generated for every function.
	211	%
	212	%(I haven't actually figured out how to compare this, probably using something
	213	%related to -fexceptions.)
[029cbc0]	214
[9698690]	215	\section{Results}
[cfbab07]	216	% First, introduce the tables.
	217	\autoref{t:PerformanceTermination},
	218	\autoref{t:PerformanceResumption}
	219	and~\autoref{t:PerformanceFixupRoutines}
	220	show the test results.
	221	In cases where a feature is not supported by a language, the test is skipped
	222	for that language and the result is marked N/A.
	223	There are also cases where the feature is supported but measuring its
[9cdfa5fb]	224	cost is impossible. This happened with Java, which uses a JIT that optimizes
	225	away the tests and cannot be stopped.\cite{Dice21}
[cfbab07]	226	These tests are marked N/C.
	227	To get results in a consistent range (1 second to 1 minute is ideal,
	228	going higher is better than going low) N, the number of iterations of the
	229	main loop in each test, is varied between tests. It is also given in the
[0477127]	230	results and has a value in the millions.
[cfbab07]	231
[9cdfa5fb]	232	An anomaly in some results came from \CFA's use of GCC nested functions.
[cfbab07]	233	These nested functions are used to create closures that can access stack
	234	variables in their lexical scope.
[9cdfa5fb]	235	However, if they do so, then they can cause the benchmark's run time to
[cfbab07]	236	increase by an order of magnitude.
	237	The simplest solution is to make those values global variables instead
[9cdfa5fb]	238	of function-local variables.
[cfbab07]	239	% Do we know if editing a global inside nested function is a problem?
	240	Tests that had to be modified to avoid this problem have been marked
	241	with a ``*'' in the results.
	242
	243	% Now come the tables themselves:
	244	% You might need a wider window for this.
	245
	246	\begin{table}[htb]
	247	\centering
	248	\caption{Termination Performance Results (sec)}
	249	\label{t:PerformanceTermination}
	250	\begin{tabular}{\|r\|*{2}{\|r r r r\|}}
[7372065]	251	\hline
[cfbab07]	252	& \multicolumn{4}{c\|\|}{AMD} & \multicolumn{4}{c\|}{ARM} \\
	253	\cline{2-9}
	254	N\hspace{8pt} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|\|}{Python} &
	255	\multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|}{Python} \\
[7372065]	256	\hline
[140eb16]	257	Empty Traversal (1M) & 23.0 & 9.6 & 17.6 & 23.4 & 30.6 & 13.6 & 15.5 & 14.7 \\
	258	D'tor Traversal (1M) & 48.1 & 23.5 & N/A & N/A & 64.2 & 29.2 & N/A & N/A \\
	259	Finally Traversal (1M) & 3.2* & N/A & 17.6 & 29.2 & 3.9* & N/A & 15.5 & 19.0 \\
	260	Other Traversal (1M) & 3.3* & 23.9 & 17.7 & 32.8 & 3.9* & 24.5 & 15.5 & 21.6 \\
	261	Cross Handler (1B) & 6.5 & 0.9 & N/C & 38.0 & 9.6 & 0.8 & N/C & 32.1 \\
	262	Cross Finally (1B) & 0.8 & N/A & N/C & 44.6 & 0.6 & N/A & N/C & 37.3 \\
	263	Match All (10M) & 30.5 & 20.6 & 11.2 & 3.9 & 36.9 & 24.6 & 10.7 & 3.1 \\
	264	Match None (10M) & 30.6 & 50.9 & 11.2 & 5.0 & 36.9 & 71.9 & 10.7 & 4.1 \\
[7372065]	265	\hline
	266	\end{tabular}
[cfbab07]	267	\end{table}
[7372065]	268
[cfbab07]	269	\begin{table}[htb]
	270	\centering
	271	\caption{Resumption Performance Results (sec)}
	272	\label{t:PerformanceResumption}
	273	\begin{tabular}{\|r\|\|r\|\|r\|}
[0b67a19]	274	\hline
[cfbab07]	275	N\hspace{8pt}
	276	& AMD & ARM \\
[0b67a19]	277	\hline
[140eb16]	278	Empty Traversal (10M) & 1.4 & 1.2 \\
[cfbab07]	279	D'tor Traversal (10M) & 1.8 & 1.0 \\
[140eb16]	280	Finally Traversal (10M) & 1.8 & 1.0 \\
	281	Other Traversal (10M) & 22.6 & 25.8 \\
	282	Cross Handler (1B) & 9.0 & 11.9 \\
[cfbab07]	283	Match All (100M) & 2.3 & 3.2 \\
[140eb16]	284	Match None (100M) & 3.0 & 3.8 \\
[0b67a19]	285	\hline
	286	\end{tabular}
[cfbab07]	287	\end{table}
[262deda0]	288
[cfbab07]	289	\begin{table}[htb]
	290	\centering
	291	\small
	292	\caption{Resumption/Fixup Routine Comparison (sec)}
	293	\label{t:PerformanceFixupRoutines}
	294	\setlength{\tabcolsep}{5pt}
	295	\begin{tabular}{\|r\|*{2}{\|r r r r r\|}}
	296	\hline
	297	& \multicolumn{5}{c\|\|}{AMD} & \multicolumn{5}{c\|}{ARM} \\
	298	\cline{2-11}
	299	N\hspace{8pt} & \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|\|}{Python} &
	300	\multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|}{Python} \\
	301	\hline
[140eb16]	302	Resume Empty (10M) & 1.4 & 1.4 & 15.4 & 2.3 & 178.0 & 1.2 & 1.2 & 8.9 & 1.2 & 118.4 \\
[cfbab07]	303	\hline
	304	\end{tabular}
	305	\end{table}
	306
	307	% Now discuss the results in the tables.
	308	One result not directly related to \CFA but important to keep in mind is that,
[0477127]	309	for exceptions, the standard intuition about which languages should go
[cfbab07]	310	faster often does not hold.
	311	For example, there are a few cases where Python out-performs
	312	\CFA, \Cpp and Java.
[cd03b76d]	313	% To be exact, the Match All and Match None cases.
[6aa84e0]	314	The most likely explination is that,
	315	the generally faster languages have made ``common cases fast" at the expense
	316	of the rarer cases. Since exceptions are considered rare, they are made
	317	expensive to help speed up common actions, such as entering and leaving try
	318	statements.
	319	Python on the other hand, while generally slower than the other languages,
	320	uses exceptions more and has not scarified their performance.
[cfbab07]	321	In addition, languages with high-level representations have a much
	322	easier time scanning the stack as there is less to decode.
	323
	324	As stated,
	325	the performance tests are not attempting to show \CFA has a new competitive
	326	way of implementing exception handling.
	327	The only performance requirement is to insure the \CFA EHM has reasonable
	328	performance for prototyping.
[0477127]	329	Although that may be hard to exactly quantify, I believe it has succeeded
[cfbab07]	330	in that regard.
	331	Details on the different test cases follow.
	332
[0477127]	333	\subsection{Termination \texorpdfstring{(\autoref{t:PerformanceTermination})}{}}
	334
[cfbab07]	335	\begin{description}
	336	\item[Empty Traversal]
	337	\CFA is slower than \Cpp, but is still faster than the other languages
	338	and closer to \Cpp than other languages.
[0477127]	339	This result is to be expected,
	340	as \CFA is closer to \Cpp than the other languages.
[cfbab07]	341
	342	\item[D'tor Traversal]
[0477127]	343	Running destructors causes a huge slowdown in the two languages that support
[cfbab07]	344	them. \CFA has a higher proportionate slowdown but it is similar to \Cpp's.
[0477127]	345	Considering the amount of work done in destructors is effectively zero
	346	(an assembly comment), the cost
	347	must come from the change of context required to run the destructor.
[cfbab07]	348
	349	\item[Finally Traversal]
[0477127]	350	Performance is similar to Empty Traversal in all languages that support finally
[cfbab07]	351	clauses. Only Python seems to have a larger than random noise change in
[9cdfa5fb]	352	its run time and it is still not large.
[cfbab07]	353	Despite the similarity between finally clauses and destructors,
[9cdfa5fb]	354	finally clauses seem to avoid the spike that run time destructors have.
[cfbab07]	355	Possibly some optimization removes the cost of changing contexts.
	356
	357	\item[Other Traversal]
	358	For \Cpp, stopping to check if a handler applies seems to be about as
	359	expensive as stopping to run a destructor.
	360	This results in a significant jump.
	361
[9cdfa5fb]	362	Other languages experience a small increase in run time.
[cfbab07]	363	The small increase likely comes from running the checks,
	364	but they could avoid the spike by not having the same kind of overhead for
	365	switching to the check's context.
	366
	367	\item[Cross Handler]
[9cdfa5fb]	368	Here, \CFA falls behind \Cpp by a much more significant margin.
	369	This is likely due to the fact that \CFA has to insert two extra function
	370	calls, while \Cpp does not have to execute any other instructions.
[cfbab07]	371	Python is much further behind.
	372
	373	\item[Cross Finally]
	374	\CFA's performance now matches \Cpp's from Cross Handler.
	375	If the code from the finally clause is being inlined,
[0477127]	376	which is just an asm comment, than there are no additional instructions
[cfbab07]	377	to execute again when exiting the try statement normally.
	378
	379	\item[Conditional Match]
[0477127]	380	Both of the conditional matching tests can be considered on their own.
[9cdfa5fb]	381	However, for evaluating the value of conditional matching itself, the
[cfbab07]	382	comparison of the two sets of results is useful.
[9cdfa5fb]	383	Consider the massive jump in run time for \Cpp going from match all to match
[cfbab07]	384	none, which none of the other languages have.
[9cdfa5fb]	385	Some strange interaction is causing run time to more than double for doing
[cfbab07]	386	twice as many raises.
[9cdfa5fb]	387	Java and Python avoid this problem and have similar run time for both tests,
[cfbab07]	388	possibly through resource reuse or their program representation.
[9cdfa5fb]	389	However, \CFA is built like \Cpp, and avoids the problem as well.
	390	This matches
[0477127]	391	the pattern of the conditional match, which makes the two execution paths
	392	very similar.
[cfbab07]	393
	394	\end{description}
	395
[0477127]	396	\subsection{Resumption \texorpdfstring{(\autoref{t:PerformanceResumption})}{}}
	397
[9cdfa5fb]	398	Moving on to resumption, there is one general note:
[0477127]	399	resumption is \textit{fast}. The only test where it fell
[cfbab07]	400	behind termination is Cross Handler.
	401	In every other case, the number of iterations had to be increased by a
[9cdfa5fb]	402	factor of 10 to get the run time in an appropriate range
[cfbab07]	403	and in some cases resumption still took less time.
	404
	405	% I tried \paragraph and \subparagraph, maybe if I could adjust spacing
	406	% between paragraphs those would work.
	407	\begin{description}
	408	\item[Empty Traversal]
	409	See above for the general speed-up notes.
[0477127]	410	This result is not surprising as resumption's linked-list approach
[cfbab07]	411	means that traversing over stack frames without a resumption handler is
	412	$O(1)$.
	413
	414	\item[D'tor Traversal]
[9cdfa5fb]	415	Resumption does have the same spike in run time that termination has.
	416	The run time is actually very similar to Finally Traversal.
[0477127]	417	As resumption does not unwind the stack, both destructors and finally
	418	clauses are run while walking down the stack during the recursive returns.
[cfbab07]	419	So it follows their performance is similar.
	420
	421	\item[Finally Traversal]
[0477127]	422	Same as D'tor Traversal,
[9cdfa5fb]	423	except termination did not have a spike in run time on this test case.
[cfbab07]	424
	425	\item[Other Traversal]
	426	Traversing across handlers reduces resumption's advantage as it actually
	427	has to stop and check each one.
	428	Resumption still came out ahead (adjusting for iterations) but by much less
	429	than the other cases.
	430
	431	\item[Cross Handler]
	432	The only test case where resumption could not keep up with termination,
	433	although the difference is not as significant as many other cases.
[9cdfa5fb]	434	It is simply a matter of where the costs come from:
	435	both termination and resumption have some work to set up or tear down a
[0477127]	436	handler. It just so happens that resumption's work is slightly slower.
[cfbab07]	437
	438	\item[Conditional Match]
	439	Resumption shows a slight slowdown if the exception is not matched
	440	by the first handler, which follows from the fact the second handler now has
[9cdfa5fb]	441	to be checked. However, the difference is not large.
[cfbab07]	442
	443	\end{description}
	444
[0477127]	445	\subsection{Resumption/Fixup \texorpdfstring{(\autoref{t:PerformanceFixupRoutines})}{}}
	446
[cfbab07]	447	Finally are the results of the resumption/fixup routine comparison.
[0477127]	448	These results are surprisingly varied. It is possible that creating a closure
[cfbab07]	449	has more to do with performance than passing the argument through layers of
	450	calls.
[0477127]	451	At 100 stack frames, resumption and manual fixup routines have similar
	452	performance in \CFA.
	453	More experiments could try to tease out the exact trade-offs,
	454	but the prototype's only performance goal is to be reasonable.
[9cdfa5fb]	455	It is already in that range, and \CFA's fixup routine simulation is
[0477127]	456	one of the faster simulations as well.
[9cdfa5fb]	457	Plus, exceptions add features and remove syntactic overhead,
	458	so even at similar performance, resumptions have advantages
[0477127]	459	over fixup routines.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: