Context Navigation

performance.tex@ 7737c29

Visit:

ADT ast-experimental enum forall-pointer-decay jacob/cs343-translation pthread-emulation qualifiedEnum

Last change on this file since 7737c29 was 7737c29, checked in by Andrew Beach <ajbeach@…>, 4 years ago

Revert "proofread chapter performance.tex", updates have been saved.

This reverts commit 9cb6514c72783e65308b91bda966d51674a21e8a.

Property mode set to 100644

File size: 18.6 KB

Rev	Line
[dac16a0]	1	\chapter{Performance}
	2	\label{c:performance}
	3
[cfbab07]	4	Performance is of secondary importance for most of this project.
	5	Instead, the focus was to get the features working. The only performance
	6	requirement is to ensure the tests for correctness run in a reasonable
	7	amount of time. Hence, a few basic performance tests were performed to
	8	check this requirement.
[b51e389c]	9
	10	\section{Test Set-Up}
[cfbab07]	11	Tests were run in \CFA, C++, Java and Python.
[9698690]	12	In addition there are two sets of tests for \CFA,
[7737c29]	13	one for termination and once with resumption.
[dac16a0]	14
	15	C++ is the most comparable language because both it and \CFA use the same
	16	framework, libunwind.
[cfbab07]	17	In fact, the comparison is almost entirely in quality of implementation.
	18	Specifically, \CFA's EHM has had significantly less time to be optimized and
[dac16a0]	19	does not generate its own assembly. It does have a slight advantage in that
[cfbab07]	20	\Cpp has to do some extra bookkeeping to support its utility functions,
	21	but otherwise \Cpp should have a significant advantage.
[dac16a0]	22
[7737c29]	23	Java a popular language with similar termination semantics, but
[cfbab07]	24	it is implemented in a very different environment, a virtual machine with
[029cbc0]	25	garbage collection.
[7372065]	26	It also implements the finally clause on try blocks allowing for a direct
[029cbc0]	27	feature-to-feature comparison.
[cfbab07]	28	As with \Cpp, Java's implementation is mature, has more optimizations
	29	and extra features as compared to \CFA.
[9698690]	30
[cfbab07]	31	Python is used as an alternative comparison because of the \CFA EHM's
	32	current performance goals, which is to not be prohibitively slow while the
[9698690]	33	features are designed and examined. Python has similar performance goals for
	34	creating quick scripts and its wide use suggests it has achieved those goals.
	35
[cfbab07]	36	Unfortunately, there are no notable modern programming languages with
	37	resumption exceptions. Even the older programming languages with resumption
	38	seem to be notable only for having resumption.
	39	Instead, resumption is compared to its simulation in other programming
[7737c29]	40	languages: fixup functions that are explicity passed into a function.
[dac16a0]	41
[cfbab07]	42	All tests are run inside a main loop that repeatedly performs a test.
	43	This approach avoids start-up or tear-down time from
[dac16a0]	44	affecting the timing results.
[7737c29]	45	The number of times the loop is run is configurable from the command line,
[cfbab07]	46	the number used in the timing runs is given with the results per test.
[7737c29]	47	Tests ran their main loop a million times.
	48	The Java tests runs the main loop 1000 times before
[cfbab07]	49	beginning the actual test to ``warm-up" the JVM.
	50	% All other languages are precompiled or interpreted.
[9698690]	51
	52	Timing is done internally, with time measured immediately before and
[cfbab07]	53	after the test loop. The difference is calculated and printed.
[9698690]	54	The loop structure and internal timing means it is impossible to test
	55	unhandled exceptions in \Cpp and Java as that would cause the process to
	56	terminate.
	57	Luckily, performance on the ``give-up and kill the process" path is not
	58	critical.
[dac16a0]	59
[cfbab07]	60	The exceptions used in these tests are always based off of
	61	the base exception for the language.
	62	This requirement minimizes performance differences based
	63	on the object model used to represent the exception.
[9698690]	64
[cfbab07]	65	All tests are designed to be as minimal as possible, while still preventing
	66	excessive optimizations.
[9698690]	67	For example, empty inline assembly blocks are used in \CFA and \Cpp to
	68	prevent excessive optimizations while adding no actual work.
[dac16a0]	69
[9698690]	70	% We don't use catch-alls but if we did:
	71	% Catch-alls are done by catching the root exception type (not using \Cpp's
	72	% \code{C++}{catch(...)}).
[dac16a0]	73
[7737c29]	74	When collecting data each test is run eleven times. The top three and bottom
[cfbab07]	75	three results are discarded and the remaining five values are averaged.
[7737c29]	76	The test are run with the latest (still pre-release) \CFA compiler was used,
[cfbab07]	77	using gcc-10 as a backend.
	78	g++-10 is used for \Cpp.
	79	Java tests are complied and run with version 11.0.11.
	80	Python used version 3.8.
	81	The machines used to run the tests are:
	82	\todo{Get patch versions for python, gcc and g++.}
	83	\begin{itemize}[nosep]
	84	\item ARM 2280 Kunpeng 920 48-core 2$\times$socket
	85	\lstinline{@} 2.6 GHz running Linux v5.11.0-25
	86	\item AMD 6380 Abu Dhabi 16-core 4$\times$socket
	87	\lstinline{@} 2.5 GHz running Linux v5.11.0-25
	88	\end{itemize}
	89	Representing the two major families of hardware architecture.
	90
[b51e389c]	91	\section{Tests}
[029cbc0]	92	The following tests were selected to test the performance of different
	93	components of the exception system.
[cfbab07]	94	They should provide a guide as to where the EHM's costs are found.
	95
	96	\paragraph{Stack Traversal}
	97	This group measures the cost of traversing the stack,
	98	(and in termination, unwinding it).
	99	Inside the main loop is a call to a recursive function.
	100	This function calls itself F times before raising an exception.
	101	F is configurable from the command line, but is usually 100.
	102	This builds up many stack frames, and any contents they may have,
	103	before the raise.
	104	The exception is always handled at the base of the stack.
	105	For example the Empty test for \CFA resumption looks like:
	106	\begin{cfa}
	107	void unwind_empty(unsigned int frames) {
	108	if (frames) {
	109	unwind_empty(frames - 1);
	110	} else {
	111	throwResume (empty_exception){&empty_vt};
	112	}
	113	}
	114	\end{cfa}
[7737c29]	115	Other test cases have additional code around the recursive call add
[cfbab07]	116	something besides simple stack frames to the stack.
[7737c29]	117	Note that both termination and resumption will have to traverse over
[cfbab07]	118	the stack but only termination has to unwind it.
[9698690]	119	\begin{itemize}[nosep]
[cfbab07]	120	% \item None:
	121	% Reuses the empty test code (see below) except that the number of frames
	122	% is set to 0 (this is the only test for which the number of frames is not
	123	% 100). This isolates the start-up and shut-down time of a throw.
	124	\item Empty:
[7372065]	125	The repeating function is empty except for the necessary control code.
[7737c29]	126	As other traversal tests add to this, so it is the baseline for the group
[cfbab07]	127	as the cost comes from traversing over and unwinding a stack frame
	128	that has no other interactions with the exception system.
[ea593a3]	129	\item Destructor:
[7372065]	130	The repeating function creates an object with a destructor before calling
	131	itself.
[7737c29]	132	Comparing this to the empty test gives the time to traverse over and/or
[cfbab07]	133	unwind a destructor.
[ea593a3]	134	\item Finally:
[7372065]	135	The repeating function calls itself inside a try block with a finally clause
	136	attached.
[7737c29]	137	Comparing this to the empty test gives the time to traverse over and/or
[cfbab07]	138	unwind a finally clause.
[ea593a3]	139	\item Other Handler:
[7372065]	140	The repeating function calls itself inside a try block with a handler that
[7737c29]	141	will not match the raised exception, but is of the same kind of handler.
	142	This means that the EHM will have to check each handler, but will continue
	143	over all of the until it reaches the base of the stack.
	144	Comparing this to the empty test gives the time to traverse over and/or
[cfbab07]	145	unwind a handler.
[ea593a3]	146	\end{itemize}
	147
[7372065]	148	\paragraph{Cross Try Statement}
[7737c29]	149	This group of tests measures the cost setting up exception handling if it is
[cfbab07]	150	not used (because the exceptional case did not occur).
[7737c29]	151	Tests repeatedly cross (enter and leave, execute) a try statement but never
	152	preform a raise.
[9698690]	153	\begin{itemize}[nosep]
[ea593a3]	154	\item Handler:
[cfbab07]	155	The try statement has a handler (of the appropriate kind).
[ea593a3]	156	\item Finally:
[262deda0]	157	The try statement has a finally clause.
[ea593a3]	158	\end{itemize}
	159
	160	\paragraph{Conditional Matching}
[cfbab07]	161	This group measures the cost of conditional matching.
[ea593a3]	162	Only \CFA implements the language level conditional match,
[cfbab07]	163	the other languages mimic it with an ``unconditional" match (it still
	164	checks the exception's type) and conditional re-raise if it is not supposed
[9698690]	165	to handle that exception.
[cfbab07]	166
[7737c29]	167	There is the pattern shown in \CFA and \Cpp. Java and Python use the same
[cfbab07]	168	pattern as \Cpp, but with their own syntax.
	169
	170	\begin{minipage}{0.45\textwidth}
	171	\begin{cfa}
	172	try {
	173	...
	174	} catch (exception_t * e ;
	175	should_catch(e)) {
	176	...
	177	}
	178	\end{cfa}
	179	\end{minipage}
	180	\begin{minipage}{0.55\textwidth}
	181	\begin{lstlisting}[language=C++]
	182	try {
	183	...
	184	} catch (std::exception & e) {
	185	if (!should_catch(e)) throw;
	186	...
	187	}
	188	\end{lstlisting}
	189	\end{minipage}
[9698690]	190	\begin{itemize}[nosep]
	191	\item Match All:
[ea593a3]	192	The condition is always true. (Always matches or never re-raises.)
[9698690]	193	\item Match None:
[ea593a3]	194	The condition is always false. (Never matches or always re-raises.)
	195	\end{itemize}
[dac16a0]	196
[cfbab07]	197	\paragraph{Resumption Simulation}
	198	A slightly altered version of the Empty Traversal test is used when comparing
	199	resumption to fix-up routines.
	200	The handler, the actual resumption handler or the fix-up routine,
	201	always captures a variable at the base of the loop,
	202	and receives a reference to a variable at the raise site, either as a
	203	field on the exception or an argument to the fix-up routine.
	204	% I don't actually know why that is here but not anywhere else.
	205
[dac16a0]	206	%\section{Cost in Size}
	207	%Using exceptions also has a cost in the size of the executable.
	208	%Although it is sometimes ignored
	209	%
	210	%There is a size cost to defining a personality function but the later problem
	211	%is the LSDA which will be generated for every function.
	212	%
	213	%(I haven't actually figured out how to compare this, probably using something
	214	%related to -fexceptions.)
[029cbc0]	215
[9698690]	216	\section{Results}
[cfbab07]	217	% First, introduce the tables.
	218	\autoref{t:PerformanceTermination},
	219	\autoref{t:PerformanceResumption}
	220	and~\autoref{t:PerformanceFixupRoutines}
	221	show the test results.
	222	In cases where a feature is not supported by a language, the test is skipped
	223	for that language and the result is marked N/A.
	224	There are also cases where the feature is supported but measuring its
	225	cost is impossible. This happened with Java, which uses a JIT that optimize
	226	away the tests and it cannot be stopped.\cite{Dice21}
	227	These tests are marked N/C.
	228	To get results in a consistent range (1 second to 1 minute is ideal,
	229	going higher is better than going low) N, the number of iterations of the
	230	main loop in each test, is varied between tests. It is also given in the
[7737c29]	231	results and usually have a value in the millions.
[cfbab07]	232
	233	An anomaly in some results came from \CFA's use of gcc nested functions.
	234	These nested functions are used to create closures that can access stack
	235	variables in their lexical scope.
[7737c29]	236	However, if they do so then they can cause the benchmark's run-time to
[cfbab07]	237	increase by an order of magnitude.
	238	The simplest solution is to make those values global variables instead
	239	of function local variables.
	240	% Do we know if editing a global inside nested function is a problem?
	241	Tests that had to be modified to avoid this problem have been marked
	242	with a ``*'' in the results.
	243
	244	% Now come the tables themselves:
	245	% You might need a wider window for this.
	246
	247	\begin{table}[htb]
	248	\centering
	249	\caption{Termination Performance Results (sec)}
	250	\label{t:PerformanceTermination}
	251	\begin{tabular}{\|r\|*{2}{\|r r r r\|}}
[7372065]	252	\hline
[cfbab07]	253	& \multicolumn{4}{c\|\|}{AMD} & \multicolumn{4}{c\|}{ARM} \\
	254	\cline{2-9}
	255	N\hspace{8pt} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|\|}{Python} &
	256	\multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|}{Python} \\
[7372065]	257	\hline
[cfbab07]	258	Empty Traversal (1M) & 3.4 & 2.8 & 18.3 & 23.4 & 3.7 & 3.2 & 15.5 & 14.8 \\
	259	D'tor Traversal (1M) & 48.4 & 23.6 & N/A & N/A & 64.2 & 29.0 & N/A & N/A \\
	260	Finally Traversal (1M) & 3.4* & N/A & 17.9 & 29.0 & 4.1* & N/A & 15.6 & 19.0 \\
	261	Other Traversal (1M) & 3.6* & 23.2 & 18.2 & 32.7 & 4.0* & 24.5 & 15.5 & 21.4 \\
	262	Cross Handler (100M) & 6.0 & 0.9 & N/C & 37.4 & 10.0 & 0.8 & N/C & 32.2 \\
	263	Cross Finally (100M) & 0.9 & N/A & N/C & 44.1 & 0.8 & N/A & N/C & 37.3 \\
	264	Match All (10M) & 32.9 & 20.7 & 13.4 & 4.9 & 36.2 & 24.5 & 12.0 & 3.1 \\
	265	Match None (10M) & 32.7 & 50.3 & 11.0 & 5.1 & 36.3 & 71.9 & 12.3 & 4.2 \\
[7372065]	266	\hline
	267	\end{tabular}
[cfbab07]	268	\end{table}
[7372065]	269
[cfbab07]	270	\begin{table}[htb]
	271	\centering
	272	\caption{Resumption Performance Results (sec)}
	273	\label{t:PerformanceResumption}
	274	\begin{tabular}{\|r\|\|r\|\|r\|}
[0b67a19]	275	\hline
[cfbab07]	276	N\hspace{8pt}
	277	& AMD & ARM \\
[0b67a19]	278	\hline
[cfbab07]	279	Empty Traversal (10M) & 0.2 & 0.3 \\
	280	D'tor Traversal (10M) & 1.8 & 1.0 \\
	281	Finally Traversal (10M) & 1.7 & 1.0 \\
	282	Other Traversal (10M) & 22.6 & 25.9 \\
	283	Cross Handler (100M) & 8.4 & 11.9 \\
	284	Match All (100M) & 2.3 & 3.2 \\
	285	Match None (100M) & 2.9 & 3.9 \\
[0b67a19]	286	\hline
	287	\end{tabular}
[cfbab07]	288	\end{table}
[262deda0]	289
[cfbab07]	290	\begin{table}[htb]
	291	\centering
	292	\small
	293	\caption{Resumption/Fixup Routine Comparison (sec)}
	294	\label{t:PerformanceFixupRoutines}
	295	\setlength{\tabcolsep}{5pt}
	296	\begin{tabular}{\|r\|*{2}{\|r r r r r\|}}
	297	\hline
	298	& \multicolumn{5}{c\|\|}{AMD} & \multicolumn{5}{c\|}{ARM} \\
	299	\cline{2-11}
	300	N\hspace{8pt} & \multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|\|}{Python} &
	301	\multicolumn{1}{c}{Raise} & \multicolumn{1}{c}{\CFA} & \multicolumn{1}{c}{\Cpp} & \multicolumn{1}{c}{Java} & \multicolumn{1}{c\|}{Python} \\
	302	\hline
	303	Resume Empty (10M) & 3.8 & 3.5 & 14.7 & 2.3 & 176.1 & 0.3 & 0.1 & 8.9 & 1.2 & 119.9 \\
	304	%Resume Other (10M) & 4.0* & 0.1* & 21.9 & 6.2 & 381.0 & 0.3* & 0.1* & 13.2 & 5.0 & 290.7 \\
	305	\hline
	306	\end{tabular}
	307	\end{table}
	308
	309	% Now discuss the results in the tables.
	310	One result not directly related to \CFA but important to keep in mind is that,
[7737c29]	311	for exceptions the standard intuition about which languages should go
[cfbab07]	312	faster often does not hold.
	313	For example, there are a few cases where Python out-performs
	314	\CFA, \Cpp and Java.
	315	The most likely explanation is that, since exceptions
	316	are rarely considered to be the common case, the more optimized languages
	317	make that case expensive to improve other cases.
	318	In addition, languages with high-level representations have a much
	319	easier time scanning the stack as there is less to decode.
	320
	321	As stated,
	322	the performance tests are not attempting to show \CFA has a new competitive
	323	way of implementing exception handling.
	324	The only performance requirement is to insure the \CFA EHM has reasonable
	325	performance for prototyping.
[7737c29]	326	Although that may be hard to exactly quantify, we believe it has succeeded
[cfbab07]	327	in that regard.
	328	Details on the different test cases follow.
	329
	330	\begin{description}
	331	\item[Empty Traversal]
	332	\CFA is slower than \Cpp, but is still faster than the other languages
	333	and closer to \Cpp than other languages.
[7737c29]	334	This is to be expected as \CFA is closer to \Cpp than the other languages.
[cfbab07]	335
	336	\item[D'tor Traversal]
[7737c29]	337	Running destructors causes huge slowdown in every language that supports
[cfbab07]	338	them. \CFA has a higher proportionate slowdown but it is similar to \Cpp's.
[7737c29]	339	Considering the amount of work done in destructors is so low the cost
	340	likely comes from the change of context required to do that work.
[cfbab07]	341
	342	\item[Finally Traversal]
[7737c29]	343	Speed is similar to Empty Traversal in all languages that support finally
[cfbab07]	344	clauses. Only Python seems to have a larger than random noise change in
	345	its run-time and it is still not large.
	346	Despite the similarity between finally clauses and destructors,
[7737c29]	347	finally clauses seem to avoid the spike in run-time destructors have.
[cfbab07]	348	Possibly some optimization removes the cost of changing contexts.
	349	\todo{OK, I think the finally clause may have been optimized out.}
	350
	351	\item[Other Traversal]
	352	For \Cpp, stopping to check if a handler applies seems to be about as
	353	expensive as stopping to run a destructor.
	354	This results in a significant jump.
	355
[7737c29]	356	Other languages experiance a small increase in run-time.
[cfbab07]	357	The small increase likely comes from running the checks,
	358	but they could avoid the spike by not having the same kind of overhead for
	359	switching to the check's context.
	360
[7737c29]	361	\todo{Could revist Other Traversal, after Finally Traversal.}
[cfbab07]	362
	363	\item[Cross Handler]
	364	Here \CFA falls behind \Cpp by a much more significant margin.
	365	This is likely due to the fact \CFA has to insert two extra function
[7737c29]	366	calls while \Cpp doesn't have to do execute any other instructions.
[cfbab07]	367	Python is much further behind.
	368
	369	\item[Cross Finally]
	370	\CFA's performance now matches \Cpp's from Cross Handler.
	371	If the code from the finally clause is being inlined,
[7737c29]	372	which is just a asm comment, than there are no additional instructions
[cfbab07]	373	to execute again when exiting the try statement normally.
	374
	375	\item[Conditional Match]
[7737c29]	376	Both of the conditional matching tests can be considered on their own,
	377	however for evaluating the value of conditional matching itself the
[cfbab07]	378	comparison of the two sets of results is useful.
	379	Consider the massive jump in run-time for \Cpp going from match all to match
	380	none, which none of the other languages have.
	381	Some strange interaction is causing run-time to more than double for doing
	382	twice as many raises.
	383	Java and Python avoid this problem and have similar run-time for both tests,
	384	possibly through resource reuse or their program representation.
	385	However \CFA is built like \Cpp and avoids the problem as well, this matches
[7737c29]	386	the pattern of the conditional match which makes the two execution paths
	387	much more similar.
[cfbab07]	388
	389	\end{description}
	390
[7737c29]	391	Moving on to resumption there is one general note,
	392	resumption is \textit{fast}, the only test where it fell
[cfbab07]	393	behind termination is Cross Handler.
	394	In every other case, the number of iterations had to be increased by a
[7737c29]	395	factor of 10 to get the run-time in an approprate range
[cfbab07]	396	and in some cases resumption still took less time.
	397
	398	% I tried \paragraph and \subparagraph, maybe if I could adjust spacing
	399	% between paragraphs those would work.
	400	\begin{description}
	401	\item[Empty Traversal]
	402	See above for the general speed-up notes.
[7737c29]	403	This result is not surprising as resumption's link list approach
[cfbab07]	404	means that traversing over stack frames without a resumption handler is
	405	$O(1)$.
	406
	407	\item[D'tor Traversal]
	408	Resumption does have the same spike in run-time that termination has.
	409	The run-time is actually very similar to Finally Traversal.
[7737c29]	410	As resumption does not unwind the stack both destructors and finally
	411	clauses are run while walking down the stack normally.
[cfbab07]	412	So it follows their performance is similar.
	413
	414	\item[Finally Traversal]
[7737c29]	415	The increase in run-time fromm Empty Traversal (once adjusted for
	416	the number of iterations) roughly the same as for termination.
	417	This suggests that the
[cfbab07]	418
	419	\item[Other Traversal]
	420	Traversing across handlers reduces resumption's advantage as it actually
	421	has to stop and check each one.
	422	Resumption still came out ahead (adjusting for iterations) but by much less
	423	than the other cases.
	424
	425	\item[Cross Handler]
	426	The only test case where resumption could not keep up with termination,
	427	although the difference is not as significant as many other cases.
[7737c29]	428	It is simply a matter of where the costs come from. Even if \CFA termination
	429	is not ``zero-cost" passing through an empty function still seems to be
	430	cheaper than updating global values.
[cfbab07]	431
	432	\item[Conditional Match]
	433	Resumption shows a slight slowdown if the exception is not matched
	434	by the first handler, which follows from the fact the second handler now has
	435	to be checked. However the difference is not large.
	436
	437	\end{description}
	438
	439	Finally are the results of the resumption/fixup routine comparison.
[7737c29]	440	These results are surprisingly varied, it is possible that creating a closure
[cfbab07]	441	has more to do with performance than passing the argument through layers of
	442	calls.
	443	Even with 100 stack frames though, resumption is only about as fast as
	444	manually passing a fixup routine.
	445	So there is a cost for the additional power and flexibility exceptions
	446	provide.

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: doc/theses/andrew_beach_MMath/performance.tex@ 7737c29

Download in other formats: