Context Navigation

source: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex @ 108f6c32

ADTast-experimentalpthread-emulationqualifiedEnum

Last change on this file since 108f6c32 was 0f6d7871, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago
add info on available random distributions, and M initial allocations for Churn micro-benchmark
Property mode set to `100644`
File size: 16.9 KB

Line
1	\chapter{Benchmarks}
2	\label{s:Benchmarks}
3
4	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
7	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9
10	There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
11	\begin{description}
12	\item[Benchmarks]
13	are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
14	The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
15	Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
16	\item[Micro-Benchmarks]
17	attempt to extract the common execution patterns associated with an application and run the pattern independently.
18	This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
19	Micro-benchmarks are often criticized for inadequately representing real-world applications.
20	\end{description}
21
22	While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
23	In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
24	As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
25	Many of these benchmark applications and micro-benchmarks are old and may not reflect current application allocation patterns.
26
27	This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
28	The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
29	% These programs can be taken as a standard to benchmark an allocator's basic goals.
30	These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
31	The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
32	Each micro-benchmark program has multiple control knobs specified by command-line arguments.
33
34	The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
35	An allocator's speed is benchmarked in different ways, as are issues like false sharing.
36
37
38	\section{Prior Multi-Threaded Micro-Benchmarks}
39
40	Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level.
41	The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level.
42	None of the prior work address multi-threading at the UT level.
43
44
45	\subsection{threadtest}
46
47	This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
48	There is no interaction among threads, \ie no object sharing.
49	Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
50	Runtime of the benchmark evaluates its efficiency.
51
52
53	\subsection{shbench}
54
55	This benchmark is similar to threadtest but each thread randomly allocate and free a number of \emph{random-sized} objects.
56	It is a stress test that also uses runtime to determine efficiency of the allocator.
57
58
59	\subsection{Larson}
60
61	This benchmark simulates a server environment.
62	Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
63	Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
64	The number of thread generations varies depending on the thread speed.
65	It calculates memory operations per second as an indicator of memory allocator's performance.
66
67
68	\section{New Multi-Threaded Micro-Benchmarks}
69
70	The following new benchmarks were created to assess multi-threaded programs at the KT and UT level.
71	For generating random values, two generators are supported: uniform~\cite{uniformPRNG} and fisher~\cite{fisherPRNG}.
72
73
74	\subsection{Churn Benchmark}
75	\label{s:ChurnBenchmark}
76
77	The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
78	Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
79	Churn simulates a memory intensive program that can be tuned to create different scenarios.
80
81	\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
82	This benchmark creates a buffer with M spots and an allocation in each spot, and then starts K threads.
83	Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
84	Each thread repeats this cycle N times.
85	The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
86
87	\begin{figure}
88	\centering
89	\begin{lstlisting}
90	Main Thread
91	create worker threads
92	note time T1
93	...
94	note time T2
95	churn_speed = (T2 - T1)
96	Worker Thread
97	initialize variables
98	...
99	for ( N )
100	R = random spot in array
101	free R
102	allocate new object at R
103	\end{lstlisting}
104	%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
105	\caption{Churn Benchmark}
106	\label{fig:ChurnBenchFig}
107	\end{figure}
108
109	The adjustment knobs for churn are:
110	\begin{description}[itemsep=0pt,parsep=0pt]
111	\item[thread:]
112	number of threads (K).
113	\item[spots:]
114	number of spots for churn (M).
115	\item[obj:]
116	number of objects per thread (N).
117	\item[max:]
118	maximum object size.
119	\item[min:]
120	minimum object size.
121	\item[step:]
122	object size increment.
123	\item[distro:]
124	object size distribution
125	\end{description}
126
127
128	\subsection{Cache Thrash}
129	\label{sec:benchThrashSec}
130
131	The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
132	If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
133	When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
134
135	Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
136	Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
137	Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
138
139	\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
140	First, it creates K worker threads.
141	Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
142	Each thread repeats this for N times.
143	The main thread measures the total time taken to for all worker threads to complete.
144	Worker threads sharing cache lines with each other will take longer.
145
146	\begin{figure}
147	\centering
148	\input{AllocInducedActiveFalseSharing}
149	\medskip
150	\begin{lstlisting}
151	Main Thread
152	create worker threads
153	...
154	signal workers to allocate
155	...
156	signal workers to free
157	...
158	print addresses from each $thread$
159	Worker Thread$\(_1\)$
160	allocate, write, read, free
161	warmup memory in chunkc of 16 bytes
162	...
163	malloc N objects
164	...
165	free objects
166	return object address to Main Thread
167	Worker Thread$\(_2\)$
168	// same as Worker Thread$\(_1\)$
169	\end{lstlisting}
170	%\input{MemoryOverhead}
171	%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
172	\caption{Allocator-Induced Active False-Sharing Benchmark}
173	\label{fig:benchThrashFig}
174	\end{figure}
175
176	The adjustment knobs for cache access scenarios are:
177	\begin{description}[itemsep=0pt,parsep=0pt]
178	\item[thread:]
179	number of threads (K).
180	\item[iterations:]
181	iterations of cache benchmark (N).
182	\item[cacheRW:]
183	repetitions of reads/writes to object (M).
184	\item[size:]
185	object size.
186	\end{description}
187
188
189	\subsection{Cache Scratch}
190	\label{s:CacheScratch}
191
192	The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
193	As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
194	In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
195
196	% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
197	% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
198	% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
199	% instead, the program induced this situation.
200	% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
201
202	Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
203	An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
204	If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
205
206	\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
207	First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
208	Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
209	Each worker thread frees the object passed by the main thread.
210	Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
211	Each worker repeats this N times.
212
213	\begin{figure}
214	\centering
215	\input{AllocInducedPassiveFalseSharing}
216	\medskip
217	\begin{lstlisting}
218	Main Thread
219	malloc N objects $for$ each worker $thread$
220	create worker threads and pass N objects to each worker
221	...
222	signal workers to allocate
223	...
224	signal workers to free
225	...
226	print addresses from each $thread$
227	Worker Thread$\(_1\)$
228	allocate, write, read, free
229	warmup memory in chunkc of 16 bytes
230	...
231	for ( N )
232	free an object passed by Main Thread
233	malloc new object
234	...
235	free objects
236	return new object addresses to Main Thread
237	Worker Thread$\(_2\)$
238	// same as Worker Thread$\(_1\)$
239	\end{lstlisting}
240	%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
241	\caption{Program-Induced Passive False-Sharing Benchmark}
242	\label{fig:benchScratchFig}
243	\end{figure}
244
245	Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
246	Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
247	Main thread measures the total time taken for all the workers to complete.
248
249	Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
250	\begin{description}[itemsep=0pt,parsep=0pt]
251	\item[threads:]
252	number of threads (K).
253	\item[iterations:]
254	iterations of cache benchmark (N).
255	\item[cacheRW:]
256	repetitions of reads/writes to object (M).
257	\item[size:]
258	object size.
259	\end{description}
260
261
262	\subsection{Speed Micro-Benchmark}
263	\label{s:SpeedMicroBenchmark}
264
265	The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
266	\begin{enumerate}[itemsep=0pt,parsep=0pt]
267	\item malloc
268	\item realloc
269	\item free
270	\item calloc
271	\item malloc-free
272	\item realloc-free
273	\item calloc-free
274	\item malloc-realloc
275	\item calloc-realloc
276	\item malloc-realloc-free
277	\item calloc-realloc-free
278	\item malloc-realloc-free-calloc
279	\end{enumerate}
280
281	\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
282	Each routine in the chain is called for N objects and then those allocated objects are used when calling the next routine in the allocation chain.
283	This tests the latency of the memory allocator when multiple routines are chained together, \eg the call sequence malloc-realloc-free-calloc gives a complete picture of the major allocation routines when combined together.
284	For each chain, the time is recorded to visualize performance of a memory allocator against each chain.
285
286	\begin{figure}
287	\centering
288	\begin{lstlisting}[morekeywords={foreach}]
289	Main Thread
290	create worker threads
291	foreach ( allocation chain )
292	note time T1
293	...
294	note time T2
295	chain_speed = (T2 - T1) / number-of-worker-threads * N )
296	Worker Thread
297	initialize variables
298	...
299	foreach ( routine in allocation chain )
300	call routine N times
301	\end{lstlisting}
302	%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
303	\caption{Speed Benchmark}
304	\label{fig:SpeedBenchFig}
305	\end{figure}
306
307	The adjustment knobs for memory usage are:
308	\begin{description}[itemsep=0pt,parsep=0pt]
309	\item[max:]
310	maximum object size.
311	\item[min:]
312	minimum object size.
313	\item[step:]
314	object size increment.
315	\item[distro:]
316	object size distribution.
317	\item[objects:]
318	number of objects per thread.
319	\item[workers:]
320	number of worker threads.
321	\end{description}
322
323
324	\subsection{Memory Micro-Benchmark}
325	\label{s:MemoryMicroBenchmark}
326
327	The memory micro-benchmark measures the memory overhead of an allocator.
328	It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
329	It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
330	This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
331
332	\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
333	It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
334	A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
335	A consumer frees these objects.
336	After every memory operation, program memory usage is recorded throughout the runtime.
337	This data is used to visualize the memory usage and consumption for the program.
338
339	\begin{figure}
340	\centering
341	\begin{lstlisting}
342	Main Thread
343	print memory snapshot
344	create producer threads
345	Producer Thread (K)
346	set free start
347	create consumer threads
348	for ( N )
349	allocate memory
350	print memory snapshot
351	Consumer Thread (M)
352	wait while ( allocations < free start )
353	for ( N )
354	free memory
355	print memory snapshot
356	\end{lstlisting}
357	%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
358	\caption{Memory Footprint Micro-Benchmark}
359	\label{fig:MemoryBenchFig}
360	\end{figure}
361
362	The global adjustment knobs for this micro-benchmark are:
363	\begin{description}[itemsep=0pt,parsep=0pt]
364	\item[producer (K):]
365	sets the number of producer threads.
366	\item[consumer (M):]
367	sets number of consumers threads for each producer.
368	\item[round:]
369	sets production and consumption round size.
370	\end{description}
371
372	The adjustment knobs for object allocation are:
373	\begin{description}[itemsep=0pt,parsep=0pt]
374	\item[max:]
375	maximum object size.
376	\item[min:]
377	minimum object size.
378	\item[step:]
379	object size increment.
380	\item[distro:]
381	object size distribution.
382	\item[objects (N):]
383	number of objects per thread.
384	\end{description}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: