Context Navigation

source: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex @ a6e8f64

ADTast-experimentalpthread-emulationqualifiedEnum

Last change on this file since a6e8f64 was a6e8f64, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago
first proofread of chapter benchmarks
Property mode set to `100644`
File size: 16.7 KB

Line
1	\chapter{Benchmarks}
2
3	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
6	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8
9	There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
10	\begin{description}
11	\item[Benchmarks]
12	are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
13	The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
14	Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
15	\item[Micro-Benchmarks]
16	attempt to extract the common execution patterns associated with an application and run the pattern independently.
17	This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
18	Micro-benchmarks are often criticized for inadequately representing real-world applications.
19	\end{description}
20
21	While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
22	In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
23	As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
24	Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns.
25
26	This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
27	The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
28	% These programs can be taken as a standard to benchmark an allocator's basic goals.
29	These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
30	The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
31	The micro-benchmark programs control knobs by command-line arguments.
32
33	The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
34	An allocator's speed is benchmarked in different ways, as are issues like false sharing.
35
36
37	\section{Prior Multi-Threaded Micro-Benchmarks}
38
39	Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level.
40	The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}.
41
42
43	\subsection{threadtest}
44
45	This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
46	There is no interaction among threads, \ie no object sharing.
47	Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated.
48	Runtime of the benchmark evaluates its efficiency.
49
50
51	\subsection{shbench}
52
53	Each thread randomly allocates and frees a number of random-sized objects.
54	It is a stress test that also uses runtime to determine efficiency of the allocator.
55
56
57	\subsection{Larson}
58
59	Larson simulates a server environment.
60	Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
61	Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
62	The number of thread generations varies depending on the thread speed.
63	It calculates memory operations per second as an indicator of memory allocator's performance.
64
65
66	\section{New Multi-Threaded Micro-Benchmarks}
67
68	The following new benchmarks were created to assess multi-threaded programs at the KT level.
69
70
71	\subsection{Memory Micro-Benchmark}
72
73	The memory micro-benchmark measures the memory overhead of an allocator.
74	It allocates a number of dynamic objects.
75	Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.
76	It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.
77
78	\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
79	It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
80	A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer.
81	A consumer frees these objects.
82	After every memory operation, program memory usage is recorded throughout the runtime.
83	This data then is used to visualize the memory usage and consumption for the program.
84
85	\begin{figure}
86	\centering
87	\begin{lstlisting}
88	Main Thread
89	print memory snapshot
90	create producer threads
91	Producer Thread
92	set free start
93	create consumer threads
94	for ( N )
95	allocate memory
96	print memory snapshot
97	Consumer Thread
98	wait while ( allocations < free start )
99	for ( N )
100	free memory
101	print memory snapshot
102	\end{lstlisting}
103	%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
104	\caption{Memory Overhead Benchmark (evaluate memory footprint)}
105	\label{fig:MemoryBenchFig}
106	\end{figure}
107
108	The adjustment knobs for this micro-benchmark are:
109	\begin{description}[itemsep=0pt,parsep=0pt]
110	\item[producer:]
111	sets the number of producer threads.
112	\item[round:]
113	sets production and consumption round size.
114	\item[consumer:]
115	sets number of consumers threads for each producer.
116	\end{description}
117
118	The adjustment knobs for the object allocation size are:
119	\begin{description}[itemsep=0pt,parsep=0pt]
120	\item[max:]
121	sets max object size.
122	\item[min:]
123	sets min object size.
124	\item[step:]
125	sets object size increment.
126	\item[distro:]
127	sets object size distribution.
128	\item[obj:]
129	sets number of objects per thread.
130	\end{description}
131
132
133	\subsection{Speed Micro-Benchmark}
134
135	The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
136	\begin{enumerate}[itemsep=0pt,parsep=0pt]
137	\item malloc
138	\item realloc
139	\item free
140	\item calloc
141	\item malloc-free
142	\item realloc-free
143	\item calloc-free
144	\item malloc-realloc
145	\item calloc-realloc
146	\item malloc-realloc-free
147	\item calloc-realloc-free
148	\item malloc-realloc-free-calloc
149	\end{enumerate}
150
151	\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
152	Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.
153	This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.
154	For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain.
155
156	\begin{figure}
157	\centering
158	\begin{lstlisting}[morekeywords={foreach}]
159	Main Thread
160	create worker threads
161	foreach ( allocation chain )
162	note time T1
163	...
164	note time T2
165	chain_speed = (T2 - T1) / number-of-worker-threads * N )
166	Worker Thread
167	initialize variables
168	...
169	foreach ( routine in allocation chain )
170	call routine N times
171	\end{lstlisting}
172	%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
173	\caption{Speed Benchmark}
174	\label{fig:SpeedBenchFig}
175	\end{figure}
176
177	The adjustment knobs for memory usage are:
178	\begin{description}[itemsep=0pt,parsep=0pt]
179	\item[max:]
180	sets max object size.
181	\item[min:]
182	sets min object size.
183	\item[step:]
184	sets object size increment.
185	\item[distro:]
186	sets object size distribution.
187	\item[obj:]
188	sets number of objects per thread.
189	\item[workers:]
190	sets number of worker threads.
191	\end{description}
192
193
194	\subsection{Churn Benchmark}
195
196	Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory.
197
198	\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
199	This benchmark creates a buffer with M spots and starts K threads.
200	Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot.
201	Each thread repeats this cycle for N times.
202	Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.
203
204	\begin{figure}
205	\centering
206	\begin{lstlisting}
207	Main Thread
208	create worker threads
209	note time T1
210	...
211	note time T2
212	churn_speed = (T2 - T1)
213	Worker Thread
214	initialize variables
215	...
216	for ( N )
217	R -> random spot in array
218	free R
219	allocate new object at R
220	\end{lstlisting}
221	%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
222	\caption{Churn Benchmark}
223	\label{fig:ChurnBenchFig}
224	\end{figure}
225
226	Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
227	Malloc/free allows us to measure latency of memory allocation only without paying any extra cost.
228	Churn simulates a memory intensive program that can be tuned to create different scenarios.
229
230	The adjustment knobs for churn usage are:
231	\begin{description}[itemsep=0pt,parsep=0pt]
232	\item[thread:]
233	sets number of threads, K.
234	\item[spots:]
235	sets number of spots for churn, M.
236	\item[obj:]
237	sets number of objects per thread, N.
238	\item[max:]
239	sets max object size.
240	\item[min:]
241	sets min object size.
242	\item[step:]
243	sets object size increment.
244	\item[distro:]
245	sets object size distribution
246	\end{description}
247
248
249	\section{Cache Thrash}\label{sec:benchThrashSec}
250
251	Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}.
252	If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance.
253	If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line.
254
255	Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines.
256	Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'.
257	Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
258	not distancing the memory among different threads.
259
260	\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
261	First, it creates K worker threads.
262	Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.
263	Each thread repeats this for N times.
264	Main thread measures the total time taken to for all worker threads to complete.
265	Worker threads sharing cache lines with each other will take longer.
266
267	\begin{figure}
268	\centering
269	\input{AllocInducedActiveFalseSharing}
270	\medskip
271	\begin{lstlisting}
272	Main Thread
273	create worker threads
274	...
275	signal workers to allocate
276	...
277	signal workers to free
278	...
279	print addresses from each $thread$
280	Worker Thread$\(_1\)$
281	allocate, write, read, free
282	warmup memory in chunkc of 16 bytes
283	...
284	malloc N objects
285	...
286	free objects
287	return object address to Main Thread
288	Worker Thread$\(_2\)$
289	// same as Worker Thread$\(_1\)$
290	\end{lstlisting}
291	%\input{MemoryOverhead}
292	%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
293	\caption{Allocator-Induced Active False-Sharing Benchmark}
294	\label{fig:benchThrashFig}
295	\end{figure}
296
297	The adjustment knobs for cache access scenarios are:
298	\begin{description}[itemsep=0pt,parsep=0pt]
299	\item[threadN:]
300	sets number of threads, K.
301	\item[cacheIt:]
302	iterations for cache benchmark, N.
303	\item[cacheRep:]
304	repetitions for cache benchmark, M.
305	\item[cacheObj:]
306	object size for cache benchmark.
307	\end{description}
308
309
310	\subsection{Cache Scratch}
311
312	The cache scratch benchmark measures allocator induced passive false sharing in an allocator.
313	An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}.
314	If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
315	If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
316	instead, the program induced this situation.
317	Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
318	Now this false sharing is being caused by the memory allocator although it was started by the program.
319
320	The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
321	An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
322	If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.
323
324	\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
325	First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line.
326	Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
327	Each worker thread frees the object passed by the main thread.
328	Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
329	Each worker repeats this N times.
330
331	\begin{figure}
332	\centering
333	\input{AllocInducedPassiveFalseSharing}
334	\medskip
335	\begin{lstlisting}
336	Main Thread
337	malloc N objects for each worker thread
338	create worker threads and pass N objects to each worker
339	...
340	signal workers to allocate
341	...
342	signal workers to free
343	...
344	print addresses from each $thread$
345	Worker Thread$\(_1\)$
346	allocate, write, read, free
347	warmup memory in chunkc of 16 bytes
348	...
349	for ( N )
350	free an object passed by Main Thread
351	malloc new object
352	...
353	free objects
354	return new object addresses to Main Thread
355	Worker Thread$\(_2\)$
356	// same as Worker Thread$\(_1\)$
357	\end{lstlisting}
358	%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
359	\caption{Program-Induced Passive False-Sharing Benchmark}
360	\label{fig:benchScratchFig}
361	\end{figure}
362
363	Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
364	Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
365	Main thread measures the total time taken for all the workers to complete.
366
367	Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\
368	\begin{description}[itemsep=0pt,parsep=0pt]
369	\item[threads:]
370	number of threads, K.
371	\item[iterations:]
372	iterations for cache benchmark, N.
373	\item[repetitions:]
374	repetitions for cache benchmark, M.
375	\item[objsize:]
376	object size for cache benchmark.
377	\end{description}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: