source: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex @ cd1a5e8

ADTast-experimentalpthread-emulationqualifiedEnum
Last change on this file since cd1a5e8 was cd1a5e8, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago

reorder micro-benchmarks to match performance chapter

  • Property mode set to 100644
File size: 16.7 KB
Line 
1\chapter{Benchmarks}
2\label{s:Benchmarks}
3
4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
7%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9
10There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
11\begin{description}
12\item[Benchmarks]
13are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
14The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
15Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
16\item[Micro-Benchmarks]
17attempt to extract the common execution patterns associated with an application and run the pattern independently.
18This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
19Micro-benchmarks are often criticized for inadequately representing real-world applications.
20\end{description}
21
22While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
23In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
24As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
25Many of these benchmark applications and micro-benchmarks are old and may not reflect current application allocation patterns.
26
27This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
28The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
29% These programs can be taken as a standard to benchmark an allocator's basic goals.
30These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
31The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
32Each micro-benchmark program has multiple control knobs specified by command-line arguments.
33
34The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
35An allocator's speed is benchmarked in different ways, as are issues like false sharing.
36
37
38\section{Prior Multi-Threaded Micro-Benchmarks}
39
40Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level.
41The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level.
42None of the prior work address multi-threading at the UT level.
43
44
45\subsection{threadtest}
46
47This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
48There is no interaction among threads, \ie no object sharing.
49Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
50Runtime of the benchmark evaluates its efficiency.
51
52
53\subsection{shbench}
54
55This benchmark is similar to threadtest but each thread randomly allocate and free a number of \emph{random-sized} objects.
56It is a stress test that also uses runtime to determine efficiency of the allocator.
57
58
59\subsection{Larson}
60
61This benchmark simulates a server environment.
62Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
63Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
64The number of thread generations varies depending on the thread speed.
65It calculates memory operations per second as an indicator of memory allocator's performance.
66
67
68\section{New Multi-Threaded Micro-Benchmarks}
69
70The following new benchmarks were created to assess multi-threaded programs at the KT and UT level.
71
72
73\subsection{Churn Benchmark}
74\label{s:ChurnBenchmark}
75
76The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
77Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
78Churn simulates a memory intensive program that can be tuned to create different scenarios.
79
80\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
81This benchmark creates a buffer with M spots and starts K threads.
82Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
83Each thread repeats this cycle N times.
84The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
85
86\begin{figure}
87\centering
88\begin{lstlisting}
89Main Thread
90        create worker threads
91        note time T1
92        ...
93        note time T2
94        churn_speed = (T2 - T1)
95Worker Thread
96        initialize variables
97        ...
98        for ( N )
99                R = random spot in array
100                free R
101                allocate new object at R
102\end{lstlisting}
103%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
104\caption{Churn Benchmark}
105\label{fig:ChurnBenchFig}
106\end{figure}
107
108The adjustment knobs for churn are:
109\begin{description}[itemsep=0pt,parsep=0pt]
110\item[thread:]
111number of threads (K).
112\item[spots:]
113number of spots for churn (M).
114\item[obj:]
115number of objects per thread (N).
116\item[max:]
117maximum object size.
118\item[min:]
119minimum object size.
120\item[step:]
121object size increment.
122\item[distro:]
123object size distribution
124\end{description}
125
126
127\subsection{Cache Thrash}
128\label{sec:benchThrashSec}
129
130The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
131If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
132When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
133
134Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
135Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
136Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
137
138\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
139First, it creates K worker threads.
140Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
141Each thread repeats this for N times.
142The main thread measures the total time taken to for all worker threads to complete.
143Worker threads sharing cache lines with each other will take longer.
144
145\begin{figure}
146\centering
147\input{AllocInducedActiveFalseSharing}
148\medskip
149\begin{lstlisting}
150Main Thread
151        create worker threads
152        ...
153        signal workers to allocate
154        ...
155        signal workers to free
156        ...
157        print addresses from each $thread$
158Worker Thread$\(_1\)$
159        allocate, write, read, free
160        warmup memory in chunkc of 16 bytes
161        ...
162        malloc N objects
163        ...
164        free objects
165        return object address to Main Thread
166Worker Thread$\(_2\)$
167        // same as Worker Thread$\(_1\)$
168\end{lstlisting}
169%\input{MemoryOverhead}
170%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
171\caption{Allocator-Induced Active False-Sharing Benchmark}
172\label{fig:benchThrashFig}
173\end{figure}
174
175The adjustment knobs for cache access scenarios are:
176\begin{description}[itemsep=0pt,parsep=0pt]
177\item[thread:]
178number of threads (K).
179\item[iterations:]
180iterations of cache benchmark (N).
181\item[cacheRW:]
182repetitions of reads/writes to object (M).
183\item[size:]
184object size.
185\end{description}
186
187
188\subsection{Cache Scratch}
189\label{s:CacheScratch}
190
191The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
192As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
193In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
194
195% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
196% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
197% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
198% instead, the program induced this situation.
199% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
200
201Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
202An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
203If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
204
205\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
206First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
207Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
208Each worker thread frees the object passed by the main thread.
209Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
210Each worker repeats this N times.
211
212\begin{figure}
213\centering
214\input{AllocInducedPassiveFalseSharing}
215\medskip
216\begin{lstlisting}
217Main Thread
218        malloc N objects $for$ each worker $thread$
219        create worker threads and pass N objects to each worker
220        ...
221        signal workers to allocate
222        ...
223        signal workers to free
224        ...
225        print addresses from each $thread$
226Worker Thread$\(_1\)$
227        allocate, write, read, free
228        warmup memory in chunkc of 16 bytes
229        ...
230        for ( N )
231                free an object passed by Main Thread
232                malloc new object
233        ...
234        free objects
235        return new object addresses to Main Thread
236Worker Thread$\(_2\)$
237        // same as Worker Thread$\(_1\)$
238\end{lstlisting}
239%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
240\caption{Program-Induced Passive False-Sharing Benchmark}
241\label{fig:benchScratchFig}
242\end{figure}
243
244Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
245Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
246Main thread measures the total time taken for all the workers to complete.
247
248Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
249\begin{description}[itemsep=0pt,parsep=0pt]
250\item[threads:]
251number of threads (K).
252\item[iterations:]
253iterations of cache benchmark (N).
254\item[cacheRW:]
255repetitions of reads/writes to object (M).
256\item[size:]
257object size.
258\end{description}
259
260
261\subsection{Speed Micro-Benchmark}
262\label{s:SpeedMicroBenchmark}
263
264The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
265\begin{enumerate}[itemsep=0pt,parsep=0pt]
266\item malloc
267\item realloc
268\item free
269\item calloc
270\item malloc-free
271\item realloc-free
272\item calloc-free
273\item malloc-realloc
274\item calloc-realloc
275\item malloc-realloc-free
276\item calloc-realloc-free
277\item malloc-realloc-free-calloc
278\end{enumerate}
279
280\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
281Each routine in the chain is called for N objects and then those allocated objects are used when calling the next routine in the allocation chain.
282This tests the latency of the memory allocator when multiple routines are chained together, \eg the call sequence malloc-realloc-free-calloc gives a complete picture of the major allocation routines when combined together.
283For each chain, the time is recorded to visualize performance of a memory allocator against each chain.
284
285\begin{figure}
286\centering
287\begin{lstlisting}[morekeywords={foreach}]
288Main Thread
289        create worker threads
290        foreach ( allocation chain )
291                note time T1
292                ...
293                note time T2
294                chain_speed = (T2 - T1) / number-of-worker-threads * N )
295Worker Thread
296        initialize variables
297        ...
298        foreach ( routine in allocation chain )
299                call routine N times
300\end{lstlisting}
301%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
302\caption{Speed Benchmark}
303\label{fig:SpeedBenchFig}
304\end{figure}
305
306The adjustment knobs for memory usage are:
307\begin{description}[itemsep=0pt,parsep=0pt]
308\item[max:]
309maximum object size.
310\item[min:]
311minimum object size.
312\item[step:]
313object size increment.
314\item[distro:]
315object size distribution.
316\item[objects:]
317number of objects per thread.
318\item[workers:]
319number of worker threads.
320\end{description}
321
322
323\subsection{Memory Micro-Benchmark}
324\label{s:MemoryMicroBenchmark}
325
326The memory micro-benchmark measures the memory overhead of an allocator.
327It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
328It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
329This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
330
331\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
332It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
333A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
334A consumer frees these objects.
335After every memory operation, program memory usage is recorded throughout the runtime.
336This data is used to visualize the memory usage and consumption for the program.
337
338\begin{figure}
339\centering
340\begin{lstlisting}
341Main Thread
342        print memory snapshot
343        create producer threads
344Producer Thread (K)
345        set free start
346        create consumer threads
347        for ( N )
348                allocate memory
349                print memory snapshot
350Consumer Thread (M)
351        wait while ( allocations < free start )
352        for ( N )
353                free memory
354                print memory snapshot
355\end{lstlisting}
356%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
357\caption{Memory Footprint Micro-Benchmark}
358\label{fig:MemoryBenchFig}
359\end{figure}
360
361The global adjustment knobs for this micro-benchmark are:
362\begin{description}[itemsep=0pt,parsep=0pt]
363\item[producer (K):]
364sets the number of producer threads.
365\item[consumer (M):]
366sets number of consumers threads for each producer.
367\item[round:]
368sets production and consumption round size.
369\end{description}
370
371The adjustment knobs for object allocation are:
372\begin{description}[itemsep=0pt,parsep=0pt]
373\item[max:]
374maximum object size.
375\item[min:]
376minimum object size.
377\item[step:]
378object size increment.
379\item[distro:]
380object size distribution.
381\item[objects (N):]
382number of objects per thread.
383\end{description}
Note: See TracBrowser for help on using the repository browser.