Context Navigation

source: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex @ 4f7ad4b

ADTast-experimentalpthread-emulationqualifiedEnum

Last change on this file since 4f7ad4b was 3c79ea9, checked in by m3zulfiq <m3zulfiq@…>, 2 years ago
completed performance/evaluations chapter
Property mode set to `100644`
File size: 12.5 KB

Line
1	\chapter{Benchmarks}
2
3	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
6	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8
9	The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the
10	performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an
11	allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain
12	allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in
13	an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation
14	pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance
15	under a desired allocation pattern.
16
17	Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc
18	matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs
19	that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by
20	giving commandline arguments to the benchmark on execution.
21
22	\section{Current Benchmarks} There are multiple benchmarks that are built individually and evaluate different aspects of
23	a memory allocator. But, there is not a set of benchamrks that can be used to evaluate multiple aspects of memory
24	allocators.
25
26	\subsection{threadtest}(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000
27	objects. Runtime of the benchmark evaluates its efficiency.
28
29	\subsection{shbench}(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized
30	objects. It is a stress test that also uses runtime to determine efficiency of the allocator.
31
32	\subsection{larson}(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are
33	created where each thread allocator and free a number of objects within a size range. Some objects are passed from
34	threads to the child threads to free. It caluculates memory operations per second as an indicator of memory
35	allocator's performance.
36
37	\section{Memory Benchmark} Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic
38	objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. It
39	calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the
40	memory that program has allocated.
41
42	\begin{figure}
43	\centering
44	\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
45	\caption{Benchmark Memory Overhead}
46	\label{fig:benchMemoryFig}
47	\end{figure}
48
49	Figure \ref{fig:benchMemoryFig} gives a flow of the memory benchmark. It creates a producer-consumer scenerio with K producers
50	and each producer has M consumers. Producer has a separate buffer for each consumer. Producer allocates N objects of
51	random sizes following the given distrubution for each consumer. Consumer frees those objects. After every memory
52	operation, program memory usage is recorded throughout the runtime. This data then can be used to visualize the memory
53	usage and consumption of the prigram.
54
55	Different knobs can be adjusted to set certain thread model.\\
56	-threadA : sets number of alloc threads (producers) for mem benchmark\\
57	-consumeS: sets production and conumption round size\\
58	-threadF : sets number of free threads (consumers) for each producer for mem benchmark
59
60	Object allocation size can be changed using the knobs:\\
61	-maxS : sets max object size\\
62	-minS : sets min object size\\
63	-stepS : sets object size increment\\
64	-distroS : sets object size distribution\\
65	-objN : sets number of objects per thread\\
66
67	\section{Speed Benchmark} Speed benchmark measures the runtime speed of an allocator (FIX ME: cite allocator routines).
68	Speed benchmark measures runtime speed of individual memory allocation routines. It also considers different
69	allocation chains to measures the performance of the allocator by combining multiple allocation routines in a chain.
70	It uses following chains and measures allocator runtime speed against them:
71	\begin{itemize}
72	\item malloc
73	\item realloc
74	\item free
75	\item calloc
76	\item malloc-free
77	\item realloc-free
78	\item calloc-free
79	\item malloc-realloc
80	\item calloc-realloc
81	\item malloc-realloc-free
82	\item calloc-realloc-free
83	\item malloc-realloc-free-calloc
84	\end{itemize}
85
86	\begin{figure}
87	\centering
88	\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
89	\caption{Benchmark Speed}
90	\label{fig:benchSpeedFig}
91	\end{figure}
92
93	As laid out in figure \ref{fig:benchSpeedFig}, each chain is measured separately. Each routine in the chain is called for N objects and then
94	those allocated objects are used when call the next routine in the allocation chain. This way we can measure the
95	complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives
96	us the whole picture of the major allocation routines when combined together in a chain.
97
98	For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against
99	each chain.
100
101	Following knobs can be adjusted to tune memory usage.\\
102	-maxS : sets max object size\\
103	-minS : sets min object size\\
104	-stepS : sets object size increment\\
105	-distroS : sets object size distribution\\
106	-objN : sets number of objects per thread\\
107	-threadN : sets number of worker threads\\
108
109	\section{Churn Benchmark} Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded
110	scenerio where each thread extinsevly allocates and frees dynamic memory.
111
112	\begin{figure}
113	\centering
114	\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
115	\caption{Benchmark Churn}
116	\label{fig:benchChurnFig}
117	\end{figure}
118
119	Figure \ref{fig:benchChurnFig} illustrates churn benchmark.
120	This benchmark creates a buffer with M spots and starts K threads. Each thread randomly picks a
121	spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. Each thread
122	repeats this cycle for N times. Main threads measures the total time taken for the whole benchmark and that time is
123	used to evaluate memory allocator's performance.
124
125	Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
126	Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. Churn simulates a
127	memory intensive program that can be tuned to create different scenerios.
128
129	Following commandline arguments can be used to tune the benchmark.\\
130	-threadN : sets number of threads, K\\
131	-cSpots : sets number of spots for churn, M\\
132	-objN : sets number of objects per thread, N\\
133	-maxS : sets max object size\\
134	-minS : sets min object size\\
135	-stepS : sets object size increment\\
136	-distroS : sets object size distribution
137
138	\section{Cache Thrash}\label{sec:benchThrashSec} Cache Thrash benchmark measures allocator induced active false sharing
139	in an allocator as illustrated in figure \ref{f:AllocatorInducedActiveFalseSharing}.
140	If memory allocator allocates memory for multiple threads on
141	same cache line, this can slow down the program performance. If both threads, who share one cache line, frequently
142	read/write to their object on the cache line concurrently then this will cause cache miss everytime a thread accesse
143	the object as the other thread might have written something at their memory location on the same cache line.
144
145	\begin{figure}
146	\centering
147	\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
148	\caption{Benchmark Allocator Induced Active False Sharing}
149	\label{fig:benchThrashFig}
150	\end{figure}
151
152	Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory
153	allocator is allocating dynamic memory to multiple threads on same cache lines. Ideally, a memory allocator should
154	distance dynamic memory region of one thread from other threads'. Having multiple threads allocating small objects
155	simultanously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
156	not distancing the memory among different threads.
157
158	Figure \ref{fig:benchThrashFig} lays out flow of the cache thrash benchmark.
159	It creates K worker threads. Each worker thread allocates an object and intensively read/write
160	it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line
161	with it. Each thread repeats this for N times. Main thread measures the total time taken to for all worker threads to
162	complete. Worker threads sharing cahche lines with each other will take longer.
163
164	Different cache access scenerios can be created using the following commandline arguments.\\
165	-threadN : sets number of threads, K\\
166	-cacheIt : iterations for cache benchmark, N\\
167	-cacheRep: repetations for cache benchmark, M\\
168	-cacheObj: object size for cache benchmark
169
170	\section{Cache Scratch} Cache Scratch benchmark measures allocator induced passive false sharing in an allocator. An
171	allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in
172	figure \ref{f:AllocatorInducedPassiveFalseSharing}. If a thread A allocates multiple objects together then they will be
173	possibly allocated on the same cache line by the memory allocator. If the thread now passes this object to another
174	thread B then the two of them will sharing the same cache line but this scenerio is not induced by the allocator.
175	Instead, the program induced this situation. Now it might be possible that if thread B frees this object and then
176	allocate an object of the same size then the allocator may return the same object which is on a cache line shared
177	with thread A. Now this false sharing is being caused by the memory allocator although it was started by the
178	program.
179
180	\begin{figure}
181	\centering
182	\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
183	\caption{Benchmark Program Induced Passive False Sharing}
184	\label{fig:benchScratchFig}
185	\end{figure}
186
187	Cache scratch main thread induces false sharing and creates a scenerio that should make memory allocator preserve the
188	program-induced false sharing if it does not retur a freed object to its owner thread and, instead, re-uses it
189	instantly. An alloator using object ownership, as described in section \ref{s:Ownership}, would be less susceptible to allocator induced passive
190	false sharing. If the object is returned to the thread who owns it or originally allocated it then the thread B will
191	get a new object that will be less likely to be on the same cache line as thread A.
192
193	As in figure \ref{fig:benchScratchFig}, cache Scratch allocates K dynamic objects together, one for each of the K worker threads,
194	possibly causing memory allocator to allocate these objects on the same cache-line. Then it create K worker threads and passes
195	an object from the K allocated objects to each of the K threads. Each worker thread frees the object passed by the main thread.
196	Then, it allocates an object and reads/writes it repetitively for M times causing frequent cache invalidations. Each worker
197	repeats this for N times.
198
199	Each thread allocating an object after freeing the original object passed by the main thread should cause the memory
200	allocator to return the same object that was initially allocated by the main thread if the allocator did not return the
201	intial object bakc to its owner (main thread). Then, intensive read/write on the shared cache line by multiple threads
202	should slow down worker threads due to to high cache invalidations and misses. Main thread measures the total time
203	taken for all the workers to complete.
204
205	Similar to bechmark cache thrash in section \ref{sec:benchThrashSec}, different cache access scenerios can be created using the following commandline arguments.\\
206	-threadN : sets number of threads, K\\
207	-cacheIt : iterations for cache benchmark, N\\
208	-cacheRep: repetations for cache benchmark, M\\
209	-cacheObj: object size for cache benchmark

Note: See TracBrowser for help on using the repository browser.

Download in other formats: