1 | \chapter{Benchmarks} |
---|
2 | |
---|
3 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
4 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
5 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite |
---|
6 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
7 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
8 | |
---|
9 | The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the |
---|
10 | performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an |
---|
11 | allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain |
---|
12 | allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in |
---|
13 | an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation |
---|
14 | pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance |
---|
15 | under a desired allocation pattern. |
---|
16 | |
---|
17 | Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc |
---|
18 | matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs |
---|
19 | that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by |
---|
20 | giving commandline arguments to the benchmark on execution. |
---|
21 | |
---|
22 | \section{Current Benchmarks} There are multiple benchmarks that are built individually and evaluate different aspects of |
---|
23 | a memory allocator. But, there is not a set of benchamrks that can be used to evaluate multiple aspects of memory |
---|
24 | allocators. |
---|
25 | |
---|
26 | \subsection{threadtest}(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000 |
---|
27 | objects. Runtime of the benchmark evaluates its efficiency. |
---|
28 | |
---|
29 | \subsection{shbench}(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized |
---|
30 | objects. It is a stress test that also uses runtime to determine efficiency of the allocator. |
---|
31 | |
---|
32 | \subsection{larson}(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are |
---|
33 | created where each thread allocator and free a number of objects within a size range. Some objects are passed from |
---|
34 | threads to the child threads to free. It caluculates memory operations per second as an indicator of memory |
---|
35 | allocator's performance. |
---|
36 | |
---|
37 | \section{Memory Benchmark} Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic |
---|
38 | objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. It |
---|
39 | calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the |
---|
40 | memory that program has allocated. |
---|
41 | |
---|
42 | \begin{figure} |
---|
43 | \centering |
---|
44 | \includegraphics[width=1\textwidth]{figures/bench-memory.eps} |
---|
45 | \caption{Benchmark Memory Overhead} |
---|
46 | \label{fig:benchMemoryFig} |
---|
47 | \end{figure} |
---|
48 | |
---|
49 | Figure \ref{fig:benchMemoryFig} gives a flow of the memory benchmark. It creates a producer-consumer scenerio with K producers |
---|
50 | and each producer has M consumers. Producer has a separate buffer for each consumer. Producer allocates N objects of |
---|
51 | random sizes following the given distrubution for each consumer. Consumer frees those objects. After every memory |
---|
52 | operation, program memory usage is recorded throughout the runtime. This data then can be used to visualize the memory |
---|
53 | usage and consumption of the prigram. |
---|
54 | |
---|
55 | Different knobs can be adjusted to set certain thread model.\\ |
---|
56 | -threadA : sets number of alloc threads (producers) for mem benchmark\\ |
---|
57 | -consumeS: sets production and conumption round size\\ |
---|
58 | -threadF : sets number of free threads (consumers) for each producer for mem benchmark |
---|
59 | |
---|
60 | Object allocation size can be changed using the knobs:\\ |
---|
61 | -maxS : sets max object size\\ |
---|
62 | -minS : sets min object size\\ |
---|
63 | -stepS : sets object size increment\\ |
---|
64 | -distroS : sets object size distribution\\ |
---|
65 | -objN : sets number of objects per thread\\ |
---|
66 | |
---|
67 | \section{Speed Benchmark} Speed benchmark measures the runtime speed of an allocator (FIX ME: cite allocator routines). |
---|
68 | Speed benchmark measures runtime speed of individual memory allocation routines. It also considers different |
---|
69 | allocation chains to measures the performance of the allocator by combining multiple allocation routines in a chain. |
---|
70 | It uses following chains and measures allocator runtime speed against them: |
---|
71 | \begin{itemize} |
---|
72 | \item malloc |
---|
73 | \item realloc |
---|
74 | \item free |
---|
75 | \item calloc |
---|
76 | \item malloc-free |
---|
77 | \item realloc-free |
---|
78 | \item calloc-free |
---|
79 | \item malloc-realloc |
---|
80 | \item calloc-realloc |
---|
81 | \item malloc-realloc-free |
---|
82 | \item calloc-realloc-free |
---|
83 | \item malloc-realloc-free-calloc |
---|
84 | \end{itemize} |
---|
85 | |
---|
86 | \begin{figure} |
---|
87 | \centering |
---|
88 | \includegraphics[width=1\textwidth]{figures/bench-speed.eps} |
---|
89 | \caption{Benchmark Speed} |
---|
90 | \label{fig:benchSpeedFig} |
---|
91 | \end{figure} |
---|
92 | |
---|
93 | As laid out in figure \ref{fig:benchSpeedFig}, each chain is measured separately. Each routine in the chain is called for N objects and then |
---|
94 | those allocated objects are used when call the next routine in the allocation chain. This way we can measure the |
---|
95 | complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives |
---|
96 | us the whole picture of the major allocation routines when combined together in a chain. |
---|
97 | |
---|
98 | For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against |
---|
99 | each chain. |
---|
100 | |
---|
101 | Following knobs can be adjusted to tune memory usage.\\ |
---|
102 | -maxS : sets max object size\\ |
---|
103 | -minS : sets min object size\\ |
---|
104 | -stepS : sets object size increment\\ |
---|
105 | -distroS : sets object size distribution\\ |
---|
106 | -objN : sets number of objects per thread\\ |
---|
107 | -threadN : sets number of worker threads\\ |
---|
108 | |
---|
109 | \section{Churn Benchmark} Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded |
---|
110 | scenerio where each thread extinsevly allocates and frees dynamic memory. |
---|
111 | |
---|
112 | \begin{figure} |
---|
113 | \centering |
---|
114 | \includegraphics[width=1\textwidth]{figures/bench-churn.eps} |
---|
115 | \caption{Benchmark Churn} |
---|
116 | \label{fig:benchChurnFig} |
---|
117 | \end{figure} |
---|
118 | |
---|
119 | Figure \ref{fig:benchChurnFig} illustrates churn benchmark. |
---|
120 | This benchmark creates a buffer with M spots and starts K threads. Each thread randomly picks a |
---|
121 | spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. Each thread |
---|
122 | repeats this cycle for N times. Main threads measures the total time taken for the whole benchmark and that time is |
---|
123 | used to evaluate memory allocator's performance. |
---|
124 | |
---|
125 | Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc. |
---|
126 | Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. Churn simulates a |
---|
127 | memory intensive program that can be tuned to create different scenerios. |
---|
128 | |
---|
129 | Following commandline arguments can be used to tune the benchmark.\\ |
---|
130 | -threadN : sets number of threads, K\\ |
---|
131 | -cSpots : sets number of spots for churn, M\\ |
---|
132 | -objN : sets number of objects per thread, N\\ |
---|
133 | -maxS : sets max object size\\ |
---|
134 | -minS : sets min object size\\ |
---|
135 | -stepS : sets object size increment\\ |
---|
136 | -distroS : sets object size distribution |
---|
137 | |
---|
138 | \section{Cache Thrash}\label{sec:benchThrashSec} Cache Thrash benchmark measures allocator induced active false sharing |
---|
139 | in an allocator as illustrated in figure \ref{f:AllocatorInducedActiveFalseSharing}. |
---|
140 | If memory allocator allocates memory for multiple threads on |
---|
141 | same cache line, this can slow down the program performance. If both threads, who share one cache line, frequently |
---|
142 | read/write to their object on the cache line concurrently then this will cause cache miss everytime a thread accesse |
---|
143 | the object as the other thread might have written something at their memory location on the same cache line. |
---|
144 | |
---|
145 | \begin{figure} |
---|
146 | \centering |
---|
147 | \includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps} |
---|
148 | \caption{Benchmark Allocator Induced Active False Sharing} |
---|
149 | \label{fig:benchThrashFig} |
---|
150 | \end{figure} |
---|
151 | |
---|
152 | Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory |
---|
153 | allocator is allocating dynamic memory to multiple threads on same cache lines. Ideally, a memory allocator should |
---|
154 | distance dynamic memory region of one thread from other threads'. Having multiple threads allocating small objects |
---|
155 | simultanously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its |
---|
156 | not distancing the memory among different threads. |
---|
157 | |
---|
158 | Figure \ref{fig:benchThrashFig} lays out flow of the cache thrash benchmark. |
---|
159 | It creates K worker threads. Each worker thread allocates an object and intensively read/write |
---|
160 | it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line |
---|
161 | with it. Each thread repeats this for N times. Main thread measures the total time taken to for all worker threads to |
---|
162 | complete. Worker threads sharing cahche lines with each other will take longer. |
---|
163 | |
---|
164 | Different cache access scenerios can be created using the following commandline arguments.\\ |
---|
165 | -threadN : sets number of threads, K\\ |
---|
166 | -cacheIt : iterations for cache benchmark, N\\ |
---|
167 | -cacheRep: repetations for cache benchmark, M\\ |
---|
168 | -cacheObj: object size for cache benchmark |
---|
169 | |
---|
170 | \section{Cache Scratch} Cache Scratch benchmark measures allocator induced passive false sharing in an allocator. An |
---|
171 | allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in |
---|
172 | figure \ref{f:AllocatorInducedPassiveFalseSharing}. If a thread A allocates multiple objects together then they will be |
---|
173 | possibly allocated on the same cache line by the memory allocator. If the thread now passes this object to another |
---|
174 | thread B then the two of them will sharing the same cache line but this scenerio is not induced by the allocator. |
---|
175 | Instead, the program induced this situation. Now it might be possible that if thread B frees this object and then |
---|
176 | allocate an object of the same size then the allocator may return the same object which is on a cache line shared |
---|
177 | with thread A. Now this false sharing is being caused by the memory allocator although it was started by the |
---|
178 | program. |
---|
179 | |
---|
180 | \begin{figure} |
---|
181 | \centering |
---|
182 | \includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps} |
---|
183 | \caption{Benchmark Program Induced Passive False Sharing} |
---|
184 | \label{fig:benchScratchFig} |
---|
185 | \end{figure} |
---|
186 | |
---|
187 | Cache scratch main thread induces false sharing and creates a scenerio that should make memory allocator preserve the |
---|
188 | program-induced false sharing if it does not retur a freed object to its owner thread and, instead, re-uses it |
---|
189 | instantly. An alloator using object ownership, as described in section \ref{s:Ownership}, would be less susceptible to allocator induced passive |
---|
190 | false sharing. If the object is returned to the thread who owns it or originally allocated it then the thread B will |
---|
191 | get a new object that will be less likely to be on the same cache line as thread A. |
---|
192 | |
---|
193 | As in figure \ref{fig:benchScratchFig}, cache Scratch allocates K dynamic objects together, one for each of the K worker threads, |
---|
194 | possibly causing memory allocator to allocate these objects on the same cache-line. Then it create K worker threads and passes |
---|
195 | an object from the K allocated objects to each of the K threads. Each worker thread frees the object passed by the main thread. |
---|
196 | Then, it allocates an object and reads/writes it repetitively for M times causing frequent cache invalidations. Each worker |
---|
197 | repeats this for N times. |
---|
198 | |
---|
199 | Each thread allocating an object after freeing the original object passed by the main thread should cause the memory |
---|
200 | allocator to return the same object that was initially allocated by the main thread if the allocator did not return the |
---|
201 | intial object bakc to its owner (main thread). Then, intensive read/write on the shared cache line by multiple threads |
---|
202 | should slow down worker threads due to to high cache invalidations and misses. Main thread measures the total time |
---|
203 | taken for all the workers to complete. |
---|
204 | |
---|
205 | Similar to bechmark cache thrash in section \ref{sec:benchThrashSec}, different cache access scenerios can be created using the following commandline arguments.\\ |
---|
206 | -threadN : sets number of threads, K\\ |
---|
207 | -cacheIt : iterations for cache benchmark, N\\ |
---|
208 | -cacheRep: repetations for cache benchmark, M\\ |
---|
209 | -cacheObj: object size for cache benchmark |
---|