Changeset e357efb
- Timestamp:
- Apr 25, 2022, 10:55:56 AM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
- Children:
- 69ec0fb
- Parents:
- 52a532ae
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
r52a532ae re357efb 22 22 In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay. 23 23 As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing. 24 Many of these applications and micro-benchmarkare old and may not reflect current application allocation patterns.24 Many of these benchmark applications and micro-benchmarks are old and may not reflect current application allocation patterns. 25 25 26 26 This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters. … … 29 29 These programs give details of an allocator's memory overhead and speed under certain allocation patterns. 30 30 The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs. 31 The micro-benchmark programs control knobsby command-line arguments.31 Each micro-benchmark program has multiple control knobs specified by command-line arguments. 32 32 33 33 The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices. … … 37 37 \section{Prior Multi-Threaded Micro-Benchmarks} 38 38 39 Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level. 40 The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}. 39 Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level. 40 The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level. 41 None of the prior work address multi-threading at the UT level. 41 42 42 43 … … 45 46 This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently. 46 47 There is no interaction among threads, \ie no object sharing. 47 Each thread repeatedly allocate 100,000 8-byteobjects then deallocates them in the order they were allocated.48 Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated. 48 49 Runtime of the benchmark evaluates its efficiency. 49 50 … … 51 52 \subsection{shbench} 52 53 53 Each thread randomly allocates and frees a number of random-sizedobjects.54 This benchmark is similar to threadtest but each thread randomly allocate and free a number of \emph{random-sized} objects. 54 55 It is a stress test that also uses runtime to determine efficiency of the allocator. 55 56 … … 57 58 \subsection{Larson} 58 59 59 Larsonsimulates a server environment.60 This benchmark simulates a server environment. 60 61 Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range. 61 62 Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process. … … 66 67 \section{New Multi-Threaded Micro-Benchmarks} 67 68 68 The following new benchmarks were created to assess multi-threaded programs at the KT level.69 The following new benchmarks were created to assess multi-threaded programs at the KT and UT level. 69 70 70 71 … … 72 73 73 74 The memory micro-benchmark measures the memory overhead of an allocator. 74 It allocates a number of dynamic objects .75 Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.76 It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.75 It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS. 76 It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates. 77 This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing. 77 78 78 79 \VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark. 79 80 It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads. 80 A producer has a separate buffer for each consumer and allocates N objects of random sizes following the givendistribution for each consumer.81 A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer. 81 82 A consumer frees these objects. 82 83 After every memory operation, program memory usage is recorded throughout the runtime. 83 This data thenis used to visualize the memory usage and consumption for the program.84 This data is used to visualize the memory usage and consumption for the program. 84 85 85 86 \begin{figure} … … 89 90 print memory snapshot 90 91 create producer threads 91 Producer Thread 92 Producer Thread (K) 92 93 set free start 93 94 create consumer threads … … 95 96 allocate memory 96 97 print memory snapshot 97 Consumer Thread 98 Consumer Thread (M) 98 99 wait while ( allocations < free start ) 99 100 for ( N ) … … 102 103 \end{lstlisting} 103 104 %\includegraphics[width=1\textwidth]{figures/bench-memory.eps} 104 \caption{Memory Overhead Benchmark (evaluate memory footprint)}105 \caption{Memory Footprint Micro-Benchmark} 105 106 \label{fig:MemoryBenchFig} 106 107 \end{figure} 107 108 108 The adjustment knobs for this micro-benchmark are:109 \begin{description}[itemsep=0pt,parsep=0pt] 110 \item[producer :]109 The global adjustment knobs for this micro-benchmark are: 110 \begin{description}[itemsep=0pt,parsep=0pt] 111 \item[producer (K):] 111 112 sets the number of producer threads. 113 \item[consumer (M):] 114 sets number of consumers threads for each producer. 112 115 \item[round:] 113 116 sets production and consumption round size. 114 \item[consumer:] 115 sets number of consumers threads for each producer. 116 \end{description} 117 118 The adjustment knobs for the object allocation size are: 117 \end{description} 118 119 The adjustment knobs for object allocation are: 119 120 \begin{description}[itemsep=0pt,parsep=0pt] 120 121 \item[max:] 121 sets maxobject size.122 maximum object size. 122 123 \item[min:] 123 sets minobject size.124 minimum object size. 124 125 \item[step:] 125 setsobject size increment.126 object size increment. 126 127 \item[distro:] 127 setsobject size distribution.128 \item[obj :]129 setsnumber of objects per thread.128 object size distribution. 129 \item[objects (N):] 130 number of objects per thread. 130 131 \end{description} 131 132 … … 150 151 151 152 \VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark. 152 Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.153 This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.154 For each chain, t ime taken is recorded which then can be used to visualize performance of a memory allocator against each chain.153 Each routine in the chain is called for N objects and then those allocated objects are used when calling the next routine in the allocation chain. 154 This tests the latency of the memory allocator when multiple routines are chained together, \eg the call sequence malloc-realloc-free-calloc gives a complete picture of the major allocation routines when combined together. 155 For each chain, the time is recorded to visualize performance of a memory allocator against each chain. 155 156 156 157 \begin{figure} … … 178 179 \begin{description}[itemsep=0pt,parsep=0pt] 179 180 \item[max:] 180 sets maxobject size.181 maximum object size. 181 182 \item[min:] 182 sets minobject size.183 minimum object size. 183 184 \item[step:] 184 setsobject size increment.185 object size increment. 185 186 \item[distro:] 186 setsobject size distribution.187 \item[obj :]188 setsnumber of objects per thread.187 object size distribution. 188 \item[objects:] 189 number of objects per thread. 189 190 \item[workers:] 190 setsnumber of worker threads.191 number of worker threads. 191 192 \end{description} 192 193 … … 194 195 \subsection{Churn Benchmark} 195 196 196 Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory. 197 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory. 198 Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@. 199 Churn simulates a memory intensive program that can be tuned to create different scenarios. 197 200 198 201 \VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark. 199 202 This benchmark creates a buffer with M spots and starts K threads. 200 Each thread randomly picks a spot out of M spots, it frees the object currently at that spotand allocates a new object for that spot.201 Each thread repeats this cycle forN times.202 Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.203 Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot. 204 Each thread repeats this cycle N times. 205 The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance. 203 206 204 207 \begin{figure} … … 215 218 ... 216 219 for ( N ) 217 R ->random spot in array220 R = random spot in array 218 221 free R 219 222 allocate new object at R … … 224 227 \end{figure} 225 228 226 Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc. 227 Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. 228 Churn simulates a memory intensive program that can be tuned to create different scenarios. 229 230 The adjustment knobs for churn usage are: 229 The adjustment knobs for churn are: 231 230 \begin{description}[itemsep=0pt,parsep=0pt] 232 231 \item[thread:] 233 sets number of threads, K.232 number of threads (K). 234 233 \item[spots:] 235 sets number of spots for churn, M.234 number of spots for churn (M). 236 235 \item[obj:] 237 sets number of objects per thread, N.236 number of objects per thread (N). 238 237 \item[max:] 239 sets maxobject size.238 maximum object size. 240 239 \item[min:] 241 sets minobject size.240 minimum object size. 242 241 \item[step:] 243 setsobject size increment.242 object size increment. 244 243 \item[distro:] 245 sets object size distribution 246 \end{description} 247 248 249 \section{Cache Thrash}\label{sec:benchThrashSec} 250 251 Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}. 252 If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance. 253 If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line. 254 255 Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines. 256 Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'. 257 Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its 258 not distancing the memory among different threads. 244 object size distribution 245 \end{description} 246 247 248 \subsection{Cache Thrash}\label{sec:benchThrashSec} 249 250 The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}. 251 If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance. 252 When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases. 253 254 Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines. 255 Ideally, a memory allocator should distance the dynamic memory region of one thread from another. 256 Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads. 259 257 260 258 \VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark. 261 259 First, it creates K worker threads. 262 Each worker thread allocates an object and intensively read /write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.260 Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line. 263 261 Each thread repeats this for N times. 264 Main thread measures the total time taken to for all worker threads to complete.262 The main thread measures the total time taken to for all worker threads to complete. 265 263 Worker threads sharing cache lines with each other will take longer. 266 264 … … 297 295 The adjustment knobs for cache access scenarios are: 298 296 \begin{description}[itemsep=0pt,parsep=0pt] 299 \item[thread N:]300 sets number of threads, K.301 \item[ cacheIt:]302 iterations for cache benchmark, N.303 \item[cacheR ep:]304 repetitions for cache benchmark, M.305 \item[ cacheObj:]306 object size for cache benchmark.297 \item[thread:] 298 number of threads (K). 299 \item[iterations:] 300 iterations of cache benchmark (N). 301 \item[cacheRW:] 302 repetitions of reads/writes to object (M). 303 \item[size:] 304 object size. 307 305 \end{description} 308 306 … … 310 308 \subsection{Cache Scratch} 311 309 312 The cache scratch benchmark measures allocator induced passive false sharing in an allocator. 313 An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}. 314 If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator. 315 If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator; 316 instead, the program induced this situation. 317 Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$. 318 Now this false sharing is being caused by the memory allocator although it was started by the program. 319 320 The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly. 310 The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}. 311 As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance. 312 In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object. 313 314 % An allocator can unintentionally induce false sharing depending upon its management of the freed objects. 315 % If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator. 316 % If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator; 317 % instead, the program induced this situation. 318 % Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$. 319 320 Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly. 321 321 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing. 322 If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.322 If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line. 323 323 324 324 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark. 325 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache -line.325 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line. 326 326 Then it create K worker threads and passes an object from the K allocated objects to each of the K threads. 327 327 Each worker thread frees the object passed by the main thread. … … 335 335 \begin{lstlisting} 336 336 Main Thread 337 malloc N objects for each worker thread337 malloc N objects $for$ each worker $thread$ 338 338 create worker threads and pass N objects to each worker 339 339 ... … … 365 365 Main thread measures the total time taken for all the workers to complete. 366 366 367 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments. \\367 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments. 368 368 \begin{description}[itemsep=0pt,parsep=0pt] 369 369 \item[threads:] 370 number of threads , K.370 number of threads (K). 371 371 \item[iterations:] 372 iterations for cache benchmark, N.373 \item[ repetitions:]374 repetitions for cache benchmark, M.375 \item[ objsize:]376 object size for cache benchmark.377 \end{description} 372 iterations of cache benchmark (N). 373 \item[cacheRW:] 374 repetitions of reads/writes to object (M). 375 \item[size:] 376 object size. 377 \end{description}
Note: See TracChangeset
for help on using the changeset viewer.