Changeset cd1a5e8
- Timestamp:
- Apr 26, 2022, 8:31:46 AM (19 months ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
- Children:
- c9136d9
- Parents:
- c73213b
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
rc73213b rcd1a5e8 1 1 \chapter{Benchmarks} 2 \label{s:Benchmarks} 2 3 3 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% … … 70 71 71 72 72 \subsection{Memory Micro-Benchmark} 73 74 The memory micro-benchmark measures the memory overhead of an allocator. 75 It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS. 76 It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates. 77 This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing. 78 79 \VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark. 80 It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads. 81 A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer. 82 A consumer frees these objects. 83 After every memory operation, program memory usage is recorded throughout the runtime. 84 This data is used to visualize the memory usage and consumption for the program. 73 \subsection{Churn Benchmark} 74 \label{s:ChurnBenchmark} 75 76 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory. 77 Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@. 78 Churn simulates a memory intensive program that can be tuned to create different scenarios. 79 80 \VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark. 81 This benchmark creates a buffer with M spots and starts K threads. 82 Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot. 83 Each thread repeats this cycle N times. 84 The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance. 85 85 86 86 \begin{figure} … … 88 88 \begin{lstlisting} 89 89 Main Thread 90 print memory snapshot 91 create producer threads 92 Producer Thread (K) 93 set free start 94 create consumer threads 90 create worker threads 91 note time T1 92 ... 93 note time T2 94 churn_speed = (T2 - T1) 95 Worker Thread 96 initialize variables 97 ... 95 98 for ( N ) 96 allocate memory 97 print memory snapshot 98 Consumer Thread (M) 99 wait while ( allocations < free start ) 100 for ( N ) 101 free memory 102 print memory snapshot 103 \end{lstlisting} 104 %\includegraphics[width=1\textwidth]{figures/bench-memory.eps} 105 \caption{Memory Footprint Micro-Benchmark} 106 \label{fig:MemoryBenchFig} 107 \end{figure} 108 109 The global adjustment knobs for this micro-benchmark are: 110 \begin{description}[itemsep=0pt,parsep=0pt] 111 \item[producer (K):] 112 sets the number of producer threads. 113 \item[consumer (M):] 114 sets number of consumers threads for each producer. 115 \item[round:] 116 sets production and consumption round size. 117 \end{description} 118 119 The adjustment knobs for object allocation are: 120 \begin{description}[itemsep=0pt,parsep=0pt] 99 R = random spot in array 100 free R 101 allocate new object at R 102 \end{lstlisting} 103 %\includegraphics[width=1\textwidth]{figures/bench-churn.eps} 104 \caption{Churn Benchmark} 105 \label{fig:ChurnBenchFig} 106 \end{figure} 107 108 The adjustment knobs for churn are: 109 \begin{description}[itemsep=0pt,parsep=0pt] 110 \item[thread:] 111 number of threads (K). 112 \item[spots:] 113 number of spots for churn (M). 114 \item[obj:] 115 number of objects per thread (N). 121 116 \item[max:] 122 117 maximum object size. … … 126 121 object size increment. 127 122 \item[distro:] 128 object size distribution. 129 \item[objects (N):] 130 number of objects per thread. 123 object size distribution 124 \end{description} 125 126 127 \subsection{Cache Thrash} 128 \label{sec:benchThrashSec} 129 130 The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}. 131 If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance. 132 When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases. 133 134 Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines. 135 Ideally, a memory allocator should distance the dynamic memory region of one thread from another. 136 Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads. 137 138 \VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark. 139 First, it creates K worker threads. 140 Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line. 141 Each thread repeats this for N times. 142 The main thread measures the total time taken to for all worker threads to complete. 143 Worker threads sharing cache lines with each other will take longer. 144 145 \begin{figure} 146 \centering 147 \input{AllocInducedActiveFalseSharing} 148 \medskip 149 \begin{lstlisting} 150 Main Thread 151 create worker threads 152 ... 153 signal workers to allocate 154 ... 155 signal workers to free 156 ... 157 print addresses from each $thread$ 158 Worker Thread$\(_1\)$ 159 allocate, write, read, free 160 warmup memory in chunkc of 16 bytes 161 ... 162 malloc N objects 163 ... 164 free objects 165 return object address to Main Thread 166 Worker Thread$\(_2\)$ 167 // same as Worker Thread$\(_1\)$ 168 \end{lstlisting} 169 %\input{MemoryOverhead} 170 %\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps} 171 \caption{Allocator-Induced Active False-Sharing Benchmark} 172 \label{fig:benchThrashFig} 173 \end{figure} 174 175 The adjustment knobs for cache access scenarios are: 176 \begin{description}[itemsep=0pt,parsep=0pt] 177 \item[thread:] 178 number of threads (K). 179 \item[iterations:] 180 iterations of cache benchmark (N). 181 \item[cacheRW:] 182 repetitions of reads/writes to object (M). 183 \item[size:] 184 object size. 185 \end{description} 186 187 188 \subsection{Cache Scratch} 189 \label{s:CacheScratch} 190 191 The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}. 192 As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance. 193 In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object. 194 195 % An allocator can unintentionally induce false sharing depending upon its management of the freed objects. 196 % If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator. 197 % If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator; 198 % instead, the program induced this situation. 199 % Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$. 200 201 Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly. 202 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing. 203 If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line. 204 205 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark. 206 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line. 207 Then it create K worker threads and passes an object from the K allocated objects to each of the K threads. 208 Each worker thread frees the object passed by the main thread. 209 Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations. 210 Each worker repeats this N times. 211 212 \begin{figure} 213 \centering 214 \input{AllocInducedPassiveFalseSharing} 215 \medskip 216 \begin{lstlisting} 217 Main Thread 218 malloc N objects $for$ each worker $thread$ 219 create worker threads and pass N objects to each worker 220 ... 221 signal workers to allocate 222 ... 223 signal workers to free 224 ... 225 print addresses from each $thread$ 226 Worker Thread$\(_1\)$ 227 allocate, write, read, free 228 warmup memory in chunkc of 16 bytes 229 ... 230 for ( N ) 231 free an object passed by Main Thread 232 malloc new object 233 ... 234 free objects 235 return new object addresses to Main Thread 236 Worker Thread$\(_2\)$ 237 // same as Worker Thread$\(_1\)$ 238 \end{lstlisting} 239 %\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps} 240 \caption{Program-Induced Passive False-Sharing Benchmark} 241 \label{fig:benchScratchFig} 242 \end{figure} 243 244 Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread). 245 Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses. 246 Main thread measures the total time taken for all the workers to complete. 247 248 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments. 249 \begin{description}[itemsep=0pt,parsep=0pt] 250 \item[threads:] 251 number of threads (K). 252 \item[iterations:] 253 iterations of cache benchmark (N). 254 \item[cacheRW:] 255 repetitions of reads/writes to object (M). 256 \item[size:] 257 object size. 131 258 \end{description} 132 259 133 260 134 261 \subsection{Speed Micro-Benchmark} 262 \label{s:SpeedMicroBenchmark} 135 263 136 264 The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines: … … 193 321 194 322 195 \subsection{Churn Benchmark} 196 197 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory. 198 Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@. 199 Churn simulates a memory intensive program that can be tuned to create different scenarios. 200 201 \VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark. 202 This benchmark creates a buffer with M spots and starts K threads. 203 Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot. 204 Each thread repeats this cycle N times. 205 The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance. 323 \subsection{Memory Micro-Benchmark} 324 \label{s:MemoryMicroBenchmark} 325 326 The memory micro-benchmark measures the memory overhead of an allocator. 327 It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS. 328 It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates. 329 This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing. 330 331 \VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark. 332 It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads. 333 A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer. 334 A consumer frees these objects. 335 After every memory operation, program memory usage is recorded throughout the runtime. 336 This data is used to visualize the memory usage and consumption for the program. 206 337 207 338 \begin{figure} … … 209 340 \begin{lstlisting} 210 341 Main Thread 211 create worker threads 212 note time T1 213 ... 214 note time T2 215 churn_speed = (T2 - T1) 216 Worker Thread 217 initialize variables 218 ... 342 print memory snapshot 343 create producer threads 344 Producer Thread (K) 345 set free start 346 create consumer threads 219 347 for ( N ) 220 R = random spot in array 221 free R 222 allocate new object at R 223 \end{lstlisting} 224 %\includegraphics[width=1\textwidth]{figures/bench-churn.eps} 225 \caption{Churn Benchmark} 226 \label{fig:ChurnBenchFig} 227 \end{figure} 228 229 The adjustment knobs for churn are: 230 \begin{description}[itemsep=0pt,parsep=0pt] 231 \item[thread:] 232 number of threads (K). 233 \item[spots:] 234 number of spots for churn (M). 235 \item[obj:] 236 number of objects per thread (N). 348 allocate memory 349 print memory snapshot 350 Consumer Thread (M) 351 wait while ( allocations < free start ) 352 for ( N ) 353 free memory 354 print memory snapshot 355 \end{lstlisting} 356 %\includegraphics[width=1\textwidth]{figures/bench-memory.eps} 357 \caption{Memory Footprint Micro-Benchmark} 358 \label{fig:MemoryBenchFig} 359 \end{figure} 360 361 The global adjustment knobs for this micro-benchmark are: 362 \begin{description}[itemsep=0pt,parsep=0pt] 363 \item[producer (K):] 364 sets the number of producer threads. 365 \item[consumer (M):] 366 sets number of consumers threads for each producer. 367 \item[round:] 368 sets production and consumption round size. 369 \end{description} 370 371 The adjustment knobs for object allocation are: 372 \begin{description}[itemsep=0pt,parsep=0pt] 237 373 \item[max:] 238 374 maximum object size. … … 242 378 object size increment. 243 379 \item[distro:] 244 object size distribution 245 \end{description} 246 247 248 \subsection{Cache Thrash}\label{sec:benchThrashSec} 249 250 The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}. 251 If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance. 252 When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases. 253 254 Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines. 255 Ideally, a memory allocator should distance the dynamic memory region of one thread from another. 256 Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads. 257 258 \VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark. 259 First, it creates K worker threads. 260 Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line. 261 Each thread repeats this for N times. 262 The main thread measures the total time taken to for all worker threads to complete. 263 Worker threads sharing cache lines with each other will take longer. 264 265 \begin{figure} 266 \centering 267 \input{AllocInducedActiveFalseSharing} 268 \medskip 269 \begin{lstlisting} 270 Main Thread 271 create worker threads 272 ... 273 signal workers to allocate 274 ... 275 signal workers to free 276 ... 277 print addresses from each $thread$ 278 Worker Thread$\(_1\)$ 279 allocate, write, read, free 280 warmup memory in chunkc of 16 bytes 281 ... 282 malloc N objects 283 ... 284 free objects 285 return object address to Main Thread 286 Worker Thread$\(_2\)$ 287 // same as Worker Thread$\(_1\)$ 288 \end{lstlisting} 289 %\input{MemoryOverhead} 290 %\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps} 291 \caption{Allocator-Induced Active False-Sharing Benchmark} 292 \label{fig:benchThrashFig} 293 \end{figure} 294 295 The adjustment knobs for cache access scenarios are: 296 \begin{description}[itemsep=0pt,parsep=0pt] 297 \item[thread:] 298 number of threads (K). 299 \item[iterations:] 300 iterations of cache benchmark (N). 301 \item[cacheRW:] 302 repetitions of reads/writes to object (M). 303 \item[size:] 304 object size. 305 \end{description} 306 307 308 \subsection{Cache Scratch} 309 310 The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}. 311 As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance. 312 In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object. 313 314 % An allocator can unintentionally induce false sharing depending upon its management of the freed objects. 315 % If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator. 316 % If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator; 317 % instead, the program induced this situation. 318 % Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$. 319 320 Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly. 321 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing. 322 If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line. 323 324 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark. 325 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line. 326 Then it create K worker threads and passes an object from the K allocated objects to each of the K threads. 327 Each worker thread frees the object passed by the main thread. 328 Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations. 329 Each worker repeats this N times. 330 331 \begin{figure} 332 \centering 333 \input{AllocInducedPassiveFalseSharing} 334 \medskip 335 \begin{lstlisting} 336 Main Thread 337 malloc N objects $for$ each worker $thread$ 338 create worker threads and pass N objects to each worker 339 ... 340 signal workers to allocate 341 ... 342 signal workers to free 343 ... 344 print addresses from each $thread$ 345 Worker Thread$\(_1\)$ 346 allocate, write, read, free 347 warmup memory in chunkc of 16 bytes 348 ... 349 for ( N ) 350 free an object passed by Main Thread 351 malloc new object 352 ... 353 free objects 354 return new object addresses to Main Thread 355 Worker Thread$\(_2\)$ 356 // same as Worker Thread$\(_1\)$ 357 \end{lstlisting} 358 %\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps} 359 \caption{Program-Induced Passive False-Sharing Benchmark} 360 \label{fig:benchScratchFig} 361 \end{figure} 362 363 Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread). 364 Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses. 365 Main thread measures the total time taken for all the workers to complete. 366 367 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments. 368 \begin{description}[itemsep=0pt,parsep=0pt] 369 \item[threads:] 370 number of threads (K). 371 \item[iterations:] 372 iterations of cache benchmark (N). 373 \item[cacheRW:] 374 repetitions of reads/writes to object (M). 375 \item[size:] 376 object size. 377 \end{description} 380 object size distribution. 381 \item[objects (N):] 382 number of objects per thread. 383 \end{description}
Note: See TracChangeset
for help on using the changeset viewer.