Ignore:
Timestamp:
Jun 2, 2022, 3:11:21 PM (23 months ago)
Author:
caparsons <caparson@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
ced5e2a
Parents:
015925a (diff), fc134a48 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex

    r015925a re5d9274  
    1212\item[Benchmarks]
    1313are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
    14 The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
     14The applications are supposed to represent common execution patterns that need to perform well with respect to an underlying software implementation.
    1515Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
    1616\item[Micro-Benchmarks]
     
    2626
    2727This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
    28 The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the key performance matrices such as speed, memory overhead, and cache performance.
     28The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the key performance metrics such as speed, memory overhead, and cache performance.
    2929% These programs can be taken as a standard to benchmark an allocator's basic goals.
    3030These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
    31 The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
     31The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum allocation patterns, which is seldom possible with benchmark programs.
    3232Each micro-benchmark program has multiple control knobs specified by command-line arguments.
    3333
    34 The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
     34The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific metrics.
    3535An allocator's speed is benchmarked in different ways, as are issues like false sharing.
    3636
     
    4040Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level.
    4141The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level.
    42 None of the prior work address multi-threading at the UT level.
     42None of the prior work addresses multi-threading at the UT level.
    4343
    4444
     
    4747This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
    4848There is no interaction among threads, \ie no object sharing.
    49 Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
    50 Runtime of the benchmark evaluates its efficiency.
     49Each thread repeatedly allocates 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
     50The execution time of the benchmark evaluates its efficiency.
    5151
    5252
     
    6363Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
    6464The number of thread generations varies depending on the thread speed.
    65 It calculates memory operations per second as an indicator of memory allocator's performance.
     65It calculates memory operations per second as an indicator of the memory allocator's performance.
    6666
    6767
     
    7575\label{s:ChurnBenchmark}
    7676
    77 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
     77The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenario, where each thread extensively allocates and frees dynamic memory.
    7878Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
    79 Churn simulates a memory intensive program that can be tuned to create different scenarios.
     79Churn simulates a memory intensive program and can be tuned to create different scenarios.
    8080
    8181\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
     
    133133When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
    134134
    135 Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
     135Cache thrash tries to create a scenario that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
    136136Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
    137137Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
     
    141141Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
    142142Each thread repeats this for N times.
    143 The main thread measures the total time taken to for all worker threads to complete.
    144 Worker threads sharing cache lines with each other will take longer.
     143The main thread measures the total time taken for all worker threads to complete.
     144Worker threads sharing cache lines with each other are expected to take longer.
    145145
    146146\begin{figure}
     
    156156        signal workers to free
    157157        ...
    158         print addresses from each $thread$
    159158Worker Thread$\(_1\)$
    160         allocate, write, read, free
    161         warmup memory in chunkc of 16 bytes
    162         ...
    163         malloc N objects
    164         ...
    165         free objects
    166         return object address to Main Thread
     159        warm up memory in chunks of 16 bytes
     160        ...
     161        For N
     162                malloc an object
     163                read/write the object M times
     164                free the object
     165        ...
    167166Worker Thread$\(_2\)$
    168167        // same as Worker Thread$\(_1\)$
     
    191190
    192191The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
    193 As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
     192As with cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
    194193In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
    195194
     
    202201Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
    203202An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
    204 If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
     203If the object is returned to the thread that owns it, then the new object that the thread gets is less likely to be on the same cache line.
    205204
    206205\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
     
    224223        signal workers to free
    225224        ...
    226         print addresses from each $thread$
    227225Worker Thread$\(_1\)$
    228         allocate, write, read, free
    229         warmup memory in chunkc of 16 bytes
    230         ...
    231         for ( N )
    232                 free an object passed by Main Thread
     226        warmup memory in chunks of 16 bytes
     227        ...
     228        free the object passed by the Main Thread
     229        For N
    233230                malloc new object
    234         ...
    235         free objects
    236         return new object addresses to Main Thread
     231                read/write the object M times
     232                free the object
     233        ...
    237234Worker Thread$\(_2\)$
    238235        // same as Worker Thread$\(_1\)$
     
    248245
    249246Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
    250 \begin{description}[itemsep=0pt,parsep=0pt]
     247\begin{description}[topsep=0pt,itemsep=0pt,parsep=0pt]
    251248\item[threads:]
    252249number of threads (K).
     
    262259\subsection{Speed Micro-Benchmark}
    263260\label{s:SpeedMicroBenchmark}
     261\vspace*{-4pt}
    264262
    265263The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
    266 \begin{enumerate}[itemsep=0pt,parsep=0pt]
     264\begin{enumerate}[topsep=-5pt,itemsep=0pt,parsep=0pt]
    267265\item malloc
    268266\item realloc
     
    332330\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
    333331It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
    334 A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
     332A producer has a separate buffer for each consumer and allocates N objects of random sizes following a configurable distribution for each consumer.
    335333A consumer frees these objects.
    336334After every memory operation, program memory usage is recorded throughout the runtime.
Note: See TracChangeset for help on using the changeset viewer.