Index: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 16d397ad093e8d4fb6f1f865130ffe79f992d4af)
+++ doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision a6e8f64afb15bd1b90d957d0d96b2c8ba7b16b46)
@@ -7,67 +7,132 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
-The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the
-performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an
-allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain
-allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in
-an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation
-pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance
-under a desired allocation pattern.
-
-Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc
-matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs
-that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by
-giving commandline arguments to the benchmark on execution.
-
-\section{Current Benchmarks} There are multiple benchmarks that are built individually and evaluate different aspects of
- a memory allocator. But, there is not a set of benchamrks that can be used to evaluate multiple aspects of memory
- allocators.
-
-\subsection{threadtest}(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000
- objects. Runtime of the benchmark evaluates its efficiency.
-
-\subsection{shbench}(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized
- objects. It is a stress test that also uses runtime to determine efficiency of the allocator.
-
-\subsection{larson}(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are
- created where each thread allocator and free a number of objects within a size range. Some objects are passed from
- threads to the child threads to free. It caluculates memory operations per second as an indicator of memory
- allocator's performance.
-
-\section{Memory Benchmark} Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic
- objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. It
- calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the
- memory that program has allocated.
-
-\begin{figure}
-\centering
-\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
-\caption{Benchmark Memory Overhead}
-\label{fig:benchMemoryFig}
-\end{figure}
-
-Figure \ref{fig:benchMemoryFig} gives a flow of the memory benchmark. It creates a producer-consumer scenerio with K producers
- and each producer has M consumers. Producer has a separate buffer for each consumer. Producer allocates N objects of
- random sizes following the given distrubution for each consumer. Consumer frees those objects. After every memory
- operation, program memory usage is recorded throughout the runtime. This data then can be used to visualize the memory
- usage and consumption of the prigram.
-
-Different knobs can be adjusted to set certain thread model.\\
--threadA :  sets number of alloc threads (producers) for mem benchmark\\
--consumeS:  sets production and conumption round size\\
--threadF :  sets number of free threads (consumers) for each producer for mem benchmark
-
-Object allocation size can be changed using the knobs:\\
--maxS    :  sets max object size\\
--minS    :  sets min object size\\
--stepS   :  sets object size increment\\
--distroS :  sets object size distribution\\
--objN    :  sets number of objects per thread\\
-
-\section{Speed Benchmark} Speed benchmark measures the runtime speed of an allocator (FIX ME: cite allocator routines).
- Speed benchmark measures runtime speed of individual memory allocation routines. It also considers different
- allocation chains to measures the performance of the allocator by combining multiple allocation routines in a chain.
- It uses following chains and measures allocator runtime speed against them:
-\begin{itemize}
+There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
+\begin{description}
+\item[Benchmarks]
+are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
+The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
+Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
+\item[Micro-Benchmarks]
+attempt to extract the common execution patterns associated with an application and run the pattern independently.
+This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
+Micro-benchmarks are often criticized for inadequately representing real-world applications.
+\end{description}
+
+While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
+In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
+As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
+Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns.
+
+This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
+The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
+% These programs can be taken as a standard to benchmark an allocator's basic goals.
+These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
+The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
+The micro-benchmark programs control knobs by command-line arguments.
+
+The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
+An allocator's speed is benchmarked in different ways, as are issues like false sharing.
+
+
+\section{Prior Multi-Threaded Micro-Benchmarks}
+
+Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level.
+The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}.
+
+
+\subsection{threadtest}
+
+This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
+There is no interaction among threads, \ie no object sharing.
+Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated.
+Runtime of the benchmark evaluates its efficiency.
+
+
+\subsection{shbench}
+
+Each thread randomly allocates and frees a number of random-sized objects.
+It is a stress test that also uses runtime to determine efficiency of the allocator.
+
+
+\subsection{Larson}
+
+Larson simulates a server environment.
+Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
+Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
+The number of thread generations varies depending on the thread speed.
+It calculates memory operations per second as an indicator of memory allocator's performance.
+
+
+\section{New Multi-Threaded Micro-Benchmarks}
+
+The following new benchmarks were created to assess multi-threaded programs at the KT level.
+
+
+\subsection{Memory Micro-Benchmark}
+
+The memory micro-benchmark measures the memory overhead of an allocator.
+It allocates a number of dynamic objects.
+Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.
+It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.
+
+\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
+It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
+A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer.
+A consumer frees these objects.
+After every memory operation, program memory usage is recorded throughout the runtime.
+This data then is used to visualize the memory usage and consumption for the program.
+
+\begin{figure}
+\centering
+\begin{lstlisting}
+Main Thread
+	print memory snapshot
+	create producer threads
+Producer Thread
+	set free start
+	create consumer threads
+	for ( N )
+		allocate memory
+		print memory snapshot
+Consumer Thread
+	wait while ( allocations < free start )
+	for ( N )
+		free memory
+		print memory snapshot
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
+\caption{Memory Overhead Benchmark (evaluate memory footprint)}
+\label{fig:MemoryBenchFig}
+\end{figure}
+
+The adjustment knobs for this micro-benchmark are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[producer:]
+sets the number of producer threads.
+\item[round:]
+sets production and consumption round size.
+\item[consumer:]
+sets number of consumers threads for each producer.
+\end{description}
+
+The adjustment knobs for the object allocation size are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[max:]
+sets max object size.
+\item[min:]
+sets min object size.
+\item[step:]
+sets object size increment.
+\item[distro:]
+sets object size distribution.
+\item[obj:]
+sets number of objects per thread.
+\end{description}
+
+
+\subsection{Speed Micro-Benchmark}
+
+The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
+\begin{enumerate}[itemsep=0pt,parsep=0pt]
 \item malloc
 \item realloc
@@ -82,128 +147,231 @@
 \item calloc-realloc-free
 \item malloc-realloc-free-calloc
-\end{itemize}
-
-\begin{figure}
-\centering
-\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
-\caption{Benchmark Speed}
-\label{fig:benchSpeedFig}
-\end{figure}
-
-As laid out in figure \ref{fig:benchSpeedFig}, each chain is measured separately. Each routine in the chain is called for N objects and then
- those allocated objects are used when call the next routine in the allocation chain. This way we can measure the
- complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives
- us the whole picture of the major allocation routines when combined together in a chain.
-
-For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against
-each chain.
-
-Following knobs can be adjusted to tune memory usage.\\
--maxS    :  sets max object size\\
--minS    :  sets min object size\\
--stepS   :  sets object size increment\\
--distroS :  sets object size distribution\\
--objN    :  sets number of objects per thread\\
--threadN :  sets number of worker threads\\
-
-\section{Churn Benchmark} Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded
- scenerio where each thread extinsevly allocates and frees dynamic memory.
-
-\begin{figure}
-\centering
-\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
-\caption{Benchmark Churn}
-\label{fig:benchChurnFig}
-\end{figure}
-
-Figure \ref{fig:benchChurnFig} illustrates churn benchmark.
- This benchmark creates a buffer with M spots and starts K threads. Each thread randomly picks a
- spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. Each thread
- repeats this cycle for N times. Main threads measures the total time taken for the whole benchmark and that time is
- used to evaluate memory allocator's performance.
+\end{enumerate}
+
+\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
+Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.
+This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.
+For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain.
+
+\begin{figure}
+\centering
+\begin{lstlisting}[morekeywords={foreach}]
+Main Thread
+	create worker threads
+	foreach ( allocation chain )
+		note time T1
+		...
+		note time T2
+		chain_speed = (T2 - T1) / number-of-worker-threads * N )
+Worker Thread
+	initialize variables
+	...
+	foreach ( routine in allocation chain )
+		call routine N times
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
+\caption{Speed Benchmark}
+\label{fig:SpeedBenchFig}
+\end{figure}
+
+The adjustment knobs for memory usage are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[max:]
+sets max object size.
+\item[min:]
+sets min object size.
+\item[step:]
+sets object size increment.
+\item[distro:]
+sets object size distribution.
+\item[obj:]
+sets number of objects per thread.
+\item[workers:]
+sets number of worker threads.
+\end{description}
+
+
+\subsection{Churn Benchmark}
+
+Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory.
+
+\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
+This benchmark creates a buffer with M spots and starts K threads.
+Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot.
+Each thread repeats this cycle for N times.
+Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.
+
+\begin{figure}
+\centering
+\begin{lstlisting}
+Main Thread
+	create worker threads
+	note time T1
+	...
+	note time T2
+	churn_speed = (T2 - T1)
+Worker Thread
+	initialize variables
+	...
+	for ( N )
+		R -> random spot in array
+		free R
+		allocate new object at R
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
+\caption{Churn Benchmark}
+\label{fig:ChurnBenchFig}
+\end{figure}
 
 Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
-Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. Churn simulates a
-memory intensive program that can be tuned to create different scenerios.
-
-Following commandline arguments can be used to tune the benchmark.\\
--threadN :  sets number of threads, K\\
--cSpots  :  sets number of spots for churn, M\\
--objN    :  sets number of objects per thread, N\\
--maxS    :  sets max object size\\
--minS    :  sets min object size\\
--stepS   :  sets object size increment\\
--distroS :  sets object size distribution
-
-\section{Cache Thrash}\label{sec:benchThrashSec} Cache Thrash benchmark measures allocator induced active false sharing
- in an allocator as illustrated in figure \ref{f:AllocatorInducedActiveFalseSharing}.
- If memory allocator allocates memory for multiple threads on
- same cache line, this can slow down the program performance. If both threads, who share one cache line, frequently
- read/write to their object on the cache line concurrently then this will cause cache miss everytime a thread accesse
- the object as the other thread might have written something at their memory location on the same cache line.
-
-\begin{figure}
-\centering
-\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
-\caption{Benchmark Allocator Induced Active False Sharing}
+Malloc/free allows us to measure latency of memory allocation only without paying any extra cost.
+Churn simulates a memory intensive program that can be tuned to create different scenarios.
+
+The adjustment knobs for churn usage are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[thread:]
+sets number of threads, K.
+\item[spots:]
+sets number of spots for churn, M.
+\item[obj:]
+sets number of objects per thread, N.
+\item[max:]
+sets max object size.
+\item[min:]
+sets min object size.
+\item[step:]
+sets object size increment.
+\item[distro:]
+sets object size distribution
+\end{description}
+
+
+\section{Cache Thrash}\label{sec:benchThrashSec}
+
+Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}.
+If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance.
+If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line.
+
+Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines.
+Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'.
+Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
+not distancing the memory among different threads.
+
+\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
+First, it creates K worker threads.
+Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.
+Each thread repeats this for N times.
+Main thread measures the total time taken to for all worker threads to complete.
+Worker threads sharing cache lines with each other will take longer.
+
+\begin{figure}
+\centering
+\input{AllocInducedActiveFalseSharing}
+\medskip
+\begin{lstlisting}
+Main Thread
+	create worker threads
+	...
+	signal workers to allocate
+	...
+	signal workers to free
+	...
+	print addresses from each $thread$
+Worker Thread$\(_1\)$
+	allocate, write, read, free
+	warmup memory in chunkc of 16 bytes
+	...
+	malloc N objects
+	...
+	free objects
+	return object address to Main Thread
+Worker Thread$\(_2\)$
+	// same as Worker Thread$\(_1\)$
+\end{lstlisting}
+%\input{MemoryOverhead}
+%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
+\caption{Allocator-Induced Active False-Sharing Benchmark}
 \label{fig:benchThrashFig}
 \end{figure}
 
-Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory
-allocator is allocating dynamic memory to multiple threads on same cache lines. Ideally, a memory allocator should
-distance dynamic memory region of one thread from other threads'. Having multiple threads allocating small objects
-simultanously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
-not distancing the memory among different threads.
-
-Figure \ref{fig:benchThrashFig} lays out flow of the cache thrash benchmark.
- It creates K worker threads. Each worker thread allocates an object and intensively read/write
- it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line
- with it. Each thread repeats this for N times. Main thread measures the total time taken to for all worker threads to
- complete. Worker threads sharing cahche lines with each other will take longer.
-
-Different cache access scenerios can be created using the following commandline arguments.\\
--threadN :  sets number of threads, K\\
--cacheIt :  iterations for cache benchmark, N\\
--cacheRep:  repetations for cache benchmark, M\\
--cacheObj:  object size for cache benchmark
-
-\section{Cache Scratch} Cache Scratch benchmark measures allocator induced passive false sharing in an allocator. An
- allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in
- figure \ref{f:AllocatorInducedPassiveFalseSharing}. If a thread A allocates multiple objects together then they will be
-  possibly allocated on the same cache line by the memory allocator. If the thread now passes this object to another
-  thread B then the two of them will sharing the same cache line but this scenerio is not induced by the allocator.
-  Instead, the program induced this situation. Now it might be possible that if thread B frees this object and then
-  allocate an object of the same size then the allocator may return the same object which is on a cache line shared
-  with thread A. Now this false sharing is being caused by the memory allocator although it was started by the
-  program.
-
-\begin{figure}
-\centering
-\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
-\caption{Benchmark Program Induced Passive False Sharing}
+The adjustment knobs for cache access scenarios are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[threadN:]
+sets number of threads, K.
+\item[cacheIt:]
+iterations for cache benchmark, N.
+\item[cacheRep:]
+repetitions for cache benchmark, M.
+\item[cacheObj:]
+object size for cache benchmark.
+\end{description}
+
+
+\subsection{Cache Scratch}
+
+The cache scratch benchmark measures allocator induced passive false sharing in an allocator.
+An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}.
+If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
+If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
+instead, the program induced this situation.
+Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
+Now this false sharing is being caused by the memory allocator although it was started by the program.
+
+The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
+An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
+If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.
+
+\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
+First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line.
+Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
+Each worker thread frees the object passed by the main thread.
+Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
+Each worker repeats this N times.
+
+\begin{figure}
+\centering
+\input{AllocInducedPassiveFalseSharing}
+\medskip
+\begin{lstlisting}
+Main Thread
+	malloc N objects for each worker thread
+	create worker threads and pass N objects to each worker
+	...
+	signal workers to allocate
+	...
+	signal workers to free
+	...
+	print addresses from each $thread$
+Worker Thread$\(_1\)$
+	allocate, write, read, free
+	warmup memory in chunkc of 16 bytes
+	...
+	for ( N )
+		free an object passed by Main Thread
+		malloc new object
+	...
+	free objects
+	return new object addresses to Main Thread
+Worker Thread$\(_2\)$
+	// same as Worker Thread$\(_1\)$
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
+\caption{Program-Induced Passive False-Sharing Benchmark}
 \label{fig:benchScratchFig}
 \end{figure}
 
-Cache scratch main thread induces false sharing and creates a scenerio that should make memory allocator preserve the
- program-induced false sharing if it does not retur a freed object to its owner thread and, instead, re-uses it
- instantly. An alloator using object ownership, as described in section \ref{s:Ownership}, would be less susceptible to allocator induced passive
- false sharing. If the object is returned to the thread who owns it or originally allocated it then the thread B will
- get a new object that will be less likely to be on the same cache line as thread A.
-
-As in figure \ref{fig:benchScratchFig}, cache Scratch allocates K dynamic objects together, one for each of the K worker threads,
- possibly causing memory allocator to allocate these objects on the same cache-line. Then it create K worker threads and passes
- an object from the K allocated objects to each of the K threads. Each worker thread frees the object passed by the main thread.
- Then, it allocates an object and reads/writes it repetitively for M times causing frequent cache invalidations. Each worker
- repeats this for N times.
-
-Each thread allocating an object after freeing the original object passed by the main thread should cause the memory
-allocator to return the same object that was initially allocated by the main thread if the allocator did not return the
-intial object bakc to its owner (main thread). Then, intensive read/write on the shared cache line by multiple threads
-should slow down worker threads due to to high cache invalidations and misses. Main thread measures the total time
-taken for all the workers to complete.
-
-Similar to bechmark cache thrash in section \ref{sec:benchThrashSec}, different cache access scenerios can be created using the following commandline arguments.\\
--threadN :  sets number of threads, K\\
--cacheIt :  iterations for cache benchmark, N\\
--cacheRep:  repetations for cache benchmark, M\\
--cacheObj:  object size for cache benchmark
+Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
+Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
+Main thread measures the total time taken for all the workers to complete.
+
+Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[threads:]
+number of threads, K.
+\item[iterations:]
+iterations for cache benchmark, N.
+\item[repetitions:]
+repetitions for cache benchmark, M.
+\item[objsize:]
+object size for cache benchmark.
+\end{description}