Index: doc/theses/mubeen_zulfiqar_MMath/allocator.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/allocator.tex	(revision 374cb11784dccbf21002ae7aee894790d8f79d65)
+++ doc/theses/mubeen_zulfiqar_MMath/allocator.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
@@ -175,5 +175,5 @@
 More operating system support is required to make this model viable, but there is still the serially-reusable problem with user-level threading.
 Leaving the 1:1 model with no atomic actions along the fastpath and no special operating-system support required.
-The 1:1 model still has the serially-reusable problem with user-level threading, which is addressed in \VRef{}, and the greatest potential for heap blowup for certain allocation patterns.
+The 1:1 model still has the serially-reusable problem with user-level threading, which is addressed in \VRef{s:UserlevelThreadingSupport}, and the greatest potential for heap blowup for certain allocation patterns.
 
 
@@ -216,5 +216,5 @@
 To obtain $O(1)$ external latency means obtaining one large storage area from the operating system and subdividing it across all program allocations, which requires a good guess at the program storage high-watermark and potential large external fragmentation.
 Excluding real-time operating-systems, operating-system operations are unbounded, and hence some external latency is unavoidable.
-The mitigating factor is that operating-system calls can often be reduced if a programmer has a sense of the storage high-watermark and the allocator is capable of using this information (see @malloc_expansion@ \VRef{}).
+The mitigating factor is that operating-system calls can often be reduced if a programmer has a sense of the storage high-watermark and the allocator is capable of using this information (see @malloc_expansion@ \VPageref{p:malloc_expansion}).
 Furthermore, while operating-system calls are unbounded, many are now reasonably fast, so their latency is tolerable and infrequent.
 
@@ -504,5 +504,5 @@
 
 
-\section{Statistics and Debugging Modes}
+\section{Statistics and Debugging}
 
 llheap can be built to accumulate fast and largely contention-free allocation statistics to help understand allocation behaviour.
@@ -547,5 +547,5 @@
 There is an unfortunate problem in detecting unfreed storage because some library routines assume their allocations have life-time duration, and hence, do not free their storage.
 For example, @printf@ allocates a 1024 buffer on first call and never deletes this buffer.
-To prevent a false positive for unfreed storage, it is possible to specify an amount of storage that is never freed (see \VRef{}), and it is subtracted from the total allocate/free difference.
+To prevent a false positive for unfreed storage, it is possible to specify an amount of storage that is never freed (see @malloc_unfreed@ \VPageref{p:malloc_unfreed}), and it is subtracted from the total allocate/free difference.
 Determining the amount of never-freed storage is annoying, but once done, any warnings of unfreed storage are application related.
 
@@ -554,4 +554,5 @@
 
 \section{User-level Threading Support}
+\label{s:UserlevelThreadingSupport}
 
 The serially-reusable problem (see \VRef{s:AllocationFastpath}) occurs for kernel threads in the ``T:H model, H = number of CPUs'' model and for user threads in the ``1:1'' model, where llheap uses the ``1:1'' model.
@@ -670,5 +671,5 @@
 It is possible to zero fill or align an allocation but not both.
 \item
-It is \emph{only} possible to zero fill and array allocation.
+It is \emph{only} possible to zero fill an array allocation.
 \item
 It is not possible to resize a memory allocation without data copying.
@@ -687,6 +688,9 @@
 void free( void * ptr );
 void * memalign( size_t alignment, size_t size );
+void * aligned_alloc( size_t alignment, size_t size );
+int posix_memalign( void ** memptr, size_t alignment, size_t size );
 void * valloc( size_t size );
 void * pvalloc( size_t size );
+
 struct mallinfo mallinfo( void );
 int mallopt( int param, int val );
@@ -707,457 +711,337 @@
 Most allocators use @nullptr@ to indicate an allocation failure, specifically out of memory;
 hence the need to return an alternate value for a zero-sized allocation.
-The alternative is to abort a program when out of memory.
-In theory, notifying the programmer allows recovery;
-in practice, it is almost impossible to gracefully recover when out of memory, so the cheaper approach of returning @nullptr@ for a zero-sized allocation is chosen for llheap.
+A different approach allowed by the C API is to abort a program when out of memory and return @nullptr@ for a zero-sized allocation.
+In theory, notifying the programmer of memory failure allows recovery;
+in practice, it is almost impossible to gracefully recover when out of memory.
+Hence, the cheaper approach of returning @nullptr@ for a zero-sized allocation is chosen because no pseudo allocation is necessary.
 
 
 \subsection{C Interface}
 
-Within the C type-system, it is still possible to increase orthogonality and functionality of the dynamic-memory API to make the allocator more usable for programmers.
+For C, it is possible to increase functionality and orthogonality of the dynamic-memory API to make allocation better for programmers.
+
+For existing C allocation routines:
+\begin{itemize}
+\item
+@calloc@ sets the sticky zero-fill property.
+\item
+@memalign@, @aligned_alloc@, @posix_memalign@, @valloc@ and @pvalloc@ set the sticky alignment property.
+\item
+@realloc@ and @reallocarray@ preserve sticky properties.
+\end{itemize}
+
+The C dynamic-memory API is extended with the following routines:
 
 \paragraph{\lstinline{void * aalloc( size_t dim, size_t elemSize )}}
-@aalloc@ is an extension of malloc.
-It allows programmer to allocate a dynamic array of objects without calculating the total size of array explicitly.
-The only alternate of this routine in the other allocators is @calloc@ but @calloc@ also fills the dynamic memory with 0 which makes it slower for a programmer who only wants to dynamically allocate an array of objects without filling it with 0.
-\paragraph{Usage}
+extends @calloc@ for allocating a dynamic array of objects without calculating the total size of array explicitly but \emph{without} zero-filling the memory.
+@aalloc@ is significantly faster than @calloc@, which is the only alternative.
+
+\noindent\textbf{Usage}
 @aalloc@ takes two parameters.
-
-\begin{itemize}
-\item
-@dim@: number of objects in the array
-\item
-@elemSize@: size of the object in the array.
-\end{itemize}
-It returns address of dynamic object allocated on heap that can contain dim number of objects of the size elemSize.
-On failure, it returns a @NULL@ pointer.
+\begin{itemize}
+\item
+@dim@: number of array objects
+\item
+@elemSize@: size of array object
+\end{itemize}
+It returns the address of the dynamic array or @NULL@ if either @dim@ or @elemSize@ are zero.
 
 \paragraph{\lstinline{void * resize( void * oaddr, size_t size )}}
-@resize@ is an extension of relloc.
-It allows programmer to reuse a currently allocated dynamic object with a new size requirement.
-Its alternate in the other allocators is @realloc@ but relloc also copy the data in old object to the new object which makes it slower for the programmer who only wants to reuse an old dynamic object for a new size requirement but does not want to preserve the data in the old object to the new object.
-\paragraph{Usage}
+extends @realloc@ for resizing an existing allocation \emph{without} copying previous data into the new allocation or preserving sticky properties.
+@resize@ is significantly faster than @realloc@, which is the only alternative.
+
+\noindent\textbf{Usage}
 @resize@ takes two parameters.
-
-\begin{itemize}
-\item
-@oaddr@: the address of the old object that needs to be resized.
-\item
-@size@: the new size requirement of the to which the old object needs to be resized.
-\end{itemize}
-It returns an object that is of the size given but it does not preserve the data in the old object.
-On failure, it returns a @NULL@ pointer.
+\begin{itemize}
+\item
+@oaddr@: address to be resized
+\item
+@size@: new allocation size (smaller or larger than previous)
+\end{itemize}
+It returns the address of the old or new storage with the specified new size or @NULL@ if @size@ is zero.
+
+\paragraph{\lstinline{void * amemalign( size_t alignment, size_t dim, size_t elemSize )}}
+extends @aalloc@ and @memalign@ for allocating an aligned dynamic array of objects.
+Sets sticky alignment property.
+
+\noindent\textbf{Usage}
+@amemalign@ takes three parameters.
+\begin{itemize}
+\item
+@alignment@: alignment requirement
+\item
+@dim@: number of array objects
+\item
+@elemSize@: size of array object
+\end{itemize}
+It returns the address of the aligned dynamic-array or @NULL@ if either @dim@ or @elemSize@ are zero.
+
+\paragraph{\lstinline{void * cmemalign( size_t alignment, size_t dim, size_t elemSize )}}
+extends @amemalign@ with zero fill and has the same usage as @amemalign@.
+Sets sticky zero-fill and alignment property.
+It returns the address of the aligned, zero-filled dynamic-array or @NULL@ if either @dim@ or @elemSize@ are zero.
+
+\paragraph{\lstinline{size_t malloc_alignment( void * addr )}}
+returns the alignment of the dynamic object for use in aligning similar allocations.
+
+\noindent\textbf{Usage}
+@malloc_alignment@ takes one parameter.
+\begin{itemize}
+\item
+@addr@: address of an allocated object.
+\end{itemize}
+It returns the alignment of the given object, where objects not allocated with alignment return the minimal allocation alignment.
+
+\paragraph{\lstinline{bool malloc_zero_fill( void * addr )}}
+returns true if the object has the zero-fill sticky property for use in zero filling similar allocations.
+
+\noindent\textbf{Usage}
+@malloc_zero_fill@ takes one parameters.
+
+\begin{itemize}
+\item
+@addr@: address of an allocated object.
+\end{itemize}
+It returns true if the zero-fill sticky property is set and false otherwise.
+
+\paragraph{\lstinline{size_t malloc_size( void * addr )}}
+returns the request size of the dynamic object (updated when an object is resized) for use in similar allocations.
+See also @malloc_usable_size@.
+
+\noindent\textbf{Usage}
+@malloc_size@ takes one parameters.
+\begin{itemize}
+\item
+@addr@: address of an allocated object.
+\end{itemize}
+It returns the request size or zero if @addr@ is @NULL@.
+
+\paragraph{\lstinline{int malloc_stats_fd( int fd )}}
+changes the file descriptor where @malloc_stats@ writes statistics (default @stdout@).
+
+\noindent\textbf{Usage}
+@malloc_stats_fd@ takes one parameters.
+\begin{itemize}
+\item
+@fd@: files description.
+\end{itemize}
+It returns the previous file descriptor.
+
+\paragraph{\lstinline{size_t malloc_expansion()}}
+\label{p:malloc_expansion}
+set the amount (bytes) to extend the heap when there is insufficient free storage to service an allocation request.
+It returns the heap extension size used throughout a program, \ie called once at heap initialization.
+
+\paragraph{\lstinline{size_t malloc_mmap_start()}}
+set the crossover between allocations occurring in the @sbrk@ area or separately mapped.
+It returns the crossover point used throughout a program, \ie called once at heap initialization.
+
+\paragraph{\lstinline{size_t malloc_unfreed()}}
+\label{p:malloc_unfreed}
+amount subtracted to adjust for unfreed program storage (debug only).
+It returns the new subtraction amount and called by @malloc_stats@.
+
+
+\subsection{\CC Interface}
+
+The following extensions take advantage of overload polymorphism in the \CC type-system.
 
 \paragraph{\lstinline{void * resize( void * oaddr, size_t nalign, size_t size )}}
-This @resize@ is an extension of the above @resize@ (FIX ME: cite above resize).
-In addition to resizing the size of of an old object, it can also realign the old object to a new alignment requirement.
-\paragraph{Usage}
-This resize takes three parameters.
-It takes an additional parameter of nalign as compared to the above resize (FIX ME: cite above resize).
-
-\begin{itemize}
-\item
-@oaddr@: the address of the old object that needs to be resized.
-\item
-@nalign@: the new alignment to which the old object needs to be realigned.
-\item
-@size@: the new size requirement of the to which the old object needs to be resized.
-\end{itemize}
-It returns an object with the size and alignment given in the parameters.
-On failure, it returns a @NULL@ pointer.
-
-\paragraph{\lstinline{void * amemalign( size_t alignment, size_t dim, size_t elemSize )}}
-amemalign is a hybrid of memalign and aalloc.
-It allows programmer to allocate an aligned dynamic array of objects without calculating the total size of the array explicitly.
-It frees the programmer from calculating the total size of the array.
-\paragraph{Usage}
-amemalign takes three parameters.
-
-\begin{itemize}
-\item
-@alignment@: the alignment to which the dynamic array needs to be aligned.
-\item
-@dim@: number of objects in the array
-\item
-@elemSize@: size of the object in the array.
-\end{itemize}
-It returns a dynamic array of objects that has the capacity to contain dim number of objects of the size of elemSize.
-The returned dynamic array is aligned to the given alignment.
-On failure, it returns a @NULL@ pointer.
-
-\paragraph{\lstinline{void * cmemalign( size_t alignment, size_t dim, size_t elemSize )}}
-cmemalign is a hybrid of amemalign and calloc.
-It allows programmer to allocate an aligned dynamic array of objects that is 0 filled.
-The current way to do this in other allocators is to allocate an aligned object with memalign and then fill it with 0 explicitly.
-This routine provides both features of aligning and 0 filling, implicitly.
-\paragraph{Usage}
-cmemalign takes three parameters.
-
-\begin{itemize}
-\item
-@alignment@: the alignment to which the dynamic array needs to be aligned.
-\item
-@dim@: number of objects in the array
-\item
-@elemSize@: size of the object in the array.
-\end{itemize}
-It returns a dynamic array of objects that has the capacity to contain dim number of objects of the size of elemSize.
-The returned dynamic array is aligned to the given alignment and is 0 filled.
-On failure, it returns a @NULL@ pointer.
-
-\paragraph{\lstinline{size_t malloc_alignment( void * addr )}}
-@malloc_alignment@ returns the alignment of a currently allocated dynamic object.
-It allows the programmer in memory management and personal bookkeeping.
-It helps the programmer in verifying the alignment of a dynamic object especially in a scenario similar to producer-consumer where a producer allocates a dynamic object and the consumer needs to assure that the dynamic object was allocated with the required alignment.
-\paragraph{Usage}
-@malloc_alignment@ takes one parameters.
-
-\begin{itemize}
-\item
-@addr@: the address of the currently allocated dynamic object.
-\end{itemize}
-@malloc_alignment@ returns the alignment of the given dynamic object.
-On failure, it return the value of default alignment of the llheap allocator.
-
-\paragraph{\lstinline{bool malloc_zero_fill( void * addr )}}
-@malloc_zero_fill@ returns whether a currently allocated dynamic object was initially zero filled at the time of allocation.
-It allows the programmer in memory management and personal bookkeeping.
-It helps the programmer in verifying the zero filled property of a dynamic object especially in a scenario similar to producer-consumer where a producer allocates a dynamic object and the consumer needs to assure that the dynamic object was zero filled at the time of allocation.
-\paragraph{Usage}
-@malloc_zero_fill@ takes one parameters.
-
-\begin{itemize}
-\item
-@addr@: the address of the currently allocated dynamic object.
-\end{itemize}
-@malloc_zero_fill@ returns true if the dynamic object was initially zero filled and return false otherwise.
-On failure, it returns false.
-
-\paragraph{\lstinline{size_t malloc_size( void * addr )}}
-@malloc_size@ returns the request size of a currently allocated dynamic object.
-It allows the programmer in memory management and personal bookkeeping.
-It helps the programmer in verifying the alignment of a dynamic object especially in a scenario similar to producer-consumer where a producer allocates a dynamic object and the consumer needs to assure that the dynamic object was allocated with the required size.
-Its current alternate in the other allocators is @malloc_usable_size@.
-But, @malloc_size@ is different from @malloc_usable_size@ as @malloc_usabe_size@ returns the total data capacity of dynamic object including the extra space at the end of the dynamic object.
-On the other hand, @malloc_size@ returns the size that was given to the allocator at the allocation of the dynamic object.
-This size is updated when an object is realloced, resized, or passed through a similar allocator routine.
-\paragraph{Usage}
-@malloc_size@ takes one parameters.
-
-\begin{itemize}
-\item
-@addr@: the address of the currently allocated dynamic object.
-\end{itemize}
-@malloc_size@ returns the request size of the given dynamic object.
-On failure, it return zero.
-
-
-\subsection{\CC Interface}
+extends @resize@ with an alignment re\-quirement.
+
+\noindent\textbf{Usage}
+takes three parameters.
+\begin{itemize}
+\item
+@oaddr@: address to be resized
+\item
+@nalign@: alignment requirement
+\item
+@size@: new allocation size (smaller or larger than previous)
+\end{itemize}
+It returns the address of the old or new storage with the specified new size and alignment, or @NULL@ if @size@ is zero.
 
 \paragraph{\lstinline{void * realloc( void * oaddr, size_t nalign, size_t size )}}
-This @realloc@ is an extension of the default @realloc@ (FIX ME: cite default @realloc@).
-In addition to reallocating an old object and preserving the data in old object, it can also realign the old object to a new alignment requirement.
-\paragraph{Usage}
-This @realloc@ takes three parameters.
-It takes an additional parameter of nalign as compared to the default @realloc@.
-
-\begin{itemize}
-\item
-@oaddr@: the address of the old object that needs to be reallocated.
-\item
-@nalign@: the new alignment to which the old object needs to be realigned.
-\item
-@size@: the new size requirement of the to which the old object needs to be resized.
-\end{itemize}
-It returns an object with the size and alignment given in the parameters that preserves the data in the old object.
-On failure, it returns a @NULL@ pointer.
+extends @realloc@ with an alignment re\-quirement and has the same usage as aligned @resize@.
 
 
 \subsection{\CFA Interface}
-We added some routines to the @malloc@ interface of \CFA.
-These routines can only be used in \CFA and not in our stand-alone llheap allocator as these routines use some features that are only provided by \CFA and not by C.
-It makes the allocator even more usable to the programmers.
-\CFA provides the liberty to know the returned type of a call to the allocator.
-So, mainly in these added routines, we removed the object size parameter from the routine as allocator can calculate the size of the object from the returned type.
-
-\subsection{\lstinline{T * malloc( void )}}
-This @malloc@ is a simplified polymorphic form of default @malloc@ (FIX ME: cite malloc).
-It does not take any parameter as compared to default @malloc@ that takes one parameter.
-\paragraph{Usage}
-This @malloc@ takes no parameters.
-It returns a dynamic object of the size of type @T@.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * aalloc( size_t dim )}}
-This @aalloc@ is a simplified polymorphic form of above @aalloc@ (FIX ME: cite aalloc).
-It takes one parameter as compared to the above @aalloc@ that takes two parameters.
-\paragraph{Usage}
-aalloc takes one parameters.
-
-\begin{itemize}
-\item
-@dim@: required number of objects in the array.
-\end{itemize}
-It returns a dynamic object that has the capacity to contain dim number of objects, each of the size of type @T@.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * calloc( size_t dim )}}
-This @calloc@ is a simplified polymorphic form of default @calloc@ (FIX ME: cite calloc).
-It takes one parameter as compared to the default @calloc@ that takes two parameters.
-\paragraph{Usage}
-This @calloc@ takes one parameter.
-
-\begin{itemize}
-\item
-@dim@: required number of objects in the array.
-\end{itemize}
-It returns a dynamic object that has the capacity to contain dim number of objects, each of the size of type @T@.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * resize( T * ptr, size_t size )}}
-This resize is a simplified polymorphic form of above resize (FIX ME: cite resize with alignment).
-It takes two parameters as compared to the above resize that takes three parameters.
-It frees the programmer from explicitly mentioning the alignment of the allocation as \CFA provides gives allocator the liberty to get the alignment of the returned type.
-\paragraph{Usage}
-This resize takes two parameters.
-
-\begin{itemize}
-\item
-@ptr@: address of the old object.
-\item
-@size@: the required size of the new object.
-\end{itemize}
-It returns a dynamic object of the size given in parameters.
-The returned object is aligned to the alignment of type @T@.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * realloc( T * ptr, size_t size )}}
-This @realloc@ is a simplified polymorphic form of default @realloc@ (FIX ME: cite @realloc@ with align).
-It takes two parameters as compared to the above @realloc@ that takes three parameters.
-It frees the programmer from explicitly mentioning the alignment of the allocation as \CFA provides gives allocator the liberty to get the alignment of the returned type.
-\paragraph{Usage}
-This @realloc@ takes two parameters.
-
-\begin{itemize}
-\item
-@ptr@: address of the old object.
-\item
-@size@: the required size of the new object.
-\end{itemize}
-It returns a dynamic object of the size given in parameters that preserves the data in the given object.
-The returned object is aligned to the alignment of type @T@.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * memalign( size_t align )}}
-This memalign is a simplified polymorphic form of default memalign (FIX ME: cite memalign).
-It takes one parameters as compared to the default memalign that takes two parameters.
-\paragraph{Usage}
-memalign takes one parameters.
-
-\begin{itemize}
-\item
-@align@: the required alignment of the dynamic object.
-\end{itemize}
-It returns a dynamic object of the size of type @T@ that is aligned to given parameter align.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * amemalign( size_t align, size_t dim )}}
-This amemalign is a simplified polymorphic form of above amemalign (FIX ME: cite amemalign).
-It takes two parameter as compared to the above amemalign that takes three parameters.
-\paragraph{Usage}
-amemalign takes two parameters.
-
-\begin{itemize}
-\item
-@align@: required alignment of the dynamic array.
-\item
-@dim@: required number of objects in the array.
-\end{itemize}
-It returns a dynamic object that has the capacity to contain dim number of objects, each of the size of type @T@.
-The returned object is aligned to the given parameter align.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * cmemalign( size_t align, size_t dim  )}}
-This cmemalign is a simplified polymorphic form of above cmemalign (FIX ME: cite cmemalign).
-It takes two parameter as compared to the above cmemalign that takes three parameters.
-\paragraph{Usage}
-cmemalign takes two parameters.
-
-\begin{itemize}
-\item
-@align@: required alignment of the dynamic array.
-\item
-@dim@: required number of objects in the array.
-\end{itemize}
-It returns a dynamic object that has the capacity to contain dim number of objects, each of the size of type @T@.
-The returned object is aligned to the given parameter align and is zero filled.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * aligned_alloc( size_t align )}}
-This @aligned_alloc@ is a simplified polymorphic form of default @aligned_alloc@ (FIX ME: cite @aligned_alloc@).
-It takes one parameter as compared to the default @aligned_alloc@ that takes two parameters.
-\paragraph{Usage}
-This @aligned_alloc@ takes one parameter.
-
-\begin{itemize}
-\item
-@align@: required alignment of the dynamic object.
-\end{itemize}
-It returns a dynamic object of the size of type @T@ that is aligned to the given parameter.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{int posix_memalign( T ** ptr, size_t align )}}
-This @posix_memalign@ is a simplified polymorphic form of default @posix_memalign@ (FIX ME: cite @posix_memalign@).
-It takes two parameters as compared to the default @posix_memalign@ that takes three parameters.
-\paragraph{Usage}
-This @posix_memalign@ takes two parameter.
-
-\begin{itemize}
-\item
-@ptr@: variable address to store the address of the allocated object.
-\item
-@align@: required alignment of the dynamic object.
-\end{itemize}
-
-It stores address of the dynamic object of the size of type @T@ in given parameter ptr.
-This object is aligned to the given parameter.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * valloc( void )}}
-This @valloc@ is a simplified polymorphic form of default @valloc@ (FIX ME: cite @valloc@).
-It takes no parameters as compared to the default @valloc@ that takes one parameter.
-\paragraph{Usage}
-@valloc@ takes no parameters.
-It returns a dynamic object of the size of type @T@ that is aligned to the page size.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{\lstinline{T * pvalloc( void )}}
-\paragraph{Usage}
-@pvalloc@ takes no parameters.
-It returns a dynamic object of the size that is calculated by rounding the size of type @T@.
-The returned object is also aligned to the page size.
-On failure, it returns a @NULL@ pointer.
-
-\subsection{Alloc Interface}
-In addition to improve allocator interface both for \CFA and our stand-alone allocator llheap in C.
-We also added a new alloc interface in \CFA that increases usability of dynamic memory allocation.
-This interface helps programmers in three major ways.
-
-\begin{itemize}
-\item
-Routine Name: alloc interface frees programmers from remembering different routine names for different kind of dynamic allocations.
-\item
-Parameter Positions: alloc interface frees programmers from remembering parameter positions in call to routines.
-\item
-Object Size: alloc interface does not require programmer to mention the object size as \CFA allows allocator to determine the object size from returned type of alloc call.
-\end{itemize}
-
-Alloc interface uses polymorphism, backtick routines (FIX ME: cite backtick) and ttype parameters of \CFA (FIX ME: cite ttype) to provide a very simple dynamic memory allocation interface to the programmers.
-The new interface has just one routine name alloc that can be used to perform a wide range of dynamic allocations.
-The parameters use backtick functions to provide a similar-to named parameters feature for our alloc interface so that programmers do not have to remember parameter positions in alloc call except the position of dimension (dim) parameter.
-
-\subsection{Routine: \lstinline{T * alloc( ...
-)}}
-Call to alloc without any parameter returns one object of size of type @T@ allocated dynamically.
-Only the dimension (dim) parameter for array allocation has the fixed position in the alloc routine.
-If programmer wants to allocate an array of objects that the required number of members in the array has to be given as the first parameter to the alloc routine.
-alloc routine accepts six kinds of arguments.
-Using different combinations of than parameters, different kind of allocations can be performed.
-Any combination of parameters can be used together except @`realloc@ and @`resize@ that should not be used simultaneously in one call to routine as it creates ambiguity about whether to reallocate or resize a currently allocated dynamic object.
-If both @`resize@ and @`realloc@ are used in a call to alloc then the latter one will take effect or unexpected resulted might be produced.
-
-\paragraph{Dim}
-This is the only parameter in the alloc routine that has a fixed-position and it is also the only parameter that does not use a backtick function.
-It has to be passed at the first position to alloc call in-case of an array allocation of objects of type @T@.
-It represents the required number of members in the array allocation as in \CFA's @aalloc@ (FIX ME: cite aalloc).
-This parameter should be of type @size_t@.
-
-Example: @int a = alloc( 5 )@
-This call will return a dynamic array of five integers.
-
-\paragraph{Align}
-This parameter is position-free and uses a backtick routine align (@`align@).
-The parameter passed with @`align@ should be of type @size_t@.
-If the alignment parameter is not a power of two or is less than the default alignment of the allocator (that can be found out using routine libAlign in \CFA) then the passed alignment parameter will be rejected and the default alignment will be used.
-
-Example: @int b = alloc( 5 , 64`align )@
-This call will return a dynamic array of five integers.
-It will align the allocated object to 64.
-
-\paragraph{Fill}
-This parameter is position-free and uses a backtick routine fill (@`fill@).
-In case of @realloc@, only the extra space after copying the data in the old object will be filled with given parameter.
-Three types of parameters can be passed using `fill.
-
-\begin{itemize}
-\item
-@char@: A char can be passed with @`fill@ to fill the whole dynamic allocation with the given char recursively till the end of required allocation.
-\item
-Object of returned type: An object of type of returned type can be passed with @`fill@ to fill the whole dynamic allocation with the given object recursively till the end of required allocation.
-\item
-Dynamic object of returned type: A dynamic object of type of returned type can be passed with @`fill@ to fill the dynamic allocation with the given dynamic object.
-In this case, the allocated memory is not filled recursively till the end of allocation.
-The filling happen until the end object passed to @`fill@ or the end of requested allocation reaches.
-\end{itemize}
-
-Example: @int b = alloc( 5 , 'a'`fill )@
-This call will return a dynamic array of five integers.
-It will fill the allocated object with character 'a' recursively till the end of requested allocation size.
-
-Example: @int b = alloc( 5 , 4`fill )@
-This call will return a dynamic array of five integers.
-It will fill the allocated object with integer 4 recursively till the end of requested allocation size.
-
-Example: @int b = alloc( 5 , a`fill )@ where @a@ is a pointer of int type
-This call will return a dynamic array of five integers.
-It will copy data in a to the returned object non-recursively until end of a or the newly allocated object is reached.
-
-\paragraph{Resize}
-This parameter is position-free and uses a backtick routine resize (@`resize@).
-It represents the old dynamic object (oaddr) that the programmer wants to
-\begin{itemize}
-\item
-resize to a new size.
-\item
-realign to a new alignment
-\item
-fill with something.
-\end{itemize}
-The data in old dynamic object will not be preserved in the new object.
-The type of object passed to @`resize@ and the returned type of alloc call can be different.
-
-Example: @int b = alloc( 5 , a`resize )@
-This call will resize object a to a dynamic array that can contain 5 integers.
-
-Example: @int b = alloc( 5 , a`resize , 32`align )@
-This call will resize object a to a dynamic array that can contain 5 integers.
-The returned object will also be aligned to 32.
-
-Example: @int b = alloc( 5 , a`resize , 32`align , 2`fill )@
-This call will resize object a to a dynamic array that can contain 5 integers.
-The returned object will also be aligned to 32 and will be filled with 2.
-
-\paragraph{Realloc}
-This parameter is position-free and uses a backtick routine @realloc@ (@`realloc@).
-It represents the old dynamic object (oaddr) that the programmer wants to
-\begin{itemize}
-\item
-realloc to a new size.
-\item
-realign to a new alignment
-\item
-fill with something.
-\end{itemize}
-The data in old dynamic object will be preserved in the new object.
-The type of object passed to @`realloc@ and the returned type of alloc call cannot be different.
-
-Example: @int b = alloc( 5 , a`realloc )@
-This call will realloc object a to a dynamic array that can contain 5 integers.
-
-Example: @int b = alloc( 5 , a`realloc , 32`align )@
-This call will realloc object a to a dynamic array that can contain 5 integers.
-The returned object will also be aligned to 32.
-
-Example: @int b = alloc( 5 , a`realloc , 32`align , 2`fill )@
-This call will resize object a to a dynamic array that can contain 5 integers.
-The returned object will also be aligned to 32.
-The extra space after copying data of a to the returned object will be filled with 2.
+
+The following extensions take advantage of overload polymorphism in the \CFA type-system.
+The key safety advantage of the \CFA type system is using the return type to select overloads;
+hence, a polymorphic routine knows the returned type and its size.
+This capability is used to remove the object size parameter and correctly cast the return storage to match the result type.
+For example, the following is the \CFA wrapper for C @malloc@:
+\begin{cfa}
+forall( T & | sized(T) ) {
+	T * malloc( void ) {
+		if ( _Alignof(T) <= libAlign() ) return @(T *)@malloc( @sizeof(T)@ ); // C allocation
+		else return @(T *)@memalign( @_Alignof(T)@, @sizeof(T)@ ); // C allocation
+	} // malloc
+\end{cfa}
+and is used as follows:
+\begin{lstlisting}
+int * i = malloc();
+double * d = malloc();
+struct Spinlock { ... } __attribute__(( aligned(128) ));
+Spinlock * sl = malloc();
+\end{lstlisting}
+where each @malloc@ call provides the return type as @T@, which is used with @sizeof@, @_Alignof@, and casting the storage to the correct type.
+This interface removes many of the common allocation errors in C programs.
+\VRef[Figure]{f:CFADynamicAllocationAPI} show the \CFA wrappers for the equivalent C/\CC allocation routines with same semantic behaviour.
+
+\begin{figure}
+\begin{lstlisting}
+T * malloc( void );
+T * aalloc( size_t dim );
+T * calloc( size_t dim );
+T * resize( T * ptr, size_t size );
+T * realloc( T * ptr, size_t size );
+T * memalign( size_t align );
+T * amemalign( size_t align, size_t dim );
+T * cmemalign( size_t align, size_t dim  );
+T * aligned_alloc( size_t align );
+int posix_memalign( T ** ptr, size_t align );
+T * valloc( void );
+T * pvalloc( void );
+\end{lstlisting}
+\caption{\CFA C-Style Dynamic-Allocation API}
+\label{f:CFADynamicAllocationAPI}
+\end{figure}
+
+In addition to the \CFA C-style allocator interface, a new allocator interface is provided to further increase orthogonality and usability of dynamic-memory allocation.
+This interface helps programmers in three ways.
+\begin{itemize}
+\item
+naming: \CFA regular and @ttype@ polymorphism is used to encapsulate a wide range of allocation functionality into a single routine name, so programmers do not have to remember multiple routine names for different kinds of dynamic allocations.
+\item
+named arguments: individual allocation properties are specified using postfix function call, so programmers do have to remember parameter positions in allocation calls.
+\item
+object size: like the \CFA C-style interface, programmers do not have to specify object size or cast allocation results.
+\end{itemize}
+Note, postfix function call is an alternative call syntax, using backtick @`@, where the argument appears before the function name, \eg
+\begin{cfa}
+duration ?@`@h( int h );		// ? denote the position of the function operand
+duration ?@`@m( int m );
+duration ?@`@s( int s );
+duration dur = 3@`@h + 42@`@m + 17@`@s;
+\end{cfa}
+@ttype@ polymorphism is similar to \CC variadic templates.
+
+\paragraph{\lstinline{T * alloc( ... )} or \lstinline{T * alloc( size_t dim, ... )}}
+is overloaded with a variable number of specific allocation routines, or an integer dimension parameter followed by a variable number specific allocation routines.
+A call without parameters returns a dynamically allocated object of type @T@ (@malloc@).
+A call with only the dimension (dim) parameter returns a dynamically allocated array of objects of type @T@ (@aalloc@).
+The variable number of arguments consist of allocation properties, which can be combined to produce different kinds of allocations.
+The only restriction is for properties @realloc@ and @resize@, which cannot be combined.
+
+The allocation property functions are:
+\subparagraph{\lstinline{T_align ?`align( size_t alignment )}}
+to align the allocation.
+The alignment parameter must be $\ge$ the default alignment (@libAlign()@ in \CFA) and a power of two, \eg:
+\begin{cfa}
+int * i0 = alloc( @4096`align@ );  sout | i0 | nl;
+int * i1 = alloc( 3, @4096`align@ );  sout | i1; for (i; 3 ) sout | &i1[i]; sout | nl;
+
+0x555555572000
+0x555555574000 0x555555574000 0x555555574004 0x555555574008
+\end{cfa}
+returns a dynamic object and object array aligned on a 4096-byte boundary.
+
+\subparagraph{\lstinline{S_fill(T) ?`fill ( /* various types */ )}}
+to initialize storage.
+There are three ways to fill storage:
+\begin{enumerate}
+\item
+A char fills each byte of each object.
+\item
+An object of the returned type fills each object.
+\item
+An object array pointer fills some or all of the corresponding object array.
+\end{enumerate}
+For example:
+\begin{cfa}[numbers=left]
+int * i0 = alloc( @0n`fill@ );  sout | *i0 | nl;  // disambiguate 0
+int * i1 = alloc( @5`fill@ );  sout | *i1 | nl;
+int * i2 = alloc( @'\xfe'`fill@ ); sout | hex( *i2 ) | nl;
+int * i3 = alloc( 5, @5`fill@ );  for ( i; 5 ) sout | i3[i]; sout | nl;
+int * i4 = alloc( 5, @0xdeadbeefN`fill@ );  for ( i; 5 ) sout | hex( i4[i] ); sout | nl;
+int * i5 = alloc( 5, @i3`fill@ );  for ( i; 5 ) sout | i5[i]; sout | nl;
+int * i6 = alloc( 5, @[i3, 3]`fill@ );  for ( i; 5 ) sout | i6[i]; sout | nl;
+\end{cfa}
+\begin{lstlisting}[numbers=left]
+0
+5
+0xfefefefe
+5 5 5 5 5
+0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
+5 5 5 5 5
+5 5 5 -555819298 -555819298  // two undefined values
+\end{lstlisting}
+Examples 1 to 3, fill an object with a value or characters.
+Examples 4 to 7, fill an array of objects with values, another array, or part of an array.
+
+\subparagraph{\lstinline{S_resize(T) ?`resize( void * oaddr )}}
+used to resize, realign, and fill, where the old object data is not copied to the new object.
+The old object type may be different from the new object type, since the values are not used.
+For example:
+\begin{cfa}[numbers=left]
+int * i = alloc( @5`fill@ );  sout | i | *i;
+i = alloc( @i`resize@, @256`align@, @7`fill@ );  sout | i | *i;
+double * d = alloc( @i`resize@, @4096`align@, @13.5`fill@ );  sout | d | *d;
+\end{cfa}
+\begin{lstlisting}[numbers=left]
+0x55555556d5c0 5
+0x555555570000 7
+0x555555571000 13.5
+\end{lstlisting}
+Examples 2 to 3 change the alignment, fill, and size for the initial storage of @i@.
+
+\begin{cfa}[numbers=left]
+int * ia = alloc( 5, @5`fill@ );  for ( i; 5 ) sout | ia[i]; sout | nl;
+ia = alloc( 10, @ia`resize@, @7`fill@ ); for ( i; 10 ) sout | ia[i]; sout | nl;
+sout | ia; ia = alloc( 5, @ia`resize@, @512`align@, @13`fill@ ); sout | ia; for ( i; 5 ) sout | ia[i]; sout | nl;;
+ia = alloc( 3, @ia`resize@, @4096`align@, @2`fill@ );  sout | ia; for ( i; 3 ) sout | &ia[i] | ia[i]; sout | nl;
+\end{cfa}
+\begin{lstlisting}[numbers=left]
+5 5 5 5 5
+7 7 7 7 7 7 7 7 7 7
+0x55555556d560 0x555555571a00 13 13 13 13 13
+0x555555572000 0x555555572000 2 0x555555572004 2 0x555555572008 2
+\end{lstlisting}
+Examples 2 to 4 change the array size, alignment and fill for the initial storage of @ia@.
+
+\subparagraph{\lstinline{S_realloc(T) ?`realloc( T * a ))}}
+used to resize, realign, and fill, where the old object data is copied to the new object.
+The old object type must be the same as the new object type, since the values used.
+Note, for @fill@, only the extra space after copying the data from the old object is filled with the given parameter.
+For example:
+\begin{cfa}[numbers=left]
+int * i = alloc( @5`fill@ );  sout | i | *i;
+i = alloc( @i`realloc@, @256`align@ );  sout | i | *i;
+i = alloc( @i`realloc@, @4096`align@, @13`fill@ );  sout | i | *i;
+\end{cfa}
+\begin{lstlisting}[numbers=left]
+0x55555556d5c0 5
+0x555555570000 5
+0x555555571000 5
+\end{lstlisting}
+Examples 2 to 3 change the alignment for the initial storage of @i@.
+The @13`fill@ for example 3 does nothing because no extra space is added.
+
+\begin{cfa}[numbers=left]
+int * ia = alloc( 5, @5`fill@ );  for ( i; 5 ) sout | ia[i]; sout | nl;
+ia = alloc( 10, @ia`realloc@, @7`fill@ ); for ( i; 10 ) sout | ia[i]; sout | nl;
+sout | ia; ia = alloc( 1, @ia`realloc@, @512`align@, @13`fill@ ); sout | ia; for ( i; 1 ) sout | ia[i]; sout | nl;;
+ia = alloc( 3, @ia`realloc@, @4096`align@, @2`fill@ );  sout | ia; for ( i; 3 ) sout | &ia[i] | ia[i]; sout | nl;
+\end{cfa}
+\begin{lstlisting}[numbers=left]
+5 5 5 5 5
+5 5 5 5 5 7 7 7 7 7
+0x55555556c560 0x555555570a00 5
+0x555555571000 0x555555571000 5 0x555555571004 2 0x555555571008 2
+\end{lstlisting}
+Examples 2 to 4 change the array size, alignment and fill for the initial storage of @ia@.
+The @13`fill@ for example 3 does nothing because no extra space is added.
+
+These \CFA allocation features are used extensively in the development of the \CFA runtime.
Index: doc/theses/mubeen_zulfiqar_MMath/archive.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/archive.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
+++ doc/theses/mubeen_zulfiqar_MMath/archive.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
@@ -0,0 +1,90 @@
+----> benchmarks.tex
+
+\section{Performance Matrices of Memory Allocators}
+
+When it comes to memory allocators, there are no set standards of performance. Performance of a memory allocator depends highly on the usage pattern of the application. A memory allocator that is the best performer for a certain application X might be the worst for some other application which has completely different memory usage pattern compared to the application X. It is extremely difficult to make one universally best memory allocator which will outperform every other memory allocator for every usage pattern. So, there is a lack of a set of standard benchmarks that are used to evaluate a memory allocators's performance.
+
+If we breakdown the goals of a memory allocator, there are two basic matrices on which a memory allocator's performance is evaluated.
+\begin{enumerate}
+\item
+Memory Overhead
+\item
+Speed
+\end{enumerate}
+
+\subsection{Memory Overhead}
+Memory overhead is the extra memory that a memory allocator takes from OS which is not requested by the application. Ideally, an allocator should get just enough memory from OS that can fulfill application's request and should return this memory to OS as soon as applications frees it. But, allocators retain more memory compared to what application has asked for which causes memory overhead. Memory overhead can happen for various reasons.
+
+\subsubsection{Fragmentation}
+Fragmentation is one of the major reasons behind memory overhead. Fragmentation happens because of situations that are either necassary for proper functioning of the allocator such as internal memory management and book-keeping or are out of allocator's control such as application's usage pattern.
+
+\paragraph{Internal Fragmentation}
+For internal book-keeping, allocators divide raw memory given by OS into chunks, blocks, or lists that can fulfill application's requested size. Allocators use memory given by OS for creating headers, footers etc. to store information about these chunks, blocks, or lists. This increases usage of memory in-addition to the memory requested by application as the allocators need to store their book-keeping information. This extra usage of memory for allocator's own book-keeping is called Internal Fragmentation. Although it cases memory overhead but this overhead is necassary for an allocator's proper funtioning.
+
+*** FIX ME: Insert a figure of internal fragmentation with explanation
+
+\paragraph{External Fragmentation}
+External fragmentation is the free bits of memory between or around chunks of memory that are currently in-use of the application. Segmentation in memory due to application's usage pattern causes external fragmentation. The memory which is part of external fragmentation is completely free as it is neither used by allocator's internal book-keeping nor by the application. Ideally, an allocator should return a segment of memory back to the OS as soon as application frees it. But, this is not always the case. Allocators get memory from OS in one of the two ways.
+
+\begin{itemize}
+\item
+MMap: an allocator can ask OS for whole pages in mmap area. Then, the allocator segments the page internally and fulfills application's request.
+\item
+Heap: an allocator can ask OS for memory in heap area using system calls such as sbrk. Heap are grows downwards and shrinks upwards.
+\begin{itemize}
+\item
+If an allocator uses mmap area, it can only return extra memory back to OS if the whole page is free i.e. no chunk on the page is in-use of the application. Even if one chunk on the whole page is currently in-use of the application, the allocator has to retain the whole page.
+\item
+If an allocator uses the heap area, it can only return the continous free memory at the end of the heap area that is currently in allocator's possession as heap area shrinks upwards. If there are free bits of memory in-between chunks of memory that are currently in-use of the application, the allocator can not return these free bits.
+
+*** FIX ME: Insert a figure of above scenrio with explanation
+\item
+Even if the entire heap area is free except one small chunk at the end of heap area that is being used by the application, the allocator cannot return the free heap area back to the OS as it is not a continous region at the end of heap area.
+
+*** FIX ME: Insert a figure of above scenrio with explanation
+
+\item
+Such scenerios cause external fragmentation but it is out of the allocator's control and depend on application's usage pattern.
+\end{itemize}
+\end{itemize}
+
+\subsubsection{Internal Memory Management}
+Allocators such as je-malloc (FIX ME: insert reference) pro-actively get some memory from the OS and divide it into chunks of certain sizes that can be used in-future to fulfill application's request. This causes memory overhead as these chunks are made before application's request. There is also the possibility that an application may not even request memory of these sizes during their whole life-time.
+
+*** FIX ME: Insert a figure of above scenrio with explanation
+
+Allocators such as rp-malloc (FIX ME: insert reference) maintain lists or blocks of sized memory segments that is freed by the application for future use. These lists are maintained without any guarantee that application will even request these sizes again.
+
+Such tactics are usually used to gain speed as allocator will not have to get raw memory from OS and manage it at the time of application's request but they do cause memory overhead.
+
+Fragmentation and managed sized chunks of free memory can lead to Heap Blowup as the allocator may not be able to use the fragments or sized free chunks of free memory to fulfill application's requests of other sizes.
+
+\subsection{Speed}
+When it comes to performance evaluation of any piece of software, its runtime is usually the first thing that is evaluated. The same is true for memory allocators but, in case of memory allocators, speed does not only mean the runtime of memory allocator's routines but there are other factors too.
+
+\subsubsection{Runtime Speed}
+Low runtime is the main goal of a memory allocator when it comes it proving its speed. Runtime is the time that it takes for a routine of memory allocator to complete its execution. As mentioned in (FIX ME: refernce to routines' list), there four basic routines that are used in memory allocation. Ideally, each routine of a memory allocator should be fast. Some memory allocator designs use pro-active measures (FIX ME: local refernce) to gain speed when allocating some memory to the application. Some memory allocators do memory allocation faster than memory freeing (FIX ME: graph refernce) while others show similar speed whether memory is allocated or freed.
+
+\subsubsection{Memory Access Speed}
+Runtime speed is not the only speed matrix in memory allocators. The memory that a memory allocator has allocated to the application also needs to be accessible as quick as possible. The application should be able to read/write allocated memory quickly. The allocation method of a memory allocator may introduce some delays when it comes to memory access speed, which is specially important in concurrent applications. Ideally, a memory allocator should allocate all memory on a cache-line to only one thread and no cache-line should be shared among multiple threads. If a memory allocator allocates memory to multple threads on a same cache line, then cache may get invalidated more frequesntly when two different threads running on two different processes will try to read/write the same memory region. On the other hand, if one cache-line is used by only one thread then the cache may get invalidated less frequently. This sharing of one cache-line among multiple threads is called false sharing (FIX ME: cite wasik).
+
+\paragraph{Active False Sharing}
+Active false sharing is the sharing of one cache-line among multiple threads that is caused by memory allocator. It happens when two threads request memory from memory allocator and the allocator allocates memory to both of them on the same cache-line. After that, if the threads are running on different processes who have their own caches and both threads start reading/writing the allocated memory simultanously, their caches will start getting invalidated every time the other thread writes something to the memory. This will cause the application to slow down as the process has to load cache much more frequently.
+
+*** FIX ME: Insert a figure of above scenrio with explanation
+
+\paragraph{Passive False Sharing}
+Passive false sharing is the kind of false sharing which is caused by the application and not the memory allocator. The memory allocator may preservce passive false sharing in future instead of eradicating it. But, passive false sharing is initiated by the application.
+
+\subparagraph{Program Induced Passive False Sharing}
+Program induced false sharing is completely out of memory allocator's control and is purely caused by the application. When a thread in the application creates multiple objects in the dynamic area and allocator allocates memory for these objects on the same cache-line as the objects are created by the same thread. Passive false sharing will occur if this thread passes one of these objects to another thread but it retains the rest of these objects or it passes some/all of the remaining objects to some third thread(s). Now, one cache-line is shared among multiple threads but it is caused by the application and not the allocator. It is out of allocator's control and has the similar performance impact as Active False Sharing (FIX ME: cite local) if these threads, who are sharing the same cache-line, start reading/writing the given objects simultanously.
+
+*** FIX ME: Insert a figure of above scenrio 1 with explanation
+
+*** FIX ME: Insert a figure of above scenrio 2 with explanation
+
+\subparagraph{Program Induced Allocator Preserved Passive False Sharing}
+Program induced allocator preserved passive false sharing is another interesting case of passive false sharing. Both the application and the allocator are partially responsible for it. It starts the same as Program Induced False Sharing (FIX ME: cite local). Once, an application thread has created multiple dynamic objects on the same cache-line and ditributed these objects among multiple threads causing sharing of one cache-line among multiple threads (Program Induced Passive False Sharing). This kind of false sharing occurs when one of these threads, which got the object on the shared cache-line, frees the passed object then re-allocates another object but the allocator returns the same object (on the shared cache-line) that this thread just freed. Although, the application caused the false sharing to happen in the frst place however, to prevent furthur false sharing, the allocator should have returned the new object on some other cache-line which is only shared by the allocating thread. When it comes to performnce impact, this passive false sharing will slow down the application just like any other kind of false sharing if the threads sharing the cache-line start reading/writing the objects simultanously.
+
+
+*** FIX ME: Insert a figure of above scenrio with explanation
Index: doc/theses/mubeen_zulfiqar_MMath/background.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/background.tex	(revision 374cb11784dccbf21002ae7aee894790d8f79d65)
+++ doc/theses/mubeen_zulfiqar_MMath/background.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
@@ -757,2 +757,10 @@
 Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is correspondingly more complex.
 Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator.
+
+
+\subsubsection{Speed Workload}
+The worload method uses the opposite approach. It calls the allocator's routines for a specific amount of time and measures how much work was done during that time. Then, similar to the time method, it divides the time by the workload done during that time and calculates the average time taken by the allocator's routine.
+*** FIX ME: Insert a figure of above benchmark with description
+
+\paragraph{Knobs}
+*** FIX ME: Insert Knobs
Index: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 374cb11784dccbf21002ae7aee894790d8f79d65)
+++ doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
@@ -1,217 +1,205 @@
 \chapter{Benchmarks}
 
-\noindent
-====================
-
-Writing Points:
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the
+performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an
+allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain
+allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in
+an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation
+pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance
+under a desired allocation pattern.
+
+Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc
+matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs
+that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by
+giving commandline arguments to the benchmark on execution.
+
+\section{Current Benchmarks} There are multiple benchmarks that are built individually and evaluate different aspects of
+ a memory allocator. But, there is not a set of benchamrks that can be used to evaluate multiple aspects of memory
+ allocators.
+
+\subsection{threadtest}(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000
+ objects. Runtime of the benchmark evaluates its efficiency.
+
+\subsection{shbench}(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized
+ objects. It is a stress test that also uses runtime to determine efficiency of the allocator.
+
+\subsection{larson}(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are
+ created where each thread allocator and free a number of objects within a size range. Some objects are passed from
+ threads to the child threads to free. It caluculates memory operations per second as an indicator of memory
+ allocator's performance.
+
+\section{Memory Benchmark} Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic
+ objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. It
+ calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the
+ memory that program has allocated.
+
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
+\caption{Benchmark Memory Overhead}
+\label{fig:benchMemoryFig}
+\end{figure}
+
+Figure \ref{fig:benchMemoryFig} gives a flow of the memory benchmark. It creates a producer-consumer scenerio with K producers
+ and each producer has M consumers. Producer has a separate buffer for each consumer. Producer allocates N objects of
+ random sizes following the given distrubution for each consumer. Consumer frees those objects. After every memory
+ operation, program memory usage is recorded throughout the runtime. This data then can be used to visualize the memory
+ usage and consumption of the prigram.
+
+Different knobs can be adjusted to set certain thread model.\\
+-threadA :  sets number of alloc threads (producers) for mem benchmark\\
+-consumeS:  sets production and conumption round size\\
+-threadF :  sets number of free threads (consumers) for mem benchmark
+
+Object allocation size can be changed using the knobs:\\
+-maxS    :  sets max object size\\
+-minS    :  sets min object size\\
+-stepS   :  sets object size increment\\
+-distroS :  sets object size distribution\\
+-objN    :  sets number of objects per thread\\
+
+\section{Speed Benchmark} Speed benchmark measures the runtime speed of an allocator (FIX ME: cite allocator routines).
+ Speed benchmark measures runtime speed of individual memory allocation routines. It also considers different
+ allocation chains to measures the performance of the allocator by combining multiple allocation routines in a chain.
+ It uses following chains and measures allocator runtime speed against them:
 \begin{itemize}
-\item
-Performance matrices of memory allocation.
-\item
-Aim of micro benchmark suite.
-
------ SHOULD WE GIVE IMPLEMENTATION DETAILS HERE? -----
-
-\PAB{For the benchmarks, yes.}
-\item
-A complete list of benchmarks in micro benchmark suite.
-\item
-One detailed section for each benchmark in micro benchmark suite including:
-
-\begin{itemize}
-\item
-The introduction of the benchmark.
-\item
-Figure.
-\item
-Results with popular memory allocators.
+\item malloc 0
+\item free NULL
+\item malloc
+\item realloc
+\item free
+\item calloc
+\item malloc-free
+\item realloc-free
+\item calloc-free
+\item malloc-realloc
+\item calloc-realloc
+\item malloc-realloc-free
+\item calloc-realloc-free
+\item malloc-realloc-free-calloc
 \end{itemize}
 
-\item
-Summarize performance of current memory allocators.
-\end{itemize}
-
-\noindent
-====================
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Performance Matrices
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-
-\section{Benchmarks}
-There are multiple benchmarks that are built individually and evaluate different aspects of a memory allocator. But, there is not standard set of benchamrks that can be used to evaluate multiple aspects of memory allocators.
-
-\paragraph{threadtest}
-(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000 objects. Runtime of the benchmark evaluates its efficiency.
-
-\paragraph{shbench}
-(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized objects. It is a stress test that also uses runtime to determine efficiency of the allocator.
-
-\paragraph{larson}
-(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are created where each thread allocator and free a number of objects within a size range. Some objects are passed from threads to the child threads to free. It caluculates memory operations per second as an indicator of memory allocator's performance.
-
-
-\section{Performance Matrices of Memory Allocators}
-
-When it comes to memory allocators, there are no set standards of performance. Performance of a memory allocator depends highly on the usage pattern of the application. A memory allocator that is the best performer for a certain application X might be the worst for some other application which has completely different memory usage pattern compared to the application X. It is extremely difficult to make one universally best memory allocator which will outperform every other memory allocator for every usage pattern. So, there is a lack of a set of standard benchmarks that are used to evaluate a memory allocators's performance.
-
-If we breakdown the goals of a memory allocator, there are two basic matrices on which a memory allocator's performance is evaluated.
-\begin{enumerate}
-\item
-Memory Overhead
-\item
-Speed
-\end{enumerate}
-
-\subsection{Memory Overhead}
-Memory overhead is the extra memory that a memory allocator takes from OS which is not requested by the application. Ideally, an allocator should get just enough memory from OS that can fulfill application's request and should return this memory to OS as soon as applications frees it. But, allocators retain more memory compared to what application has asked for which causes memory overhead. Memory overhead can happen for various reasons.
-
-\subsubsection{Fragmentation}
-Fragmentation is one of the major reasons behind memory overhead. Fragmentation happens because of situations that are either necassary for proper functioning of the allocator such as internal memory management and book-keeping or are out of allocator's control such as application's usage pattern.
-
-\paragraph{Internal Fragmentation}
-For internal book-keeping, allocators divide raw memory given by OS into chunks, blocks, or lists that can fulfill application's requested size. Allocators use memory given by OS for creating headers, footers etc. to store information about these chunks, blocks, or lists. This increases usage of memory in-addition to the memory requested by application as the allocators need to store their book-keeping information. This extra usage of memory for allocator's own book-keeping is called Internal Fragmentation. Although it cases memory overhead but this overhead is necassary for an allocator's proper funtioning.
-
-*** FIX ME: Insert a figure of internal fragmentation with explanation
-
-\paragraph{External Fragmentation}
-External fragmentation is the free bits of memory between or around chunks of memory that are currently in-use of the application. Segmentation in memory due to application's usage pattern causes external fragmentation. The memory which is part of external fragmentation is completely free as it is neither used by allocator's internal book-keeping nor by the application. Ideally, an allocator should return a segment of memory back to the OS as soon as application frees it. But, this is not always the case. Allocators get memory from OS in one of the two ways.
-
-\begin{itemize}
-\item
-MMap: an allocator can ask OS for whole pages in mmap area. Then, the allocator segments the page internally and fulfills application's request.
-\item
-Heap: an allocator can ask OS for memory in heap area using system calls such as sbrk. Heap are grows downwards and shrinks upwards.
-\begin{itemize}
-\item
-If an allocator uses mmap area, it can only return extra memory back to OS if the whole page is free i.e. no chunk on the page is in-use of the application. Even if one chunk on the whole page is currently in-use of the application, the allocator has to retain the whole page.
-\item
-If an allocator uses the heap area, it can only return the continous free memory at the end of the heap area that is currently in allocator's possession as heap area shrinks upwards. If there are free bits of memory in-between chunks of memory that are currently in-use of the application, the allocator can not return these free bits.
-
-*** FIX ME: Insert a figure of above scenrio with explanation
-\item
-Even if the entire heap area is free except one small chunk at the end of heap area that is being used by the application, the allocator cannot return the free heap area back to the OS as it is not a continous region at the end of heap area.
-
-*** FIX ME: Insert a figure of above scenrio with explanation
-
-\item
-Such scenerios cause external fragmentation but it is out of the allocator's control and depend on application's usage pattern.
-\end{itemize}
-\end{itemize}
-
-\subsubsection{Internal Memory Management}
-Allocators such as je-malloc (FIX ME: insert reference) pro-actively get some memory from the OS and divide it into chunks of certain sizes that can be used in-future to fulfill application's request. This causes memory overhead as these chunks are made before application's request. There is also the possibility that an application may not even request memory of these sizes during their whole life-time.
-
-*** FIX ME: Insert a figure of above scenrio with explanation
-
-Allocators such as rp-malloc (FIX ME: insert reference) maintain lists or blocks of sized memory segments that is freed by the application for future use. These lists are maintained without any guarantee that application will even request these sizes again.
-
-Such tactics are usually used to gain speed as allocator will not have to get raw memory from OS and manage it at the time of application's request but they do cause memory overhead.
-
-Fragmentation and managed sized chunks of free memory can lead to Heap Blowup as the allocator may not be able to use the fragments or sized free chunks of free memory to fulfill application's requests of other sizes.
-
-\subsection{Speed}
-When it comes to performance evaluation of any piece of software, its runtime is usually the first thing that is evaluated. The same is true for memory allocators but, in case of memory allocators, speed does not only mean the runtime of memory allocator's routines but there are other factors too.
-
-\subsubsection{Runtime Speed}
-Low runtime is the main goal of a memory allocator when it comes it proving its speed. Runtime is the time that it takes for a routine of memory allocator to complete its execution. As mentioned in (FIX ME: refernce to routines' list), there four basic routines that are used in memory allocation. Ideally, each routine of a memory allocator should be fast. Some memory allocator designs use pro-active measures (FIX ME: local refernce) to gain speed when allocating some memory to the application. Some memory allocators do memory allocation faster than memory freeing (FIX ME: graph refernce) while others show similar speed whether memory is allocated or freed.
-
-\subsubsection{Memory Access Speed}
-Runtime speed is not the only speed matrix in memory allocators. The memory that a memory allocator has allocated to the application also needs to be accessible as quick as possible. The application should be able to read/write allocated memory quickly. The allocation method of a memory allocator may introduce some delays when it comes to memory access speed, which is specially important in concurrent applications. Ideally, a memory allocator should allocate all memory on a cache-line to only one thread and no cache-line should be shared among multiple threads. If a memory allocator allocates memory to multple threads on a same cache line, then cache may get invalidated more frequesntly when two different threads running on two different processes will try to read/write the same memory region. On the other hand, if one cache-line is used by only one thread then the cache may get invalidated less frequently. This sharing of one cache-line among multiple threads is called false sharing (FIX ME: cite wasik).
-
-\paragraph{Active False Sharing}
-Active false sharing is the sharing of one cache-line among multiple threads that is caused by memory allocator. It happens when two threads request memory from memory allocator and the allocator allocates memory to both of them on the same cache-line. After that, if the threads are running on different processes who have their own caches and both threads start reading/writing the allocated memory simultanously, their caches will start getting invalidated every time the other thread writes something to the memory. This will cause the application to slow down as the process has to load cache much more frequently.
-
-*** FIX ME: Insert a figure of above scenrio with explanation
-
-\paragraph{Passive False Sharing}
-Passive false sharing is the kind of false sharing which is caused by the application and not the memory allocator. The memory allocator may preservce passive false sharing in future instead of eradicating it. But, passive false sharing is initiated by the application.
-
-\subparagraph{Program Induced Passive False Sharing}
-Program induced false sharing is completely out of memory allocator's control and is purely caused by the application. When a thread in the application creates multiple objects in the dynamic area and allocator allocates memory for these objects on the same cache-line as the objects are created by the same thread. Passive false sharing will occur if this thread passes one of these objects to another thread but it retains the rest of these objects or it passes some/all of the remaining objects to some third thread(s). Now, one cache-line is shared among multiple threads but it is caused by the application and not the allocator. It is out of allocator's control and has the similar performance impact as Active False Sharing (FIX ME: cite local) if these threads, who are sharing the same cache-line, start reading/writing the given objects simultanously.
-
-*** FIX ME: Insert a figure of above scenrio 1 with explanation
-
-*** FIX ME: Insert a figure of above scenrio 2 with explanation
-
-\subparagraph{Program Induced Allocator Preserved Passive False Sharing}
-Program induced allocator preserved passive false sharing is another interesting case of passive false sharing. Both the application and the allocator are partially responsible for it. It starts the same as Program Induced False Sharing (FIX ME: cite local). Once, an application thread has created multiple dynamic objects on the same cache-line and ditributed these objects among multiple threads causing sharing of one cache-line among multiple threads (Program Induced Passive False Sharing). This kind of false sharing occurs when one of these threads, which got the object on the shared cache-line, frees the passed object then re-allocates another object but the allocator returns the same object (on the shared cache-line) that this thread just freed. Although, the application caused the false sharing to happen in the frst place however, to prevent furthur false sharing, the allocator should have returned the new object on some other cache-line which is only shared by the allocating thread. When it comes to performnce impact, this passive false sharing will slow down the application just like any other kind of false sharing if the threads sharing the cache-line start reading/writing the objects simultanously.
-
-
-*** FIX ME: Insert a figure of above scenrio with explanation
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Micro Benchmark Suite}
-The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance under a desired allocation pattern.
-
-Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by giving commandline arguments to the benchmark on execution.
-
-Following is the list of avalable knobs.
-
-*** FIX ME: Add knobs items after finalize
-
-\subsection{Memory Benchmark}
-Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. Finally, it calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the memory that program has allocated.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
-
-\subsection{Speed Benchmark}
-Speed benchmark calculates the runtime speed of an allocator's functions (FIX ME: cite allocator routines). It does by measuring the runtime of allocator routines in two different ways.
-
-\subsubsection{Speed Time}
-The time method does a certain amount of work by calling each routine of the allocator (FIX ME: cite allocator routines) a specific time. It calculates the total time it took to perform this workload. Then, it divides the time it took by the workload and calculates the average time taken by the allocator's routine.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
-
-\subsubsection{Speed Workload}
-The worload method uses the opposite approach. It calls the allocator's routines for a specific amount of time and measures how much work was done during that time. Then, similar to the time method, it divides the time by the workload done during that time and calculates the average time taken by the allocator's routine.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
-
-\subsection{Cache Scratch}
-Cache Scratch benchmark measures program induced allocator preserved passive false sharing (FIX ME CITE) in an allocator. It does so in two ways.
-
-\subsubsection{Cache Scratch Time}
-Cache Scratch Time allocates dynamic objects. Then, it benchmarks program induced allocator preserved passive false sharing (FIX ME CITE) in an allocator by measuring the time it takes to read/write these objects.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
-
-\subsubsection{Cache Scratch Layout}
-Cache Scratch Layout also allocates dynamic objects. Then, it benchmarks program induced allocator preserved passive false sharing (FIX ME CITE) by using heap addresses returned by the allocator. It calculates how many objects were allocated to different threads on the same cache line.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
-
-\subsection{Cache Thrash}
-Cache Thrash benchmark measures allocator induced passive false sharing (FIX ME CITE) in an allocator. It also does so in two ways.
-
-\subsubsection{Cache Thrash Time}
-Cache Thrash Time allocates dynamic objects. Then, it benchmarks allocator induced false sharing (FIX ME CITE) in an allocator by measuring the time it takes to read/write these objects.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
-
-\subsubsection{Cache Thrash Layout}
-Cache Thrash Layout also allocates dynamic objects. Then, it benchmarks allocator induced false sharing (FIX ME CITE) by using heap addresses returned by the allocator. It calculates how many objects were allocated to different threads on the same cache line.
-*** FIX ME: Insert a figure of above benchmark with description
-
-\paragraph{Relevant Knobs}
-*** FIX ME: Insert Relevant Knobs
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
+\caption{Benchmark Speed}
+\label{fig:benchSpeedFig}
+\end{figure}
+
+As laid out in figure \ref{fig:benchSpeedFig}, each chain is measured separately. Each routine in the chain is called for N objects and then
+ those allocated objects are used when call the next routine in the allocation chain. This way we can measure the
+ complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives
+ us the whole picture of the major allocation routines when combined together in a chain.
+
+For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against
+each chain.
+
+Number of worker threads can be adjust using a command-line argument -threadN.
+
+\section{Churn Benchmark} Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded
+ scenerio where each thread extinsevly allocates and frees dynamic memory.
+
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
+\caption{Benchmark Churn}
+\label{fig:benchChurnFig}
+\end{figure}
+
+Figure \ref{fig:benchChurnFig} illustrates churn benchmark.
+ This benchmark creates a buffer with M spots and starts K threads. Each thread randomly picks a
+ spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. Each thread
+ repeats this cycle for N times. Main threads measures the total time taken for the whole benchmark and that time is
+ used to evaluate memory allocator's performance.
+
+Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
+Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. Churn simulates a
+memory intensive program that can be tuned to create different scenerios.
+
+Following commandline arguments can be used to tune the benchmark.\\
+-threadN :  sets number of threads, K\\
+-cSpots  :  sets number of spots for churn, M\\
+-objN    :  sets number of objects per thread, N\\
+-maxS    :  sets max object size\\
+-minS    :  sets min object size\\
+-stepS   :  sets object size increment\\
+-distroS :  sets object size distribution
+
+\section{Cache Thrash}\label{sec:benchThrashSec} Cache Thrash benchmark measures allocator induced active false sharing
+ in an allocator as illustrated in figure \ref{f:AllocatorInducedActiveFalseSharing}.
+ If memory allocator allocates memory for multiple threads on
+ same cache line, this can slow down the program performance. If both threads, who share one cache line, frequently
+ read/write to their object on the cache line concurrently then this will cause cache miss everytime a thread accesse
+ the object as the other thread might have written something at their memory location on the same cache line.
+
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
+\caption{Benchmark Allocator Induced Active False Sharing}
+\label{fig:benchThrashFig}
+\end{figure}
+
+Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory
+allocator is allocating dynamic memory to multiple threads on same cache lines. Ideally, a memory allocator should
+distance dynamic memory region of one thread from other threads'. Having multiple threads allocating small objects
+simultanously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
+not distancing the memory among different threads.
+
+Figure \ref{fig:benchThrashFig} lays out flow of the cache thrash benchmark.
+ It creates K worker threads. Each worker thread allocates an object and intensively read/write
+ it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line
+ with it. Each thread repeats this for N times. Main thread measures the total time taken to for all worker threads to
+ complete. Worker threads sharing cahche lines with each other will take longer.
+
+Different cache access scenerios can be created using the following commandline arguments.\\
+-threadN :  sets number of threads, K\\
+-cacheIt :  iterations for cache benchmark, N\\
+-cacheRep:  repetations for cache benchmark, M\\
+-cacheObj:  object size for cache benchmark
+
+\section{Cache Scratch} Cache Scratch benchmark measures allocator induced passive false sharing in an allocator. An
+ allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in
+ figure \ref{f:AllocatorInducedPassiveFalseSharing}. If a thread A allocates multiple objects together then they will be
+  possibly allocated on the same cache line by the memory allocator. If the thread now passes this object to another
+  thread B then the two of them will sharing the same cache line but this scenerio is not induced by the allocator.
+  Instead, the program induced this situation. Now it might be possible that if thread B frees this object and then
+  allocate an object of the same size then the allocator may return the same object which is on a cache line shared
+  with thread A. Now this false sharing is being caused by the memory allocator although it was started by the
+  program.
+
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
+\caption{Benchmark Program Induced Passive False Sharing}
+\label{fig:benchScratchFig}
+\end{figure}
+
+Cache scratch main thread induces false sharing and creates a scenerio that should make memory allocator preserve the
+ program-induced false sharing if it does not retur a freed object to its owner thread and, instead, re-uses it
+ instantly. An alloator using object ownership, as described in section \ref{s:Ownership}, would be less susceptible to allocator induced passive
+ false sharing. If the object is returned to the thread who owns it or originally allocated it then the thread B will
+ get a new object that will be less likely to be on the same cache line as thread A.
+
+As in figure \ref{fig:benchScratchFig}, cache Scratch allocates K dynamic objects together, one for each of the K worker threads,
+ possibly causing memory allocator to allocate these objects on the same cache-line. Then it create K worker threads and passes
+ an object from the K allocated objects to each of the K threads. Each worker thread frees the object passed by the main thread.
+ Then, it allocates an object and reads/writes it repetitively for M times causing frequent cache invalidations. Each worker
+ repeats this for N times.
+
+Each thread allocating an object after freeing the original object passed by the main thread should cause the memory
+allocator to return the same object that was initially allocated by the main thread if the allocator did not return the
+intial object bakc to its owner (main thread). Then, intensive read/write on the shared cache line by multiple threads
+should slow down worker threads due to to high cache invalidations and misses. Main thread measures the total time
+taken for all the workers to complete.
+
+Similar to bechmark cache thrash in section \ref{sec:benchThrashSec}, different cache access scenerios can be created using the following commandline arguments.\\
+-threadN :  sets number of threads, K\\
+-cacheIt :  iterations for cache benchmark, N\\
+-cacheRep:  repetations for cache benchmark, M\\
+-cacheObj:  object size for cache benchmark
Index: doc/theses/mubeen_zulfiqar_MMath/dofree.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/dofree.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
+++ doc/theses/mubeen_zulfiqar_MMath/dofree.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
@@ -0,0 +1,34 @@
+Algorithm~\ref{alg:heapObjectFreeOwn} shows how a free request is fulfilled if object ownership is turned on. Algorithm~\ref{alg:heapObjectFreeNoOwn} shows how the same free request is fulfilled without object ownership.
+
+\begin{algorithm}
+\caption{Dynamic object free at address A with object ownership}\label{alg:heapObjectFreeOwn}
+\begin{algorithmic}[1]
+\If {$\textit{A was mmap-ed}$}
+	\State $\text{return A's dynamic memory to system using system call munmap}$
+\Else
+	\State $\text{B} \gets \textit{O's owner}$
+	\If {$\textit{B is thread-local heap's bucket}$}
+		\State $\text{push A to B's free-list}$
+	\Else
+		\State $\text{push A to B's away-list}$
+	\EndIf
+\EndIf
+\end{algorithmic}
+\end{algorithm}
+
+\begin{algorithm}
+\caption{Dynamic object free at address A without object ownership}\label{alg:heapObjectFreeNoOwn}
+\begin{algorithmic}[1]
+\If {$\textit{A was mmap-ed}$}
+	\State $\text{return A's dynamic memory to system using system call munmap}$
+\Else
+	\State $\text{B} \gets \textit{O's owner}$
+	\If {$\textit{B is thread-local heap's bucket}$}
+		\State $\text{push A to B's free-list}$
+	\Else
+		\State $\text{C} \gets \textit{thread local heap's bucket with same size as B}$
+		\State $\text{push A to C's free-list}$
+	\EndIf
+\EndIf
+\end{algorithmic}
+\end{algorithm}
Index: doc/theses/mubeen_zulfiqar_MMath/performance.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/performance.tex	(revision 374cb11784dccbf21002ae7aee894790d8f79d65)
+++ doc/theses/mubeen_zulfiqar_MMath/performance.tex	(revision 2686bc769528b67434ff02c007d4554d9630fa2f)
@@ -1,22 +1,4 @@
 \chapter{Performance}
 \label{c:Performance}
-
-\noindent
-====================
-
-Writing Points:
-\begin{itemize}
-\item
-Machine Specification
-\item
-Allocators and their details
-\item
-Benchmarks and their details
-\item
-Results
-\end{itemize}
-
-\noindent
-====================
 
 \section{Machine Specification}
@@ -25,99 +7,440 @@
 \begin{itemize}
 \item
-AMD EPYC 7662, 64-core socket $\times$ 2, 2.0 GHz
+{\bf Nasus} AMD EPYC 7662, 64-core socket $\times$ 2, 2.0 GHz, GCC version 9.3.0
 \item
-Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz
-\item
-Intel Xeon Gold 5220R, 48-core socket $\times$ 2, 2.20GHz
+{\bf Algol} Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz, GCC version 9.4.0
 \end{itemize}
 
 
-\section{Existing Memory Allocators}
+\section{Existing Memory Allocators}\label{sec:curAllocatorSec}
 With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
 
-\paragraph{dlmalloc}
-dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
-
-\paragraph{hoard}
+\subsection{dlmalloc}
+dlmalloc (FIX ME: cite allocator with download link) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
+\\
+\\
+{\bf Version:} 2.8.6\\
+{\bf Configuration:} Compiled with pre-processor USE\_LOCKS.\\
+{\bf Compilation command:}\\
+cc -g3 -O3 -Wall -Wextra -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -fPIC -shared -DUSE\_LOCKS -o libdlmalloc.so malloc-2.8.6.c
+
+\subsection{hoard}
 Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
-
-\paragraph{jemalloc}
+\\
+\\
+{\bf Version:} 3.13\\
+{\bf Configuration:} Compiled with hoard's default configurations and Makefile.\\
+{\bf Compilation command:}\\
+make all
+
+\subsection{jemalloc}
 jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
-
-\paragraph{ptmalloc}
-ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
-
-\paragraph{rpmalloc}
+\\
+\\
+{\bf Version:} 5.2.1\\
+{\bf Configuration:} Compiled with jemalloc's default configurations and Makefile.\\
+{\bf Compilation command:}\\
+./autogen.sh\\
+./configure\\
+make\\
+make install
+
+\subsection{pt3malloc}
+pt3malloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. pt3malloc heap has similar design to dlmalloc's heap.
+\\
+\\
+{\bf Version:} 1.8\\
+{\bf Configuration:} Compiled with pt3malloc's Makefile using option "linux-shared".\\
+{\bf Compilation command:}\\
+make linux-shared
+
+\subsection{rpmalloc}
 rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
-
-\paragraph{tbb malloc}
+\\
+\\
+{\bf Version:} 1.4.1\\
+{\bf Configuration:} Compiled with rpmalloc's default configurations and ninja build system.\\
+{\bf Compilation command:}\\
+python3 configure.py\\
+ninja
+
+\subsection{tbb malloc}
 tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
-
-\paragraph{tc malloc}
-tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
-
-
-\section{Memory Allocators}
-For these experiments, we used 7 memory allocators excluding our standalone memory allocator uHeapLmmm.
-
-\begin{tabularx}{0.8\textwidth} {
-	| >{\raggedright\arraybackslash}X
-	| >{\centering\arraybackslash}X
-	| >{\raggedleft\arraybackslash}X |
-}
-\hline
-Memory Allocator & Version     & Configurations \\
-\hline
-dl               &             &  \\
-\hline
-hoard            &             &  \\
-\hline
-je               &             &  \\
-\hline
-pt3              &             &  \\
-\hline
-rp               &             &  \\
-\hline
-tbb              &             &  \\
-\hline
-tc               &             &  \\
-\end{tabularx}
-
-%(FIX ME: complete table)
+\\
+\\
+{\bf Version:} intel tbb 2020 update 2, tbb\_interface\_version == 11102\\
+{\bf Configuration:} Compiled with tbbmalloc's default configurations and Makefile.\\
+{\bf Compilation command:}\\
+make
 
 \section{Experiment Environment}
-We conducted these experiments ... (FIX ME: what machine and which specifications to add).
-
-We used our micro becnhmark suite (FIX ME: cite mbench) to evaluate other memory allocators (FIX ME: cite above memory allocators) and our uHeapLmmm.
+We used our micro becnhmark suite (FIX ME: cite mbench) to evaluate these memory allocators \ref{sec:curAllocatorSec} and our own memory allocator uHeap \ref{sec:allocatorSec}.
 
 \section{Results}
+FIX ME: add experiment, knobs, graphs, description+analysis
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% CHURN
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\subsection{Churn Benchmark}
+
+Churn benchmark tested memory allocators for speed under intensive dynamic memory usage.
+
+This experiment was run with following configurations:
+
+-maxS		 : 500
+
+-minS		 : 50
+
+-stepS		 : 50
+
+-distroS	 : fisher
+
+-objN		 : 100000
+
+-cSpots		 : 16
+
+-threadN	 : \{ 1, 2, 4, 8, 16 \} *
+
+* Each allocator was tested for its performance across different number of threads. Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN.
+
+Results are shown in figure \ref{fig:churn} for both algol and nasus.
+X-axis shows number of threads. Each allocator's performance for each thread is shown in different colors.
+Y-axis shows the total time experiment took to finish.
+
+\begin{figure}
+\centering
+    \subfigure[Algol]{ \includegraphics[width=0.9\textwidth]{evaluations/algol-perf-eps/churn} }
+    \subfigure[Nasus]{ \includegraphics[width=0.9\textwidth]{evaluations/nasus-perf-eps/churn} }
+\caption{Churn}
+\label{fig:churn}
+\end{figure}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% THRASH
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\subsection{Cache Thrash}
+
+Thrash benchmark tested memory allocators for active false sharing.
+
+This experiment was run with following configurations:
+
+-cacheIt 	: 1000
+
+-cacheRep	: 1000000
+
+-cacheObj	: 1
+
+-threadN 	: \{ 1, 2, 4, 8, 16 \} *
+
+* Each allocator was tested for its performance across different number of threads. Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN.
+
+Results are shown in figure \ref{fig:cacheThrash} for both algol and nasus.
+X-axis shows number of threads. Each allocator's performance for each thread is shown in different colors.
+Y-axis shows the total time experiment took to finish.
+
+\begin{figure}
+\centering
+    \subfigure[Algol]{ \includegraphics[width=0.9\textwidth]{evaluations/algol-perf-eps/cache-time-0-thrash} }
+    \subfigure[Nasus]{ \includegraphics[width=0.9\textwidth]{evaluations/nasus-perf-eps/cache-time-0-thrash} }
+\caption{Cache Thrash}
+\label{fig:cacheThrash}
+\end{figure}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% SCRATCH
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\subsection{Cache Scratch}
+
+Scratch benchmark tested memory allocators for program induced allocator preserved passive false sharing.
+
+This experiment was run with following configurations:
+
+-cacheIt 	: 1000
+
+-cacheRep	: 1000000
+
+-cacheObj	: 1
+
+-threadN 	: \{ 1, 2, 4, 8, 16 \} *
+
+* Each allocator was tested for its performance across different number of threads. Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN.
+
+Results are shown in figure \ref{fig:cacheScratch} for both algol and nasus.
+X-axis shows number of threads. Each allocator's performance for each thread is shown in different colors.
+Y-axis shows the total time experiment took to finish.
+
+\begin{figure}
+\centering
+    \subfigure[Algol]{ \includegraphics[width=0.9\textwidth]{evaluations/algol-perf-eps/cache-time-0-scratch} }
+    \subfigure[Nasus]{ \includegraphics[width=0.9\textwidth]{evaluations/nasus-perf-eps/cache-time-0-scratch} }
+\caption{Cache Scratch}
+\label{fig:cacheScratch}
+\end{figure}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% SPEED
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\subsection{Speed Benchmark}
+
+Speed benchmark tested memory allocators for program induced allocator preserved passive false sharing.
+
+This experiment was run with following configurations:
+
+-threadN :  sets number of threads, K\\
+-cSpots  :  sets number of spots for churn, M\\
+-objN    :  sets number of objects per thread, N\\
+-maxS    :  sets max object size\\
+-minS    :  sets min object size\\
+-stepS   :  sets object size increment\\
+-distroS :  sets object size distribution
+
+%speed-1-malloc-null.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-1-malloc-null}
+\caption{speed-1-malloc-null}
+\label{fig:speed-1-malloc-null}
+\end{figure}
+
+%speed-2-free-null.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-2-free-null}
+\caption{speed-2-free-null}
+\label{fig:speed-2-free-null}
+\end{figure}
+
+%speed-3-malloc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-3-malloc}
+\caption{speed-3-malloc}
+\label{fig:speed-3-malloc}
+\end{figure}
+
+%speed-4-realloc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-4-realloc}
+\caption{speed-4-realloc}
+\label{fig:speed-4-realloc}
+\end{figure}
+
+%speed-5-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-5-free}
+\caption{speed-5-free}
+\label{fig:speed-5-free}
+\end{figure}
+
+%speed-6-calloc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-6-calloc}
+\caption{speed-6-calloc}
+\label{fig:speed-6-calloc}
+\end{figure}
+
+%speed-7-malloc-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-7-malloc-free}
+\caption{speed-7-malloc-free}
+\label{fig:speed-7-malloc-free}
+\end{figure}
+
+%speed-8-realloc-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-8-realloc-free}
+\caption{speed-8-realloc-free}
+\label{fig:speed-8-realloc-free}
+\end{figure}
+
+%speed-9-calloc-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-9-calloc-free}
+\caption{speed-9-calloc-free}
+\label{fig:speed-9-calloc-free}
+\end{figure}
+
+%speed-10-malloc-realloc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-10-malloc-realloc}
+\caption{speed-10-malloc-realloc}
+\label{fig:speed-10-malloc-realloc}
+\end{figure}
+
+%speed-11-calloc-realloc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-11-calloc-realloc}
+\caption{speed-11-calloc-realloc}
+\label{fig:speed-11-calloc-realloc}
+\end{figure}
+
+%speed-12-malloc-realloc-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-12-malloc-realloc-free}
+\caption{speed-12-malloc-realloc-free}
+\label{fig:speed-12-malloc-realloc-free}
+\end{figure}
+
+%speed-13-calloc-realloc-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-13-calloc-realloc-free}
+\caption{speed-13-calloc-realloc-free}
+\label{fig:speed-13-calloc-realloc-free}
+\end{figure}
+
+%speed-14-{m,c,re}alloc-free.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/speed-14-{m,c,re}alloc-free}
+\caption{speed-14-{m,c,re}alloc-free}
+\label{fig:speed-14-{m,c,re}alloc-free}
+\end{figure}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% MEMORY
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 \subsection{Memory Benchmark}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsection{Speed Benchmark}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsubsection{Speed Time}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsubsection{Speed Workload}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsection{Cache Scratch}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsubsection{Cache Scratch Time}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsubsection{Cache Scratch Layout}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsection{Cache Thrash}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsubsection{Cache Thrash Time}
-FIX ME: add experiment, knobs, graphs, and description
-
-\subsubsection{Cache Thrash Layout}
-FIX ME: add experiment, knobs, graphs, and description
+
+%mem-1-prod-1-cons-100-cfa.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-cfa}
+\caption{mem-1-prod-1-cons-100-cfa}
+\label{fig:mem-1-prod-1-cons-100-cfa}
+\end{figure}
+
+%mem-1-prod-1-cons-100-dl.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-dl}
+\caption{mem-1-prod-1-cons-100-dl}
+\label{fig:mem-1-prod-1-cons-100-dl}
+\end{figure}
+
+%mem-1-prod-1-cons-100-glc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-glc}
+\caption{mem-1-prod-1-cons-100-glc}
+\label{fig:mem-1-prod-1-cons-100-glc}
+\end{figure}
+
+%mem-1-prod-1-cons-100-hrd.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-hrd}
+\caption{mem-1-prod-1-cons-100-hrd}
+\label{fig:mem-1-prod-1-cons-100-hrd}
+\end{figure}
+
+%mem-1-prod-1-cons-100-je.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-je}
+\caption{mem-1-prod-1-cons-100-je}
+\label{fig:mem-1-prod-1-cons-100-je}
+\end{figure}
+
+%mem-1-prod-1-cons-100-pt3.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-pt3}
+\caption{mem-1-prod-1-cons-100-pt3}
+\label{fig:mem-1-prod-1-cons-100-pt3}
+\end{figure}
+
+%mem-1-prod-1-cons-100-rp.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-rp}
+\caption{mem-1-prod-1-cons-100-rp}
+\label{fig:mem-1-prod-1-cons-100-rp}
+\end{figure}
+
+%mem-1-prod-1-cons-100-tbb.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-tbb}
+\caption{mem-1-prod-1-cons-100-tbb}
+\label{fig:mem-1-prod-1-cons-100-tbb}
+\end{figure}
+
+%mem-4-prod-4-cons-100-cfa.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-cfa}
+\caption{mem-4-prod-4-cons-100-cfa}
+\label{fig:mem-4-prod-4-cons-100-cfa}
+\end{figure}
+
+%mem-4-prod-4-cons-100-dl.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-dl}
+\caption{mem-4-prod-4-cons-100-dl}
+\label{fig:mem-4-prod-4-cons-100-dl}
+\end{figure}
+
+%mem-4-prod-4-cons-100-glc.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-glc}
+\caption{mem-4-prod-4-cons-100-glc}
+\label{fig:mem-4-prod-4-cons-100-glc}
+\end{figure}
+
+%mem-4-prod-4-cons-100-hrd.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-hrd}
+\caption{mem-4-prod-4-cons-100-hrd}
+\label{fig:mem-4-prod-4-cons-100-hrd}
+\end{figure}
+
+%mem-4-prod-4-cons-100-je.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-je}
+\caption{mem-4-prod-4-cons-100-je}
+\label{fig:mem-4-prod-4-cons-100-je}
+\end{figure}
+
+%mem-4-prod-4-cons-100-pt3.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-pt3}
+\caption{mem-4-prod-4-cons-100-pt3}
+\label{fig:mem-4-prod-4-cons-100-pt3}
+\end{figure}
+
+%mem-4-prod-4-cons-100-rp.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-rp}
+\caption{mem-4-prod-4-cons-100-rp}
+\label{fig:mem-4-prod-4-cons-100-rp}
+\end{figure}
+
+%mem-4-prod-4-cons-100-tbb.eps
+\begin{figure}
+\centering
+\includegraphics[width=1\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-tbb}
+\caption{mem-4-prod-4-cons-100-tbb}
+\label{fig:mem-4-prod-4-cons-100-tbb}
+\end{figure}