Index: doc/theses/mubeen_zulfiqar_MMath/background.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/background.tex	(revision 65781a8cee69fed1056bc74e14bf540907b365dc)
+++ doc/theses/mubeen_zulfiqar_MMath/background.tex	(revision a1147436cd21266c5ee817bc0f3d65c60ff97316)
@@ -34,5 +34,5 @@
 \VRef[Figure]{f:AllocatorComponents} shows the two important data components for a memory allocator, management and storage, collectively called the \newterm{heap}.
 The \newterm{management data} is a data structure located at a known memory address and contains all information necessary to manage the storage data.
-The management data starts with fixed-sized information in the static-data memory that flows into the dynamic-allocation memory.
+The management data starts with fixed-sized information in the static-data memory that references components in the dynamic-allocation memory.
 The \newterm{storage data} is composed of allocated and freed objects, and \newterm{reserved memory}.
 Allocated objects (white) are variable sized, and allocated and maintained by the program;
@@ -44,5 +44,5 @@
 \label{f:AllocatorComponents}
 \end{figure}
-Freed objects (light grey) are memory deallocated by the program, which are linked into one or more lists facilitating easy location for new allocations.
+Freed objects (light grey) represent memory deallocated by the program, which are linked into one or more lists facilitating easy location of new allocations.
 Often the free list is chained internally so it does not consume additional storage, \ie the link fields are placed at known locations in the unused memory blocks.
 Reserved memory (dark grey) is one or more blocks of memory obtained from the operating system but not yet allocated to the program;
@@ -81,5 +81,5 @@
 Fragmentation is memory requested from the operating system but not used by the program;
 hence, allocated objects are not fragmentation.
-\VRef[Figure]{f:InternalExternalFragmentation}) shows fragmentation is divided into two forms: internal or external.
+\VRef[Figure]{f:InternalExternalFragmentation} shows fragmentation is divided into two forms: internal or external.
 
 \begin{figure}
@@ -96,5 +96,5 @@
 An allocator should strive to keep internal management information to a minimum.
 
-\newterm{External fragmentation} is all memory space reserved from the operating system but not allocated to the program~\cite{Wilson95,Lim98,Siebert00}, which includes freed objects, all external management data, and reserved memory.
+\newterm{External fragmentation} is all memory space reserved from the operating system but not allocated to the program~\cite{Wilson95,Lim98,Siebert00}, which includes all external management data, freed objects, and reserved memory.
 This memory is problematic in two ways: heap blowup and highly fragmented memory.
 \newterm{Heap blowup} occurs when memory freed by the program is not reused for future allocations leading to potentially unbounded external fragmentation growth~\cite{Berger00}.
@@ -125,5 +125,5 @@
 \end{figure}
 
-For a single-threaded memory allocator, three basic approaches for controlling fragmentation have been identified~\cite{Johnstone99}.
+For a single-threaded memory allocator, three basic approaches for controlling fragmentation are identified~\cite{Johnstone99}.
 The first approach is a \newterm{sequential-fit algorithm} with one list of free objects that is searched for a block large enough to fit a requested object size.
 Different search policies determine the free object selected, \eg the first free object large enough or closest to the requested size.
@@ -132,5 +132,5 @@
 
 The second approach is a \newterm{segregated} or \newterm{binning algorithm} with a set of lists for different sized freed objects.
-When an object is allocated, the requested size is rounded up to the nearest bin-size, possibly with spacing after the object.
+When an object is allocated, the requested size is rounded up to the nearest bin-size, often leading to spacing after the object.
 A binning algorithm is fast at finding free memory of the appropriate size and allocating it, since the first free object on the free list is used.
 The fewer bin-sizes, the fewer lists need to be searched and maintained;
@@ -158,5 +158,5 @@
 Temporal locality commonly occurs during an iterative computation with a fix set of disjoint variables, while spatial locality commonly occurs when traversing an array.
 
-Hardware takes advantage of temporal and spatial locality through multiple levels of caching (\ie memory hierarchy).
+Hardware takes advantage of temporal and spatial locality through multiple levels of caching, \ie memory hierarchy.
 When an object is accessed, the memory physically located around the object is also cached with the expectation that the current and nearby objects will be referenced within a short period of time.
 For example, entire cache lines are transferred between memory and cache and entire virtual-memory pages are transferred between disk and memory.
@@ -171,5 +171,5 @@
 
 There are a number of ways a memory allocator can degrade locality by increasing the working set.
-For example, a memory allocator may access multiple free objects before finding one to satisfy an allocation request (\eg sequential-fit algorithm).
+For example, a memory allocator may access multiple free objects before finding one to satisfy an allocation request, \eg sequential-fit algorithm.
 If there are a (large) number of objects accessed in very different areas of memory, the allocator may perturb the program's memory hierarchy causing multiple cache or page misses~\cite{Grunwald93}.
 Another way locality can be degraded is by spatially separating related data.
@@ -181,5 +181,5 @@
 
 A multi-threaded memory-allocator does not run any threads itself, but is used by a multi-threaded program.
-In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.
+In addition to single-threaded design issues of fragmentation and locality, a multi-threaded allocator is simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.
 
 
@@ -192,8 +192,13 @@
 Second is when multiple threads contend for a shared resource simultaneously, and hence, some threads must wait until the resource is released.
 Contention can be reduced in a number of ways:
+\begin{itemize}[itemsep=0pt]
+\item
 using multiple fine-grained locks versus a single lock, spreading the contention across a number of locks;
+\item
 using trylock and generating new storage if the lock is busy, yielding a classic space versus time tradeoff;
+\item
 using one of the many lock-free approaches for reducing contention on basic data-structure operations~\cite{Oyama99}.
-However, all of these approaches have degenerate cases where contention occurs.
+\end{itemize}
+However, all of these approaches have degenerate cases where program contention is high, which occurs outside of the allocator.
 
 
@@ -275,5 +280,5 @@
 \label{s:MultipleHeaps}
 
-A single-threaded allocator has at most one thread and heap, while a multi-threaded allocator has potentially multiple threads and heaps.
+A multi-threaded allocator has potentially multiple threads and heaps.
 The multiple threads cause complexity, and multiple heaps are a mechanism for dealing with the complexity.
 The spectrum ranges from multiple threads using a single heap, denoted as T:1 (see \VRef[Figure]{f:SingleHeap}), to multiple threads sharing multiple heaps, denoted as T:H (see \VRef[Figure]{f:SharedHeaps}), to one thread per heap, denoted as 1:1 (see \VRef[Figure]{f:PerThreadHeap}), which is almost back to a single-threaded allocator.
@@ -339,5 +344,5 @@
 An alternative implementation is for all heaps to share one reserved memory, which requires a separate lock for the reserved storage to ensure mutual exclusion when acquiring new memory.
 Because multiple threads can allocate/free/reallocate adjacent storage, all forms of false sharing may occur.
-Other storage-management options are to use @mmap@ to set aside (large) areas of virtual memory for each heap and suballocate each heap's storage within that area.
+Other storage-management options are to use @mmap@ to set aside (large) areas of virtual memory for each heap and suballocate each heap's storage within that area, pushing part of the storage management complexity back to the operating system.
 
 \begin{figure}
@@ -368,5 +373,5 @@
 
 
-\paragraph{1:1 model (thread heaps)} where each thread has its own heap, which eliminates most contention and locking because threads seldom accesses another thread's heap (see ownership in \VRef{s:Ownership}).
+\paragraph{1:1 model (thread heaps)} where each thread has its own heap eliminating most contention and locking because threads seldom access another thread's heap (see ownership in \VRef{s:Ownership}).
 An additional benefit of thread heaps is improved locality due to better memory layout.
 As each thread only allocates from its heap, all objects for a thread are consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.
@@ -380,5 +385,5 @@
 Second is to place the thread heap on a list of available heaps and reuse it for a new thread in the future.
 Destroying the thread heap immediately may reduce external fragmentation sooner, since all free objects are freed to the global heap and may be reused by other threads.
-Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap.
+Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap because any unfreed storage is immediately accessible..
 
 
@@ -388,5 +393,5 @@
 However, an important goal of user-level threading is for fast operations (creation/termination/context-switching) by not interacting with the operating system, which allows the ability to create large numbers of high-performance interacting threads ($>$ 10,000).
 It is difficult to retain this goal, if the user-threading model is directly involved with the heap model.
-\VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model provided by the language runtime.
+\VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model is provided by the language runtime.
 Hence, a user thread allocates/deallocates from/to the heap of the kernel thread on which it is currently executing.
 
@@ -400,6 +405,5 @@
 Adopting this model results in a subtle problem with shared heaps.
 With kernel threading, an operation that is started by a kernel thread is always completed by that thread.
-For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted.
-Any correctness locking associated with the shared heap is preserved across preemption.
+For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted, \ie any locking correctness associated with the shared heap is preserved across preemption.
 
 However, this correctness property is not preserved for user-level threading.
@@ -409,5 +413,5 @@
 However, eagerly disabling/enabling time-slicing on the allocation/deallocation fast path is expensive, because preemption is rare (10--100 milliseconds).
 Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it can complete atomically.
-Occasionally ignoring a preemption should be benign.
+Occasionally ignoring a preemption should be benign, but a persistent lack of preemption can result in both short and long term starvation.
 
 
@@ -430,6 +434,6 @@
 
 \newterm{Ownership} defines which heap an object is returned-to on deallocation.
-If a thread returns an object to the heap it was originally allocated from, the heap has ownership of its objects.
-Alternatively, a thread can return an object to the heap it is currently allocating from, which can be any heap accessible during a thread's lifetime.
+If a thread returns an object to the heap it was originally allocated from, a heap has ownership of its objects.
+Alternatively, a thread can return an object to the heap it is currently associated with, which can be any heap accessible during a thread's lifetime.
 \VRef[Figure]{f:HeapsOwnership} shows an example of multiple heaps (minus the global heap) with and without ownership.
 Again, the arrows indicate the direction memory conceptually moves for each kind of operation.
@@ -539,4 +543,5 @@
 Only with the 1:1 model and ownership is active and passive false-sharing avoided (see \VRef{s:Ownership}).
 Passive false-sharing may still occur, if delayed ownership is used.
+Finally, a completely free container can become reserved storage and be reset to allocate objects of a new size or freed to the global heap.
 
 \begin{figure}
@@ -553,10 +558,4 @@
 \caption{Free-list Structure with Container Ownership}
 \end{figure}
-
-A fragmented heap has multiple containers that may be partially or completely free.
-A completely free container can become reserved storage and be reset to allocate objects of a new size.
-When a heap reaches a threshold of free objects, it moves some free storage to the global heap for reuse to prevent heap blowup.
-Without ownership, when a heap frees objects to the global heap, individual objects must be passed, and placed on the global-heap's free-list.
-Containers cannot be freed to the global heap unless completely free because
 
 When a container changes ownership, the ownership of all objects within it change as well.
@@ -569,5 +568,5 @@
 Note, once the object is freed by Task$_1$, no more false sharing can occur until the container changes ownership again.
 To prevent this form of false sharing, container movement may be restricted to when all objects in the container are free.
-One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area.
+One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area, again pushing storage management complexity back to the operating system.
 
 \begin{figure}
@@ -700,5 +699,5 @@
 \end{figure}
 
-As mentioned, an implementation may have only one heap deal with the global heap, so the other heap can be simplified.
+As mentioned, an implementation may have only one heap interact with the global heap, so the other heap can be simplified.
 For example, if only the private heap interacts with the global heap, the public heap can be reduced to a lock-protected free-list of objects deallocated by other threads due to ownership, called a \newterm{remote free-list}.
 To avoid heap blowup, the private heap allocates from the remote free-list when it reaches some threshold or it has no free storage.
@@ -721,11 +720,11 @@
 An allocation buffer is reserved memory (see~\VRef{s:AllocatorComponents}) not yet allocated to the program, and is used for allocating objects when the free list is empty.
 That is, rather than requesting new storage for a single object, an entire buffer is requested from which multiple objects are allocated later.
-Both any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or operating system, respectively.
+Any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or operating system, respectively.
 The allocation buffer reduces contention and the number of global/operating-system calls.
 For coalescing, a buffer is split into smaller objects by allocations, and recomposed into larger buffer areas during deallocations.
 
-Allocation buffers are useful initially when there are no freed objects in a heap because many allocations usually occur when a thread starts.
+Allocation buffers are useful initially when there are no freed objects in a heap because many allocations usually occur when a thread starts (simple bump allocation).
 Furthermore, to prevent heap blowup, objects should be reused before allocating a new allocation buffer.
-Thus, allocation buffers are often allocated more frequently at program/thread start, and then their use often diminishes.
+Thus, allocation buffers are often allocated more frequently at program/thread start, and then allocations often diminish.
 
 Using an allocation buffer with a thread heap avoids active false-sharing, since all objects in the allocation buffer are allocated to the same thread.
@@ -746,13 +745,13 @@
 \label{s:LockFreeOperations}
 
-A lock-free algorithm guarantees safe concurrent-access to a data structure, so that at least one thread can make progress in the system, but an individual task has no bound to execution, and hence, may starve~\cite[pp.~745--746]{Herlihy93}.
-% A wait-free algorithm puts a finite bound on the number of steps any thread takes to complete an operation, so an individual task cannot starve
+A \newterm{lock-free algorithm} guarantees safe concurrent-access to a data structure, so that at least one thread makes progress, but an individual task has no execution bound and may starve~\cite[pp.~745--746]{Herlihy93}.
+(A \newterm{wait-free algorithm} puts a bound on the number of steps any thread takes to complete an operation to prevent starvation.)
 Lock-free operations can be used in an allocator to reduce or eliminate the use of locks.
-Locks are a problem for high contention or if the thread holding the lock is preempted and other threads attempt to use that lock.
-With respect to the heap, these situations are unlikely unless all threads makes extremely high use of dynamic-memory allocation, which can be an indication of poor design.
+While locks and lock-free data-structures often have equal performance, lock-free has the advantage of not holding a lock across preemption so other threads can continue to make progress.
+With respect to the heap, these situations are unlikely unless all threads make extremely high use of dynamic-memory allocation, which can be an indication of poor design.
 Nevertheless, lock-free algorithms can reduce the number of context switches, since a thread does not yield/block while waiting for a lock;
-on the other hand, a thread may busy-wait for an unbounded period.
+on the other hand, a thread may busy-wait for an unbounded period holding a processor.
 Finally, lock-free implementations have greater complexity and hardware dependency.
 Lock-free algorithms can be applied most easily to simple free-lists, \eg remote free-list, to allow lock-free insertion and removal from the head of a stack.
-Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is more complex.
+Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is correspondinglyy more complex.
 Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator.
Index: doc/theses/mubeen_zulfiqar_MMath/intro.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/intro.tex	(revision 65781a8cee69fed1056bc74e14bf540907b365dc)
+++ doc/theses/mubeen_zulfiqar_MMath/intro.tex	(revision a1147436cd21266c5ee817bc0f3d65c60ff97316)
@@ -48,5 +48,5 @@
 Attempts have been made to perform quasi garbage collection in C/\CC~\cite{Boehm88}, but it is a compromise.
 This thesis only examines dynamic memory-management with \emph{explicit} deallocation.
-While garbage collection and compaction are not part this work, many of the results are applicable to the allocation phase in any memory-management approach.
+While garbage collection and compaction are not part this work, many of the work's results are applicable to the allocation phase in any memory-management approach.
 
 Most programs use a general-purpose allocator, often the one provided implicitly by the programming-language's runtime.
@@ -65,14 +65,22 @@
 \begin{enumerate}[leftmargin=*]
 \item
-Implementation of a new stand-lone concurrent low-latency memory-allocator ($\approx$1,200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading).
-
-\item
-Adopt returning of @nullptr@ for a zero-sized allocation, rather than an actual memory address, both of which can be passed to @free@.
-
-\item
-Extended the standard C heap functionality by preserving with each allocation its original request size versus the amount allocated, if an allocation is zero fill, and the allocation alignment.
-
-\item
-Use the zero fill and alignment as \emph{sticky} properties for @realloc@, to realign existing storage, or preserve existing zero-fill and alignment when storage is copied.
+Implementation of a new stand-lone concurrent low-latency memory-allocator ($\approx$1,200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for the programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading).
+
+\item
+Adopt @nullptr@ return for a zero-sized allocation, rather than an actual memory address, which can be passed to @free@.
+
+\item
+Extend the standard C heap functionality by preserving with each allocation:
+\begin{itemize}[itemsep=0pt]
+\item
+its request size plus the amount allocated,
+\item
+whether an allocation is zero fill,
+\item
+and allocation alignment.
+\end{itemize}
+
+\item
+Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ to zero-fill and align when storage is extended or copied.
 Without this extension, it is unsafe to @realloc@ storage initially allocated with zero-fill/alignment as these properties are not preserved when copying.
 This silent generation of a problem is unintuitive to programmers and difficult to locate because it is transient.
@@ -86,5 +94,5 @@
 @resize( oaddr, alignment, size )@ re-purpose an old allocation with new alignment but \emph{without} preserving fill.
 \item
-@realloc( oaddr, alignment, size )@ same as previous @realloc@ but adding or changing alignment.
+@realloc( oaddr, alignment, size )@ same as @realloc@ but adding or changing alignment.
 \item
 @aalloc( dim, elemSize )@ same as @calloc@ except memory is \emph{not} zero filled.
@@ -96,5 +104,5 @@
 
 \item
-Provide additional heap wrapper functions in \CFA to provide a complete orthogonal set of allocation operations and properties.
+Provide additional heap wrapper functions in \CFA creating an orthogonal set of allocation operations and properties.
 
 \item
@@ -109,5 +117,5 @@
 @malloc_size( addr )@ returns the size of the memory allocation pointed-to by @addr@.
 \item
-@malloc_usable_size( addr )@ returns the usable size of the memory pointed-to by @addr@, i.e., the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@.
+@malloc_usable_size( addr )@ returns the usable (total) size of the memory pointed-to by @addr@, i.e., the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@.
 \end{itemize}