Index: doc/papers/llheap/Paper.tex
===================================================================
--- doc/papers/llheap/Paper.tex	(revision d69752711fac0e68cc2a2c774b76b636a3cff37c)
+++ doc/papers/llheap/Paper.tex	(revision 6b1c4f2b7e0190ae0c36b5f7fc2c04f4626f49f2)
@@ -305,5 +305,5 @@
 \item
 Provide additional heap operations to complete programmer expectation with respect to accessing different allocation properties.
-\begin{itemize}[itemsep=0pt,parsep=0pt]
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @resize( oaddr, size )@ re-purpose an old allocation for a new type \emph{without} preserving fill or alignment.
@@ -325,5 +325,5 @@
 \item
 Provide additional query operations to access information about an allocation:
-\begin{itemize}[itemsep=0pt]
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @malloc_alignment( addr )@ returns the alignment of the allocation pointed-to by @addr@.
@@ -339,5 +339,5 @@
 \item
 Provide complete, fast, and contention-free allocation statistics to help understand allocation behaviour:
-\begin{itemize}[itemsep=0pt]
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @malloc_stats()@ print memory-allocation statistics on the file-descriptor set by @malloc_stats_fd@.
@@ -894,23 +894,22 @@
 
 Additional restrictions may be applied to the movement of containers to prevent active false-sharing.
-For example, in Figure~\ref{f:ContainerFalseSharing1}, a container being used by Thread$_1$ changes ownership, through the global heap.
-In Figure~\ref{f:ContainerFalseSharing2}, when Thread$_2$ allocates an object from the newly acquired container it is actively false-sharing even though no objects are passed among threads.
-Note, once the object is freed by Thread$_1$, no more false sharing can occur until the container changes ownership again.
+For example, if a container changes ownership through the global heap, then when a thread allocates an object from the newly acquired container it is actively false-sharing even though no objects are passed among threads.
+Note, once the thread frees the object, no more false sharing can occur until the container changes ownership again.
 To prevent this form of false sharing, container movement may be restricted to when all objects in the container are free.
 One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area, again pushing storage management complexity back to the operating system.
 
-\begin{figure}
-\centering
-\subfloat[]{
-	\input{ContainerFalseSharing1}
-	\label{f:ContainerFalseSharing1}
-} % subfloat
-\subfloat[]{
-	\input{ContainerFalseSharing2}
-	\label{f:ContainerFalseSharing2}
-} % subfloat
-\caption{Active False-Sharing using Containers}
-\label{f:ActiveFalseSharingContainers}
-\end{figure}
+% \begin{figure}
+% \centering
+% \subfloat[]{
+% 	\input{ContainerFalseSharing1}
+% 	\label{f:ContainerFalseSharing1}
+% } % subfloat
+% \subfloat[]{
+% 	\input{ContainerFalseSharing2}
+% 	\label{f:ContainerFalseSharing2}
+% } % subfloat
+% \caption{Active False-Sharing using Containers}
+% \label{f:ActiveFalseSharingContainers}
+% \end{figure}
 
 Using containers with ownership increases external fragmentation since a new container for a requested object size must be allocated separately for each thread requesting it.
@@ -975,29 +974,29 @@
 
 The container header allows an alternate approach for managing the heap's free-list.
-Rather than maintain a global free-list throughout the heap (see~Figure~\ref{f:GlobalFreeListAmongContainers}), the containers are linked through their headers and only the local free objects within a container are linked together (see~Figure~\ref{f:LocalFreeListWithinContainers}).
+Rather than maintain a global free-list throughout the heap the containers are linked through their headers and only the local free objects within a container are linked together.
 Note, maintaining free lists within a container assumes all free objects in the container are associated with the same heap;
 thus, this approach only applies to containers with ownership.
 
 This alternate free-list approach can greatly reduce the complexity of moving all freed objects belonging to a container to another heap.
-To move a container using a global free-list, as in Figure~\ref{f:GlobalFreeListAmongContainers}, the free list is first searched to find all objects within the container.
+To move a container using a global free-list, the free list is first searched to find all objects within the container.
 Each object is then removed from the free list and linked together to form a local free-list for the move to the new heap.
-With local free-lists in containers, as in Figure~\ref{f:LocalFreeListWithinContainers}, the container is simply removed from one heap's free list and placed on the new heap's free list.
+With local free-lists in containers, the container is simply removed from one heap's free list and placed on the new heap's free list.
 Thus, when using local free-lists, the operation of moving containers is reduced from $O(N)$ to $O(1)$.
 However, there is the additional storage cost in the header, which increases the header size, and therefore internal fragmentation.
 
-\begin{figure}
-\centering
-\subfloat[Global Free-List Among Containers]{
-	\input{FreeListAmongContainers}
-	\label{f:GlobalFreeListAmongContainers}
-} % subfloat
-\hspace{0.25in}
-\subfloat[Local Free-List Within Containers]{
-	\input{FreeListWithinContainers}
-	\label{f:LocalFreeListWithinContainers}
-} % subfloat
-\caption{Container Free-List Structure}
-\label{f:ContainerFreeListStructure}
-\end{figure}
+% \begin{figure}
+% \centering
+% \subfloat[Global Free-List Among Containers]{
+% 	\input{FreeListAmongContainers}
+% 	\label{f:GlobalFreeListAmongContainers}
+% } % subfloat
+% \hspace{0.25in}
+% \subfloat[Local Free-List Within Containers]{
+% 	\input{FreeListWithinContainers}
+% 	\label{f:LocalFreeListWithinContainers}
+% } % subfloat
+% \caption{Container Free-List Structure}
+% \label{f:ContainerFreeListStructure}
+% \end{figure}
 
 When all objects in the container are the same size, a single free-list is sufficient.
@@ -1116,83 +1115,140 @@
 \subsection{Design Choices}
 
-llheap's design was reviewed and changed multiple times throughout the work.
-Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}).
-Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design.
-
-
-\subsubsection{Allocation Fastpath}
-\label{s:AllocationFastpath}
-
-These designs look at the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced.
-\paragraph{T:1 model}
-Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs.
-This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes.
-When KTs $\le$ N, the common bucket sizes are uncontented;
-when KTs $>$ N, the free buckets are contented and latency increases significantly.
-In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared.
-Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency;
-however, picking N is workload specific.
-
-\begin{figure}
-\centering
-\input{AllocDS1}
-\caption{T:1 with Shared Buckets}
-\label{f:T1SharedBuckets}
-\end{figure}
+% Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}).
+% Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design.
+
+
+% \paragraph{T:1 model}
+% Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs.
+% This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes.
+% When KTs $\le$ N, the common bucket sizes are uncontented;
+% when KTs $>$ N, the free buckets are contented and latency increases significantly.
+% In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared.
+% Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency;
+% however, picking N is workload specific.
+% 
+% \begin{figure}
+% \centering
+% \input{AllocDS1}
+% \caption{T:1 with Shared Buckets}
+% \label{f:T1SharedBuckets}
+% \end{figure}
+% 
+% Problems:
+% \begin{itemize}
+% \item
+% Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator.
+% \item
+% When no thread is assigned a bucket number, its free storage is unavailable.
+% \item
+% All KTs contend for the global-pool lock for initial allocations, before free-lists get populated.
+% \end{itemize}
+% Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
+
+% \paragraph{T:H model}
+% Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs.
+% A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket.
+% When KT $\le$ N, the heaps might be uncontented;
+% when KTs $>$ N, the heaps are contented.
+% In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared.
+% By increasing N, this approach reduces contention but increases storage (time versus space);
+% however, picking N is workload specific.
+% 
+% \begin{figure}
+% \centering
+% \input{AllocDS2}
+% \caption{T:H with Shared Heaps}
+% \label{f:THSharedHeaps}
+% \end{figure}
+% 
+% Problems:
+% \begin{itemize}
+% \item
+% Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator.
+% \item
+% When no thread is assigned to a heap, its free storage is unavailable.
+% \item
+% Ownership issues arise (see Section~\ref{s:Ownership}).
+% \item
+% All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated.
+% \end{itemize}
+% Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
+
+% \paragraph{T:H model, H = number of CPUs}
+% This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@.
+% (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.)
+% Hence, each CPU logically has its own private heap and local pool.
+% A memory operation is serviced from the heap associated with the CPU executing the operation.
+% This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership).
+% This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs.
+% 
+% Problems:
+% \begin{itemize}
+% \item
+% Need to know when a CPU is added/removed from the @taskset@.
+% \item
+% Need a fast way to determine the CPU a KT is executing on to access the appropriate heap.
+% \item
+% Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
+% \begin{quote}
+% A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable}
+% \end{quote}
+% If a KT is preempted during an allocation operation, the operating system can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness.
+% Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable.
+% Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the operating system is providing the second thread via the signal handler.
+% 
+% Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted.
+% \end{itemize}
+% Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs.
+% Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs.
+% For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
+% The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation.
+% Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection.
+% Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
+% If @librseq@ had an @rseq_abort@ which:
+% \begin{enumerate}
+% \item
+% Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit.
+% \item
+% Do nothing if there is no current restartable critical subsection in progress.
+% \end{enumerate}
+% Then @rseq_abort@ could be called on the backside of a  user-level context-switching.
+% A feature similar to this idea might exist for hardware transactional-memory.
+% A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
+
+% \subsubsection{Allocation Fastpath}
+% \label{s:AllocationFastpath}
+
+llheap's design was reviewed and changed multiple times during its development.  Only the final design choices are
+discussed in this paper.
+(See~\cite{Zulfiqar22} for a discussion of alternate choices and reasons for rejecting them.)
+All designs were analyzed for the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced.
+The heap model choosen is 1:1, which is the T:H model with T = H, where there is one thread-local heap for each KT.
+(See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.)
+Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
+Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership.
 
 Problems:
-\begin{itemize}
-\item
-Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator.
-\item
-When no thread is assigned a bucket number, its free storage is unavailable.
-\item
-All KTs contend for the global-pool lock for initial allocations, before free-lists get populated.
-\end{itemize}
-Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
-
-\paragraph{T:H model}
-Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs.
-A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket.
-When KT $\le$ N, the heaps might be uncontented;
-when KTs $>$ N, the heaps are contented.
-In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared.
-By increasing N, this approach reduces contention but increases storage (time versus space);
-however, picking N is workload specific.
-
-\begin{figure}
-\centering
-\input{AllocDS2}
-\caption{T:H with Shared Heaps}
-\label{f:THSharedHeaps}
-\end{figure}
-
-Problems:
-\begin{itemize}
-\item
-Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator.
-\item
-When no thread is assigned to a heap, its free storage is unavailable.
-\item
-Ownership issues arise (see Section~\ref{s:Ownership}).
-\item
-All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated.
-\end{itemize}
-Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
-
-\paragraph{T:H model, H = number of CPUs}
-This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@.
-(See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.)
-Hence, each CPU logically has its own private heap and local pool.
-A memory operation is serviced from the heap associated with the CPU executing the operation.
-This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership).
-This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs.
-
-Problems:
-\begin{itemize}
-\item
-Need to know when a CPU is added/removed from the @taskset@.
-\item
-Need a fast way to determine the CPU a KT is executing on to access the appropriate heap.
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
+\item
+Need to know when a KT starts/terminates to create/delete its heap.
+
+\noindent
+It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
+\item
+There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
+
+\noindent
+The classic solution only deletes a heap after all referents are returned, which is complex.
+The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
+If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
+To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
+In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
+\item
+There can be significant external fragmentation as the number of KTs increases.
+
+\noindent
+In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
+Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
 \item
 Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
@@ -1205,52 +1261,6 @@
 
 Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted.
-\end{itemize}
-Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs.
-Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs.
-For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
-The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation.
-Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection.
-Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
-If @librseq@ had an @rseq_abort@ which:
-\begin{enumerate}
-\item
-Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit.
-\item
-Do nothing if there is no current restartable critical subsection in progress.
-\end{enumerate}
-Then @rseq_abort@ could be called on the backside of a  user-level context-switching.
-A feature similar to this idea might exist for hardware transactional-memory.
-A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
-
-\paragraph{1:1 model}
-This design is the T:H model with T = H, where there is one thread-local heap for each KT.
-(See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.)
-Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
-Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership.
-
-Problems:
-\begin{itemize}
-\item
-Need to know when a KT starts/terminates to create/delete its heap.
-
-\noindent
-It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
-\item
-There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
-
-\noindent
-The classic solution only deletes a heap after all referents are returned, which is complex.
-The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
-If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
-To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
-In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
-\item
-There can be significant external fragmentation as the number of KTs increases.
-
-\noindent
-In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
-Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
-\item
-There is the same serially-reusable problem with UTs migrating across KTs.
+
+%There is the same serially-reusable problem with UTs migrating across KTs.
 \end{itemize}
 Tests showed this design produced the closest performance match with the best current allocators, and code inspection showed most of these allocators use different variations of this approach.
@@ -1296,5 +1306,5 @@
 \subsubsection{Allocation Latency}
 
-A primary goal of llheap is low latency.
+A primary goal of llheap is low latency, hence the name low-latency heap (llheap).
 Two forms of latency are internal and external.
 Internal latency is the time to perform an allocation, while external latency is time to obtain/return storage from/to the operating system.
@@ -1315,24 +1325,13 @@
 
 Figure~\ref{f:llheapStructure} shows the design of llheap, which uses the following features:
-\begin{itemize}
-\item
 1:1 multiple-heap model to minimize the fastpath,
-\item
 can be built with or without heap ownership,
-\item
 headers per allocation versus containers,
-\item
 no coalescing to minimize latency,
-\item
 global heap memory (pool) obtained from the operating system using @mmap@ to create and reuse heaps needed by threads,
-\item
 local reserved memory (pool) per heap obtained from global pool,
-\item
 global reserved memory (pool) obtained from the operating system using @sbrk@ call,
-\item
 optional fast-lookup table for converting allocation requests into bucket sizes,
-\item
 optional statistic-counters table for accumulating counts of allocation operations.
-\end{itemize}
 
 \begin{figure}
@@ -1360,13 +1359,9 @@
 All objects in a bucket are of the same size.
 The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation using @mallopt( M_MMAP_THRESHOLD )@, \ie small objects managed by the program and large objects managed by the operating system.
-Each free bucket of a specific size has the following two lists:
-\begin{itemize}
-\item
-A free stack used solely by the KT heap-owner, so push/pop operations do not require locking.
+Each free bucket of a specific size has two lists.
+1) A free stack used solely by the KT heap-owner, so push/pop operations do not require locking.
 The free objects are a stack so hot storage is reused first.
-\item
-For ownership, a shared away-stack for KTs to return storage allocated by other KTs, so push/pop operations require locking.
+2) For ownership, a shared away-stack for KTs to return storage allocated by other KTs, so push/pop operations require locking.
 When the free stack is empty, the entire ownership stack is removed and becomes the head of the corresponding free stack.
-\end{itemize}
 
 Algorithm~\ref{alg:heapObjectAlloc} shows the allocation outline for an object of size $S$.
@@ -1378,17 +1373,10 @@
 The @char@ type restricts the number of bucket sizes to 256.
 For $S$ > 64K, a binary search is used.
-Then, the allocation storage is obtained from the following locations (in order), with increasing latency.
-\begin{enumerate}[topsep=0pt,itemsep=0pt,parsep=0pt]
-\item
+Then, the allocation storage is obtained from the following locations (in order), with increasing latency:
 bucket's free stack,
-\item
 bucket's away stack,
-\item
-heap's local pool
-\item
-global pool
-\item
-operating system (@sbrk@)
-\end{enumerate}
+heap's local pool,
+global pool,
+operating system (@sbrk@).
 
 \begin{algorithm}
@@ -1808,5 +1796,5 @@
 
 For existing C allocation routines:
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @calloc@ sets the sticky zero-fill property.
@@ -1825,5 +1813,5 @@
 \noindent\textbf{Usage}
 @aalloc@ takes two parameters.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @dim@: number of array objects
@@ -1839,5 +1827,5 @@
 \noindent\textbf{Usage}
 @resize@ takes two parameters.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @oaddr@: address to be resized
@@ -1853,5 +1841,5 @@
 \noindent\textbf{Usage}
 @amemalign@ takes three parameters.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @alignment@: alignment requirement
@@ -1873,5 +1861,5 @@
 \noindent\textbf{Usage}
 @malloc_alignment@ takes one parameter.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @addr@: address of an allocated object.
@@ -1885,5 +1873,5 @@
 @malloc_zero_fill@ takes one parameters.
 
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @addr@: address of an allocated object.
@@ -1897,5 +1885,5 @@
 \noindent\textbf{Usage}
 @malloc_size@ takes one parameters.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @addr@: address of an allocated object.
@@ -1908,5 +1896,5 @@
 \noindent\textbf{Usage}
 @malloc_stats_fd@ takes one parameters.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @fd@: file descriptor.
@@ -1938,5 +1926,5 @@
 \noindent\textbf{Usage}
 takes three parameters.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 @oaddr@: address to be resized
@@ -1998,5 +1986,5 @@
 In addition to the \CFA C-style allocator interface, a new allocator interface is provided to further increase orthogonality and usability of dynamic-memory allocation.
 This interface helps programmers in three ways.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 naming: \CFA regular and @ttype@ polymorphism (@ttype@ polymorphism in \CFA is similar to \CC variadic templates) is used to encapsulate a wide range of allocation functionality into a single routine name, so programmers do not have to remember multiple routine names for different kinds of dynamic allocations.
@@ -2528,5 +2516,5 @@
 
 The performance experiments were run on two different multi-core architectures (x64 and ARM) to determine if there is consistency across platforms:
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item
 \textbf{Algol} Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz, GCC version 9.4.0
@@ -2793,5 +2781,5 @@
 Each allocator's performance for each thread is shown in different colors.
 
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item Figure~\ref{fig:speed-3-malloc} shows results for chain: malloc
 \item Figure~\ref{fig:speed-4-realloc} shows results for chain: realloc
@@ -3030,5 +3018,5 @@
 Each figure has 2 graphs, one for each experiment environment.
 Each graph has following 5 subgraphs that show memory usage and statistics throughout the micro-benchmark's lifetime.
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
 \item \textit{\textbf{current\_req\_mem(B)}} shows the amount of dynamic memory requested and currently in-use of the benchmark.
 \item \textit{\textbf{heap}}* shows the memory requested by the program (allocator) from the system that lies in the heap (@sbrk@) area.