\chapter{Performance}
\label{c:Performance}

This chapter uses the micro-benchmarks from \VRef[Chapter]{s:Benchmarks} to test a number of current memory allocators, including llheap.
The goal is to see if llheap is competitive with the currently popular memory allocators.


\section{Machine Specification}

The performance experiments were run on two different multi-core architectures (x64 and ARM) to determine if there is consistency across platforms:
\begin{itemize}
\item
\textbf{Algol} Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz, GCC version 9.4.0
\item
\textbf{Nasus} AMD EPYC 7662, 64-core socket $\times$ 2, 2.0 GHz, GCC version 9.3.0
\end{itemize}


\section{Existing Memory Allocators}
\label{sec:curAllocatorSec}

With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes.
For this thesis, 7 of the most popular and widely used memory allocators were selected for comparison, along with llheap.

\paragraph{llheap (\textsf{llh})}
is the thread-safe allocator from \VRef[Chapter]{c:Allocator}
\\
\textbf{Version:} 1.0
\textbf{Configuration:} Compiled with dynamic linking, but without statistics or debugging.\\
\textbf{Compilation command:} @make@

\paragraph{glibc (\textsf{glc})}
\cite{glibc} is the default glibc thread-safe allocator.
\\
\textbf{Version:} Ubuntu GLIBC 2.31-0ubuntu9.7 2.31\\
\textbf{Configuration:} Compiled by Ubuntu 20.04.\\
\textbf{Compilation command:} N/A

\paragraph{dlmalloc (\textsf{dl})}
\cite{dlmalloc} is a thread-safe allocator that is single threaded and single heap.
It maintains free-lists of different sizes to store freed dynamic memory.
\\
\textbf{Version:} 2.8.6\\
\textbf{Configuration:} Compiled with preprocessor @USE_LOCKS@.\\
\textbf{Compilation command:} @gcc -g3 -O3 -Wall -Wextra -fno-builtin-malloc -fno-builtin-calloc@ @-fno-builtin-realloc -fno-builtin-free -fPIC -shared -DUSE_LOCKS -o libdlmalloc.so malloc-2.8.6.c@

\paragraph{hoard (\textsf{hrd})}
\cite{hoard} is a thread-safe allocator that is multi-threaded and uses a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap.
\\
\textbf{Version:} 3.13\\
\textbf{Configuration:} Compiled with hoard's default configurations and @Makefile@.\\
\textbf{Compilation command:} @make all@

\paragraph{jemalloc (\textsf{je})}
\cite{jemalloc} is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena.
Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
\\
\textbf{Version:} 5.2.1\\
\textbf{Configuration:} Compiled with jemalloc's default configurations and @Makefile@.\\
\textbf{Compilation command:} @autogen.sh; configure; make; make install@

\paragraph{ptmalloc3 (\textsf{pt3})}
\cite{ptmalloc3} is a modification of dlmalloc.
It is a thread-safe multi-threaded memory allocator that uses multiple heaps.
ptmalloc3 heap has similar design to dlmalloc's heap.
\\
\textbf{Version:} 1.8\\
\textbf{Configuration:} Compiled with ptmalloc3's @Makefile@ using option ``linux-shared''.\\
\textbf{Compilation command:} @make linux-shared@

\paragraph{rpmalloc (\textsf{rp})}
\cite{rpmalloc} is a thread-safe allocator that is multi-threaded and uses per-thread heap.
Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
\\
\textbf{Version:} 1.4.1\\
\textbf{Configuration:} Compiled with rpmalloc's default configurations and ninja build system.\\
\textbf{Compilation command:} @python3 configure.py; ninja@

\paragraph{tbb malloc (\textsf{tbb})}
\cite{tbbmalloc} is a thread-safe allocator that is multi-threaded and uses a private heap for each thread.
Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
\\
\textbf{Version:} intel tbb 2020 update 2, tbb\_interface\_version == 11102\\
\textbf{Configuration:} Compiled with tbbmalloc's default configurations and @Makefile@.\\
\textbf{Compilation command:} @make@

% \section{Experiment Environment}
% We used our micro benchmark suite (FIX ME: cite mbench) to evaluate these memory allocators \ref{sec:curAllocatorSec} and our own memory allocator uHeap \ref{sec:allocatorSec}.

\section{Experiments}

Each micro-benchmark is configured and run with each of the allocators,
The less time an allocator takes to complete a benchmark the better so lower in the graphs is better, except for the Memory micro-benchmark graphs.
All graphs use log scale on the Y-axis, except for the Memory micro-benchmark (see \VRef{s:MemoryMicroBenchmark}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% CHURN
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Churn Micro-Benchmark}

Churn tests allocators for speed under intensive dynamic memory usage (see \VRef{s:ChurnBenchmark}).
This experiment was run with following configurations:
\begin{description}[itemsep=0pt,parsep=0pt]
\item[thread:]
1, 2, 4, 8, 16, 32, 48
\item[spots:]
16
\item[obj:]
100,000
\item[max:]
500
\item[min:]
50
\item[step:]
50
\item[distro:]
fisher
\end{description}

% -maxS		 : 500
% -minS		 : 50
% -stepS		 : 50
% -distroS	 : fisher
% -objN		 : 100000
% -cSpots		 : 16
% -threadN	 : 1, 2, 4, 8, 16

\VRef[Figure]{fig:churn} shows the results for algol and nasus.
The X-axis shows the number of threads;
the Y-axis shows the total experiment time.
Each allocator's performance for each thread is shown in different colors.

\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/churn} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/churn} }
\caption{Churn}
\label{fig:churn}
\end{figure}

\paragraph{Assessment}
All allocators did well in this micro-benchmark, except for \textsf{dl} on the ARM.
\textsf{dl}'s is the slowest, indicating some small bottleneck with respect to the other allocators.
\textsf{je} is the fastest, with only a small benefit over the other allocators.
% llheap is slightly slower because it uses ownership, where many of the allocations have remote frees, which requires locking.
% When llheap is compiled without ownership, its performance is the same as the other allocators (not shown).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% THRASH
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Cache Thrash}
\label{sec:cache-thrash-perf}

Thrash tests memory allocators for active false sharing (see \VRef{sec:benchThrashSec}).
This experiment was run with following configurations:
\begin{description}[itemsep=0pt,parsep=0pt]
\item[threads:]
1, 2, 4, 8, 16, 32, 48
\item[iterations:]
1,000
\item[cacheRW:]
1,000,000
\item[size:]
1
\end{description}

% * Each allocator was tested for its performance across different number of threads.
% Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN.

\VRef[Figure]{fig:cacheThrash} shows the results for algol and nasus.
The X-axis shows the number of threads;
the Y-axis shows the total experiment time.
Each allocator's performance for each thread is shown in different colors.

\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/cache_thrash_0-thrash} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/cache_thrash_0-thrash} }
\caption{Cache Thrash}
\label{fig:cacheThrash}
\end{figure}

\paragraph{Assessment}
All allocators did well in this micro-benchmark, except for \textsf{dl} and \textsf{pt3}.
\textsf{dl} uses a single heap for all threads so it is understandable that it generates so much active false-sharing.
Requests from different threads are dealt with sequentially by the single heap (using a single lock), which can allocate objects to different threads on the same cache line.
\textsf{pt3} uses the T:H model, so multiple threads can use one heap, but the active false-sharing is less than \textsf{dl}.
The rest of the memory allocators generate little or no active false-sharing.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% SCRATCH
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Cache Scratch}

Scratch tests memory allocators for program-induced allocator-preserved passive false-sharing (see \VRef{s:CacheScratch}).
This experiment was run with following configurations:
\begin{description}[itemsep=0pt,parsep=0pt]
\item[threads:]
1, 2, 4, 8, 16, 32, 48
\item[iterations:]
1,000
\item[cacheRW:]
1,000,000
\item[size:]
1
\end{description}

% * Each allocator was tested for its performance across different number of threads.
% Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN.

\VRef[Figure]{fig:cacheScratch} shows the results for algol and nasus.
The X-axis shows the number of threads;
the Y-axis shows the total experiment time.
Each allocator's performance for each thread is shown in different colors.

\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/cache_scratch_0-scratch} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/cache_scratch_0-scratch} }
\caption{Cache Scratch}
\label{fig:cacheScratch}
\end{figure}

\paragraph{Assessment}
This micro-benchmark divides the allocators into two groups.
First is the high-performer group: \textsf{llh}, \textsf{je}, and \textsf{rp}.
These memory allocators generate little or no passive false-sharing and their performance difference is negligible.
Second is the low-performer group, which includes the rest of the memory allocators.
These memory allocators have significant program-induced passive false-sharing, where \textsf{hrd}'s is the worst performing allocator.
All of the allocators in this group are sharing heaps among threads at some level.

Interestingly, allocators such as \textsf{hrd} and \textsf{glc} performed well in micro-benchmark cache thrash (see \VRef{sec:cache-thrash-perf}), but, these allocators are among the low performers in the cache scratch.
It suggests these allocators do not actively produce false-sharing, but preserve program-induced passive false sharing.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% SPEED
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Speed Micro-Benchmark}

Speed tests memory allocators for runtime latency (see \VRef{s:SpeedMicroBenchmark}).
This experiment was run with following configurations:
\begin{description}[itemsep=0pt,parsep=0pt]
\item[max:]
500
\item[min:]
50
\item[step:]
50
\item[distro:]
fisher
\item[objects:]
100,000
\item[workers:]
1, 2, 4, 8, 16, 32, 48
\end{description}

% -maxS    :  500
% -minS    :  50
% -stepS   :  50
% -distroS :  fisher
% -objN    :  1000000
% -threadN    : \{ 1, 2, 4, 8, 16 \} *

%* Each allocator was tested for its performance across different number of threads.
%Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN.

\VRefrange[Figures]{fig:speed-3-malloc}{fig:speed-14-malloc-calloc-realloc-free} show 12 figures, one figure for each chain of the speed benchmark.
The X-axis shows the number of threads;
the Y-axis shows the total experiment time.
Each allocator's performance for each thread is shown in different colors.

\begin{itemize}
\item \VRef[Figure]{fig:speed-3-malloc} shows results for chain: malloc
\item \VRef[Figure]{fig:speed-4-realloc} shows results for chain: realloc
\item \VRef[Figure]{fig:speed-5-free} shows results for chain: free
\item \VRef[Figure]{fig:speed-6-calloc} shows results for chain: calloc
\item \VRef[Figure]{fig:speed-7-malloc-free} shows results for chain: malloc-free
\item \VRef[Figure]{fig:speed-8-realloc-free} shows results for chain: realloc-free
\item \VRef[Figure]{fig:speed-9-calloc-free} shows results for chain: calloc-free
\item \VRef[Figure]{fig:speed-10-malloc-realloc} shows results for chain: malloc-realloc
\item \VRef[Figure]{fig:speed-11-calloc-realloc} shows results for chain: calloc-realloc
\item \VRef[Figure]{fig:speed-12-malloc-realloc-free} shows results for chain: malloc-realloc-free
\item \VRef[Figure]{fig:speed-13-calloc-realloc-free} shows results for chain: calloc-realloc-free
\item \VRef[Figure]{fig:speed-14-malloc-calloc-realloc-free} shows results for chain: malloc-realloc-free-calloc
\end{itemize}

\paragraph{Assessment}
This micro-benchmark divides the allocators into two groups: with and without @calloc@.
@calloc@ uses @memset@ to set the allocated memory to zero, which dominates the cost of the allocation chain (large performance increase) and levels performance across the allocators.
But the difference among the allocators in a @calloc@ chain still gives an idea of their relative performance.

All allocators did well in this micro-benchmark across all allocation chains, except for \textsf{dl}, \textsf{pt3}, and \textsf{hrd}.
Again, the low-performing allocators are sharing heaps among threads, so the contention causes performance increases with increasing numbers of threads.
Furthermore, chains with @free@ can trigger coalescing, which slows the fast path.
The high-performing allocators all illustrate low latency across the allocation chains, \ie there are no performance spikes as the chain lengths, that might be caused by contention and/or coalescing.
Low latency is important for applications that are sensitive to unknown execution delays.

%speed-3-malloc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-3-malloc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-3-malloc} }
\caption{Speed benchmark chain: malloc}
\label{fig:speed-3-malloc}
\end{figure}

%speed-4-realloc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-4-realloc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-4-realloc} }
\caption{Speed benchmark chain: realloc}
\label{fig:speed-4-realloc}
\end{figure}

%speed-5-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-5-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-5-free} }
\caption{Speed benchmark chain: free}
\label{fig:speed-5-free}
\end{figure}

%speed-6-calloc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-6-calloc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-6-calloc} }
\caption{Speed benchmark chain: calloc}
\label{fig:speed-6-calloc}
\end{figure}

%speed-7-malloc-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-7-malloc-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-7-malloc-free} }
\caption{Speed benchmark chain: malloc-free}
\label{fig:speed-7-malloc-free}
\end{figure}

%speed-8-realloc-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-8-realloc-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-8-realloc-free} }
\caption{Speed benchmark chain: realloc-free}
\label{fig:speed-8-realloc-free}
\end{figure}

%speed-9-calloc-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-9-calloc-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-9-calloc-free} }
\caption{Speed benchmark chain: calloc-free}
\label{fig:speed-9-calloc-free}
\end{figure}

%speed-10-malloc-realloc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-10-malloc-realloc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-10-malloc-realloc} }
\caption{Speed benchmark chain: malloc-realloc}
\label{fig:speed-10-malloc-realloc}
\end{figure}

%speed-11-calloc-realloc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-11-calloc-realloc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-11-calloc-realloc} }
\caption{Speed benchmark chain: calloc-realloc}
\label{fig:speed-11-calloc-realloc}
\end{figure}

%speed-12-malloc-realloc-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-12-malloc-realloc-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-12-malloc-realloc-free} }
\caption{Speed benchmark chain: malloc-realloc-free}
\label{fig:speed-12-malloc-realloc-free}
\end{figure}

%speed-13-calloc-realloc-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-13-calloc-realloc-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-13-calloc-realloc-free} }
\caption{Speed benchmark chain: calloc-realloc-free}
\label{fig:speed-13-calloc-realloc-free}
\end{figure}

%speed-14-{m,c,re}alloc-free.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-14-m-c-re-alloc-free} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-14-m-c-re-alloc-free} }
\caption{Speed benchmark chain: malloc-calloc-realloc-free}
\label{fig:speed-14-malloc-calloc-realloc-free}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% MEMORY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage
\subsection{Memory Micro-Benchmark}
\label{s:MemoryMicroBenchmark}

This experiment is run with the following two configurations for each allocator.
The difference between the two configurations is the number of producers and consumers.
Configuration 1 has one producer and one consumer, and configuration 2 has 4 producers, where each producer has 4 consumers.

\noindent
Configuration 1:
\begin{description}[itemsep=0pt,parsep=0pt]
\item[producer (K):]
1
\item[consumer (M):]
1
\item[round:]
100,000
\item[max:]
500
\item[min:]
50
\item[step:]
50
\item[distro:]
fisher
\item[objects (N):]
100,000
\end{description}

% -threadA :  1
% -threadF :  1
% -maxS    :  500
% -minS    :  50
% -stepS   :  50
% -distroS :  fisher
% -objN    :  100000
% -consumeS:  100000

\noindent
Configuration 2:
\begin{description}[itemsep=0pt,parsep=0pt]
\item[producer (K):]
4
\item[consumer (M):]
4
\item[round:]
100,000
\item[max:]
500
\item[min:]
50
\item[step:]
50
\item[distro:]
fisher
\item[objects (N):]
100,000
\end{description}

% -threadA :  4
% -threadF :  4
% -maxS    :  500
% -minS    :  50
% -stepS   :  50
% -distroS :  fisher
% -objN    :  100000
% -consumeS:  100000

% \begin{table}[b]
% \centering
%     \begin{tabular}{ |c|c|c| }
%      \hline
%     Memory Allocator & Configuration 1 Result & Configuration 2 Result\\
%      \hline
%     llh & \VRef[Figure]{fig:mem-1-prod-1-cons-100-llh} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-llh}\\
%      \hline
%     dl & \VRef[Figure]{fig:mem-1-prod-1-cons-100-dl} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-dl}\\
%      \hline
%     glibc & \VRef[Figure]{fig:mem-1-prod-1-cons-100-glc} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-glc}\\
%      \hline
%     hoard & \VRef[Figure]{fig:mem-1-prod-1-cons-100-hrd} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-hrd}\\
%      \hline
%     je & \VRef[Figure]{fig:mem-1-prod-1-cons-100-je} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-je}\\
%      \hline
%     pt3 & \VRef[Figure]{fig:mem-1-prod-1-cons-100-pt3} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-pt3}\\
%      \hline
%     rp & \VRef[Figure]{fig:mem-1-prod-1-cons-100-rp} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-rp}\\
%      \hline
%     tbb & \VRef[Figure]{fig:mem-1-prod-1-cons-100-tbb} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-tbb}\\
%      \hline
%     \end{tabular}
% \caption{Memory benchmark results}
% \label{table:mem-benchmark-figs}
% \end{table}
% Table \ref{table:mem-benchmark-figs} shows the list of figures that contain memory benchmark results.

\VRefrange[Figures]{fig:mem-1-prod-1-cons-100-llh}{fig:mem-4-prod-4-cons-100-tbb} show 16 figures, two figures for each of the 8 allocators, one for each configuration.
Each figure has 2 graphs, one for each experiment environment.
Each graph has following 5 subgraphs that show memory usage and statistics throughout the micro-benchmark's lifetime.
\begin{itemize}
\item \textit{\textbf{current\_req\_mem(B)}} shows the amount of dynamic memory requested and currently in-use of the benchmark.
\item \textit{\textbf{heap}}* shows the memory requested by the program (allocator) from the system that lies in the heap (@sbrk@) area.
\item \textit{\textbf{mmap\_so}}* shows the memory requested by the program (allocator) from the system that lies in the @mmap@ area.
\item \textit{\textbf{mmap}}* shows the memory requested by the program (allocator or shared libraries) from the system that lies in the @mmap@ area.
\item \textit{\textbf{total\_dynamic}} shows the total usage of dynamic memory by the benchmark program, which is a sum of \textit{heap}, \textit{mmap}, and \textit{mmap\_so}.
\end{itemize}
* These statistics are gathered by monitoring a process's @/proc/self/maps@ file.

The X-axis shows the time when the memory information is polled.
The Y-axis shows the memory usage in bytes.

For this experiment, the difference between the memory requested by the benchmark (\textit{current\_req\_mem(B)}) and the memory that the process has received from system (\textit{heap}, \textit{mmap}) should be minimum.
This difference is the memory overhead caused by the allocator and shows the level of fragmentation in the allocator.

\paragraph{Assessment}
First, the differences in the shape of the curves between architectures (top ARM, bottom x64) is small, where the differences are in the amount of memory used.
Hence, it is possible to focus on either the top or bottom graph.

Second, the heap curve is 0 for four memory allocators: \textsf{hrd}, \textsf{je}, \textsf{pt3}, and \textsf{rp}, indicating these memory allocators only use @mmap@ to get memory from the system and ignore the @sbrk@ area.

The total dynamic memory is higher for \textsf{hrd} and \textsf{tbb} than the other allocators.
The main reason is the use of superblocks (see \VRef{s:ObjectContainers}) containing objects of the same size.
These superblocks are maintained throughout the life of the program.

\textsf{pt3} is the only memory allocator where the total dynamic memory goes down in the second half of the program lifetime when the memory is freed by the benchmark program.
It makes pt3 the only memory allocator that gives memory back to the operating system as it is freed by the program.

% FOR 1 THREAD

%mem-1-prod-1-cons-100-llh.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-llh} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-llh} }
\caption{Memory benchmark results with Configuration-1 for llh memory allocator}
\label{fig:mem-1-prod-1-cons-100-llh}
\end{figure}

%mem-1-prod-1-cons-100-dl.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-dl} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-dl} }
\caption{Memory benchmark results with Configuration-1 for dl memory allocator}
\label{fig:mem-1-prod-1-cons-100-dl}
\end{figure}

%mem-1-prod-1-cons-100-glc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-glc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-glc} }
\caption{Memory benchmark results with Configuration-1 for glibc memory allocator}
\label{fig:mem-1-prod-1-cons-100-glc}
\end{figure}

%mem-1-prod-1-cons-100-hrd.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-hrd} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-hrd} }
\caption{Memory benchmark results with Configuration-1 for hoard memory allocator}
\label{fig:mem-1-prod-1-cons-100-hrd}
\end{figure}

%mem-1-prod-1-cons-100-je.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-je} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-je} }
\caption{Memory benchmark results with Configuration-1 for je memory allocator}
\label{fig:mem-1-prod-1-cons-100-je}
\end{figure}

%mem-1-prod-1-cons-100-pt3.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-pt3} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-pt3} }
\caption{Memory benchmark results with Configuration-1 for pt3 memory allocator}
\label{fig:mem-1-prod-1-cons-100-pt3}
\end{figure}

%mem-1-prod-1-cons-100-rp.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-rp} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-rp} }
\caption{Memory benchmark results with Configuration-1 for rp memory allocator}
\label{fig:mem-1-prod-1-cons-100-rp}
\end{figure}

%mem-1-prod-1-cons-100-tbb.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-tbb} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-tbb} }
\caption{Memory benchmark results with Configuration-1 for tbb memory allocator}
\label{fig:mem-1-prod-1-cons-100-tbb}
\end{figure}

% FOR 4 THREADS

%mem-4-prod-4-cons-100-llh.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-llh} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-llh} }
\caption{Memory benchmark results with Configuration-2 for llh memory allocator}
\label{fig:mem-4-prod-4-cons-100-llh}
\end{figure}

%mem-4-prod-4-cons-100-dl.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-dl} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-dl} }
\caption{Memory benchmark results with Configuration-2 for dl memory allocator}
\label{fig:mem-4-prod-4-cons-100-dl}
\end{figure}

%mem-4-prod-4-cons-100-glc.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-glc} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-glc} }
\caption{Memory benchmark results with Configuration-2 for glibc memory allocator}
\label{fig:mem-4-prod-4-cons-100-glc}
\end{figure}

%mem-4-prod-4-cons-100-hrd.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-hrd} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-hrd} }
\caption{Memory benchmark results with Configuration-2 for hoard memory allocator}
\label{fig:mem-4-prod-4-cons-100-hrd}
\end{figure}

%mem-4-prod-4-cons-100-je.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-je} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-je} }
\caption{Memory benchmark results with Configuration-2 for je memory allocator}
\label{fig:mem-4-prod-4-cons-100-je}
\end{figure}

%mem-4-prod-4-cons-100-pt3.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-pt3} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-pt3} }
\caption{Memory benchmark results with Configuration-2 for pt3 memory allocator}
\label{fig:mem-4-prod-4-cons-100-pt3}
\end{figure}

%mem-4-prod-4-cons-100-rp.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-rp} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-rp} }
\caption{Memory benchmark results with Configuration-2 for rp memory allocator}
\label{fig:mem-4-prod-4-cons-100-rp}
\end{figure}

%mem-4-prod-4-cons-100-tbb.eps
\begin{figure}
\centering
    \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-tbb} }
    \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-tbb} }
\caption{Memory benchmark results with Configuration-2 for tbb memory allocator}
\label{fig:mem-4-prod-4-cons-100-tbb}
\end{figure}