Context Navigation

source: doc/papers/llheap/Paper.tex@ d2e6f84

Visit:

Last change on this file since d2e6f84 was 73475a5, checked in by Peter A. Buhr <pabuhr@…>, 4 weeks ago
updates for llheap paper
Property mode set to `100644`
File size: 138.3 KB

Line
1	% Type: Paper
2	%
3	% Abstract
4	%
5	% A new C-based concurrent memory-allocator is presented, called llheap (ll => low latency). It supports C/C++ applications with multiple kernel threads, or it can be embedded into user-threading runtime-systems. llheap extends the C allocation API with new functions providing orthogonal access to allocation features; hence, programmers do have to code missing combinations. llheap also extends the C allocation semantics by remembering multiple aspects of the initial allocation. These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations. As well, realloc/reallocarray preserve initial zero-fill and alignment properties when adjusting storage size, again increasing future allocation safety. The allocator provides a contention-free statistics gathering mode, and a debugging mode for dynamically checking allocation pre/post conditions and invariants. These modes are invaluable for understanding and debugging a program's dynamic allocation behaviour, with low enough cost to be used in production code. An example is presented for condensing the allocation API using advanced type-systems, providing a single type-safe allocation routine using named arguments. Finally, performance results across a number of benchmarks show llheap is competitive with other modern memory allocators.
6	%
7	% Upload: llheap.pdf
8	%
9	% Computing Classification Systems
10	%
11	% Add
12	% 500 Software and its engineering > Software libraries and repositories
13	% Add
14	% 300 Computing methodologies > Concurrent programming languages
15	%
16	% Authors, submitter has to have an orcid
17	%
18	% Details & Comments
19	%
20	% cover letter
21	%
22	% Funding
23	% yes
24	% Government of Canada >
25	% Natural Sciences and Engineering Research Council of Canada
26	%
27	% Electronic Supplementary Materials No
28	% Are you submitting a conference paper extension: No
29	% X ACM uses CrossCheck, an automated service that checks for plagiarism. Any submission to ACM is subject to such a check. Confirm that you are familiar with the ACM Plagiarism Polic
30	% To confirm that you have reviewed all title, author, and affiliation information in the submission form and the manuscript for accuracy, and approve its exact use in the final, published article, please check the box to the right. X
31
32	\documentclass[manuscript,screen,review]{acmart}
33
34	% Latex packages used in the document.
35
36	\usepackage{comment}
37	\usepackage{epic,eepic}
38	\usepackage{upquote} % switch curled `'" to straight
39	\usepackage{relsize}
40	\usepackage{xspace}
41	\usepackage{xcolor}
42	\usepackage{calc}
43	\usepackage{algorithm}
44	\usepackage{algorithmic}
45	\usepackage{enumitem}
46	\usepackage{tabularx} % allows \lstMakeShortInline@
47	\usepackage[scaled=0.88]{helvet} % descent Helvetica font and scale to times size
48	\usepackage[T1]{fontenc}
49	\usepackage{listings} % format program code
50	\usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt,font={rm,md,up}]{subfig}
51	\renewcommand{\thesubfigure}{(\alph{subfigure})}
52
53	\hypersetup{breaklinks=true}
54	\usepackage{breakurl}
55
56	% \usepackage[pagewise]{lineno}
57	% \renewcommand{\linenumberfont}{\scriptsize\sffamily}
58
59	\usepackage{varioref} % extended references
60	% adjust varioref package with default "section" and "page" titles, and optional title with faraway page numbers
61	% \VRef{label} => Section 2.7, \VPageref{label} => page 17
62	% \VRef[Figure]{label} => Figure 3.4, \VPageref{label} => page 17
63	% \renewcommand{\reftextfaceafter}{\unskip}
64	% \renewcommand{\reftextfacebefore}{\unskip}
65	% \renewcommand{\reftextafter}{\unskip}
66	% \renewcommand{\reftextbefore}{\unskip}
67	% \renewcommand{\reftextfaraway}[1]{\unskip, p.~\pageref{#1}}
68	% \renewcommand{\reftextpagerange}[2]{\unskip, pp.~\pageref{#1}--\pageref{#2}}
69	% \newcommand{\VRef}[2][Section]{\ifx#1\@empty\else{#1}\nobreakspace\fi\vref{#2}}
70	% \newcommand{\VRefrange}[3][Sections]{\ifx#1\@empty\else{#1}\nobreakspace\fi\vrefrange{#2}{#3}}
71	% \newcommand{\VPageref}[2][page]{\ifx#1\@empty\else{#1}\nobreakspace\fi\pageref{#2}}
72	% \newcommand{\VPagerefrange}[3][pages]{\ifx#1\@empty\else{#1}\nobreakspace\fi\pageref{#2}{#3}}
73
74	\makeatletter
75	\newcommand{\abbrevFont}{\textit} % set empty for no italics
76	\newcommand{\CheckCommaColon}{\@ifnextchar{,}{}{\@ifnextchar{:}{}{,\xspace}}}
77	\newcommand{\CheckPeriod}{\@ifnextchar{.}{}{.\xspace}}
78	\newcommand{\EG}{\abbrevFont{e}.\abbrevFont{g}.}
79	\newcommand{\eg}{\EG\CheckCommaColon}
80	\newcommand{\IE}{\abbrevFont{i}.\abbrevFont{e}.}
81	\newcommand{\ie}{\IE\CheckCommaColon}
82	\newcommand{\ETC}{\abbrevFont{etc}}
83	\newcommand{\etc}{\ETC\CheckPeriod}
84	\newcommand{\VS}{\abbrevFont{vs}}
85	\newcommand{\vs}{\VS\CheckPeriod}
86
87	\newcommand{\newtermFont}{\emph}
88	\newcommand{\newterm}[1]{\newtermFont{#1}}
89
90	\newcommand{\CFAIcon}{\textsf{C}\raisebox{\depth}{\rotatebox{180}{\textsf{A}}}\xspace} % Cforall symbolic name
91	\newcommand{\CFA}{\protect\CFAIcon} % safe for section/caption
92	\newcommand{\CFL}{\textrm{Cforall}\xspace} % Cforall symbolic name
93	\newcommand{\CCIcon}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}} % C++ icon
94	\newcommand{\CC}[1][]{\protect\CCIcon{#1}\xspace} % C++ symbolic name
95	\newcommand{\uC}{$\mu$\CC}
96	\newcommand{\Csharp}{C\raisebox{-0.7ex}{\relsize{2}$^\sharp$}\xspace} % C# symbolic name
97
98	\newcommand{\LstBasicStyle}[1]{{\lst@basicstyle{#1}}}
99	\newcommand{\LstKeywordStyle}[1]{{\lst@basicstyle{\lst@keywordstyle{#1}}}}
100	\newcommand{\LstCommentStyle}[1]{{\lst@basicstyle{\lst@commentstyle{#1}}}}
101	\newcommand{\LstStringStyle}[1]{{\lst@basicstyle{\lst@stringstyle{#1}}}}
102
103	\newlength{\parindentlnth}
104	\setlength{\parindentlnth}{\parindent}
105	\newlength{\gcolumnposn} % temporary hack because lstlisting does not handle tabs correctly
106	\newlength{\columnposn}
107	\setlength{\gcolumnposn}{3.25in}
108	\setlength{\columnposn}{\gcolumnposn}
109	\renewcommand{\C}[2][\@empty]{\ifx#1\@empty\else\global\setlength{\columnposn}{#1}\global\columnposn=\columnposn\fi\hfill\makebox[\textwidth-\columnposn][l]{\lst@basicstyle{\LstCommentStyle{#2}}}}
110	\newcommand{\CRT}{\global\columnposn=\gcolumnposn}
111	\makeatother
112
113	\lstset{
114	columns=fullflexible,
115	basicstyle=\linespread{0.9}\sf, % reduce line spacing and use sanserif font
116	stringstyle=\fontsize{9}{9}\selectfont\tt, % use typewriter font
117	tabsize=5, % N space tabbing
118	xleftmargin=\parindentlnth, % indent code to paragraph indentation
119	escapechar=\$, % LaTeX escape in CFA code
120	mathescape=false, % disable LaTeX math escape in CFA code $...$
121	keepspaces=true, %
122	showstringspaces=false, % do not show spaces with cup
123	showlines=true, % show blank lines at end of code
124	aboveskip=4pt, % spacing above/below code block
125	belowskip=2pt,
126	numberstyle=\footnotesize\sf, % numbering style
127	moredelim=**[is][\color{red}]{@}{@},
128	% replace/adjust listing characters that look bad in sanserif
129	literate=
130	% {-}{\makebox[1ex][c]{\raisebox{0.4ex}{\rule{0.75ex}{0.1ex}}}}1
131	{-}{\raisebox{0pt}{\ttfamily-}}1
132	{^}{\raisebox{0.6ex}{\(\scriptstyle\land\,\)}}1
133	{~}{\raisebox{0.3ex}{\(\scriptstyle\sim\,\)}}1
134	% {'}{\ttfamily'\hspace*{-0.4ex}}1
135	{`}{\raisebox{-2pt}{\large\textasciigrave\hspace{-1pt}}}1
136	{<-}{$\leftarrow$}2
137	{=>}{$\Rightarrow$}2
138	% {->}{\raisebox{-1pt}{\texttt{-}}\kern-0.1ex\textgreater}2,
139	}% lstset
140
141	% CFA programming language, based on ANSI C (with some gcc additions)
142	\lstdefinelanguage{CFA}[ANSI]{C}{
143	morekeywords={
144	_Alignas, _Alignof, __alignof, __alignof__, and, asm, __asm, __asm__, _Atomic, __attribute, __attribute__,
145	__auto_type, basetypeof, _Bool, catch, catchResume, choose, coerce, _Complex, __complex, __complex__, __const, __const__,
146	coroutine, _Decimal32, _Decimal64, _Decimal128, disable, enable, exception, __extension__, fallthrough, finally, fixup,
147	__float80, float80, __float128, float128, _Float16, _Float32, _Float32x, _Float64, _Float64x, _Float128, _Float128x,
148	forall, fortran, generator, _Generic, _Imaginary, __imag, __imag__, inline, __inline, __inline__, int128, __int128, __int128_t,
149	__label__, monitor, mutex, _Noreturn, __builtin_offsetof, one_t, or, recover, report, restrict, __restrict, __restrict__,
150	__signed, __signed__, _Static_assert, suspend, thread, __thread, _Thread_local, throw, throwResume, timeout, trait, try,
151	typeof, __typeof, __typeof__, typeid, __uint128_t, __builtin_va_arg, __builtin_va_list, virtual, __volatile, __volatile__,
152	vtable, waitfor, waituntil, when, with, zero_t,
153	},
154	moredirectives={defined,include_next},
155	}
156
157	% uC++ programming language, based on ANSI C++
158	\lstdefinelanguage{uC++}[GNU]{C++}{
159	morekeywords={
160	_Accept, _AcceptReturn, _AcceptWait, _Actor, _At, _Catch, _CatchResume, _CorActor, _Cormonitor, _Coroutine,
161	_Disable, _Else, _Enable, _Event, _Exception, _Finally, _Monitor, _Mutex, _Nomutex, _PeriodicTask, _RealTimeTask,
162	_Resume, _ResumeTop, _Select, _SporadicTask, _Task, _Timeout, _When, _With, _Throw},
163	}
164
165	% Go programming language: https://github.com/julienc91/listings-golang/blob/master/listings-golang.sty
166	\lstdefinelanguage{Golang}{
167	morekeywords=[1]{package,import,func,type,struct,return,defer,panic,recover,select,var,const,iota,},
168	morekeywords=[2]{string,uint,uint8,uint16,uint32,uint64,int,int8,int16,int32,int64,
169	bool,float32,float64,complex64,complex128,byte,rune,uintptr, error,interface},
170	morekeywords=[3]{map,slice,make,new,nil,len,cap,copy,close,true,false,delete,append,real,imag,complex,chan,},
171	morekeywords=[4]{for,break,continue,range,goto,switch,case,fallthrough,if,else,default,},
172	morekeywords=[5]{Println,Printf,Error,},
173	sensitive=true,
174	morecomment=[l]{//},
175	morecomment=[s]{/}{/},
176	morestring=[b]',
177	morestring=[b]",
178	morestring=[s]{`}{`},
179	}
180
181	\lstnewenvironment{cfa}[1][]{\lstset{language=CFA,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
182	\lstnewenvironment{C++}[1][]{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
183	\lstnewenvironment{uC++}[1][]{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
184	\lstnewenvironment{Go}[1][]{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
185	\lstnewenvironment{python}[1][]{\lstset{language=python,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
186	\lstnewenvironment{java}[1][]{\lstset{language=java,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
187
188	\newsavebox{\myboxA}
189	\newsavebox{\myboxB}
190	\newsavebox{\myboxC}
191	\newsavebox{\myboxD}
192
193	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
194
195	\title{High-Performance Concurrent Memory Allocation}
196
197	\author{Mubeen Zulfiqar}
198	\email{m3zulfiq@uwaterloo.ca}
199	\author{Ayelet Wasik}
200	\email{aisraeli@plg.uwaterloo.ca}
201	\author{Peter A. Buhr}
202	\email{pabuhr@uwaterloo.ca}
203	\orcid{0000-0003-3747-9281}
204	\affiliation{%
205	\institution{University of Waterloo}
206	\city{Waterloo}
207	\state{Ontario}
208	\country{Canada}
209	}
210	\author{Dave Dice}
211	\email{dave.dice@oracle.com}
212	\orcid{0000-0001-9164-7747}
213	\affiliation{%
214	\institution{Oracle Labs}
215	\city{Burlington}
216	\state{Massachusetts}
217	\country{USA}
218	}
219	\author{Bryan Chan}
220	\email{bryan.chan@huawei.com}
221	\affiliation{%
222	\institution{Huawei Compiler Lab}
223	\city{Markham}
224	\state{Ontario}
225	\country{Canada}
226	}
227
228	\renewcommand{\shortauthors}{Zulfiqar et al.}
229
230	% inline code @...@
231	\lstMakeShortInline@%
232
233	\begin{document}
234
235	\begin{abstract}
236	A new C-based concurrent memory-allocator is presented, called llheap (ll $\Rightarrow$ low latency).
237	It supports C/\CC applications with multiple kernel threads, or it can be embedded into user-threading runtime-systems.
238	llheap extends the C allocation API with new functions providing orthogonal access to allocation features;
239	hence, programmers do have to code missing combinations.
240	llheap also extends the C allocation semantics by remembering multiple aspects of the initial allocation.
241	These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations.
242	As well, \lstinline{realloc}/\lstinline{reallocarray} preserve initial zero-fill and alignment properties when adjusting storage size, again increasing future allocation safety.
243	The allocator provides a contention-free statistics gathering mode, and a debugging mode for dynamically checking allocation pre/post conditions and invariants.
244	These modes are invaluable for understanding and debugging a program's dynamic allocation behaviour, with low enough cost to be used in production code.
245	An example is presented for condensing the allocation API using advanced type-systems, providing a single type-safe allocation routine using named arguments.
246	Finally, performance results across a number of benchmarks show llheap is competitive with other modern memory allocators.
247	\end{abstract}
248
249	\begin{CCSXML}
250	<concept>
251	<concept_id>10011007.10011006.10011072</concept_id>
252	<concept_desc>Software and its engineering~Software libraries and repositories</concept_desc>
253	<concept_significance>500</concept_significance>
254	</concept>
255	</ccs2012>
256
257	<ccs2012>
258	<concept>
259	<concept_id>10010147.10011777.10011014</concept_id>
260	<concept_desc>Computing methodologies~Concurrent programming languages</concept_desc>
261	<concept_significance>300</concept_significance>
262	</concept>
263	\end{CCSXML}
264
265	\ccsdesc[500]{Software and its engineering~Software libraries and repositories}
266	\ccsdesc[300]{Computing methodologies~Concurrent programming languages}
267
268	\keywords{memory allocation, (user-level) concurrency, type-safety, statistics, debugging, high performance}
269
270	\received{20 February 2007}
271	\received[revised]{12 March 2009}
272	\received[accepted]{5 June 2009}
273
274
275	\maketitle
276
277	\section{Introduction}
278
279	Memory management services a series of program allocation/deallocation requests and attempts to satisfy them from variable-sized blocks of memory while minimizing total memory usage.
280	A general-purpose memory allocator cannot anticipate storage requests so its time and space performance cannot be optimal (bin packing).
281	Each allocator takes advantage of a subset of typical allocation patterns (heuristics) to produce excellent results, both in time and space (similar to LRU paging).
282	Nevertheless, allocators are a series of compromises, possibly with static or dynamic tuning parameters to optimize specific request patterns.
283
284
285	\subsection{Memory Structure}
286	\label{s:MemoryStructure}
287
288	Figure~\ref{f:ProgramAddressSpace} shows the typical layout of a program's address space (high addresses to low) divided into a number of zones, with free memory surrounding the dynamic code/data~\cite{memlayout}.
289	Static code and data are placed into memory at load time from the executable and are fixed-sized at runtime.
290	Dynamic code/data memory is managed by the dynamic loader for libraries loaded at runtime, which is complex especially in a multi-threaded program~\cite{Huang06}.
291	However, changes to the dynamic code/data space are typically infrequent, most occurring at program startup and are largely outside of a program's control.
292	Stack memory is managed by the program call/return mechanism using a LIFO technique.
293	For stackful coroutines and user threads, new stacks are commonly created in the dynamic-allocation memory.
294	The dynamic-allocation memory is often a contiguous area, which starts empty and grows/shrinks as the program creates/deletes variables with independent lifetime.
295	The language's runtime manages this area, where management complexity is a function of the mechanism for deleting variables.
296
297	\begin{figure}
298	\centering
299	\input{AddressSpace.pstex_t}
300	\vspace{-5pt}
301	\caption{Program Address Space Divided into Zones}
302	\label{f:ProgramAddressSpace}
303	\end{figure}
304
305
306	\subsection{Dynamic Memory-Management}
307	\label{s:DynamicMemoryManagement}
308
309	Modern programming languages provide two forms of storage management: managed or unmanaged.
310	Both forms have explicit allocation, but managed memory has implicit deallocation (garbage collection~\cite{Wilson92}, GC) and unmanaged memory has some form of explicit deallocation.
311	Sometimes there are explicit deallocation hints in managed.
312	Both forms attempt to reuse freed storage in the heap for new allocations.
313	Unmanaged languages have no information about allocated \newterm{objects}, and hence, use techniques during freeing to detect adjacent unused storage if coalescing.
314	Conservative GC attempts to find free objects in an unmanaged system by scanning memory and marking anything that \emph{looks} like a live object.
315	However, \emph{conservative} means some non-objects might be marked as live;
316	the goal is not to miss any live objects.
317	Managed languages maintain sufficient information to locate all live objects.
318	Precise GC is then able to mark just the live objects.
319	Both approaches then sweep through the unmarked objects looking for adjacent free storage to coalesce.
320	Precise GC has a further coalescing option of compacting used objects and adjusting the pointers used to find them to the new locations, resulting in a large area of contiguous free storage.
321	Languages such as Lisp~\cite{CommonLisp}, Java~\cite{Java}, Haskell~\cite{Haskell}, Go~\cite{Go}, are managed and normally implemented using precise GC.
322	(Both Go~\cite{Go1.3} and Netscape JavaScript~\cite{JavaScriptGC} switched from conservative to precise GC.)
323	Languages such as C~\cite{C}, \CC~\cite{C++}, Rust~\cite{Rust} and Swift~\cite{swift} (because of explicit management of weak references) are unmanaged but could be used with conservative GC~\cite{Boehm88}.
324	This work only examines unmanaged memory with \emph{explicit} deallocation.
325	% While GC is not part this work, some of the results are applicable to the allocation phase in any memory-management approach.
326
327	Most programs use a general-purpose allocator, usually the one provided by the programming-language's runtime.
328	In certain languages, programmers can write specialize allocators for specific needs.
329	POSIX~\cite{POSIX17} provides for replacement of the default memory allocator in C and \CC through a standard API.
330	Most industry JVMs provide multiple GCs, from which a user selects one for their workload.
331	%Jikes RVM MMTk~\cite{MMTk} provides a similar generalization for the Java virtual machine.
332	As well, new languages support concurrency (kernel/user threading), which must be safely handled by the allocator.
333	Hence, alternative allocators exist for C/\CC with the goal of scaling in multi-threaded programs~\cite{Berger00,mtmalloc,streamflow,tcmalloc}.
334	This work examines the design of high-performance allocators for use by kernel and user multi-threaded applications written in C/\CC.
335
336
337	\subsection{Contributions}
338	\label{s:Contributions}
339
340	This work provides the following contributions to the area of explicit concurrent dynamic-allocation.
341	\begin{enumerate}[leftmargin=18pt,topsep=3pt,itemsep=0pt]
342	\item
343	Implementation of a new stand-alone concurrent low-latency memory-allocator, called llheap~\cite{llheap}, ($\approx$1,500 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions for the concurrent languages \uC~\cite{uC++} and \CFA~\cite{Moss18,Delisle21} using user-level threads running on multiple kernel threads (M:N threading).
344
345	\item
346	Extend the C allocation API with new functions @aalloc@, @amemalign@, @cmemalign@, @resize@, @aligned_resize@, @aligned_realloc@, and @aligned_reallocarray@ to make allocation properties orthogonally accessible.
347
348	\item
349	Extend the C allocation semantics by preserving with each allocation its request size, the amount allocated, whether it is zero fill, and its alignment.
350
351	\item
352	Provide additional query operations @malloc_alignment@, @malloc_zero_fill@, and @malloc_size@ to access allocation information.
353
354	\item
355	Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ and @reallocarray@ to zero-fill and align when storage is extended or copied.
356	Without this extension, it is unsafe to @realloc@ storage if the properties are not preserved when copying.
357	This silent problem is unintuitive to programmers and difficult to locate because it is transient.
358
359	\item
360	Provide optional extensive, fast, and contention-free allocation statistics to understand allocation behaviour.
361
362	\item
363	Provide runtime checks to validate allocation operations and identify the amount of unfreed storage at program termination.
364
365	\item
366	Build 8 different versions of the allocator: static or dynamic linking, with or without statistics or debugging.
367	A program may link to any of these 8 versions of the allocator often without recompilation (linking or @LD_PRELOAD@).
368
369	\item
370	Demonstrate how advanced programming-language type-systems can condense the allocation API providing a single type-safe allocation function using named arguments.
371
372	\item
373	Create a benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs.
374
375	\item
376	Run performance experiments using the new benchmark test-suite comparing llheap with six of the best allocators in use today.
377	The goal is to demonstrate that llheap's performance, both in time and space, is comparable to the best allocators in use today.
378	\end{enumerate}
379
380
381	\section{Background}
382
383	The following is an overview of allocator design options that affect memory usage and performance (see~\cite{Zulfiqar22} for more details).
384	Dynamic acquires and releases obtain \newterm{object} storage via calls such as @malloc@/@new@ and @free@/@delete@ in C/\CC, respectively.
385	A \newterm{memory allocator} contains a complex data-structure and code that manages the layout of objects in the dynamic-allocation zone.
386	% The management goals are to make allocation/deallocation operations as fast as possible while densely packing objects to make efficient use of memory.
387	Since objects in C/\CC cannot be moved, only adjacent free storage can be \newterm{coalesced} into larger free areas.
388	The allocator grows or shrinks the dynamic-allocation zone to obtain storage for objects and reduce memory usage using \newterm{operating system} (OS) calls, such as @mmap@ or @sbrk@ in UNIX.
389
390
391	\vspace*{-7pt}
392	\subsection{Allocator Components}
393	\label{s:AllocatorComponents}
394
395	Figure~\ref{f:AllocatorComponents} shows the two important data components for a memory allocator, management and storage, collectively called the \newterm{heap}.
396	The \newterm{management data} is a data structure located at a known memory address and contains fixed-sized information in the static-data memory that references components in the dynamic-allocation memory.
397	For multi-threaded programs, additional management data may exist in \newterm{thread-local storage} (TLS) for each kernel thread executing the program.
398	The \newterm{storage data} is composed of allocated/freed objects, and \newterm{reserved memory}.
399	Allocated objects (white) are variable sized, and are allocated and maintained by the program;
400	\ie only the program knows the location of allocated storage.
401	Freed objects (light grey) represent memory deallocated by the program, which are linked into one or more lists facilitating location for new allocations.
402	Reserved memory (dark grey) is one or more blocks of memory obtained from the OS but not yet used by the program;
403	if there are multiple reserved blocks, they are normally linked together.
404
405	\begin{figure}
406	\centering
407	\input{AllocatorComponents}
408	\caption{Allocator Components (Heap)}
409	\label{f:AllocatorComponents}
410	\end{figure}
411
412	In many allocator designs, allocated objects and reserved blocks have management data embedded within them (see also Section~\ref{s:ObjectContainers}).
413	Figure~\ref{f:AllocatedObject} shows an allocated object with a header, trailer, and optional spacing around the object.
414	The header contains information about the object, \eg size, type, \etc.
415	The trailer may be used to simplify coalescing and/or for safety purposes to mark the end of an object.
416	An object may be preceded by padding to ensure proper alignment.
417	Some algorithms quantize allocation requests, resulting in additional space after an object.
418	When padding and spacing are necessary, neither can be used to satisfy a future allocation request while the current allocation exists.
419
420	A free object often contains management data, \eg size, pointers, \etc.
421	Often the free list is linked internally so it does not consume additional storage, \ie the link fields are placed at known locations in the unused memory blocks.
422	For internal linking, the amount of management data for a free node defines the minimum allocation size, \eg if 16 bytes are needed for a free-list node, allocation requests less than 16 bytes are rounded up.
423	Often the minimum storage alignment and free-node size are the same.
424	The information in an allocated or freed object is overwritten when it transitions from allocated to freed and vice-versa by new program data and/or management information, receptively.
425	For safety purposes, freed storage may be scrubbed (overwritten) to expose inadvertent bugs, such as assuming variables are zero initialized.
426
427	\begin{figure}
428	\centering
429	\input{AllocatedObject}
430	\caption{Allocated Object}
431	\label{f:AllocatedObject}
432
433	\bigskip
434
435	\input{IntExtFragmentation}
436	\caption{Internal and External Fragmentation}
437	\label{f:InternalExternalFragmentation}
438	\end{figure}
439
440
441	\subsection{Single-Threaded Memory-Allocator}
442	\label{s:SingleThreadedMemoryAllocator}
443
444	In a sequential (single threaded) program, the program thread performs all allocation operations without direct concurrency issues.
445	However, interrupts introduce indirect concurrency, if the signal handler performs allocation/deallocation (serially reusable problem~\cite{SeriallyReusable}).
446	In general, the primary issues in a single-threaded allocator are fragmentation and locality.
447
448
449	\subsubsection{Fragmentation}
450	\label{s:Fragmentation}
451
452	Fragmentation is unused memory requested from the OS.
453	Figure~\ref{f:InternalExternalFragmentation} shows fragmentation has two forms: \emph{internal} or \emph{external}.
454
455	\newterm{Internal fragmentation} is inaccessible \emph{allocated} memory, such as headers, trailers, \etc.
456	Internal fragmentation is problematic when management space approaches the object size, \eg for objects $<$16 bytes, memory usage doubles.
457
458	\newterm{External fragmentation} is memory not allocated in the program~\cite{Wilson95,Lim98,Siebert00}, which includes all external management data, freed objects, and reserved memory.
459	This memory is problematic resulting in heap blowup and fragmented memory.
460	\newterm{Blowup} occurs when freed memory becomes a checkerboard of adjacent allocated and free areas, where the free blocks are too small to service requests, leading to unbounded external fragmentation growth~\cite{Berger00}.
461	Heap blowup is a fundamental problem in unmanaged languages without compaction.
462
463	Three basic approaches for controlling fragmentation are identified~\cite{Johnstone99}.
464	The first approach is \newterm{sequential-fit} with a list of free objects (possibly ordered by size) that is searched for a block large enough to fit a requested object.
465	Different search policies determine the free object selected, \eg the first free object large enough (first fit) or closest to the requested size (best fit).
466	Any storage larger than the request can become spacing after the object or split into a smaller free object.
467
468	The second approach is \newterm{segregation} or \newterm{binning} with a set of lists for different sized freed objects.
469	The request size is rounded up to the nearest bin size, often leading to internal fragmentation after the object.
470	A binning algorithm searches for the smallest bin that covers the request, and selects the first free object, if available.
471	Fewer bin sizes means more internal fragmentation but increased reuse as more request sizes match the bin size.
472	More bin sizes has less internal fragmentation size but more external fragmentation as larger bins cannot service smaller requests.
473	Allowing larger bins to service smaller allocations means the freed object can be returned to the matching or larger bin (some advantages to either scheme).
474
475	The third approach is \newterm{splitting} and \newterm{coalescing}.
476	If there is no matching free storage for allocation, a larger free object is split to get the allocation and the smaller object is put back on the free list.
477	For example, in the \newterm{buddy system}, a block of free memory is split into equal chunks, splitting continues until a minimal block is created that fits the allocation.
478	When an object is deallocated, it is coalesced with the objects immediately before/after it in memory, if they are free, creating a larger block.
479	Coalescing can be done eagerly at each deallocation or lazily when an allocation cannot be fulfilled.
480	However, coalescing increases allocation latency (unbounded delays), both for allocation and deallocation.
481	While coalescing does not reduce external fragmentation, the coalesced blocks improve fragmentation quality so future allocations are less likely to cause heap blowup.
482
483
484	\subsubsection{Locality}
485	\label{s:Locality}
486
487	The principle of locality recognizes that programs tend to reference a small set of data, called a \newterm{working set}, for a certain period of time, composed of temporal and spatial accesses~\cite{Denning05}.
488	% Temporal clustering implies a group of objects are accessed repeatedly within a short time period, while spatial clustering implies a group of objects physically close together (nearby addresses) are accessed repeatedly within a short time period.
489	% Temporal locality commonly occurs during an iterative computation with a fixed set of disjoint variables, while spatial locality commonly occurs when traversing an array.
490	Hardware takes advantage of the working set through multiple levels of caching and paging, \ie memory hierarchy.
491	% When an object is accessed, the memory physically located around the object is also cached with the expectation that the current and nearby objects will be referenced within a short period of time.
492	% For example, entire cache lines are transferred between cache and memory, and entire virtual-memory pages are transferred between memory and disk.
493	% A program exhibiting good locality has better performance due to fewer cache misses and page faults\footnote{With the advent of large RAM memory, paging is becoming less of an issue in modern programming.}.
494
495	Temporal locality is largely controlled by program accesses to variables~\cite{Feng05}.
496	An allocator has only indirect influence on temporal locality but largely dictates spatial locality.
497	For temporal locality, an allocator tries to return recently freed storage for new allocations, as this memory is still \emph{warm} in the memory hierarchy.
498	For spatial locality, an allocator places objects used together close together in memory, so the working set of the program fits into the fewest possible cache lines and pages.
499	% However, usage patterns are different for every program as is the underlying hardware memory architecture;
500	% hence, no general-purpose memory-allocator can provide ideal locality for every program on every computer.
501
502	An allocator can easily degrade locality by increasing the working set.
503	For example, it can access an unbounded number of free objects when matching an allocation or coalescing, causing multiple cache or page misses~\cite{Grunwald93}.
504	An allocator can spatially separate related data by binning free storage anywhere in memory, so the related objects are highly separated.
505
506
507	\subsection{Multi-Threaded Memory-Allocator}
508	\label{s:MultiThreadedMemoryAllocator}
509
510	In a concurrent program, multiple kernel threads (KT) perform allocations, requiring some form of mutual exclusion.
511	Along with fragmentation and locality issues, a multi-threaded allocator must deal with false sharing and additional forms of heap blowup.
512
513
514	\subsubsection{Mutual Exclusion}
515	\label{s:MutualExclusion}
516
517	% \newterm{Mutual exclusion} provides sequential access to the shared-management data of the heap.
518	There are two performance issues for mutual exclusion.
519	First, the cost of performing atomic instructions every time a shared resource is accessed to provide mutual exclusion.
520	Solutions using any atomic fence, atomic instruction (lock free), or lock along a fast path, even with zero contention, results in significant slowdown.
521	Second, \newterm{contention} on simultaneous access, so threads must wait until the resource is released.
522	Contention can be reduced by:
523	1) Using multiple fine-grained locks versus few course-gain locks to spread the contention.
524	2) Using trylock and generating new storage if the lock is busy (classic space versus time tradeoff).
525	% 3) Using one of the many lock-free approaches for reducing contention on basic data-structure operations~\cite{Fatourou12}.
526	% However, all approaches have degenerate cases where program contention to the heap is high, which is beyond the allocator's control.
527	Figure~\ref{f:ThreadHeapRelationship} shows how a multi-threaded allocator reduces contention by subdividing a single heap into multiple heaps.
528
529	\begin{figure}
530	\centering
531	\subfloat[T:1]{
532	\input{SingleHeap}
533	\label{f:SingleHeap}
534	} % subfloat
535	\vrule
536	\subfloat[T:H]{
537	\input{SharedHeaps}
538	\label{f:SharedHeaps}
539	} % subfloat
540	\vrule
541	\subfloat[1:1]{
542	\input{PerThreadHeap}
543	\label{f:PerThreadHeap}
544	} % subfloat
545	\caption{Multiple Heaps, Thread:Heap Relationship}
546	\label{f:ThreadHeapRelationship}
547	\end{figure}
548
549	\begin{description}[leftmargin=*]
550	\item[T:1 model (Figure~\ref{f:SingleHeap})] is all threads (T) sharing a single heap (1).
551	% The arrows indicate memory movement for allocation/deallocation operations.
552	% Memory is obtained from freed objects, reserved memory, or the OS;
553	% freed memory can be returned to the OS.
554	To handle concurrency, a single lock is used for all heap operations or fine-grained (lock-free) locking if operations can be made independent.
555	As threads perform large numbers of allocations, a single heap becomes a significant source of contention.
556
557	\item[T:H model (Figure~\ref{f:SharedHeaps})] is multiple threads (T) sharing multiple heaps (H).
558	The allocator allocates/deallocates heaps and assigns threads to heaps often based on dynamic contention pressure.
559	While locking is required for heap access, contention is (normally) reduced as access is spread across the heaps.
560	Locking can be reduced (eliminated) using the T:C variant, \ie each CPU has a heap, and a thread cannot migrate from the CPU if executing an allocator critical-section, implemented with restartable critical sections~\cite{Desnoyers19,Dice02} (see also Section~\ref{s:UserlevelThreadingSupport}).
561	% The goal is minimal heaps (storage) and contention per heap (time).
562	Multiple heaps increase external fragmentation as the ratio of heaps to threads increases, which can lead to heap blowup, where the worst-case scenario is more heaps than threads.
563	The external fragmentation is now multiplied by the number of heaps, since each heap manages its own free storage and allocates its own reserved memory.
564	When freeing, objects normally need to be returned to their original heap (see Section~\ref{s:Ownership}).
565	% Returning storage to the OS may be difficult or impossible, \eg the contiguous @sbrk@ area in Unix.
566	% In the worst case, a program in which objects are allocated from one heap but deallocated to another heap means these freed objects are never reused.
567
568	A shared \newterm{global heap} (G) is often introduced to manage the reserved memory among heaps and centralize interacts with the OS.
569	Instead of heaps making individual object allocations/deallocations through the global heap, resulting in locking and high contention, the global heap partitions the reserved memory into heap (allocation) buffers, which are given out to heaps for their own suballocations.
570	Hence, a heap's allocations are temporally and spatially accessed densely in a small set of buffers, rather than spread sparsely across the entire reserve memory.
571	Buffers are allocated at heap startup, after which allocation often reaches a steady state through free lists.
572	Allocation buffers may increase external fragmentation, since some memory may never be used.
573
574	\item[1:1 model (Figure~\ref{f:PerThreadHeap})] is each thread (1) having its own heap (1), eliminating most contention and locking if threads seldom access another thread's heap (see Section~\ref{s:Ownership}).
575	A thread's objects are consolidated in its heap, better utilizing the cache and paging during thread execution.
576	In contrast, the T:H model can spread thread objects over a larger area in different heaps.
577	%For example, assume page boundaries coincide with cache line boundaries, if a thread heap always acquires pages of memory then no two threads share a page or cache line unless pointers are passed among them.
578	When a thread terminates, it can free its heap objects to the global heap, or the thread heap is retained as-is and reused for a new thread in the future.
579	Destroying a heap can reduce external fragmentation sooner, since all free objects in the global heap are available for immediate reuse.
580	Alternatively, reusing a heap can aid the inheriting thread, if it has a similar allocation pattern, because the heap in primed with freed storage of the right sizes.
581	\end{description}
582
583
584	\subsubsection{False Sharing}
585	\label{s:FalseSharing}
586
587	False sharing occurs for a read/write or write/write among threads modifying different memory sharing a cache line~\cite{Bolosky93}.
588	The write invalidates each thread's cache, even though the threads may be uninterested in the other modified object.
589	False sharing can occur three ways:
590	1) Thread T$_1$ allocates objects O$_1$ and O$_2$ on the same cache line and passes O$_2$'s reference to thread T$_2$.
591	2) Thread T$_1$ allocates object O$_1$ and thread T$_2$ allocates O$_2$, where objects O$_1$ and O$_2$ are on the same cache line.
592	3) T$_2$ deallocates O$_2$, T$_1$ allocates O$_1$ on the same cache line as O$_2$, and T$_2$ reallocated O$_2$ while T$_1$ is using O$_1$.
593	In all three cases, the false sharing is hidden and possibly transient (non-deterministic), making it extremely difficult to find and fix.
594	Case 1) occurs in all three allocator models, and is induced by program behaviour, not the allocator.
595	Case 2) and 3) are allocator induced, and occurs in T:1 and T:H models due to heap sharing, but not 1:1 with private heaps, except possibly at boundary points among heaps.
596
597
598	\subsubsection{Object Containers}
599	\label{s:ObjectContainers}
600
601	Associating header data with every allocation can result in significant internal fragmentation, as shown in Figure~\ref{f:AllocatedObject}.
602	While the header and object are spatially together in memory, they are generally not accessed temporally together~\cite{Feng05}.
603	The result is poor cache usage, since only a portion of the cache line is holding useful data from the program's perspective.
604	% \eg an object is accessed by the program after it is allocated, while the header is accessed by the allocator after it is free.
605
606	\begin{figure}
607	\centering
608	\input{Container}
609	\caption{Object Container}
610	\label{f:ObjectContainer}
611	\end{figure}
612
613	The alternative approach factors common header data to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks}~\cite[\S~3]{Berger00}) suballocated from a heap's allocation buffers, as in Figure~\ref{f:ObjectContainer}.
614	A trailer may also be used at the end of the container.
615	To find the header from an allocation within the container, the container is aligned on a power of 2 boundary and the lower bits of the object address are truncated (or rounded up, minus the trailer size, to obtain the trailer address).
616	Container size is a tradeoff between internal and external fragmentation as some portion of a container may not be used and this portion is unusable for other kinds of allocations.
617	A consequence of this tradeoff is its effect on spatial locality, which can produce positive or negative results depending on the program's access patterns.
618	Normally, heap ownership applies to its containers.
619	Without ownership, different objects in a container may be on different heap free-lists.
620	Finally, containers are linked together for management purposes, and should all objects in a container become free, the container can be repurposed for different sized objects or given to another heap through a global heap.
621
622
623	\subsubsection{Ownership}
624	\label{s:Ownership}
625
626	Object \newterm{ownership} is defined as the heap to which an object is returned upon deallocation~\cite[\S~6.1]{Berger00}.
627	If a thread returns an object to its originating heap, a heap has ownership of its objects.
628	Containers force ownership of internal contiguous objects, unless the entire container changes ownership after it becomes empty.
629	Alternatively, a thread can return an object to the heap it is currently associated with, which can be any heap accessible during a thread's lifetime.
630	The advantage of ownership is preventing heap blowup by returning storage for reuse by the owner heap.
631	Ownership prevents the problem of a producer thread allocating from one heap, passing the object to a consumer thread, and the consumer deallocates the object to another heap, hence draining the producer heap of storage.
632	The disadvantage of ownership is deallocating to another thread's heap requires an atomic operation.
633
634	Object ownership can be immediate or delayed, meaning free objects may be batched on a separate free list either by the returning or receiving thread.
635	The returning thread batches objects to reduce contention by passing multiple objects at once;
636	however, batching across multiple allocation sizes and heaps is complex and there is no obvious time when to push back to the owner heap.
637	It is simpler for the returning threads to immediately return to the receiving thread's batch list as the receiving thread has better knowledge when to incorporate the batch list into its free pool.
638	The receiving thread often delays incorporating returned storage until its local storage in drained.
639
640
641	\subsubsection{User-Level Threading}
642
643	Any heap model can be used with user-level (M:N) threading.
644	However, an important goal of user threads (UT) is for fast operations (creation/termination/context-switching) by not interacting with the OS, allowing large numbers of high-performance interacting threads ($>$ 10,000).
645	In general, UTs use whatever kernel-level heap-model is provided by the language runtime.
646	Hence, a UT allocates/deallocates from/to the heap of the KT on which it is executing.
647
648	However, there is a subtle concurrency problem with user threading and shared heaps.
649	With kernel threading, an operation started by a KT is always completed by that thread, even if preempted;
650	hence, any locking correctness associated with the shared heap is preserved.
651	However, this correctness property is not preserved for user-level threading.
652	A UT can start an allocation/deallocation on one KT, be preempted by user-level time slicing, and continue running on a different KT to complete the operation~\cite{Dice02}.
653	When the UT continues on the new KT, it may have pointers into the previous KT's heap and hold locks associated with it.
654	To get the same KT safety, time slicing must be disabled/\-enabled around these operations to prevent movement.
655	However, eagerly disabling time slicing on the allocation/deallocation fast path is expensive, especially as preemption is infrequent (millisecond intervals).
656	Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it completes atomically.
657	Occasional ignoring a preemption is normally benign;
658	in the worst case, ignoring preemption results in starvation.
659	To mitigate starvation, techniques like rolling the preemption forward at the next context switch can be used.
660
661
662	\section{llheap}
663
664	This section presents our new stand-alone, concurrent, low-latency memory allocator, called llheap (low-latency heap), fulfilling the GNU C Library allocator API~\cite{GNUallocAPI} for C/\CC programs using KTs, with specialized versions for the programming languages \uC and \CFA using user-level threads running over multiple KTs (M:N threading).
665	The primary design objective for llheap is low latency across all allocator calls independent of application access-patterns and/or number of threads, \ie very seldom does the allocator delay during an allocator call.
666	Excluded from the low-latency objective are (large) allocations requiring initialization, \eg zero fill, and/or data copying, along with unbounded delays to acquire storage from the OS or OS scheduling, all of which are outside the allocator's purview.
667	A direct consequence of this objective is very simple or no storage coalescing;
668	hence, llheap's design is willing to use more storage to lower latency.
669	This objective is apropos because systems research and industrial applications are striving for low latency and modern computers have huge amounts of RAM.
670	Finally, llheap's performance must be comparable with current allocators, both in space and time (see performance comparison in Section~\ref{c:Performance}).
671
672
673	\subsection{llheap Design}
674
675	Figure~\ref{f:llheapDesign} shows the design of llheap, which uses the following features:
676	1:1 allocator model eliminating locking on the fast path,
677	separate small (@sbrk@) and large object management (@mmap@),
678	headers per allocation versus containers,
679	small object binning (buckets) forming lists for different sized freed objects,
680	optional fast-lookup table for converting allocation requests into bucket sizes,
681	no coalescing to minimize latency,
682	optional heap ownership (build time),
683	reserved memory (buffer pool) per heap obtained from a global pool,
684	global heap managing freed thread heaps and interacting with the OS to obtained storage,
685	optional statistic-counters table for accumulating counts of allocation operations and a debugging version for testing (build time).
686
687	\begin{figure}
688	\centering
689	% \includegraphics[width=0.65\textwidth]{figures/NewHeapStructure.eps}
690	\input{llheap}
691	\caption{llheap Design}
692	\label{f:llheapDesign}
693	\end{figure}
694
695	llheap starts by creating an empty array for $N$ global heaps from storage obtained using @mmap@ that persists for program duration, where $N$ is the number of computer cores.
696	There is a global last-array pointer and bump-pointer within this array to locate the next free heap storage.
697	When an array's storage is exhausted, another empty array is allocated.
698	Terminated threads push their heap onto a global-stack top-pointer, where free heaps are intrusively linked.
699	When statistics are turned on, there is a global top pointer for a intrusive linked-list to link \emph{all} the heaps (not shown), which is traversed to accumulate statistics counters across heaps when @malloc_stats@ is called.
700
701	When a KT starts, it pops heap storage from the heap free-list, or if empty, gets the next free heap-storage.
702	When a KT terminates, its heap is pushed onto the heap free-list for reuse by a new KT, which prevents unbounded heap growth.
703	The free heaps are stored in a stack so hot storage is reused first.
704	Preserving all heaps created during the program lifetime solves the storage lifetime problem when ownership is used.
705	This approach wastes storage if a large number of KTs are created/terminated at program start and then the program continues sequentially, which is rare.
706
707	Each heap uses segregated free-buckets that have free objects distributed across 60 different sizes from 16 to 16M.
708	All objects in a bucket are the same size.
709	The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation, which is specified by calling @mallopt( M_MMAP_THRESHOLD )@, where the cross over must be $\ge$ the page size or $\le$ the largest bucket (16M).
710	Each cache-aligned bucket has a stack of the same-sized freed objects, where a stack ensures hot storage is reused first.
711	llheap can be configured with object ownership, where an object is freed to the heap from which it is allocated, or object no-ownership, where an object is freed to the KT's current heap.
712	For ownership, a shared remote stack is added to the freelist structure, so push/pop operations require locking.
713	Pushes are eager on each remove free \vs batching, and pops are lazy when there is no cheap storage available, then the entire remote stack is gulped and added to the bucket's free list.
714
715	Initial threads are assigned empty heaps from the heap array.
716	The first thread allocation causes a request for storage from the shared @sbrk@ area.
717	The size of this request is the maximum of the request size or the @sbrk@-extension-size / 16.
718	This heuristic means the @sbrk@ area is subdivided into separate heap buffers (HB) per thread, providing no contention and data locality.
719	A thread does bump allocation in its current buffer, until it starts reusing freed storage or there is insufficient storage, and it obtains another buffer.
720	Thread buffers are not linked;
721	only logically connected to the thread through allocated and deallocated storage.
722	When a thread ends, its heap is returned to the heap array but no storage is released.
723	A new thread receiving a freed heap starts with it fully populated with freed storage.
724	The heuristic is that threads often do similar work, so the free storage in the heap is reusable, resulting in less internal fragmentation.
725	%The heuristic is that threads often do similar work so the free storage in the heap is immediately available.
726	%The downside is the risk of more external fragmentation, if the freed storage is never reused.
727	The downside is if the freed storage is never reused creating external fragmentation.
728
729	Algorithm~\ref{alg:heapObjectAlloc} shows the allocation outline for an object of size $S$.
730	The allocation is divided into small (@sbrk@) or large (@mmap@).
731	For small allocations, $S$ is quantized into a bucket size.
732	Quantizing is performed using a direct lookup for sizes < 64K or a binary search over the ordered bucket array for $S$ $\ge$ 64K.
733	Then, the allocation storage is obtained from the following locations, in order of increasing latency: the bucket's free stack, the heap's local buffer, the bucket's remote stack, the global buffer, the OS (@sbrk@).
734	For large allocations, the storage is directly allocated from the OS using @mmap@.
735
736	\begin{algorithm}[t]
737	\caption{Dynamic object allocation of size $S$}
738	\label{alg:heapObjectAlloc}
739	\begin{algorithmic}
740	\STATE $S \gets S + \text{header-size}$
741	\IF {$S < \textit{mmap-threshhold}$}
742	\STATE $\textit{B} \gets \text{smallest free-bucket} \geq S$
743	\IF {$\textit{B's free-list \(\neg\)empty}$}
744	\STATE $\textit{O} \gets \text{pop an object from B's free-list}$
745	\ELSIF {$\textit{heap's allocation buffer} \ge B$}
746	\STATE $\textit{O} \gets \text{bump allocate object of size B from allocation buffer}$
747	\ELSIF {$\textit{heap's remote-list \(\neg\)empty}$}
748	\STATE $\textit{merge heap's remote-list into free-list}$
749	\STATE $\textit{O} \gets \text{pop an object from B's free-list}$
750	\ELSE
751	\STATE $\textit{O} \gets \text{allocate an object of size B from global pool}$
752	\ENDIF
753	\ELSE
754	\STATE $\textit{O} \gets \text{allocate an object of size S using \lstinline{mmap} system-call}$
755	\ENDIF
756	\RETURN $\textit{O}$
757	\end{algorithmic}
758	\end{algorithm}
759
760	Algorithm~\ref{alg:heapObjectFreeOwn} shows the deallocation (free) outline for an object at address $A$ with ownership.
761	First, the address is divided into small (@sbrk@) or large (@mmap@).
762	For small allocations, the bucket associated with the request size is retrieved from the allocation header.
763	If the bucket is local to the thread, the allocation is pushed onto the thread's associated bucket.
764	If the bucket is not local to the thread, the allocation is pushed onto the owning thread's remote stack.
765	For large allocations, the storage is unmapped back to the OS.
766	Without object ownership, the algorithm is the same as for ownership except when the bucket is not local to the thread.
767	In that case, the corresponding bucket of the owner thread is computed by the deallocating thread, and the allocation is pushed onto the deallocating thread's corresponding bucket, \ie no search is required.
768
769	\begin{algorithm}[t]
770	\caption{Dynamic object free at address $A$ with object ownership}
771	\label{alg:heapObjectFreeOwn}
772	\begin{algorithmic}
773	\IF {$\textit{A heap allocation}$}
774	\STATE $\text{B} \gets \textit{O's owner}$
775	\IF {$\textit{B's thread = current heap thread}$}
776	\STATE $\text{push A to B's free-list}$
777	\ELSE
778	\STATE $\text{push A to B's remote-list}$
779	\ENDIF
780	\ELSE
781	\STATE $\text{return A to system using system call \lstinline{munmap}}$
782	\ENDIF
783	\end{algorithmic}
784	\end{algorithm}
785
786	\begin{comment}
787	\begin{algorithm}
788	\caption{Dynamic object free at address $A$ without object ownership}
789	\label{alg:heapObjectFreeNoOwn}
790	\begin{algorithmic}[1]
791	\IF {$\textit{A mapped allocation}$}
792	\STATE $\text{return A's dynamic memory to system using system call \lstinline{munmap}}$
793	\ELSE
794	\STATE $\text{B} \gets \textit{O's owner}$
795	\IF {$\textit{B is thread-local heap's bucket}$}
796	\STATE $\text{push A to B's free-list}$
797	\ELSE
798	\STATE $\text{C} \gets \textit{thread local heap's bucket with same size as B}$
799	\STATE $\text{push A to C's free-list}$
800	\ENDIF
801	\ENDIF
802	\end{algorithmic}
803	\end{algorithm}
804	\end{comment}
805
806	Finally, the llheap design funnels all allocation/deallocation operations through the @malloc@ and @free@ routines, which are the only routines to directly access and manage the internal data structures of the heap.
807	Other allocation operations, \eg @calloc@, @memalign@, and @realloc@, are composed of calls to @malloc@ and possibly @free@, and may manipulate header information after storage is allocated.
808	This design simplifies heap-management code during development and maintenance.
809
810
811	\subsubsection{Bounded Allocation}
812
813	The llheap design results in bounded allocation.
814	For small allocations, once all the buckets have freed objects, storage is recycled.
815	For large allocations, the storage is directly recycled back to the OS.
816	When a thread terminates, its heap is recycled to the next new thread and the above process begins for that thread.
817	The pathological case is threads allocating a large amount of storage, freeing it, and then quiescing, which demonstrates that the bound constant can be large.
818	This pathological pattern occurs for \emph{immortal} threads, \eg I/O threads with program lifetime and bursts of activity performing many allocations/deallocations.
819	Hence, independent of external fragmentation in thread heaps, storage cannot grow unbounded unless the program does not free.
820
821
822	\subsubsection{Alignment}
823
824	The minimum storage alignment $M$ comes from the architecture application-binary-interface (ABI) based on hardware factors: bus width (32 or 64-bit), largest register (double, long double), largest atomic instruction (double compare-and-swap), or vector data (Intel MMX).
825	An access with a nonaligned address maybe slow or an error.
826	A memory allocator must assume the largest hardware requirement because it is unaware of the data type occupying the allocation.
827	Often the minimum storage alignment is an 8/16-byte boundary on a 32/64-bit computer, respectively.
828	Alignments larger than $M$ are powers of 2, such as page alignment (4/8K).
829	Any alignment less than $M$ is raised to the minimal alignment.
830
831	llheap aligns its allocation header on an $M$ boundary and its size is $M$, making the following data $M$ aligned.
832	This pattern means there is no minimal alignment computation along the allocation fast path, \ie new storage and reused storage is always correctly aligned.
833	An alignment $N$ greater than $M$ is accomplished with a \emph{pessimistic} request for storage that ensures \emph{both} the alignment and size request are satisfied.
834	\begin{center}
835	\input{Alignment2}
836	\end{center}
837	The amount of storage necessary is $alignment - M + size$, which ensures there is an address, $A$, after the storage returned from @malloc@, $P$, that is a multiple of $alignment$ followed by sufficient storage for the data object.
838	The approach is pessimistic if $P$ happens to have the correct alignment $N$, and the initial allocation has requested sufficient space to move to the next multiple of $N$.
839	In this case, there is $alignment - M$ bytes of unused storage after the data object, which could be used by @realloc@.
840	Note, the address returned by the allocation is $A$, which is subsequently returned for deallocation.
841	However, the deallocation requires the value $P$, which must be computable from $A$, from which $H$ can be computed $P - M$.
842	Hence, there must be a mechanism to detect $P$ $\neq$ $A$ and compute $P$ from $A$.
843
844	To detect and perform this computation, llheap uses two headers:
845	the \emph{original} header $H$ associated with the allocation, and a \emph{fake} header $F$ within this storage before the alignment boundary $A$.
846	\begin{center}
847	\input{Alignment2Impl}
848	\end{center}
849	Since every allocation is aligned at $M$, $P$ $\neq$ $A$ only holds for alignments greater than $M$.
850	When $P$ $\neq$ $A$, the minimum distance between $P$ and $A$ is $M$ bytes, due to the pessimistic storage allocation.
851	Therefore, there is always room for an $M$-byte fake header before $A$.
852	The fake header must supply an indicator to distinguish it from a normal header and the location of address $P$ generated by the allocation.
853	This information is encoded as an offset from A to P and the initial alignment (discussed in Section~\ref{s:ReallocStickyProperties}).
854	To distinguish a fake header from a normal header, the least-significant bit of the alignment is set to 1 because the offset participates in multiple calculations, while the alignment is just remembered data.
855	\begin{center}
856	\input{FakeHeader}
857	\end{center}
858
859	Note, doing alignment with containers requires a separate container for the aligned fixed-sized objects, so there are more kinds of containers that must be managed.
860
861
862	\subsubsection{\lstinline{realloc} and Sticky Properties}
863	\label{s:ReallocStickyProperties}
864
865	The allocation routine @realloc@ provides a memory management pattern for shrinking/enlarging an existing allocation, while maintaining some or all of the object data.
866	The realloc pattern is simpler than the suboptimal manual steps.
867	\begin{flushleft}
868	\setlength{\tabcolsep}{10pt}
869	\begin{tabular}{ll}
870	\multicolumn{1}{c}{\textbf{realloc pattern}} & \multicolumn{1}{c}{\textbf{manual}} \\
871	\begin{C++}
872	T * naddr = realloc( oaddr, newSize );
873
874
875
876	\end{C++}
877	&
878	\begin{C++}
879	T * naddr = (T *)malloc( newSize ); $\C[2in]{// new storage}$
880	memcpy( naddr, addr, oldSize ); $\C{// copy old bytes}$
881	free( addr ); $\C{// free old storage}$
882	addr = naddr; $\C{// change pointer}\CRT$
883	\end{C++}
884	\end{tabular}
885	\end{flushleft}
886	The manual steps are suboptimal because there may be internal fragmentation at the end of the allocation due to bucket sizes.
887	If this storage is sufficiently large, it eliminates a new allocation and copying.
888	Alternatively, if the storage is made smaller, there may be a reasonable crossover point, where just increasing the internal fragmentation eliminates a new allocation and copying.
889	Hence, using @realloc@ as often as possible can reduce storage management costs.
890	In fact, if @oaddr@ is @nullptr@, @realloc@ does a @malloc( newSize)@, and if @newSize@ is 0, @realloc@ does a @free( oaddr )@, so all allocation/deallocation can be done with @realloc@.
891
892	The hidden problem with this pattern is the effect of zero fill and alignment with respect to reallocation.
893	For safety, these properties must persist (be ``sticky'') when storage size changes.
894	Prior to llheap, allocation properties are not preserved across reallocation nor is it possible to query an allocation to maintain these properties manually.
895	Hence, a random call to @realloc@ that reallocates storage may cause downstream errors, if allocation properties are needed.
896	This silent problem is unintuitive to programmers, can cause catastrophic failure, and is difficult to debug because it is transient.
897	To prevent these problems, llheap preserves initial allocation properties within an allocation, allowing them to be queried, and the semantics of @realloc@ preserve these properties on any storage change.
898	As a result, the realloc pattern is efficient and safe.
899
900	Note, @realloc@ has a compile-time disadvantage \vs @malloc@, because @malloc@ simplifies optimization opportunities.
901	For @malloc@ the compiler knows the new storage address is not aliased, which is not true for @realloc@: the same storage can be returned.
902	The compiler uses this knowledge to optimize the region of code between the @malloc@ call and the point where the pointer escapes or it finds the matching @free@.
903	For @realloc@, the compiler must also analyse the code \emph{before} the call and this analysis may fail.
904
905	Finally, there is a flaw in @realloc@'s definition: if there is no memory to allocate new storage for an expansion, the original allocation is not freed or moved, @errno@ is set to @ENOMEM@, and a null pointer is returned.
906	This semantics preserves the original allocation so the data is not lost in a failure case.
907	However, most calls to @realloc@ are written: @p = realloc( p, size )@, so the original storage is leaked when pointer @p@ is overwritten with null, negating the benefit of not freeing the storage for recovery purposes.
908	Programmers can follow a coding pattern of:
909	\begin{C++}
910	char * p;
911	...
912	void * p1 = realloc( p, size );
913	if ( p1 ) p = (char *)p1;
914	else // release some storage
915	\end{C++}
916	However, most programmers ignore return codes.
917	A better alternative is to change @realloc@'s interface to be like @posix_memalign@, which returns two results, a return code and a storage address, so the error code is separate from the returned storage.
918	\begin{C++}
919	int retcode = realloc( (void **)&p, size );
920	\end{C++}
921	which returns 0 or @ENOMEM@, only changes @p@ for expansion, but requires an ugly cast on the call.
922
923
924	\subsubsection{Sticky Test}
925
926	Since sticky properties are an important safety feature for @realloc@, an ad-hoc @realloc@ test was created (not shown) to test whether a memory allocator preserves zero-fill from @calloc@ and/or alignment from @memalign@.
927	The first test @calloc@s a large array (zero fill), sets the array to 42, shortens it, and then enlarges it to the original size.
928	It does these steps 100 times attempting to get a reused large block of memory that is still set to 42, showing new storage does not preserve zero fill.
929	The second test @memalign@s storage and @realloc@s it multiple times making it larger until the current storage must be copied into new storage.
930	The alignment of each storage address returned from @realloc@ is verified with the original alignment.
931
932	If a test fails, that sticky properties is not provided;
933	if the test passes, that sticky property is provided in some form but not necessarily in all forms (test just got lucky).
934	If an allocator fails these tests, it is unnecessary to perform a manual inspection of the @realloc@ code for sticky properties.
935	Only llheap passes the test, as its @realloc@ applies sticky properties.
936
937
938	\subsubsection{Header}
939
940	To preserve allocation properties requires storing additional information about an allocation.
941	Figure~\ref{f:llheapHeader} shows llheap captures this information in the per object header, which has two fields (left/right) sized appropriately for 32/64-bit alignment requirements.
942
943	\begin{figure}
944	\centering
945	\input{Header}
946	\caption{llheap Header}
947	\label{f:llheapHeader}
948	\end{figure}
949
950	The left field is a union of three values:
951	\begin{description}[leftmargin=*,topsep=2pt,itemsep=2pt,parsep=0pt]
952	\item[bucket pointer]
953	is for deallocation and points back to the bucket associated with this storage request (see Figure~\ref{f:llheapDesign} for the fields accessible in a bucket).
954	\item[mapped size]
955	is for deallocation of mapped storage and is the storage size for unmapping.
956	\item[next free block]
957	is an intrusive pointer linking same-size free blocks onto a bucket's stack of free objects.
958	\end{description}
959	The low-order 3-bits of these fields are unused for any stored values, due to the minimum aligned of 8-bytes (even for 32-bit addressing).
960	The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively.
961	Note, the zero-filled/mapped bits are only used in the normal header and the alignment bit in the fake header.
962	This implementation allows a fast test if any of the lower 3-bits are on (@&@ and compare).
963	If no bits are on, it implies a basic allocation, which is handled quickly in the fast path for allocation and free;
964	otherwise, the bits are analysed and appropriate actions are taken for the complex cases.
965
966	The right field remembers the allocation request size versus the allocation (bucket) size, \eg request of 42 bytes is rounded up to 64 bytes.
967	Since programmers think in request size rather than allocation size, the request size allows better generation of statistics or errors and also helps in memory management.
968
969
970	\subsection{Statistics and Debugging}
971
972	llheap can be built to accumulate fast and largely contention-free allocation statistics to help understand dynamic memory behaviour.
973	Incrementing statistic counters must appear on the allocation fast path.
974	To make statistics performant enough for use on running systems, each heap has its own set of statistic counters, so statistic operations do not require slow atomic operations.
975
976	To locate all statistic counters, heaps are linked together in statistics mode, and this list is locked and traversed to sum all counters across heaps.
977	Note, the list is locked to prevent errors traversing an active list, which may have nodes added or removed dynamically;
978	the statistics counters are not locked and can flicker during accumulation.
979	Hence, printing statistics during program execution is an approximation.
980	Figure~\ref{f:StatiticsOutput} shows an example of statistics output, which covers all allocation operations and information about deallocating storage not owned by a thread.
981	No other memory allocator provides as comprehensive statistical information.
982	These statistics were invaluable during the development of llheap for debugging and verifying correctness, and should be equally valuable to application developers.
983
984	\begin{figure}
985	\begin{C++}
986	PID: 2167216 Heap statistics: (storage request / allocation)
987	malloc >0 calls 19,938,000,110; 0 calls 2,064,000,000; storage 4,812,152,081,688 / 5,487,040,092,624 bytes
988	aalloc >0 calls 0; 0 calls 0; storage 0 / 0 bytes
989	calloc >0 calls 7; 0 calls 0; storage 1,040 / 1,152 bytes
990	memalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
991	amemalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
992	cmemalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
993	resize >0 calls 0; 0 calls 0; storage 0 / 0 bytes
994	realloc >0 calls 0; 0 calls 0; storage 0 / 0 bytes
995	copies 0; smaller 0; alignment 0; 0 fill 0
996	free !null calls 19,938,000,092; null / 0 calls 4,064,000,004; storage 4,812,152,003,021 / 5,487,040,005,152 bytes
997	remote pushes 4; pulls 0; storage 0 / 0 bytes
998	sbrk calls 1; storage 8,388,608 bytes
999	mmap calls 2,000,000; storage 2,097,152,000,000 / 2,105,344,000,000 bytes
1000	munmap calls 2,000,000; storage 2,097,152,000,000 / 2,105,344,000,000 bytes
1001	remainder calls 0; storage 0 bytes
1002	threads started 4; exited 4
1003	heaps $new$ 4; reused 0
1004	\end{C++}
1005	\caption{Statistics Output}
1006	\label{f:StatiticsOutput}
1007	\end{figure}
1008
1009	llheap can also be built with debug checking, which inserts many asserts along all allocation paths.
1010	These assertions detect incorrect allocation usage, like double frees, unfreed storage, or memory corruption because internal values (like header fields) are overwritten.
1011	These checks are best effort as opposed to complete allocation checking as in @valgrind@~\cite{valgind}.
1012	Nevertheless, the checks detect many allocation problems.
1013	There is a problem in detecting unfreed storage because some library routines assume their allocations have life-time duration, and hence, do not free their storage.
1014	For example, @printf@ might allocate a 1024-byte buffer on the first call and never delete this buffer.
1015	To prevent a false positive for unfreed storage, it is possible to specify an amount of storage that is never freed (see @malloc_unfreed@ in Section~\ref{s:ExtendedCAPI}), and it is subtracted from the total allocate/free difference.
1016	Determining the amount of never-freed storage is annoying, but once done, any warnings of unfreed storage are application related.
1017	Debugging mode also scrubs each allocation with @0xff@, so assumptions about zero-filled objects generate errors.
1018	Finally, if a program does segment-fault in debug mode, a stack backtrace is printed to help in debugging.
1019
1020	Tests indicate only a 30\% performance decrease when statistics \emph{and} debugging are enabled in programs with 10\% to 15\% allocation cost, and the latency cost for accumulating statistic from each heap is mitigated by limited calls, often only one at the end of the program.
1021
1022
1023	% \subsection{Design Choices}
1024	%
1025	% llheap's design was reviewed and changed multiple times during its development.
1026	% All designs focused on the allocation/free \newterm{fast path}, \ie the shortest code path for the most common operations.
1027	% The model chosen is 1:1, giving one heap per thread for each kernel thread (KT).
1028	% Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
1029	% Therefore, the majority of heap operations are uncontended, modulo operations on the global heap and ownership.
1030	%
1031	% Problems:
1032	% \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1033	% \item
1034	% Need to know when a KT starts/terminates to create/delete its heap.
1035	%
1036	% \noindent
1037	% It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
1038	% \item
1039	% There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
1040	%
1041	% \noindent
1042	% The classic solution only deletes a heap after all referents are returned, which is complex.
1043	% The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
1044	% If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
1045	% To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
1046	% In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
1047	% \item
1048	% There can be significant external fragmentation as the number of KTs increases.
1049	%
1050	% \noindent
1051	% In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
1052	% Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
1053	% \item
1054	% Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
1055	% \begin{quote}
1056	% A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable}
1057	% \end{quote}
1058	% If a KT is preempted during an allocation operation, the OS can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness.
1059	% Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable.
1060	% Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the OS is providing the second thread via the signal handler.
1061
1062	% There is the same serially-reusable problem with UTs migrating across KTs.
1063	% \end{itemize}
1064	% Tests showed this design produced the closest performance match with the best current allocators, and code inspection showed most of these allocators use different variations of this approach.
1065
1066
1067	\subsection{Bootstrapping}
1068
1069	There are problems bootstrapping a memory allocator.
1070	Programs can be statically or dynamically linked.
1071	The order in which the linker schedules startup code is poorly supported so it cannot be controlled entirely.
1072	Knowing a KT's start and end independently from the KT code is also difficult.
1073
1074	For static linking, the allocator is loaded with the program.
1075	Hence, allocation calls immediately invoke the allocator operation defined by the loaded allocation library and there is only one memory allocator used in the program.
1076	This approach allows allocator substitution by placing an allocation library before any other in the linked/load path.
1077
1078	Allocator substitution is similar for dynamic linking.
1079	However, the dynamic loader starts first and needs to perform dynamic allocations \emph{before} the substitution allocator is loaded.
1080	As a result, the dynamic loader uses a default allocator until the substitution allocator is loaded, after which all allocation operations are handled by the substitution allocator, including those from the dynamic loader.
1081	Hence, some part of the @sbrk@ area may be used by the default allocator and substitution allocator statistics cannot be correct.
1082	Furthermore, dynamic linking uses an assembler trampoline to call the procedure linkage table resolver, so there is an additional cost along the allocator fast path for all allocation operations.
1083	Testing showed up to a 5\% performance decrease with dynamic linking as compared to static linking, even when using @tls_model( "initial-exec" )@ to obtain tighter binding.
1084
1085	After the allocator is loaded, it needs to be initialized before the first allocation request.
1086	Currently, the only mechanism to control initialization is via constructor routines (see below), each with an integer priority, where the linker calls the constructors in increasing order of priority.
1087	However, there are few conventions for priorities amongst libraries, where constructors with equal priorities are called in arbitrary order.
1088	(Only a transitive closure of references amongst library calls can establish an absolute initialization order.)
1089	As a result, the first call to an allocation routine can occur without initialization.
1090	To deal with this problem, it is necessary to have a global flag that is checked along the allocation fast path to trigger initialization (singleton pattern).
1091
1092	Along these lines, there is a subtle problem is defining when a program starts and ends.
1093	For example, prolog/epilog code outside of the program should not be considered in statistics as the application does not make these calls.
1094	llheap establishes these two points using constructor/destructor routines with initialization priority 100, where system libraries use priorities $\le$ 100 and application programs have priorities $>$ 100.
1095	\begin{flushleft}
1096	\hspace*{\parindentlnth}
1097	\setlength{\tabcolsep}{20pt}
1098	\begin{tabular}{@{}ll@{}}
1099	\begin{C++}
1100	@__attribute__(( constructor( 100 ) ))@
1101	static void startup( void ) {
1102	// clear statistic counters
1103	// reset allocUnfreed counter
1104	}
1105
1106	\end{C++}
1107	&
1108	\begin{C++}
1109	@__attribute__(( destructor( 100 ) ))@
1110	static void shutdown( void ) {
1111	// sum allocUnfreed for all heaps
1112	// subtract global unfreed storage
1113	// if allocUnfreed > 0 then print warning message
1114	}
1115	\end{C++}
1116	\end{tabular}
1117	\end{flushleft}
1118	Hence, @startup@ is called after the program prologue but before the application starts, and @shutdown@ is called after the program terminates but before the program epilogue.
1119	By resetting counters in @startup@, prologue allocations are ignored, and checking unfreed storage in @shutdown@ checks only application memory management, ignoring the program epilogue.
1120
1121	Unfortunately, @startup@/@shutdown@ only apply to the program KT, not to any additional KTs created by the program.
1122	However, it is essential for the allocator to know when each KT is started/terminated to initialize/de-initialize the KT's heap.
1123	Initialization can be handled by making the global flag (above) thread-local and then the initialization check along the fast path covers the first allocation by a newly created thread.
1124	De-initialization is handled by registering a destructor routine using @pthread_key_create@ in the initialization code triggered along the fast path, which subsequently calls the destructor at thread termination.
1125
1126
1127	\subsection{User-level Threading Support}
1128	\label{s:UserlevelThreadingSupport}
1129
1130	llheap is the underlying allocator in the user-threading programming languages \uC and \CFA.
1131	These systems have preemptive scheduling, which requires management of timing events through a signal handle (@SIGALRM@).
1132	The complexity in these system is the serially-reusable problem (see Section~\ref{s:SingleThreadedMemoryAllocator}) when UTs are time sliced (language level) independently from KTs (OS level).
1133	The solution is to prevent interrupts resulting in a CPU or KT change during critical operations, eliminating problems like starting a memory operation on one KT and completing it on another when the underlying heaps are different.
1134	% For user-level threading, the serially-reusable problem occurs with time slicing for preemptable user-level scheduling, as the interrupted UT is unlikely to be restarted on the same KT.
1135	However, without time slicing, a long running UT prevents the execution of other UTs (starvation).
1136
1137	The languages modify llheap using two techniques to prevent time slicing during non-interruptible allocation operations.
1138	On the slow path, when executing expensive operations, time-slicing interrupts are disabled/enabled, so the operation completes atomically on the KT.
1139	On the fast path, all non-interruptible allocation/deallocation routines are placed in a separate code segment.
1140	The linker places this segment into a contiguous block of memory. %, but the order of routines within the block is unspecified.
1141	Then the time-slice signal handler compares the program counter at the point of interrupt with the start/end address of the non-interruptible segment, and if executing within the segment, the signal handler returns without context switching.
1142	The llheap funnel design simplifies this implementation so only a few funnel and statistics routines are located in the non-interruptible section.
1143	% This technique is fragile as no mechanism exists to ensure all crucial code along the fast path is placed into the non-interruptible segment.
1144
1145	Interestingly, marking non-interruptible operations by bracketing them with a set/reset of a thread-local flag fails, as read/write is not atomic on some machines.
1146	For example, the ARM processor stores the thread-local pointer in a coprocessor register that cannot perform atomic base-displacement addressing.
1147	Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a UT.
1148	As well, switching to a T:C model with restartable critical sections using @librseq@~\cite{Desnoyers19} was examined (see Section~\ref{s:MutualExclusion}).
1149	However, tests showed that while @librseq@ can determine the particular CPU quickly, setting up the restartable critical-section along the allocation fast-path produced a significant decrease in performance.
1150	Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with UT migration across KTs.
1151	For example, UT$_1$ is executing an allocation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
1152	The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately performs an allocation.
1153	Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical section.
1154	Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its allocation operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
1155	If @librseq@ had an @rseq_abort@ which:
1156	\begin{enumerate}[leftmargin=*,topsep=2pt,itemsep=0pt,parsep=0pt]
1157	\item
1158	marks the current restartable critical-section as cancelled so it restarts when attempting to commit.
1159	\item
1160	does nothing if there is no current restartable critical section in progress.
1161	\end{enumerate}
1162	Then @rseq_abort@ could be called on the backside of a user-level context-switching.
1163	A feature similar to this idea might exist for hardware transactional memory.
1164	A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
1165
1166
1167	\subsection{C API}
1168
1169	Figure~\ref{f:CDynamicAllocationAPI} shows the C dynamic allocation API, which is neither orthogonal nor complete.
1170	For example, it is possible to zero fill or align an allocation but not both, it is only possible to zero fill an array allocation, and it is not possible to resize a memory allocation without data copying.
1171	As a result, programmers must provide missing alternatives, which is error prone, rightly blaming the C programming language for a poor allocation API.
1172	Furthermore, newer programming languages have better type systems that can provide safer and more powerful APIs for memory allocation.
1173	The following presents llheap API changes.
1174
1175	\begin{figure}
1176	\hspace*{\parindentlnth}
1177	\begin{tabular}{@{}l\|l@{}}
1178	\begin{C++}
1179	void * malloc( size_t size );
1180	void * calloc( size_t dimension, size_t size );
1181	void * realloc( void * oaddr, size_t size );
1182	void * reallocarray( void * oaddr, size_t dimension, size_t size );
1183	void free( void * addr );
1184	void * memalign( size_t alignment, size_t size );
1185	void * aligned_alloc( size_t alignment, size_t size );
1186	int posix_memalign( void ** memptr, size_t alignment, size_t size );
1187	void * valloc( size_t size );
1188	void * pvalloc( size_t size );
1189	\end{C++}
1190	&
1191	\begin{C++}
1192	int mallopt( int option, int value );
1193	size_t malloc_usable_size( void * addr );
1194	void malloc_stats( void );
1195	int malloc_info( int options, FILE * fp );
1196
1197	// Unsupported
1198	struct mallinfo mallinfo( void );
1199	int malloc_trim( size_t );
1200	void * malloc_get_state( void );
1201	int malloc_set_state( void * );
1202	\end{C++}
1203	\end{tabular}
1204	\caption{llheap support of C dynamic-allocation API}
1205	\label{f:CDynamicAllocationAPI}
1206	\end{figure}
1207
1208
1209	\subsubsection{Extended C API}
1210	\label{s:ExtendedCAPI}
1211
1212	llheap transparently augments the C dynamic memory API to increase functionality, orthogonality, and safety.
1213	\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1214	\item
1215	@malloc@ remembers the original allocation size separate from the actual allocation size.
1216	\item
1217	@calloc@ sets the sticky zero-fill property.
1218	\item
1219	@memalign@, @aligned_alloc@, @posix_memalign@, @valloc@ and @pvalloc@ set the sticky alignment property, remembering the specified alignment size.
1220	\item
1221	@realloc@ and @reallocarray@ preserve sticky properties across copying.
1222	\item
1223	@malloc_stats@ prints detailed statistics of allocation/free operations when linked with a statistic version.
1224	\item
1225	Existence of shell variable @MALLOC_STATS@ implicitly calls @malloc_stats@ at program termination, so precompiled programs do not have to be modified.
1226	\end{itemize}
1227
1228	llheap extends the C dynamic-memory API with new allocation operations with APIs matching existing C counterparts.
1229	\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
1230	\item
1231	@aalloc@ extends @calloc@ for dynamic array allocation \emph{without} zero-filling the memory (faster than @calloc@).
1232	\item
1233	@resize@ extends @realloc@ for resizing an allocation \emph{without} copying previous data or preserving sticky properties (faster than @realloc@).
1234	\item
1235	@resizearray@ extends @resize@ for an array allocation (faster than @reallocarray@).
1236	\item
1237	@amemalign@ extends @aalloc@ with alignment and sets sticky alignment property.
1238	\item
1239	@cmemalign@ extends @amemalign@ with zero fill and sets sticky zero-fill and alignment property.
1240	\item
1241	@aligned_resize@ extends @resize@ with an alignment.
1242	\item
1243	@aligned_resizearray@ extends @resizearray@ with alignment.
1244	\item
1245	@aligned_realloc@ extends @realloc@ with alignment.
1246	\item
1247	@aligned_reallocarray@ extends @resizearray@ with alignment.
1248	\end{itemize}
1249
1250	llheap extends the C dynamic memory API with new control operations.
1251	The following routines are called \emph{once} during llheap startup to set specific limits \emph{before} an application starts.
1252	Setting these value early is essential because allocations can occur from the dynamic loader and other libraries before application code executes.
1253	To set a value, define a specific routine in an application and return the desired value, \eg
1254	\begin{C++}
1255	size_t malloc_extend() { return 16 * 1024 * 1024; }
1256	\end{C++}
1257	\begin{itemize}[leftmargin=*,topsep=0pt,itemsep=1pt,parsep=0pt]
1258	\item
1259	@malloc_extend@ returns the number of bytes to extend the @sbrk@ area when there is insufficient free storage to service an allocation request.
1260	\item
1261	@malloc_mmap_start@ returns the crossover allocation size from the @sbrk@ area to separate mapped areas, see also @mallopt( M_MMAP_THRESHOLD )@.
1262	\item
1263	@malloc_unfreed@ returns the amount subtracted from the global unfreed program storage to adjust for unreleased storage from routines like @printf@ (debug only).
1264	\end{itemize}
1265
1266	llheap extends the C dynamic-memory API with functions to query object properties.
1267	\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
1268	\item
1269	@malloc_size@ returns the requested size of a dynamic object, which is updated when an object is resized, similar to @malloc_usable_size@.
1270	\item
1271	@malloc_alignment@ returns the object alignment, where the minimal alignment is 16 bytes.
1272	\item
1273	@malloc_zero_fill@ returns true if the object is zero filled.
1274	\item
1275	@malloc_remote@ returns true if the object is from a remote heap (@OWNERSHIP@ only).
1276	\end{itemize}
1277
1278	llheap extends the C dynamic-memory API with new statistics control.
1279	\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
1280	\item
1281	@malloc_stats_fd@ sets the file descriptor for @malloc_stats@ writes (default @stdout@).
1282	\item
1283	@malloc_stats_clear@ clears the statistics counters for all thread heaps.
1284	\item
1285	@heap_stats@ extends @malloc_stats@ to only print statistics for the heap associated with the executing thread.
1286	\end{itemize}
1287
1288
1289	\subsubsection{Modern Allocation API}
1290
1291	Modern programming languages have complex type systems that can be used to consolidate the panoply of memory allocation routines and features, providing a simpler programming experience and safety.
1292	The \CFA language is used to demonstrate this capability, because llheap forms the memory allocator for this C variant, but other languages can provide similar APIs.
1293
1294	\CFA polymorphism reduces the allocation API to two overloaded routines allocating a single object or an array of objects.
1295	\begin{cfa}
1296	forall( T & ) {
1297	T * alloc( /* list of property functions ... */ ) { ... } // singleton allocation
1298	T * alloc( size_t @dimension@, /* list of property functions ... */ ) { ... } // array allocation
1299	}
1300	\end{cfa}
1301	Because the \CFA type system uses the return type to select overloads (like Ada), this capability is leveraged to remove the object-size parameter and return cast for regular calls to C @malloc@ or @memalign@.
1302	\begin{cfa}
1303	inline T * alloc( ... ) {
1304	if ( _Alignof(T) <= defaultAlign() ) return @(T *)@malloc( @sizeof(T)@ ); // C allocation
1305	else return @(T *)@memalign( @_Alignof(T)@, @sizeof(T)@ ); // C allocation
1306	}
1307	\end{cfa}
1308	The calls to these two routine are now much safer than the C equivalents.
1309	\begin{C++}
1310	int * ip = alloc(); $\C[2.75in]{// T => int, sizeof => 4/8, alignment => default}$
1311	double * dp = alloc(); $\C{// T => double, sizeof => 8, alignment => default}$
1312	struct Spinlock { ... } [[aligned(128)]] * sp = alloc(); $\C{// T => Spinlock, sizeof => ..., alignment = 128}$
1313	int * ia = alloc( 10 ); $\C{// T => int, sizeof => 4/8, alignment => default, dimension => 10}\CRT$
1314	\end{C++}
1315	At compile time, each call to @alloc@ extracts the return type @T@ from the left-hand side of the assignment, which is then used in @sizeof@, @_Alignof@, and casting the storage to the correct type.
1316	The @inline@ and constant expression allow the compiler to remove the @if@ statement.
1317	This interface removes all the common allocation-call errors in C and provides a uniform name covering all allocation reducing the cognitive burden.
1318
1319	The property functions are a variable number of routines providing @alloc@ with management details and actions.
1320	The functions are @align@, @fill@, @resize@, and @realloc@, and written in prefix versus postfix notation solely for aesthetic reasons, \eg @3`fill@ $\equiv$ @fill( 3 )@.
1321	The examples are arrays but apply equally to singleton allocations.
1322	\begin{cfa}
1323	int * ip = alloc( 5, @4096`align@, @5`fill@ ); $\C[3in]{// start array on 4096 boundary and initialize elements with 5}$
1324	int * ip2 = alloc( 10, @ip`fill@, @(malloc_alignment( ip ))`align@ ); $\C{// first 5 elements same as ip, same alignment as ip}$
1325	_Complex double * cdp = alloc( 5, @(3.5+4.1i)`fill@ ); $\C{// initialize complex elements with 3.5+4.1i}$
1326	struct S { int i, j; };
1327	S * sp = alloc( 10, @((S){3, 4})`fill@ ); $\C{// initialize structure elements with {3, 4}}$
1328	ip = alloc( 10, @ip`realloc@, @10`fill@ ); $\C{// make array ip larger and initialize new elements with 10}$
1329	double * dp = alloc( 5, @ip2`resize@, @256`align@, @13.5`fill@ ); $\C{// reuse ip2 storage for something else}\CRT$
1330	\end{cfa}
1331	Finally, \CFA has constructors and destructors, like \CC, which are invoked when allocating with @new@ and @delete@.
1332	\begin{cfa}
1333	T * t = new( 3, 4, 5 ); $\C[3in]{// allocate T and call constructor T\{ 3, 4, 5 \}}$
1334	W * w = new( 3.5 ); $\C{// allocate W and call constructor W\{ 3,5 \}}$
1335	delete( t, w ); $\C{// call destructors and free t and w}\CRT$
1336	\end{cfa}
1337	The benefits of high-level API simplifications should not be underestimated with respect to programmer productivity and safety.
1338
1339
1340	\section{Performance}
1341	\label{c:Performance}
1342
1343	This section uses a number of benchmarks to compare the behaviour of currently popular memory allocators with llheap.
1344	The goal is to see if llheap is a competitive memory allocator;
1345	no attempt is made to select a performance winner.
1346
1347
1348	\subsection{Experimental Environment}
1349	\label{s:ExperimentalEnvironment}
1350
1351	The performance experiments are run on three different multi-core architectures, ARM, AMD, and Intel, covering memory models weak order (WO) and total store order (TSO), to determine if there is consistency across architectures:
1352	\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1353	\item[ARM]
1354	Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
1355	\item[AMD]
1356	Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model
1357	\item[Intel]
1358	Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model
1359	\end{description}
1360	For the parallel experiments, threads are pinned to cores in a linear fashion, \ie from core $N$ to $N+M$, where $N$ is the start of a socket boundary.
1361	This layout produces the best throughput, as there is little or no communication among threads in the benchmarks, so binding tightly to the cache layout is unnecessary;
1362	hence, there is almost no OS or NUMA effects perturbing the benchmarks.
1363
1364	The compilers are gcc/g++-14.2.0 and gfortran-14.2.0 running on the Linux v6.8.0-52-generic OS, with @LD_PRELOAD@ used to override the default allocator.
1365	To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
1366	\begin{uC++}
1367	static inline void * pass( void * v ) { $\C[2.5in]{// prevent eliding, cheaper than volatile}$
1368	__asm__ __volatile__( "" : "+r"(v) ); return v;
1369	}
1370	void * vp = pass( malloc( 0 ) ); $\C{// wrap malloc call to prevent elision}\CRT$
1371	\end{uC++}
1372	The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all allocators.
1373
1374
1375	\subsection{Memory Allocators}
1376	\label{s:MemoryAllocators}
1377
1378	Historically, a number of C/\CC, stand-alone, general-purpose memory-allocators, \eg dlmalloc~\cite{dlmalloc}, have been written for use by programming languages providing unmanaged memory.
1379	For this work, 6 of the popular, thread-safe memory-allocators are selected for comparison, along with llheap.
1380
1381	\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt,listparindent=\parindent]
1382	\item[glibc~\cite{glibc}] % https://sourceware.org/glibc/wiki/MallocInternals
1383	is the default glibc allocator, derived from ptmalloc, derived from dlmalloc.
1384	glibc has multiple threads sharing multiple heaps with a global shared heap, header per allocation, free-lists with different organizational criteria and searching, and coalescing of certain adjacent free-areas.
1385	Version Ubuntu GLIBC 2.31-0ubuntu9.7 2.31 compiled by Ubuntu 24.04.
1386
1387	\item[hoard~\cite{hoard}]
1388	has multiple threads sharing multiple heaps with a global shared heap, where each heap is composed of superblocks containing fixed-sized objects, with each super-block having a single header for its objects and reuse of superblocks if empty.
1389	Version 3.13.0, compiled with gcc-14.2.0, default configuration, using command @make@.
1390	Over the past 5 years, hoard development has stopped;
1391	it fails on the ARM architecture, possibly because of the WO memory model.
1392
1393	\item[jemalloc~\cite{Evans06}]
1394	has multiple threads sharing multiple heaps (arenas) composed of same-sized chunks subdivided into regions composed of pages where each page is a container of same-sized objects.
1395	The components are organized into a number of data structures to facilitate allocations, freeing, and coalescing.
1396	Large objects are allocated using @mmap@.
1397	Version jemalloc-5.3.0~\cite{jemalloc}, built with the default configuration, using commands: @autogen.sh; configure; make; make install@.
1398
1399	\item[mimalloc~\cite{Leijen19}]
1400	has a heap per thread composed of a reserved area subdivided into 3-sized page buffers, where each page is a container of same-sized objects.
1401	Each page manages its own internal free list and the free list is build when a page is created so there is no initial bump pointer.
1402	Empty pages are coalesced for reuse.
1403	Uses a fast freelist search for small allocation sizes.
1404	Onwership is handled with a separate remote free-list, and remote frees are batched before pushing to the owner heap.
1405	Version mimalloc-v2.1.2, built with the default configuration, using commands @cmake . ; make@.
1406
1407	\item[tbbmalloc~{\cite[pp.~314--315]{Kukanov07}}] is the allocator shipped with Intel's Threading Building Blocks (TBB).
1408	tbbmalloc has a heap per thread for small allocations, with large allocation handled using a single request.
1409	There is a global heap to acquire and reuse space obtained from the OS;
1410	its reserved space is divided into thread buffers (containers).
1411	A thread heap is composed of linked containers, with binning used to manage the allocations/deallocations within the containers.
1412	Small object space is not returned to the OS.
1413	An allocation has to search its container list to find a partially filled one.
1414	The search is mitigated by moving mostly-free containers to the start of the container list;
1415	free containers are returned to the global heap.
1416	Ownership is handled with a separate remote free-list.
1417	Version @libtbbmalloc.so.2.11@, installed using @apt-get install libtbb-dev@.
1418
1419	\item[tcmalloc~\cite{tcmalloc}] is the allocator shipped with Google's perftools.\footnote{
1420	Currently, there are two versions of tcmalloc: Google's perftools and one experimental version available on GitHub, which is not an officially supported Google product.
1421	We selected the perftools version because it is the most likely choice for users as it installs directly onto multiple OSs.}
1422	tcmalloc has per CPU heaps for small allocations, with large allocation handled with a single request.
1423	CPU heaps require a rollback mechanism, @rseq@, to prevent the serially-reusable problem.
1424	There is a global heap to acquire and reuse space obtained from the OS;
1425	its reserved space is divided into multi-page spans (containers) of fixed sized objects.
1426	A CPU heap uses binning to manage the allocations/deallocations within the containers.
1427	Free containers are returned to the OS.
1428	Version @libtcmalloc_minimal.so.4@, installed using @apt-get install google-perftools@.
1429	\end{description}
1430
1431	Untested allocators:
1432	\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1433	\item[ptmalloc3]
1434	is 8 years old and already integrated into glibc.
1435	\item[rpmalloc]
1436	requires explicit insertion of initialization/finalization calls for handling concurrent kernel threads.
1437	Having to augment programs, like SPEC CPU benchmarks, is deemed outside of normal programmer expectations.
1438	% An allocator should just plugin and work.
1439	\item[lock free] allocators guarantee allocation progress whether threads are delayed or killed using an atomic instruction, often CAS.
1440	The original lock-free allocator~\cite{Michael04} is completely lock-free.
1441	As stated, atomic instructions on the fast path result in a significant performance penalty.
1442	Hence, new allocators are not completely lock free, switching to a combination of synchronization-free, \ie 1:1 allocator model, on the fast path and lock-free on the slow path(s) to manipulate shared data structures~\cite{rpmalloc}.
1443	These allocators are better labelled as \newterm{hybrid locking} rather than lock free, as the lock-free aspect is not contributing to performance.
1444
1445	% We observe that none of the pre-built standard malloc replacement libraries for ubuntu \url{https://launchpad.net/ubuntu/+search?text=malloc} are completely lock-free.
1446	% 1:1 allocators can avoid synchronization (locks, or lock-free techniques with atomic instructions as well as cache coherence overheads) in their critical fast paths, but care must be taken to ensure the the amount of free memory captured in thread-local structures is bounded.
1447
1448	% Another approach to synchronization for allocators is \newterm{Restartable Critical Sections} ~\cite {https://dl.acm.org/doi/10.1145/512429.512451, https://dl.acm.org/doi/pdf/10.5555/1698184, https://doi.org/10.1145/1064979.1064985}, which are available in linux as the \newterm{RSEQ} facility ~\cite{https://www.gnu.org/software/libc/manual/html_node/Restartable-Sequences.html}.
1449	% Restartable Critical Sections provide obstruction-free progress by means of specially crafted transactions that will be rolled back if they happen to be interrupted by the kernel.
1450	% Restartable Critical Sections transactions can only operate on CPU-specific data, however, which forces a T:C allocator configuration.
1451	% Google's experimental tcmalloc \url{https://google.github.io/tcmalloc/rseq.html} uses RSEQ.
1452	% SuperMalloc \url{ACM DL is dead at the moment, but it's in ISMM 2015} attempts to use hardware transactional memory for lock elision, but falls back to classic locking if the hardware facility is not present or when a given transactional attempt encounters repeated progress failures.
1453
1454
1455	\end{description}
1456
1457	Allocator size is an indirect indicator of complexity.
1458	Lines-of-code are computed with command @cloc *.{h,c,cc,cpp}@, except for hoard:
1459	@cloc --exclude-lang="Bourne Shell",SKILL,Markdown,Bazel Heap-Layers source include@.
1460	\begin{center}
1461	\setlength{\tabcolsep}{13pt}
1462	\begin{tabular}{@{}rrrrrrrr@{}}
1463	llheap & glibc & hoard & jemalloc & mimalloc & tbbmalloc & tcmalloc \\
1464	1,450 & 3,807 & 11,932 & 24,512 & 6,887 & 6,256 & 33,963 \\
1465	\end{tabular}
1466	\end{center}
1467
1468
1469	\section{Benchmarks}
1470	\label{s:Benchmarks}
1471
1472	There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
1473	\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1474	\item[Benchmarks]
1475	are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
1476	The applications are supposed to represent common execution patterns that need to perform well with respect to an underlying software implementation.
1477	Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns, resulting in little or no information about why an application did or did perform well for the tested software.
1478	\item[Micro-Benchmarks]
1479	attempt to extract the common execution patterns associated with an application and run the pattern independently.
1480	This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
1481	Micro-benchmarks are often criticized for inadequately representing real-world applications, but that is not their purpose.
1482	\end{description}
1483
1484	While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
1485	In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
1486	As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
1487	Many of these benchmark applications and micro-benchmarks are old and do not reflect current application allocation patterns.
1488
1489	Except for the SPEC CPU benchmark, the other performance benchmarks used for testing are micro-benchmarks created for this paper.
1490	All the benchmarks are used solely to extract differences among memory allocators.
1491	The term benchmark in the following discussion means benchmark or micro-benchmark.
1492
1493
1494	\subsection{SPEC CPU 2017}
1495
1496	SPEC CPU 2017 is an industry-standardized suite for measuring and comparing performance of compute-intensive programs.
1497	It contains integer and floating-point tests written in C, \CC, and Fortran, covering throughput and speed, where each test contains multiple benchmarks~\cite{SPECCPU2017}.
1498	All the benchmarks perform dynamic allocation, from light to heavy.
1499	However, the dynamic allocation is relatively small in comparison to the benchmark computation.
1500	Therefore, differences among allocators should be small, unless a particular access pattern triggers a pathological case.
1501	The reason for performing SPEC CPU across the allocators is to prove this hypothesis.
1502	For allocator comparisons, we consider SPEC CPU differences of 5\% as equal and undetectable in general workloads and computing environments.
1503	For compiler comparisons, small differences of 1\% or 2\% are considered significant.
1504
1505	Table~\ref{t:SPEC-CPU-benchmark} shows the elapsed time (inverted throughput) of the SPEC CPU tests condensed to the geomean across the benchmarks for each of the four SPEC tests, intrate, intspeed, fprate, and fpspeed, covering integer and floating-point operations.
1506	The tests are configured with size = ref, intrate/fprate: copies = 1, intspeed: threads = 1, fpspeed: threads = 16;
1507	only fpspeed is concurrent using OpenMP.
1508	Rigorous testing of SPEC CPU often runs many benchmark copies in parallel to completely load all computer cores.
1509	However, these tests quickly run into architectural bottlenecks having little to do with an allocator's behaviour.
1510	Runnning a single program bound to one core means the focus is strictly on allocator differences rather than conjoining transient OS and hardware differences.
1511	The throughputs are ranked with {\color{red}red} lowest time and {\color{blue}blue} highest, where lower is best.
1512	Hoard failed in multiple experiments on the ARM architecture, marked with {\color{purple}Err}, making it impossible to report the successful tests.
1513
1514	The results show all allocators do well;
1515	the average, median, and relative standard deviation (right column)\footnote{$rstd = \sigma / \mu \times 100$, where $\sigma =$ standard deviation and $\mu =$ average} proves our hypothesis that the performance difference, 0.6\% to 2.3\%, across allocators is small.
1516	One implementation trend we observed is that two of the integer tests, @omnetpp@ and @xalancbmk@, had an execution pattern that exercised the cache.
1517	For the three allocators using headers-per-allocation, glibc, llheap, and tbbmalloc, performance could be up to 40\% slower, between the best and worst allocator results.
1518	The reason is that the headers consumed part of the cache line, resulting in more cache misses.
1519	These two experiments, disproportionally increased the geomean for these allocators for both integral experiments on all architectures.
1520	Hence, headers-per-allocation are disadvantaged for this specific execution pattern.
1521	The floating-point tests show no trends among the allocators.
1522	The goal for llheap in this experiment is to do well, which is established by it being close to the median result, meaning it is normally in the middle of the allocator results.
1523
1524	\begin{table}
1525	\centering
1526	\caption{SPEC CPU benchmark, 3 hardware architectures, geomean per test in seconds, lower is better}
1527	\label{t:SPEC-CPU-benchmark}
1528	%\setlength{\tabcolsep}{6pt}
1529	\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r\|{7}{r}\|{3}{r}@{}}
1530	& bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1531	\cline{2-12}
1532	& intrate & {\color{blue}314.4} & {\color{violet}Err} & 300.3 & 309.9 & 302.6 & 313 & {\color{red}298.7} & 306.5 & 309.9 & 2\% \\
1533	ARM & intspeed & {\color{blue}439.1} & {\color{violet}Err} & 417.6 & 431.1 & 419.9 & 436.2 & {\color{red}415.5} & 426.6 & 431.1 & 2.2\% \\
1534	& fprate & 347.6 & {\color{violet}Err} & {\color{red}333.9} & 352.2 & {\color{blue}356.6} & 345.9 & 344.5 & 346.8 & 347.6 & 2\% \\
1535	& fpspeed & 248.4 & {\color{violet}Err} & 245.3 & 245.7 & {\color{blue}250.9} & 246.6 & {\color{red}243.8} & 246.8 & 246.6 & 0.93\%
1536	\end{tabular}
1537
1538	\begin{comment}
1539	\bigskip
1540	\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r\|{7}{r}\|{3}{r}@{}}
1541	& bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1542	\cline{2-12}
1543	& intrate & 251 & 242 & 239 & 249 & 240 & {\color{blue}251} & {\color{red}237} & 244 & 242 & 2.3\% \\
1544	AMD & intspeed & 356 & 337 & 335 & 351 & 339 & {\color{blue}356} & {\color{red}333} & 344 & 339 & 2.7\% \\
1545	& fprate & 256 & 261 & {\color{red}250} & 257 & {\color{blue}270} & 256 & 254 & 258 & 256 & 2.3\% \\
1546	& fpspeed & 340 & {\color{blue}353} & {\color{red}326} & 338 & 348 & 341 & 328 & 339 & 340 & 2.7\%
1547	\end{tabular}
1548	\end{comment}
1549
1550	\bigskip
1551	\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r\|{7}{r}\|{3}{r}@{}}
1552	& bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1553	\cline{2-12}
1554	& intrate & 251.2 & {\color{red}241.1} & 251.9 & 249.3 & 251.6 & 251.5 & {\color{blue}252.3} & 249.9 & 251.5 & 1.5\% \\
1555	AMD & intspeed & {\color{blue}356.1} & {\color{red}337.1} & 355.4 & 351.7 & 355.5 & 355.8 & 355.9 & 352.5 & 355.5 & 1.8\% \\
1556	& fprate & {\color{red}253.9} & {\color{blue}259.9} & 254.4 & 255.8 & 254.5 & 254.4 & 254.7 & 255.4 & 254.5 & 0.75\% \\
1557	& fpspeed & 329.9 & {\color{blue}339.6} & 330.6 & {\color{red}327.2} & 329.9 & 329.8 & 329.5 & 330.9 & 329.9 & 1.1\%
1558	\end{tabular}
1559
1560	\bigskip
1561	\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r\|{7}{r}\|{3}{r}@{}}
1562	& bench./alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1563	\cline{2-12}
1564	& intrate & 188.6 & 185.1 & 183.1 & 188.6 & 181.5 & {\color{blue}189.4} & {\color{red}181.2} & 185.4 & 185.1 & 1.8\% \\
1565	Intel & intspeed & 271.6 & 264.6 & 263.5 & 270.2 & 261.2 & {\color{blue}272.1} & {\color{red}260.3} & 266.2 & 264.6 & 1.7\% \\
1566	& fprate & 202.7 & {\color{red}201.8} & 204.4 & 205.1 & {\color{blue}205.3} & 204.7 & 203.7 & 204 & 204.4 & 0.59\% \\
1567	& fpspeed & 237.3 & 235.3 & 234.5 & 235.6 & {\color{blue}244.5} & 236.1 & {\color{red}233.6} & 236.7 & 235.6 & 1.4\%
1568	\end{tabular}
1569	\end{table}
1570
1571
1572	\subsection{Realloc Benchmark}
1573
1574	Some examination of @realloc@ is necessary to encourage its use.
1575	Reallocation can be very efficient (both in space and time) when manipulating variable-sized objects, like strings, multi-precise numbers, or dynamic-sized arrays.
1576	Both X11 (500+ calls) and glibc (300+ calls) use realloc for various purposes.
1577	For example, in \CC:
1578	\begin{C++}
1579	string s = "abc"; // initial allocation and copy new value
1580	s = "gh"; // change size and copy new value
1581	s = "l" + s + "r"; // change size and copy new value
1582	s = s.substr(0,2); // reduce size
1583	\end{C++}
1584	variable @s@ changes size and value multiple times, plus temporary strings are created implicitly, \eg multiple concatenations, all of which requires multiple allocations, copying, and deallocations.
1585	@realloc@ can optimize some of these operations in two ways:
1586	\begin{enumerate}[leftmargin=*]
1587	\item
1588	For decreasing size, Figure~\ref{f:ReallocOptDecreasing} shows a logical truncation of the existing object rather than creating a new object, \ie use a heuristic to decide whether to perform the 3-step procedure (allocate, copy, and free), or pretend the storage is decreased and return the old storage and value, performing zero work but increasing internal fragmentation.
1589	For example, a request to decrease size from 96 to 75 bytes can be implemented two ways:
1590	The 21 bytes of internal fragmentation at the end of the logical reallocation may be unavailable, directly available if the allocator supports @malloc_usable_size@, or indirectly available if put back on the allocator free list.
1591	\item
1592	For increasing size, Figure~\ref{f:ReallocOptIncreasing} takes advantage of the fact that many memory allocators quantize request sizes (binning), often returning slightly more storage than requested (internal fragmentation).
1593	For example, an initial request for 75 bytes may return 96 bytes of storage, giving 21 bytes of internal fragmentation:
1594	For increasing the size up to 21 bytes, realloc can take advantage of this unused space rather than performing the 3-step procedure, which can also result in unused storage.
1595	\end{enumerate}
1596
1597	\begin{figure}
1598	\centering
1599	\subfloat[Decreasing]{\label{f:ReallocOptDecreasing}\input{decreasing}}
1600	\hspace*{5pt}
1601	\vrule
1602	\hspace*{5pt}
1603	\subfloat[Increasing]{\label{f:ReallocOptIncreasing}\raisebox{0.38\totalheight}{\input{increasing}}}
1604	\caption{Realloc Optimizations}
1605	\label{f:ReallocOptimizations}
1606	\end{figure}
1607
1608	Figure~\ref{f:reallocShrinkBenchmark} shows a benchmark to determine if an allocator takes advantage of the first optimization.
1609	The benchmark takes a fixed-size allocation and reduction it by 10\%--90\% in steps of 10\%, checking the storage addresses at each reduction step if the same or new storage is returned.
1610	The fixed-sized allocation is varied between sizes 64--16K in powers of 2.
1611	Hence, both small and large sized storage are reduced.
1612	The following table shows the approximate percentage point where storage is retained on shrinkage, \eg the storage reduction must be greater than 50\% of the prior allocation before a new allocation is performed for the smaller size, data is copied, and prior storage released.
1613	\begin{center}
1614	\setlength{\tabcolsep}{15pt}
1615	\begin{tabular}{@{}ccccccc@{}}
1616	glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc \\
1617	90\% & 50\% & 20\% & 50\% & 50\% & 90\% & 50\%
1618	\end{tabular}
1619	\end{center}
1620	The results show glibc and tbbmalloc do not perform this optimization, while the other allocators do with 50\% as the most popular crossover point.
1621
1622	Figure~\ref{f:reallocGrowBenchmark} shows a benchmark to determine if an allocator takes advantage of the second optimization.
1623	This benchmark creates an array of fixed-sized elements increasing the array size by 1 from 1--10,000 elements.
1624	Then the element size is varied from 32, 64, 128, 256 bytes.
1625	To prevent allocators from doing a bump allocation across the entire benchmark, a small perturbation is introduced where storage is allocated, held, and then released at infrequent points across the experiment.
1626	A companion experiment is a manual simulation of the @realloc@: @malloc@ new storage, copy old data, and free old storage.
1627	Note, the @realloc@ simulation is performing an equivalent perturbation to the @realloc@ benchmark each time through the loop.
1628	The experiment is repeated 10,000 times for @realloc@ and 100 times for the simulation to obtain similar timing ranges.
1629	The performance difference between the @realloc@ and @realloc@-simulation experiments shows if @realloc@ is optimizing unused internal fragmentation at the end of its quantized bucket.
1630
1631	Figure~\ref{f:reallocGrowResults} shows the results for the @realloc@ and @realloc@ simulation benchmarks.
1632	The difference between the benchmarks is two orders of magnitude, \ie all allocators are reusing some internal fragmentation to prevent a reallocation and copy as the array grows.
1633	The large difference is the extra copying in the simulation case, which is expensive.
1634	Within the @realloc@ benchmark, allocators glibc, hoard, jemalloc, and tbbmalloc have higher cost, while the remaining allocators have almost identical results.
1635	Within the @realloc@ simulation benchmark, allocators glibc and tbbmalloc have higher cost, while the remaining allocators have almost identical results.
1636	This benchmark confirms that @realloc@ can provide some level of performance benefit for dynamically growing data structures, \eg strings or arrays.
1637	Therefore, encouraging its use is reasonable, if and only if, it is safe to do so.
1638	Note, this encouragement is apt for container developers, where low-level storage management is performed internally for the benefit of application users.
1639
1640	\begin{figure}
1641	\begin{C++}
1642	for ( size_t p = 10; p <= 100; p += 10 ) {
1643	for ( size_t s = 64; s < 16 * 1024; s <<= 1 ) {
1644	bool reuse = false;
1645	void * prev = pass( malloc( s ) );
1646	void * curr = pass( realloc( prev, s * p / 100 ) );
1647	if ( prev == curr ) { /* print */ }
1648	free( curr );
1649	}
1650	}
1651	\end{C++}
1652	\vspace*{-10pt}
1653	\caption{\lstinline{realloc} Shrink Benchmark}
1654	\label{f:reallocShrinkBenchmark}
1655
1656	\vspace*{10pt}
1657
1658	%\setlength{\tabcolsep}{15pt}
1659	\begin{tabular}{@{}ll@{}}
1660	\multicolumn{1}{c}{\lstinline{realloc}} & \multicolumn{1}{c}{\lstinline{realloc} simulation} \\
1661	\begin{C++}
1662	struct S { size_t ca[DIM]; }; // varied 32, 64, 128, 256
1663	enum { Ssize = sizeof( S ) };
1664	for ( size_t t = 0; t < @10$'$000@; t += 1 ) {
1665	S * sa = nullptr, * perturb = nullptr;
1666	for ( size_t i = 0, s = Ssize; i < 10$'$000; i += 1, s += Ssize ) {
1667	sa = (S *)@realloc( sa, s );@
1668
1669	sa[i].ca[0] = i;
1670	if ( i % 1024 == 0 ) perturb = (S *)realloc( perturb, s );
1671	}
1672	free( sa );
1673	free( perturb );
1674	}
1675	\end{C++}
1676	&
1677	\begin{C++}
1678	struct S { size_t ca[DIM]; }; // varied 32, 64, 128, 256
1679	enum { Ssize = sizeof( S ) };
1680	for ( size_t t = 0; t < @100@; t += 1 ) {
1681	S * sa = nullptr, * so = (S *)malloc( Ssize );
1682	for ( size_t i = 0, s = Ssize; i < 10$'$000; i += 1, s += Ssize ) {
1683	sa = (S *)@malloc( s )@; // simulate realloc
1684	memcpy( sa, so, s - Ssize ); // so one smaller
1685	sa[i].ca[0] = i;
1686	free( so );
1687	so = sa;
1688	}
1689	free( sa );
1690	}
1691	\end{C++}
1692	\end{tabular}
1693	\caption{\lstinline{realloc} Grow Benchmark}
1694	\label{f:reallocGrowBenchmark}
1695
1696	\vspace*{20pt}
1697
1698	\hspace*{-17pt}
1699	\setlength{\tabcolsep}{-13pt}
1700	\begin{tabular}{@{}l@{\hspace{-5pt}{\vrule height 1.05in}\hspace{-5pt}}l@{}}
1701	\begin{tabular}{@{}lll@{}}
1702	\input{prolog.realloc.tex} & \input{swift.realloc.tex} & \input{java.realloc.tex}
1703	\\
1704	\multicolumn{3}{@{}c@{}}{\lstinline{realloc}, 10,000 repetitions}
1705	\end{tabular}
1706	&
1707	\setlength{\tabcolsep}{-10pt}
1708	\begin{tabular}{@{}lll@{}}
1709	\input{prolog.reallocsim.tex} & \input{swift.reallocsim.tex} & \input{java.reallocsim.tex}
1710	\\
1711	\multicolumn{3}{@{}c@{}}{\lstinline{realloc} simulation, 100 repetitions}
1712	\end{tabular}
1713	\end{tabular}
1714
1715	\caption{\lstinline{realloc} Grow Results, x-axis in bytes, lower is better}
1716	\label{f:reallocGrowResults}
1717	\end{figure}
1718
1719
1720	\subsubsection{Cache Benchmark}
1721	\label{s:CacheBenchmark}
1722
1723	The cache benchmarks attempt to look for false sharing (see Section~\ref{s:FalseSharing}).
1724	Unfortunately, testing for allocator-induced false-sharing is difficult, because it is equivalent to searching for randomly conjoined allocations within a large storage space.
1725	Figure~\ref{f:CacheBenchmark} shows a benchmark for program induced false-sharing, where pointers are passed among threads.
1726	As a side effect, this benchmark is indirectly checking which allocator model is being used.
1727	The program main runs the benchmark with 4, 8, 16, and 32 threads, passing each thread a separate array of dynamically allocated storage from its common heap with @ASIZE@ elements.
1728	Each thread then traverse the array adding a value to each element (read and write).
1729	The traversal is repeated T times.
1730	Each thread frees the array at the end.
1731	The experiment is run with a small and medium sized array.
1732	If there is any heap sharing, the small array has a higher probability for false sharing, \eg the first and last array elements for different array can be juxtaposed in memory, and hence appear in the same cache line.
1733
1734	\begin{figure}
1735	\begin{C++}
1736	enum { TIMES = 10$'$000$'$000$'$000, ASIZE = 3 }; $\C{// repetitions, array size 3 or 30}$
1737	void * worker( void * arg ) { $\C{// array passed from program main}$
1738	volatile size_t * arr = (size_t *)arg; $\C{// volatile prevents code elision}$
1739	for ( size_t t = 0; t < TIMES / ASIZE; t += 1 ) $\C{// repeat experiment N times}$
1740	for ( size_t r = 0; r < ASIZE; r += 1 ) $\C{// iterate through array}$
1741	arr[r] += r; $\C{// read/write array elements}$
1742	free( (void *)arr ); $\C{// cast away volatile}$
1743	}
1744	\end{C++}
1745	\vspace*{-5pt}
1746	\caption{Cache False-Sharing Benchmark}
1747	\label{f:CacheBenchmark}
1748	\end{figure}
1749
1750	Figure~\ref{f:cacheResults} shows the results for the cache benchmark run with array sizes 3 and 30.
1751	Allocators glibc, llheap, mimalloc, and tbbmalloc show little or no false-sharing issues at both 3 and 30 array sizes, \ie all generate virtually the same result.
1752	Note, on the Intel, there is a rise at 32 cores, because of an L3 cache shift at 16 cores; stepping to 32 cores introduces NUMA effects.
1753	This result correlates with these allocators using a 1:1 allocator model.
1754	Allocators hoard, jemalloc, and tcmalloc show false-sharing issues at both 3 and 30 array sizes, reducing performance by 2 times at size 3.
1755	The @perf@ performance analyzer shows a large number of cache misses for these allocators, indicating false sharing.
1756	This result correlates with these allocators using some form of heap sharing.
1757
1758	\begin{figure}
1759	\setlength{\tabcolsep}{-8pt}
1760	\begin{tabular}{@{}l@{\hspace{-5pt}{\vrule height 1.05in}\hspace{-5pt}}l@{}}
1761	\begin{tabular}{@{}lll@{}}
1762	\input{prolog.cacheS.tex} & \input{swift.cacheS.tex} & \input{java.cacheS.tex}
1763	\\
1764	\multicolumn{3}{@{}c@{}}{3 Element Array}
1765	\end{tabular}
1766	&
1767	\begin{tabular}{@{}lll@{}}
1768	\input{prolog.cacheL.tex} & \input{swift.cacheL.tex} & \input{java.cacheL.tex}
1769	\\
1770	\multicolumn{3}{@{}c@{}}{30 Element Array}
1771	\end{tabular}
1772	\end{tabular}
1773	\caption{Cache False-Sharing Results, x-axis in cores, lower is better}
1774	\label{f:cacheResults}
1775	\end{figure}
1776
1777
1778	\subsection{Ownership Benchmark}
1779
1780	% In multi-threaded allocators with H:T or 1:1 structure, one thread can allocation storage, send it to another thread, and the receiving thread deallocates it.
1781	% This raises the question of where the storage is returned: the heap (area) from which it was allocated or a different heap;
1782	% in some cases there is no choice, when storage is bound to its allocation area.
1783	% If storage is returned to its allocation heap, there are concurrency issues if the allocation area is shared.
1784	% If the storage is returned to another heap, there can still be concurrency issues, but the real problem is storage drain in the allocation heap and storage bloat in the deallocation heap, without a secondary mechanism to redistribute storage.
1785	% This choice is the \newterm{ownership problem}.
1786
1787	Historically the Larson benchmark~\cite{larson99memory} is purported to test for ownership issues, but in actuality, the benchmark is a complex simulation of a server environment.
1788	Multiple threads allocate and free a number of random-sized objects within a size range.
1789	Each thread runs for a time period, and at termination, creates a child thread and passes its array of objects as an argument, which does not require synchronization.
1790	The number of thread generations varies with thread speed.
1791	% It calculates memory operations per second as an indicator of the memory allocator's performance.
1792	Because the benchmark performs multiple kinds of tests, it is impossible to extracted just the remote-free rate.
1793
1794	Therefore, a new benchmark is created to measure the asynchronous transfer cost from the deallocating to the allocating thread (remote free).
1795	However, the allocating thread must first asynchronously transferred the allocations to the deallocating thread.
1796	This cost needs to be mitigated so it does not mask the remote-free measurement.
1797	To accomplish this, a thread batches its allocations (lots of 100), and atomically exchanges this batch with a freeing thread, which then individually frees the batch components.
1798	Hence, the cost of the asynchronous allocation transfer is much less than the individual cost of the remote free.
1799
1800	Figure~\ref{f:OwnershipBenchmark} shows the pseudo-code for the benchmark.
1801	There is a global matrix of allocation addresses: one row for each thread and one column for each batch.
1802	Each thread starts at a specific row and fills that row with two different sized allocations.
1803	A thread then loops until it atomically exchanges its row pointer with another thread's row pointer.
1804	The storage in the received batch is then remote freed, the batch row is reset with new allocations, and the process repeats for a timed duration.
1805	As well, after each allocation, an integer is written into the storage, and that integer is read before the deallocation.
1806
1807	Figure~\ref{f:Ownership} (a)--(c) shows the throughput of the ownership benchmark.
1808	The results are divided into three groups.
1809	glibc and tbbmalloc are slowest because of many system calls to @futex@. % and @nano_sleep@.
1810	Figure~\ref{f:Ownership}~(d) shows the system time climbing during scaling on the AMD;
1811	the other architectures are similar.
1812	llheap and mimalloc are next, as these allocators do not batch remote frees, so every free requires locking.
1813	jemalloc, hoard, and tcmalloc are fastest, as these allocators batch remote frees, reducing locking.
1814	For 1:1 allocators, eager remote return makes sense as the returned storage can be reused during the owning thread's lifetime.
1815	For N:T allocators, lazy remote return using batching makes sense as heaps outlive threads so eventually returned storage can be used by any existing or new thread.
1816	Batching is possible for 1:1 allocators, but results in complexity and external fragmentation, which is only warranted in certain cases.
1817
1818	\begin{figure}
1819	\begin{cfa}
1820	void * batches[MaxThread][MaxBatch]; $\C{// thread global}$
1821	struct Aligned { CALIGN void * * col; };
1822	volatile Aligned allocations[MaxThread];
1823
1824	Aligned batch = { batches[id] }; $\C{// thread local}$
1825	size_t cnt = 0, a = 0;
1826	for ( ; ! stop; ) { $\C{// loop for T second}$
1827	for ( ssize_t i = Batch - 1; i >= 0; i -= 1 ) { $\C{// allocations, oppose order from frees}$
1828	batch.col[i] = malloc( i & 1 ? 42 : 192 ); $\C{// two allocation sizes}$
1829	(int )batch.col[i] = 42; $\C{// write storage}$
1830	}
1831	Aligned obatch = batch;
1832	while ( (batch.col = Fas( allocations[a].col, batch.col )) == obatch.col \|\| batch.col == nullptr ) { // atomic exchange
1833	if ( stop ) goto fini;
1834	a = (a + 1) % Threads; $\C{// try another batch}$
1835	}
1836	for ( size_t i = 0; i < Batch; i += 1 ) { $\C{// deallocations}$
1837	if ( (int )batch.col[i] != 42 ) abort(); $\C{// read storage check}$
1838	free( batch.col[i] ); $\C{// remote free}$
1839	}
1840	cnt += Batch; $\C{// sum allocations/frees}$
1841	a = (a + 1) % Threads; $\C{// try another batch}$
1842	} fini: ;
1843	\end{cfa}
1844	\caption{Ownership Benchmark Outline}
1845	\label{f:OwnershipBenchmark}
1846	\end{figure}
1847
1848	\begin{figure}
1849	\hspace*{-14pt}
1850	\setlength{\tabcolsep}{-13pt}
1851	\begin{tabular}{@{}lll@{\hspace{-6pt}{\vrule height 2.05in}\hspace{-6pt}}l@{}}
1852	\input{prolog.ownership.tex}
1853	&
1854	\input{swift.ownership.tex}
1855	&
1856	\input{java.ownership.tex}
1857	&
1858	\input{swift.ownershipres.tex}
1859	\end{tabular}
1860	\caption{Ownership Results, x-axis is cores, (a)--(c) higher is better, (d) lower is better}
1861	\label{f:Ownership}
1862	\end{figure}
1863
1864
1865	\subsection{Delay Benchmark}
1866
1867	The delay benchmark is a torture test of abrupt allocation patterns looking for delays that increase latency.
1868	A flat response across the tests means there are few or no allocator-induced pauses.
1869	The test examines small and large requests, where small requests are handled by the heap (@sbrk@) and large requests are handled by the OS (@mmap@).
1870	Putting large requests in the heap causes external fragmentation when freed, unless an allocator subdivided the space, leading to pauses.
1871	The @mallopt@ function provides the option @M_MMAP_THRESHOLD@ to set the division point in bytes for requests that cannot be satisfied by an allocator's free list.
1872	Each @sbrk@ test in this benchmark is repeated 5,000,000,000 times and each @mmap@ test is performed 1,000,000 times;
1873	the different repetitions result from the high cost of the OS calls making the experiment run too long.
1874	A \emph{long running} experiment, rather than short experiments with averaged results, is searching for blowup scenarios in time and/or space.
1875	Finally, scaling is tested with 4, 8, 16, and 32 pinned threads, where the threads synchronize between tests using a @pthread@ barrier.
1876	In all experiments, allocated storage has its first and last byte assigned a character to simulate usage.
1877
1878	The tests are performed in this order:
1879	\begin{enumerate}[leftmargin=18pt,topsep=3pt,itemsep=2pt,parsep=0pt]
1880	\item
1881	@x = malloc( 0 ) / free( x )@:
1882	handles the pathological case of an zero-sized allocation and free.
1883	The POSIX standard allows two meanings for this case: return @NULL@ or a unique pointer, where both can be freed.
1884	The fastest implementation is to return @NULL@, rather than create a fictitious allocation.
1885	However, this overloads the @malloc@ return-value to mean error or a zero-sized allocation.
1886	To comply with the POSIX standard, the check for running out of memory is:
1887	\begin{uC++}
1888	if ( malloc( 0 ) == NULL && errno == ENOMEM ) ... // no memory
1889	\end{uC++}
1890	Unfortunately, most programmers assume @NULL@ means an error, \eg two tests in the SPEC CPU benchmark fail if @NULL@ is returned for a zero-sized allocation.
1891	Hence, returning @NULL@ for a zero-sized allocation is an impractical allocator option.
1892
1893	\item
1894	@free( NULL )@: handles the pathological case of freeing a non-existent or zero-byte allocation.
1895	Non-existent allocations occur as algorithm base-cases, such as an unused pointer set to @NULL@.
1896	Having the allocator ignore this case eliminates checking for an erroneous @free@ call on a @NULL@ value.
1897	This call should be fast.
1898
1899	\item
1900	\label{expS}
1901	@x = malloc( 42 ) / free( x )@:
1902	handles a fixed-sized allocation and free.
1903
1904	\item
1905	@x[0..100) = malloc( 42 ) / free( x[0..100) )@:
1906	handles a group of fixed-sized allocations and group free.
1907
1908	\item
1909	@x[0..1000) = malloc( 42 ) / free( x[0..1000) )@:
1910	handles a larger group of fixed-sized allocations and group free.
1911
1912	\item
1913	@x[0..100) = malloc( 42 ) / free( x(100..0] )@:
1914	handles a group of fixed-sized allocations and group free in reverse order.
1915
1916	\item
1917	\label{expE}
1918	@x[0..1000) = malloc( 42 ) / free( x(1000..0] )@:
1919	handles a larger group of fixed-sized allocations and group free in reverse order.
1920
1921	\item
1922	@x = malloc( [0..100) ) / free( x )@:
1923	handles a variable-sized allocation and free.
1924
1925	\item
1926	@x[0..100) = malloc( [0..100) ) / free( x[0..100) )@:
1927	handles a group of variable-sized allocations and group free.
1928
1929	\item
1930	@x[0..1000) = malloc( [0..1000) ) / free( x[0..1000) )@:
1931	handles a larger group of variable-sized allocations and group free.
1932
1933	\item
1934	@x[0..100) = malloc( [0..100) ) / free( x(100..0] )@:
1935	handles a group of variable-sized allocations and group free in reverse order.
1936
1937	\item
1938	@x[0..1000) = malloc( [0..1000) ) / free( x(1000..0] )@:
1939	handles a larger group of variable-sized allocations and group free in reverse order.
1940	\end{enumerate}
1941	Experiments \ref{expS}--\ref{expE} are repeated with a fixed-sized allocation of 1,048,576, where @M_MMAP_THRESHOLD@ is set to 524,288 to force the use of @mmap@, resulting in 17 experiments.
1942	Because the @mmap@ experiments test the operating-system memory-management not the allocators, the variable-sized @mmap@ experiments are deemed unnecessary.
1943	A test with random-sized @sbrk@ allocations @malloc( [0..N) random )@ was performed, but the results are the same as fixed sized as all the allocation sizes are quickly accessed over the large number of experiment repetitions.
1944	That is, once the buckets or superblocks for the allocation sizes are created, access order is irrelevant.
1945
1946	Figures~\ref{f:LatencyExpARM}--\ref{f:LatencyExpIntel} show the results of the @sbrk@ and @mmap@ experiments across the seven allocators with parallel scaling.
1947	The average of the N threads is graphed for each experiment and the standard deviation is the error bar.
1948	For the @sbrk@ graphs, a good allocator result should be low (smaller is better), flat across scaling (cores), with no error bars (STD $\approx$ 0) indicating no jitter (pauses) among the threads.
1949	The result patterns across the three hardware architectures are similar, with differences correlating to CPU speed and cache differences.
1950
1951	The key observation across the @sbrk@ graphs is that llheap and mimalloc are always at the bottom (lower is better) and flat with respect to scaling.
1952	The only exception is on the Intel, where all allocators experienced similar non-flat behaviour, because of the L3 cache shift at 16 cores.
1953	Some anomalies are tcmalloc and hoard experiencing large jitter (see error bars) and scaling issues in some experiments, which is correlated with poorer results;
1954	jemalloc has significant scaling issues for experiments 5, 7, 10, and 12, resulting from large numbers of @futex@ calls, possibly related to @madvise@ for returning storage to the OS;
1955	and glibc and tbbmalloc are often slower than the other allocators (symbols are on top of each other).
1956
1957	The key observation across the @mmap@ graphs is that only three allocators, glibc, llheap, and tbbmalloc honoured the @mmap@ threshold request (symbols are on top of each other).
1958	The other allocators made no @mmap@ calls, so their results are extremely low.
1959	The exception is hoard, which did make @mmap@ calls that were uncorrelated with @M_MMAP_THRESHOLD@, and had significant jitter due to a large number of @futex@ calls.
1960	For the allocators using @mmap@, there should be some scaling effect as more threads make more system calls.
1961
1962	\begin{figure}
1963	\input{prolog.tex}
1964	\vspace*{-20pt}
1965	\caption{Delay Results, ARM, x-axis is cores, lower is better}
1966	\label{f:LatencyExpARM}
1967	\end{figure}
1968
1969	\begin{figure}
1970	\input{swift.tex}
1971	\vspace*{-20pt}
1972	\caption{Delay Results, AMD, x-axis is cores, lower is better}
1973	\label{f:LatencyExpAMD}
1974	\end{figure}
1975
1976	\begin{figure}
1977	\input{java.tex}
1978	\vspace*{-20pt}
1979	\caption{Delay Results, Intel, x-axis is cores, lower is better}
1980	\label{f:LatencyExpIntel}
1981	\end{figure}
1982
1983	Figures~\ref{f:LatencyResARM}--\ref{f:LatencyResIntel} show a time/space perspective across the entire experiment.
1984	The user, system, and real times along with the maximum memory usage are presented for the @sbrk@ and @mmap@ experiments.
1985	The result patterns across the three hardware architectures are similar.
1986	If an allocator disappears in a graph, its result is less than 1 on a logarithmic scale.
1987	Surprisingly, there are large (2 orders of magnitude) time differences among the allocators.
1988
1989	\begin{figure}
1990	\hspace*{15pt}
1991	\input{prolog2.tex}
1992	\vspace*{-20pt}
1993	\caption{Delay Results, ARM, x-axis is cores, lower is better}
1994	\label{f:LatencyResARM}
1995
1996	\hspace*{15pt}
1997	\input{swift2.tex}
1998	\vspace*{-20pt}
1999	\caption{Delay Results, AMD, x-axis is cores, lower is better}
2000	\label{f:LatencyResAMD}
2001
2002	\hspace*{15pt}
2003	\input{java2.tex}
2004	\vspace*{-20pt}
2005	\caption{Delay Results, Intel, x-axis is cores, lower is better}
2006	\label{f:LatencyResIntel}
2007	\end{figure}
2008
2009	For @sbrk@ graphs, the user time should be high and scale with cores, the system time very low, the real time constant, and the maximum memory scales with cores.
2010	For user time, llheap and mimalloc, are at the bottom (lower is better) and all allocators have linear scaling as cores increase.
2011	The remaining allocators are slower by one to two orders of magnitude, which correlates with high results in the experiments.
2012	For system time jemalloc has non-trivial system time that scales with cores, caused by a large number of @futex@ calls.
2013	The remaining allocators have virtually zero system time (not on graph).
2014	The exception is a random anomaly where allocators had small amounts of system time, which appeared/disappeared on different experiment runs as if something slightly perturbs the experiment (OS?) over its 20 hour run.
2015	For real time, llheap and mimalloc, take the least overall time and all allocators except jemalloc have flat performance.
2016	For maximum memory, all allocators scale with cores, and there is a rough inverse correlation between user time and memory usage, \ie time \vs speed tradeoff.
2017
2018	For @mmap@ graphs, only used by glibc, llheap, and tbbmalloc, the user time should be low and scale with cores, the system time should be high and scale with cores, the real time constant, and the maximum memory scales with cores.
2019	For user time, glibc, llheap, and tbbmalloc, are at the bottom because there are no @sbrk@ requests.
2020	The remaining allocators all use a non-trivial amount of time handling the large requests, except mimalloc, which handles the large request identically to a small request.
2021	Interestingly, the amount of time varies by one to two orders of magnitude.
2022	For system time, glibc, llheap, and tbbmalloc, are at the top because of the OS calls to @mmap@.
2023	Interestingly, the remaining allocators still use orders of magnitude of system time, except mimalloc ($<$ 1 so invisible).
2024	For real time, all allocators scale linearly with cores, except mimalloc, which is flat.
2025	For maximum memory, all allocators scale with cores, and there is a rough inverse correlation between user time and memory usage, \ie time \vs speed tradeoff.
2026
2027
2028	\subsection{Out of Memory Benchmark}
2029
2030	Figure~\ref{f:OutMemoryBenchmark} show a \CC program with unbounded memory allocation.
2031	The program is run in a shell with restricted data size.
2032	Hence, it quickly runs out of memory, causing @malloc@, which is called by \CC @new@, to return a @nullptr@ with @errno@ set to @ENOMEM@.
2033	Routine @new@ sees the @nullptr@ and calls the handler routine set by @set_new_handler@, which prints a message, and resets the default handler to raise the @bad_alloc@ exception caught in the program main.
2034	Note, to raise an exception requires dynamic allocation, but \CC preallocates a few special exception, like @bad_alloc@, for special cases.
2035
2036	All allocators printed the correct output except hoard, mimalloc, and tcmalloc.
2037	Hoard prints @MAP_FAILED@ and hangs spinning on a spinlock in a complex call chain.
2038	mimalloc aborts the program because it incorrectly attempts to raise the @bad_alloc@ exception itself if and only it is compiled with \CC, whereas it is compiled with C.
2039	The correct design is to return a @nullptr@ with @errno@ set to @ENOMEM@ to \CC @new@, which then raises the exception;
2040	hence, the allocator can be compiled with C or \CC.
2041	tcmalloc prints the correct output but adds ``allocation failed'' messages.
2042
2043	\begin{figure}
2044	\begin{tabular}{@{\hspace{\parindentlnth}}l@{\hspace{2\parindentlnth}}l@{}@{}}
2045	\begin{cfa}
2046	static void handler() {
2047	cout << "Memory allocation failed\n";
2048	set_new_handler( nullptr );
2049	}
2050
2051
2052	\end{cfa}
2053	&
2054	\begin{cfa}
2055	int main() {
2056	set_new_handler( handler );
2057	try {
2058	for ( ;; ) pass( new char[50] ); // unbounded allocation
2059	} catch( const bad_alloc & e ) { cout << e.what() << endl; }
2060	}
2061	\end{cfa}
2062	\end{tabular}
2063	\caption{Out of Memory Benchmark}
2064	\label{f:OutMemoryBenchmark}
2065	\end{figure}
2066
2067
2068	\section{Conclusion}
2069
2070	The goal of this work is to build a full-featured, low-latency (or high bandwidth) memory allocator for both KT and UT multi-threading systems that is competitive with the best current memory allocators while extending the feature set of existing and new allocator routines.
2071	The new llheap allocator achieves all of these goals, while maintaining and managing sticky allocation information \emph{without a performance loss}.
2072	Hence, it is possible to use @realloc@ frequently as a safe operation, rather than just occasionally or not at all.
2073	Furthermore, the ability to query sticky properties and other information allows programmers to write safer programs, as it is possible to dynamically match allocation styles from unknown library routines that return allocations.
2074
2075	Extending the C allocation API with @resize@, advanced @realloc@, @aalloc@, @amemalign@, @cmemalign@ and other alignment variations means programmers do not have to generate these allocation operations themselves.
2076	The ability of the type systems in modern languages, \eg \CFA, to condense the allocation API to one routine with completely orthogonal allocation properties shows how far the allocation API can be advanced.
2077	The result is increased safety and a cognitive reduction in performing dynamic allocation.
2078	All of these extensions should eliminate common reasons for C programmers to roll their own memory allocator and/or allocation function, which is a huge safety advantage.
2079
2080	The ability to compile llheap with static/dynamic linking and optional statistics/debugging provides programmers with multiple mechanisms to balance performance and safety.
2081	These allocator versions are easy to use because they can be linked to an application without recompilation.
2082	Providing comprehensive statistics for all allocation operations is invaluable in understanding and debugging a program's dynamic behaviour.
2083	No other memory allocator provides such comprehensive statistics gathering.
2084	This capability was used extensively during the development of llheap to verify its behaviour, and to verify the benchmarks developed for the paper.
2085	As well, the debugging mode, where allocations are checked along with internal pre/post-conditions and invariants, is extremely useful especially for students ($\approx$1,000 students have tested the \uC version of llheap).
2086	While not as powerful as the @valgrind@ interpreter, lheap's debugging mode can detect a large number of allocation mistakes.
2087	The contention-free statistics gathering and debugging have a low enough cost to be used in production code.
2088	Finally, no other memory allocator addresses the needs of user-level threading, which are now available in many modern languages.
2089
2090	Creating a benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs, has been an interesting challenge.
2091	The purpose of these performance tests is not to pick winners and losers among the allocators, because each allocator optimizes a particular set of allocation patterns: there is no optimal memory-allocator.
2092	The goal is to demonstrate that llheap's performance, both in time and space, across some interesting allocation patterns, is comparable to the best allocators in use today.
2093	Admittedly, there are pathological cases where llheap might use significant amounts of memory because it never coalesces or returns storage to the OS.
2094	These pathological cases do not correlate to long running applications, where llheap can perform very well.
2095	In the small set of tested benchmarks, no heap blowup was observed, while some tests caused time blowups in other allocators.
2096	Therefore, llheap is a viable drop-in replacement for many applications and its ancillary features make it safer and more informative.
2097
2098
2099	\subsection{Recommendations}
2100
2101	Substantial work has been put into building a new allocator and benchmarks, plus doing comprehensive performance tests among allocators.
2102	Based on this work, we make two recommendations:
2103	\begin{enumerate}[leftmargin=*, topsep=0pt,itemsep=0pt,parsep=0pt]
2104	\item
2105	Hoard is no longer maintained and did not do well (even broke) in some performance experiments.
2106	We recommend to those doing memory allocation research not to use it.
2107	\item
2108	glibc did not perform as well as other allocators.
2109	Given it is the default memory allocator for many academic and industry applications, this seems unfortunate and skews performance resulting so developers may draw incorrect conclusions.
2110	As such, we recommend the adoption of a newer memory allocator for glibc.
2111	We offer llheap for the reasons given above, but most importantly, its small code base.
2112	glibc maintainers come and go.
2113	Therefore, it is crucial for a new maintainer to on-board quickly and have a thorough understanding of the code base within a month.
2114	The llheap code base is small and can be learned quickly because of its simple design, making it an ideal choice as a substitute allocator.
2115	\end{enumerate}
2116
2117
2118	\section{Acknowledgements}
2119
2120	This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
2121	% Special thanks to Trevor Brown for many helpful discussions.
2122
2123	\bibliographystyle{ACM-Reference-Format}
2124	\bibliography{pl,local}
2125
2126	\end{document}
2127	\endinput
2128
2129	% Local Variables: %
2130	% tab-width: 4 %
2131	% fill-column: 120 %
2132	% compile-command: "make" %
2133	% End: %

Note: See TracBrowser for help on using the repository browser.

Download in other formats: