source: doc/papers/llheap/Paper.tex@ d2e6f84

Last change on this file since d2e6f84 was 73475a5, checked in by Peter A. Buhr <pabuhr@…>, 4 weeks ago

updates for llheap paper

  • Property mode set to 100644
File size: 138.3 KB
Line 
1% Type: Paper
2%
3% Abstract
4%
5% A new C-based concurrent memory-allocator is presented, called llheap (ll => low latency). It supports C/C++ applications with multiple kernel threads, or it can be embedded into user-threading runtime-systems. llheap extends the C allocation API with new functions providing orthogonal access to allocation features; hence, programmers do have to code missing combinations. llheap also extends the C allocation semantics by remembering multiple aspects of the initial allocation. These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations. As well, realloc/reallocarray preserve initial zero-fill and alignment properties when adjusting storage size, again increasing future allocation safety. The allocator provides a contention-free statistics gathering mode, and a debugging mode for dynamically checking allocation pre/post conditions and invariants. These modes are invaluable for understanding and debugging a program's dynamic allocation behaviour, with low enough cost to be used in production code. An example is presented for condensing the allocation API using advanced type-systems, providing a single type-safe allocation routine using named arguments. Finally, performance results across a number of benchmarks show llheap is competitive with other modern memory allocators.
6%
7% Upload: llheap.pdf
8%
9% Computing Classification Systems
10%
11% Add
12% 500 Software and its engineering > Software libraries and repositories
13% Add
14% 300 Computing methodologies > Concurrent programming languages
15%
16% Authors, submitter has to have an orcid
17%
18% Details & Comments
19%
20% cover letter
21%
22% Funding
23% yes
24% Government of Canada >
25% Natural Sciences and Engineering Research Council of Canada
26%
27% Electronic Supplementary Materials No
28% Are you submitting a conference paper extension: No
29% X ACM uses CrossCheck, an automated service that checks for plagiarism. Any submission to ACM is subject to such a check. Confirm that you are familiar with the ACM Plagiarism Polic
30% To confirm that you have reviewed all title, author, and affiliation information in the submission form and the manuscript for accuracy, and approve its exact use in the final, published article, please check the box to the right. X
31
32\documentclass[manuscript,screen,review]{acmart}
33
34% Latex packages used in the document.
35
36\usepackage{comment}
37\usepackage{epic,eepic}
38\usepackage{upquote} % switch curled `'" to straight
39\usepackage{relsize}
40\usepackage{xspace}
41\usepackage{xcolor}
42\usepackage{calc}
43\usepackage{algorithm}
44\usepackage{algorithmic}
45\usepackage{enumitem}
46\usepackage{tabularx} % allows \lstMakeShortInline@
47\usepackage[scaled=0.88]{helvet} % descent Helvetica font and scale to times size
48\usepackage[T1]{fontenc}
49\usepackage{listings} % format program code
50\usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt,font={rm,md,up}]{subfig}
51\renewcommand{\thesubfigure}{(\alph{subfigure})}
52
53\hypersetup{breaklinks=true}
54\usepackage{breakurl}
55
56% \usepackage[pagewise]{lineno}
57% \renewcommand{\linenumberfont}{\scriptsize\sffamily}
58
59\usepackage{varioref} % extended references
60% adjust varioref package with default "section" and "page" titles, and optional title with faraway page numbers
61% \VRef{label} => Section 2.7, \VPageref{label} => page 17
62% \VRef[Figure]{label} => Figure 3.4, \VPageref{label} => page 17
63% \renewcommand{\reftextfaceafter}{\unskip}
64% \renewcommand{\reftextfacebefore}{\unskip}
65% \renewcommand{\reftextafter}{\unskip}
66% \renewcommand{\reftextbefore}{\unskip}
67% \renewcommand{\reftextfaraway}[1]{\unskip, p.~\pageref{#1}}
68% \renewcommand{\reftextpagerange}[2]{\unskip, pp.~\pageref{#1}--\pageref{#2}}
69% \newcommand{\VRef}[2][Section]{\ifx#1\@empty\else{#1}\nobreakspace\fi\vref{#2}}
70% \newcommand{\VRefrange}[3][Sections]{\ifx#1\@empty\else{#1}\nobreakspace\fi\vrefrange{#2}{#3}}
71% \newcommand{\VPageref}[2][page]{\ifx#1\@empty\else{#1}\nobreakspace\fi\pageref{#2}}
72% \newcommand{\VPagerefrange}[3][pages]{\ifx#1\@empty\else{#1}\nobreakspace\fi\pageref{#2}{#3}}
73
74\makeatletter
75\newcommand{\abbrevFont}{\textit} % set empty for no italics
76\newcommand{\CheckCommaColon}{\@ifnextchar{,}{}{\@ifnextchar{:}{}{,\xspace}}}
77\newcommand{\CheckPeriod}{\@ifnextchar{.}{}{.\xspace}}
78\newcommand{\EG}{\abbrevFont{e}.\abbrevFont{g}.}
79\newcommand{\eg}{\EG\CheckCommaColon}
80\newcommand{\IE}{\abbrevFont{i}.\abbrevFont{e}.}
81\newcommand{\ie}{\IE\CheckCommaColon}
82\newcommand{\ETC}{\abbrevFont{etc}}
83\newcommand{\etc}{\ETC\CheckPeriod}
84\newcommand{\VS}{\abbrevFont{vs}}
85\newcommand{\vs}{\VS\CheckPeriod}
86
87\newcommand{\newtermFont}{\emph}
88\newcommand{\newterm}[1]{\newtermFont{#1}}
89
90\newcommand{\CFAIcon}{\textsf{C}\raisebox{\depth}{\rotatebox{180}{\textsf{A}}}\xspace} % Cforall symbolic name
91\newcommand{\CFA}{\protect\CFAIcon} % safe for section/caption
92\newcommand{\CFL}{\textrm{Cforall}\xspace} % Cforall symbolic name
93\newcommand{\CCIcon}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}} % C++ icon
94\newcommand{\CC}[1][]{\protect\CCIcon{#1}\xspace} % C++ symbolic name
95\newcommand{\uC}{$\mu$\CC}
96\newcommand{\Csharp}{C\raisebox{-0.7ex}{\relsize{2}$^\sharp$}\xspace} % C# symbolic name
97
98\newcommand{\LstBasicStyle}[1]{{\lst@basicstyle{#1}}}
99\newcommand{\LstKeywordStyle}[1]{{\lst@basicstyle{\lst@keywordstyle{#1}}}}
100\newcommand{\LstCommentStyle}[1]{{\lst@basicstyle{\lst@commentstyle{#1}}}}
101\newcommand{\LstStringStyle}[1]{{\lst@basicstyle{\lst@stringstyle{#1}}}}
102
103\newlength{\parindentlnth}
104\setlength{\parindentlnth}{\parindent}
105\newlength{\gcolumnposn} % temporary hack because lstlisting does not handle tabs correctly
106\newlength{\columnposn}
107\setlength{\gcolumnposn}{3.25in}
108\setlength{\columnposn}{\gcolumnposn}
109\renewcommand{\C}[2][\@empty]{\ifx#1\@empty\else\global\setlength{\columnposn}{#1}\global\columnposn=\columnposn\fi\hfill\makebox[\textwidth-\columnposn][l]{\lst@basicstyle{\LstCommentStyle{#2}}}}
110\newcommand{\CRT}{\global\columnposn=\gcolumnposn}
111\makeatother
112
113\lstset{
114columns=fullflexible,
115basicstyle=\linespread{0.9}\sf, % reduce line spacing and use sanserif font
116stringstyle=\fontsize{9}{9}\selectfont\tt, % use typewriter font
117tabsize=5, % N space tabbing
118xleftmargin=\parindentlnth, % indent code to paragraph indentation
119escapechar=\$, % LaTeX escape in CFA code
120mathescape=false, % disable LaTeX math escape in CFA code $...$
121keepspaces=true, %
122showstringspaces=false, % do not show spaces with cup
123showlines=true, % show blank lines at end of code
124aboveskip=4pt, % spacing above/below code block
125belowskip=2pt,
126numberstyle=\footnotesize\sf, % numbering style
127moredelim=**[is][\color{red}]{@}{@},
128% replace/adjust listing characters that look bad in sanserif
129literate=
130% {-}{\makebox[1ex][c]{\raisebox{0.4ex}{\rule{0.75ex}{0.1ex}}}}1
131 {-}{\raisebox{0pt}{\ttfamily-}}1
132 {^}{\raisebox{0.6ex}{\(\scriptstyle\land\,\)}}1
133 {~}{\raisebox{0.3ex}{\(\scriptstyle\sim\,\)}}1
134% {'}{\ttfamily'\hspace*{-0.4ex}}1
135 {`}{\raisebox{-2pt}{\large\textasciigrave\hspace{-1pt}}}1
136 {<-}{$\leftarrow$}2
137 {=>}{$\Rightarrow$}2
138% {->}{\raisebox{-1pt}{\texttt{-}}\kern-0.1ex\textgreater}2,
139}% lstset
140
141% CFA programming language, based on ANSI C (with some gcc additions)
142\lstdefinelanguage{CFA}[ANSI]{C}{
143 morekeywords={
144 _Alignas, _Alignof, __alignof, __alignof__, and, asm, __asm, __asm__, _Atomic, __attribute, __attribute__,
145 __auto_type, basetypeof, _Bool, catch, catchResume, choose, coerce, _Complex, __complex, __complex__, __const, __const__,
146 coroutine, _Decimal32, _Decimal64, _Decimal128, disable, enable, exception, __extension__, fallthrough, finally, fixup,
147 __float80, float80, __float128, float128, _Float16, _Float32, _Float32x, _Float64, _Float64x, _Float128, _Float128x,
148 forall, fortran, generator, _Generic, _Imaginary, __imag, __imag__, inline, __inline, __inline__, int128, __int128, __int128_t,
149 __label__, monitor, mutex, _Noreturn, __builtin_offsetof, one_t, or, recover, report, restrict, __restrict, __restrict__,
150 __signed, __signed__, _Static_assert, suspend, thread, __thread, _Thread_local, throw, throwResume, timeout, trait, try,
151 typeof, __typeof, __typeof__, typeid, __uint128_t, __builtin_va_arg, __builtin_va_list, virtual, __volatile, __volatile__,
152 vtable, waitfor, waituntil, when, with, zero_t,
153 },
154 moredirectives={defined,include_next},
155}
156
157% uC++ programming language, based on ANSI C++
158\lstdefinelanguage{uC++}[GNU]{C++}{
159 morekeywords={
160 _Accept, _AcceptReturn, _AcceptWait, _Actor, _At, _Catch, _CatchResume, _CorActor, _Cormonitor, _Coroutine,
161 _Disable, _Else, _Enable, _Event, _Exception, _Finally, _Monitor, _Mutex, _Nomutex, _PeriodicTask, _RealTimeTask,
162 _Resume, _ResumeTop, _Select, _SporadicTask, _Task, _Timeout, _When, _With, _Throw},
163}
164
165% Go programming language: https://github.com/julienc91/listings-golang/blob/master/listings-golang.sty
166\lstdefinelanguage{Golang}{
167 morekeywords=[1]{package,import,func,type,struct,return,defer,panic,recover,select,var,const,iota,},
168 morekeywords=[2]{string,uint,uint8,uint16,uint32,uint64,int,int8,int16,int32,int64,
169 bool,float32,float64,complex64,complex128,byte,rune,uintptr, error,interface},
170 morekeywords=[3]{map,slice,make,new,nil,len,cap,copy,close,true,false,delete,append,real,imag,complex,chan,},
171 morekeywords=[4]{for,break,continue,range,goto,switch,case,fallthrough,if,else,default,},
172 morekeywords=[5]{Println,Printf,Error,},
173 sensitive=true,
174 morecomment=[l]{//},
175 morecomment=[s]{/*}{*/},
176 morestring=[b]',
177 morestring=[b]",
178 morestring=[s]{`}{`},
179}
180
181\lstnewenvironment{cfa}[1][]{\lstset{language=CFA,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
182\lstnewenvironment{C++}[1][]{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
183\lstnewenvironment{uC++}[1][]{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
184\lstnewenvironment{Go}[1][]{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
185\lstnewenvironment{python}[1][]{\lstset{language=python,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
186\lstnewenvironment{java}[1][]{\lstset{language=java,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
187
188\newsavebox{\myboxA}
189\newsavebox{\myboxB}
190\newsavebox{\myboxC}
191\newsavebox{\myboxD}
192
193%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
194
195\title{High-Performance Concurrent Memory Allocation}
196
197\author{Mubeen Zulfiqar}
198\email{m3zulfiq@uwaterloo.ca}
199\author{Ayelet Wasik}
200\email{aisraeli@plg.uwaterloo.ca}
201\author{Peter A. Buhr}
202\email{pabuhr@uwaterloo.ca}
203\orcid{0000-0003-3747-9281}
204\affiliation{%
205 \institution{University of Waterloo}
206 \city{Waterloo}
207 \state{Ontario}
208 \country{Canada}
209}
210\author{Dave Dice}
211\email{dave.dice@oracle.com}
212\orcid{0000-0001-9164-7747}
213\affiliation{%
214 \institution{Oracle Labs}
215 \city{Burlington}
216 \state{Massachusetts}
217 \country{USA}
218}
219\author{Bryan Chan}
220\email{bryan.chan@huawei.com}
221\affiliation{%
222 \institution{Huawei Compiler Lab}
223 \city{Markham}
224 \state{Ontario}
225 \country{Canada}
226}
227
228\renewcommand{\shortauthors}{Zulfiqar et al.}
229
230% inline code @...@
231\lstMakeShortInline@%
232
233\begin{document}
234
235\begin{abstract}
236A new C-based concurrent memory-allocator is presented, called llheap (ll $\Rightarrow$ low latency).
237It supports C/\CC applications with multiple kernel threads, or it can be embedded into user-threading runtime-systems.
238llheap extends the C allocation API with new functions providing orthogonal access to allocation features;
239hence, programmers do have to code missing combinations.
240llheap also extends the C allocation semantics by remembering multiple aspects of the initial allocation.
241These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations.
242As well, \lstinline{realloc}/\lstinline{reallocarray} preserve initial zero-fill and alignment properties when adjusting storage size, again increasing future allocation safety.
243The allocator provides a contention-free statistics gathering mode, and a debugging mode for dynamically checking allocation pre/post conditions and invariants.
244These modes are invaluable for understanding and debugging a program's dynamic allocation behaviour, with low enough cost to be used in production code.
245An example is presented for condensing the allocation API using advanced type-systems, providing a single type-safe allocation routine using named arguments.
246Finally, performance results across a number of benchmarks show llheap is competitive with other modern memory allocators.
247\end{abstract}
248
249\begin{CCSXML}
250<concept>
251<concept_id>10011007.10011006.10011072</concept_id>
252<concept_desc>Software and its engineering~Software libraries and repositories</concept_desc>
253<concept_significance>500</concept_significance>
254</concept>
255</ccs2012>
256
257<ccs2012>
258<concept>
259<concept_id>10010147.10011777.10011014</concept_id>
260<concept_desc>Computing methodologies~Concurrent programming languages</concept_desc>
261<concept_significance>300</concept_significance>
262</concept>
263\end{CCSXML}
264
265\ccsdesc[500]{Software and its engineering~Software libraries and repositories}
266\ccsdesc[300]{Computing methodologies~Concurrent programming languages}
267
268\keywords{memory allocation, (user-level) concurrency, type-safety, statistics, debugging, high performance}
269
270\received{20 February 2007}
271\received[revised]{12 March 2009}
272\received[accepted]{5 June 2009}
273
274
275\maketitle
276
277\section{Introduction}
278
279Memory management services a series of program allocation/deallocation requests and attempts to satisfy them from variable-sized blocks of memory while minimizing total memory usage.
280A general-purpose memory allocator cannot anticipate storage requests so its time and space performance cannot be optimal (bin packing).
281Each allocator takes advantage of a subset of typical allocation patterns (heuristics) to produce excellent results, both in time and space (similar to LRU paging).
282Nevertheless, allocators are a series of compromises, possibly with static or dynamic tuning parameters to optimize specific request patterns.
283
284
285\subsection{Memory Structure}
286\label{s:MemoryStructure}
287
288Figure~\ref{f:ProgramAddressSpace} shows the typical layout of a program's address space (high addresses to low) divided into a number of zones, with free memory surrounding the dynamic code/data~\cite{memlayout}.
289Static code and data are placed into memory at load time from the executable and are fixed-sized at runtime.
290Dynamic code/data memory is managed by the dynamic loader for libraries loaded at runtime, which is complex especially in a multi-threaded program~\cite{Huang06}.
291However, changes to the dynamic code/data space are typically infrequent, most occurring at program startup and are largely outside of a program's control.
292Stack memory is managed by the program call/return mechanism using a LIFO technique.
293For stackful coroutines and user threads, new stacks are commonly created in the dynamic-allocation memory.
294The dynamic-allocation memory is often a contiguous area, which starts empty and grows/shrinks as the program creates/deletes variables with independent lifetime.
295The language's runtime manages this area, where management complexity is a function of the mechanism for deleting variables.
296
297\begin{figure}
298\centering
299\input{AddressSpace.pstex_t}
300\vspace{-5pt}
301\caption{Program Address Space Divided into Zones}
302\label{f:ProgramAddressSpace}
303\end{figure}
304
305
306\subsection{Dynamic Memory-Management}
307\label{s:DynamicMemoryManagement}
308
309Modern programming languages provide two forms of storage management: managed or unmanaged.
310Both forms have explicit allocation, but managed memory has implicit deallocation (garbage collection~\cite{Wilson92}, GC) and unmanaged memory has some form of explicit deallocation.
311Sometimes there are explicit deallocation hints in managed.
312Both forms attempt to reuse freed storage in the heap for new allocations.
313Unmanaged languages have no information about allocated \newterm{objects}, and hence, use techniques during freeing to detect adjacent unused storage if coalescing.
314Conservative GC attempts to find free objects in an unmanaged system by scanning memory and marking anything that \emph{looks} like a live object.
315However, \emph{conservative} means some non-objects might be marked as live;
316the goal is not to miss any live objects.
317Managed languages maintain sufficient information to locate all live objects.
318Precise GC is then able to mark just the live objects.
319Both approaches then sweep through the unmarked objects looking for adjacent free storage to coalesce.
320Precise GC has a further coalescing option of compacting used objects and adjusting the pointers used to find them to the new locations, resulting in a large area of contiguous free storage.
321Languages such as Lisp~\cite{CommonLisp}, Java~\cite{Java}, Haskell~\cite{Haskell}, Go~\cite{Go}, are managed and normally implemented using precise GC.
322(Both Go~\cite{Go1.3} and Netscape JavaScript~\cite{JavaScriptGC} switched from conservative to precise GC.)
323Languages such as C~\cite{C}, \CC~\cite{C++}, Rust~\cite{Rust} and Swift~\cite{swift} (because of explicit management of weak references) are unmanaged but could be used with conservative GC~\cite{Boehm88}.
324This work only examines unmanaged memory with \emph{explicit} deallocation.
325% While GC is not part this work, some of the results are applicable to the allocation phase in any memory-management approach.
326
327Most programs use a general-purpose allocator, usually the one provided by the programming-language's runtime.
328In certain languages, programmers can write specialize allocators for specific needs.
329POSIX~\cite{POSIX17} provides for replacement of the default memory allocator in C and \CC through a standard API.
330Most industry JVMs provide multiple GCs, from which a user selects one for their workload.
331%Jikes RVM MMTk~\cite{MMTk} provides a similar generalization for the Java virtual machine.
332As well, new languages support concurrency (kernel/user threading), which must be safely handled by the allocator.
333Hence, alternative allocators exist for C/\CC with the goal of scaling in multi-threaded programs~\cite{Berger00,mtmalloc,streamflow,tcmalloc}.
334This work examines the design of high-performance allocators for use by kernel and user multi-threaded applications written in C/\CC.
335
336
337\subsection{Contributions}
338\label{s:Contributions}
339
340This work provides the following contributions to the area of explicit concurrent dynamic-allocation.
341\begin{enumerate}[leftmargin=18pt,topsep=3pt,itemsep=0pt]
342\item
343Implementation of a new stand-alone concurrent low-latency memory-allocator, called llheap~\cite{llheap}, ($\approx$1,500 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions for the concurrent languages \uC~\cite{uC++} and \CFA~\cite{Moss18,Delisle21} using user-level threads running on multiple kernel threads (M:N threading).
344
345\item
346Extend the C allocation API with new functions @aalloc@, @amemalign@, @cmemalign@, @resize@, @aligned_resize@, @aligned_realloc@, and @aligned_reallocarray@ to make allocation properties orthogonally accessible.
347
348\item
349Extend the C allocation semantics by preserving with each allocation its request size, the amount allocated, whether it is zero fill, and its alignment.
350
351\item
352Provide additional query operations @malloc_alignment@, @malloc_zero_fill@, and @malloc_size@ to access allocation information.
353
354\item
355Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ and @reallocarray@ to zero-fill and align when storage is extended or copied.
356Without this extension, it is unsafe to @realloc@ storage if the properties are not preserved when copying.
357This silent problem is unintuitive to programmers and difficult to locate because it is transient.
358
359\item
360Provide optional extensive, fast, and contention-free allocation statistics to understand allocation behaviour.
361
362\item
363Provide runtime checks to validate allocation operations and identify the amount of unfreed storage at program termination.
364
365\item
366Build 8 different versions of the allocator: static or dynamic linking, with or without statistics or debugging.
367A program may link to any of these 8 versions of the allocator often without recompilation (linking or @LD_PRELOAD@).
368
369\item
370Demonstrate how advanced programming-language type-systems can condense the allocation API providing a single type-safe allocation function using named arguments.
371
372\item
373Create a benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs.
374
375\item
376Run performance experiments using the new benchmark test-suite comparing llheap with six of the best allocators in use today.
377The goal is to demonstrate that llheap's performance, both in time and space, is comparable to the best allocators in use today.
378\end{enumerate}
379
380
381\section{Background}
382
383The following is an overview of allocator design options that affect memory usage and performance (see~\cite{Zulfiqar22} for more details).
384Dynamic acquires and releases obtain \newterm{object} storage via calls such as @malloc@/@new@ and @free@/@delete@ in C/\CC, respectively.
385A \newterm{memory allocator} contains a complex data-structure and code that manages the layout of objects in the dynamic-allocation zone.
386% The management goals are to make allocation/deallocation operations as fast as possible while densely packing objects to make efficient use of memory.
387Since objects in C/\CC cannot be moved, only adjacent free storage can be \newterm{coalesced} into larger free areas.
388The allocator grows or shrinks the dynamic-allocation zone to obtain storage for objects and reduce memory usage using \newterm{operating system} (OS) calls, such as @mmap@ or @sbrk@ in UNIX.
389
390
391\vspace*{-7pt}
392\subsection{Allocator Components}
393\label{s:AllocatorComponents}
394
395Figure~\ref{f:AllocatorComponents} shows the two important data components for a memory allocator, management and storage, collectively called the \newterm{heap}.
396The \newterm{management data} is a data structure located at a known memory address and contains fixed-sized information in the static-data memory that references components in the dynamic-allocation memory.
397For multi-threaded programs, additional management data may exist in \newterm{thread-local storage} (TLS) for each kernel thread executing the program.
398The \newterm{storage data} is composed of allocated/freed objects, and \newterm{reserved memory}.
399Allocated objects (white) are variable sized, and are allocated and maintained by the program;
400\ie only the program knows the location of allocated storage.
401Freed objects (light grey) represent memory deallocated by the program, which are linked into one or more lists facilitating location for new allocations.
402Reserved memory (dark grey) is one or more blocks of memory obtained from the OS but not yet used by the program;
403if there are multiple reserved blocks, they are normally linked together.
404
405\begin{figure}
406\centering
407\input{AllocatorComponents}
408\caption{Allocator Components (Heap)}
409\label{f:AllocatorComponents}
410\end{figure}
411
412In many allocator designs, allocated objects and reserved blocks have management data embedded within them (see also Section~\ref{s:ObjectContainers}).
413Figure~\ref{f:AllocatedObject} shows an allocated object with a header, trailer, and optional spacing around the object.
414The header contains information about the object, \eg size, type, \etc.
415The trailer may be used to simplify coalescing and/or for safety purposes to mark the end of an object.
416An object may be preceded by padding to ensure proper alignment.
417Some algorithms quantize allocation requests, resulting in additional space after an object.
418When padding and spacing are necessary, neither can be used to satisfy a future allocation request while the current allocation exists.
419
420A free object often contains management data, \eg size, pointers, \etc.
421Often the free list is linked internally so it does not consume additional storage, \ie the link fields are placed at known locations in the unused memory blocks.
422For internal linking, the amount of management data for a free node defines the minimum allocation size, \eg if 16 bytes are needed for a free-list node, allocation requests less than 16 bytes are rounded up.
423Often the minimum storage alignment and free-node size are the same.
424The information in an allocated or freed object is overwritten when it transitions from allocated to freed and vice-versa by new program data and/or management information, receptively.
425For safety purposes, freed storage may be scrubbed (overwritten) to expose inadvertent bugs, such as assuming variables are zero initialized.
426
427\begin{figure}
428\centering
429\input{AllocatedObject}
430\caption{Allocated Object}
431\label{f:AllocatedObject}
432
433\bigskip
434
435\input{IntExtFragmentation}
436\caption{Internal and External Fragmentation}
437\label{f:InternalExternalFragmentation}
438\end{figure}
439
440
441\subsection{Single-Threaded Memory-Allocator}
442\label{s:SingleThreadedMemoryAllocator}
443
444In a sequential (single threaded) program, the program thread performs all allocation operations without direct concurrency issues.
445However, interrupts introduce indirect concurrency, if the signal handler performs allocation/deallocation (serially reusable problem~\cite{SeriallyReusable}).
446In general, the primary issues in a single-threaded allocator are fragmentation and locality.
447
448
449\subsubsection{Fragmentation}
450\label{s:Fragmentation}
451
452Fragmentation is unused memory requested from the OS.
453Figure~\ref{f:InternalExternalFragmentation} shows fragmentation has two forms: \emph{internal} or \emph{external}.
454
455\newterm{Internal fragmentation} is inaccessible \emph{allocated} memory, such as headers, trailers, \etc.
456Internal fragmentation is problematic when management space approaches the object size, \eg for objects $<$16 bytes, memory usage doubles.
457
458\newterm{External fragmentation} is memory not allocated in the program~\cite{Wilson95,Lim98,Siebert00}, which includes all external management data, freed objects, and reserved memory.
459This memory is problematic resulting in heap blowup and fragmented memory.
460\newterm{Blowup} occurs when freed memory becomes a checkerboard of adjacent allocated and free areas, where the free blocks are too small to service requests, leading to unbounded external fragmentation growth~\cite{Berger00}.
461Heap blowup is a fundamental problem in unmanaged languages without compaction.
462
463Three basic approaches for controlling fragmentation are identified~\cite{Johnstone99}.
464The first approach is \newterm{sequential-fit} with a list of free objects (possibly ordered by size) that is searched for a block large enough to fit a requested object.
465Different search policies determine the free object selected, \eg the first free object large enough (first fit) or closest to the requested size (best fit).
466Any storage larger than the request can become spacing after the object or split into a smaller free object.
467
468The second approach is \newterm{segregation} or \newterm{binning} with a set of lists for different sized freed objects.
469The request size is rounded up to the nearest bin size, often leading to internal fragmentation after the object.
470A binning algorithm searches for the smallest bin that covers the request, and selects the first free object, if available.
471Fewer bin sizes means more internal fragmentation but increased reuse as more request sizes match the bin size.
472More bin sizes has less internal fragmentation size but more external fragmentation as larger bins cannot service smaller requests.
473Allowing larger bins to service smaller allocations means the freed object can be returned to the matching or larger bin (some advantages to either scheme).
474
475The third approach is \newterm{splitting} and \newterm{coalescing}.
476If there is no matching free storage for allocation, a larger free object is split to get the allocation and the smaller object is put back on the free list.
477For example, in the \newterm{buddy system}, a block of free memory is split into equal chunks, splitting continues until a minimal block is created that fits the allocation.
478When an object is deallocated, it is coalesced with the objects immediately before/after it in memory, if they are free, creating a larger block.
479Coalescing can be done eagerly at each deallocation or lazily when an allocation cannot be fulfilled.
480However, coalescing increases allocation latency (unbounded delays), both for allocation and deallocation.
481While coalescing does not reduce external fragmentation, the coalesced blocks improve fragmentation quality so future allocations are less likely to cause heap blowup.
482
483
484\subsubsection{Locality}
485\label{s:Locality}
486
487The principle of locality recognizes that programs tend to reference a small set of data, called a \newterm{working set}, for a certain period of time, composed of temporal and spatial accesses~\cite{Denning05}.
488% Temporal clustering implies a group of objects are accessed repeatedly within a short time period, while spatial clustering implies a group of objects physically close together (nearby addresses) are accessed repeatedly within a short time period.
489% Temporal locality commonly occurs during an iterative computation with a fixed set of disjoint variables, while spatial locality commonly occurs when traversing an array.
490Hardware takes advantage of the working set through multiple levels of caching and paging, \ie memory hierarchy.
491% When an object is accessed, the memory physically located around the object is also cached with the expectation that the current and nearby objects will be referenced within a short period of time.
492% For example, entire cache lines are transferred between cache and memory, and entire virtual-memory pages are transferred between memory and disk.
493% A program exhibiting good locality has better performance due to fewer cache misses and page faults\footnote{With the advent of large RAM memory, paging is becoming less of an issue in modern programming.}.
494
495Temporal locality is largely controlled by program accesses to variables~\cite{Feng05}.
496An allocator has only indirect influence on temporal locality but largely dictates spatial locality.
497For temporal locality, an allocator tries to return recently freed storage for new allocations, as this memory is still \emph{warm} in the memory hierarchy.
498For spatial locality, an allocator places objects used together close together in memory, so the working set of the program fits into the fewest possible cache lines and pages.
499% However, usage patterns are different for every program as is the underlying hardware memory architecture;
500% hence, no general-purpose memory-allocator can provide ideal locality for every program on every computer.
501
502An allocator can easily degrade locality by increasing the working set.
503For example, it can access an unbounded number of free objects when matching an allocation or coalescing, causing multiple cache or page misses~\cite{Grunwald93}.
504An allocator can spatially separate related data by binning free storage anywhere in memory, so the related objects are highly separated.
505
506
507\subsection{Multi-Threaded Memory-Allocator}
508\label{s:MultiThreadedMemoryAllocator}
509
510In a concurrent program, multiple kernel threads (KT) perform allocations, requiring some form of mutual exclusion.
511Along with fragmentation and locality issues, a multi-threaded allocator must deal with false sharing and additional forms of heap blowup.
512
513
514\subsubsection{Mutual Exclusion}
515\label{s:MutualExclusion}
516
517% \newterm{Mutual exclusion} provides sequential access to the shared-management data of the heap.
518There are two performance issues for mutual exclusion.
519First, the cost of performing atomic instructions every time a shared resource is accessed to provide mutual exclusion.
520Solutions using any atomic fence, atomic instruction (lock free), or lock along a fast path, even with zero contention, results in significant slowdown.
521Second, \newterm{contention} on simultaneous access, so threads must wait until the resource is released.
522Contention can be reduced by:
5231) Using multiple fine-grained locks versus few course-gain locks to spread the contention.
5242) Using trylock and generating new storage if the lock is busy (classic space versus time tradeoff).
525% 3) Using one of the many lock-free approaches for reducing contention on basic data-structure operations~\cite{Fatourou12}.
526% However, all approaches have degenerate cases where program contention to the heap is high, which is beyond the allocator's control.
527Figure~\ref{f:ThreadHeapRelationship} shows how a multi-threaded allocator reduces contention by subdividing a single heap into multiple heaps.
528
529\begin{figure}
530\centering
531\subfloat[T:1]{
532 \input{SingleHeap}
533 \label{f:SingleHeap}
534} % subfloat
535\vrule
536\subfloat[T:H]{
537 \input{SharedHeaps}
538 \label{f:SharedHeaps}
539} % subfloat
540\vrule
541\subfloat[1:1]{
542 \input{PerThreadHeap}
543 \label{f:PerThreadHeap}
544} % subfloat
545\caption{Multiple Heaps, Thread:Heap Relationship}
546\label{f:ThreadHeapRelationship}
547\end{figure}
548
549\begin{description}[leftmargin=*]
550\item[T:1 model (Figure~\ref{f:SingleHeap})] is all threads (T) sharing a single heap (1).
551% The arrows indicate memory movement for allocation/deallocation operations.
552% Memory is obtained from freed objects, reserved memory, or the OS;
553% freed memory can be returned to the OS.
554To handle concurrency, a single lock is used for all heap operations or fine-grained (lock-free) locking if operations can be made independent.
555As threads perform large numbers of allocations, a single heap becomes a significant source of contention.
556
557\item[T:H model (Figure~\ref{f:SharedHeaps})] is multiple threads (T) sharing multiple heaps (H).
558The allocator allocates/deallocates heaps and assigns threads to heaps often based on dynamic contention pressure.
559While locking is required for heap access, contention is (normally) reduced as access is spread across the heaps.
560Locking can be reduced (eliminated) using the T:C variant, \ie each CPU has a heap, and a thread cannot migrate from the CPU if executing an allocator critical-section, implemented with restartable critical sections~\cite{Desnoyers19,Dice02} (see also Section~\ref{s:UserlevelThreadingSupport}).
561% The goal is minimal heaps (storage) and contention per heap (time).
562Multiple heaps increase external fragmentation as the ratio of heaps to threads increases, which can lead to heap blowup, where the worst-case scenario is more heaps than threads.
563The external fragmentation is now multiplied by the number of heaps, since each heap manages its own free storage and allocates its own reserved memory.
564When freeing, objects normally need to be returned to their original heap (see Section~\ref{s:Ownership}).
565% Returning storage to the OS may be difficult or impossible, \eg the contiguous @sbrk@ area in Unix.
566% In the worst case, a program in which objects are allocated from one heap but deallocated to another heap means these freed objects are never reused.
567
568A shared \newterm{global heap} (G) is often introduced to manage the reserved memory among heaps and centralize interacts with the OS.
569Instead of heaps making individual object allocations/deallocations through the global heap, resulting in locking and high contention, the global heap partitions the reserved memory into heap (allocation) buffers, which are given out to heaps for their own suballocations.
570Hence, a heap's allocations are temporally and spatially accessed densely in a small set of buffers, rather than spread sparsely across the entire reserve memory.
571Buffers are allocated at heap startup, after which allocation often reaches a steady state through free lists.
572Allocation buffers may increase external fragmentation, since some memory may never be used.
573
574\item[1:1 model (Figure~\ref{f:PerThreadHeap})] is each thread (1) having its own heap (1), eliminating most contention and locking if threads seldom access another thread's heap (see Section~\ref{s:Ownership}).
575A thread's objects are consolidated in its heap, better utilizing the cache and paging during thread execution.
576In contrast, the T:H model can spread thread objects over a larger area in different heaps.
577%For example, assume page boundaries coincide with cache line boundaries, if a thread heap always acquires pages of memory then no two threads share a page or cache line unless pointers are passed among them.
578When a thread terminates, it can free its heap objects to the global heap, or the thread heap is retained as-is and reused for a new thread in the future.
579Destroying a heap can reduce external fragmentation sooner, since all free objects in the global heap are available for immediate reuse.
580Alternatively, reusing a heap can aid the inheriting thread, if it has a similar allocation pattern, because the heap in primed with freed storage of the right sizes.
581\end{description}
582
583
584\subsubsection{False Sharing}
585\label{s:FalseSharing}
586
587False sharing occurs for a read/write or write/write among threads modifying different memory sharing a cache line~\cite{Bolosky93}.
588The write invalidates each thread's cache, even though the threads may be uninterested in the other modified object.
589False sharing can occur three ways:
5901) Thread T$_1$ allocates objects O$_1$ and O$_2$ on the same cache line and passes O$_2$'s reference to thread T$_2$.
5912) Thread T$_1$ allocates object O$_1$ and thread T$_2$ allocates O$_2$, where objects O$_1$ and O$_2$ are on the same cache line.
5923) T$_2$ deallocates O$_2$, T$_1$ allocates O$_1$ on the same cache line as O$_2$, and T$_2$ reallocated O$_2$ while T$_1$ is using O$_1$.
593In all three cases, the false sharing is hidden and possibly transient (non-deterministic), making it extremely difficult to find and fix.
594Case 1) occurs in all three allocator models, and is induced by program behaviour, not the allocator.
595Case 2) and 3) are allocator induced, and occurs in T:1 and T:H models due to heap sharing, but not 1:1 with private heaps, except possibly at boundary points among heaps.
596
597
598\subsubsection{Object Containers}
599\label{s:ObjectContainers}
600
601Associating header data with every allocation can result in significant internal fragmentation, as shown in Figure~\ref{f:AllocatedObject}.
602While the header and object are spatially together in memory, they are generally not accessed temporally together~\cite{Feng05}.
603The result is poor cache usage, since only a portion of the cache line is holding useful data from the program's perspective.
604% \eg an object is accessed by the program after it is allocated, while the header is accessed by the allocator after it is free.
605
606\begin{figure}
607\centering
608\input{Container}
609\caption{Object Container}
610\label{f:ObjectContainer}
611\end{figure}
612
613The alternative approach factors common header data to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks}~\cite[\S~3]{Berger00}) suballocated from a heap's allocation buffers, as in Figure~\ref{f:ObjectContainer}.
614A trailer may also be used at the end of the container.
615To find the header from an allocation within the container, the container is aligned on a power of 2 boundary and the lower bits of the object address are truncated (or rounded up, minus the trailer size, to obtain the trailer address).
616Container size is a tradeoff between internal and external fragmentation as some portion of a container may not be used and this portion is unusable for other kinds of allocations.
617A consequence of this tradeoff is its effect on spatial locality, which can produce positive or negative results depending on the program's access patterns.
618Normally, heap ownership applies to its containers.
619Without ownership, different objects in a container may be on different heap free-lists.
620Finally, containers are linked together for management purposes, and should all objects in a container become free, the container can be repurposed for different sized objects or given to another heap through a global heap.
621
622
623\subsubsection{Ownership}
624\label{s:Ownership}
625
626Object \newterm{ownership} is defined as the heap to which an object is returned upon deallocation~\cite[\S~6.1]{Berger00}.
627If a thread returns an object to its originating heap, a heap has ownership of its objects.
628Containers force ownership of internal contiguous objects, unless the entire container changes ownership after it becomes empty.
629Alternatively, a thread can return an object to the heap it is currently associated with, which can be any heap accessible during a thread's lifetime.
630The advantage of ownership is preventing heap blowup by returning storage for reuse by the owner heap.
631Ownership prevents the problem of a producer thread allocating from one heap, passing the object to a consumer thread, and the consumer deallocates the object to another heap, hence draining the producer heap of storage.
632The disadvantage of ownership is deallocating to another thread's heap requires an atomic operation.
633
634Object ownership can be immediate or delayed, meaning free objects may be batched on a separate free list either by the returning or receiving thread.
635The returning thread batches objects to reduce contention by passing multiple objects at once;
636however, batching across multiple allocation sizes and heaps is complex and there is no obvious time when to push back to the owner heap.
637It is simpler for the returning threads to immediately return to the receiving thread's batch list as the receiving thread has better knowledge when to incorporate the batch list into its free pool.
638The receiving thread often delays incorporating returned storage until its local storage in drained.
639
640
641\subsubsection{User-Level Threading}
642
643Any heap model can be used with user-level (M:N) threading.
644However, an important goal of user threads (UT) is for fast operations (creation/termination/context-switching) by not interacting with the OS, allowing large numbers of high-performance interacting threads ($>$ 10,000).
645In general, UTs use whatever kernel-level heap-model is provided by the language runtime.
646Hence, a UT allocates/deallocates from/to the heap of the KT on which it is executing.
647
648However, there is a subtle concurrency problem with user threading and shared heaps.
649With kernel threading, an operation started by a KT is always completed by that thread, even if preempted;
650hence, any locking correctness associated with the shared heap is preserved.
651However, this correctness property is not preserved for user-level threading.
652A UT can start an allocation/deallocation on one KT, be preempted by user-level time slicing, and continue running on a different KT to complete the operation~\cite{Dice02}.
653When the UT continues on the new KT, it may have pointers into the previous KT's heap and hold locks associated with it.
654To get the same KT safety, time slicing must be disabled/\-enabled around these operations to prevent movement.
655However, eagerly disabling time slicing on the allocation/deallocation fast path is expensive, especially as preemption is infrequent (millisecond intervals).
656Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it completes atomically.
657Occasional ignoring a preemption is normally benign;
658in the worst case, ignoring preemption results in starvation.
659To mitigate starvation, techniques like rolling the preemption forward at the next context switch can be used.
660
661
662\section{llheap}
663
664This section presents our new stand-alone, concurrent, low-latency memory allocator, called llheap (low-latency heap), fulfilling the GNU C Library allocator API~\cite{GNUallocAPI} for C/\CC programs using KTs, with specialized versions for the programming languages \uC and \CFA using user-level threads running over multiple KTs (M:N threading).
665The primary design objective for llheap is low latency across all allocator calls independent of application access-patterns and/or number of threads, \ie very seldom does the allocator delay during an allocator call.
666Excluded from the low-latency objective are (large) allocations requiring initialization, \eg zero fill, and/or data copying, along with unbounded delays to acquire storage from the OS or OS scheduling, all of which are outside the allocator's purview.
667A direct consequence of this objective is very simple or no storage coalescing;
668hence, llheap's design is willing to use more storage to lower latency.
669This objective is apropos because systems research and industrial applications are striving for low latency and modern computers have huge amounts of RAM.
670Finally, llheap's performance must be comparable with current allocators, both in space and time (see performance comparison in Section~\ref{c:Performance}).
671
672
673\subsection{llheap Design}
674
675Figure~\ref{f:llheapDesign} shows the design of llheap, which uses the following features:
6761:1 allocator model eliminating locking on the fast path,
677separate small (@sbrk@) and large object management (@mmap@),
678headers per allocation versus containers,
679small object binning (buckets) forming lists for different sized freed objects,
680optional fast-lookup table for converting allocation requests into bucket sizes,
681no coalescing to minimize latency,
682optional heap ownership (build time),
683reserved memory (buffer pool) per heap obtained from a global pool,
684global heap managing freed thread heaps and interacting with the OS to obtained storage,
685optional statistic-counters table for accumulating counts of allocation operations and a debugging version for testing (build time).
686
687\begin{figure}
688\centering
689% \includegraphics[width=0.65\textwidth]{figures/NewHeapStructure.eps}
690\input{llheap}
691\caption{llheap Design}
692\label{f:llheapDesign}
693\end{figure}
694
695llheap starts by creating an empty array for $N$ global heaps from storage obtained using @mmap@ that persists for program duration, where $N$ is the number of computer cores.
696There is a global last-array pointer and bump-pointer within this array to locate the next free heap storage.
697When an array's storage is exhausted, another empty array is allocated.
698Terminated threads push their heap onto a global-stack top-pointer, where free heaps are intrusively linked.
699When statistics are turned on, there is a global top pointer for a intrusive linked-list to link \emph{all} the heaps (not shown), which is traversed to accumulate statistics counters across heaps when @malloc_stats@ is called.
700
701When a KT starts, it pops heap storage from the heap free-list, or if empty, gets the next free heap-storage.
702When a KT terminates, its heap is pushed onto the heap free-list for reuse by a new KT, which prevents unbounded heap growth.
703The free heaps are stored in a stack so hot storage is reused first.
704Preserving all heaps created during the program lifetime solves the storage lifetime problem when ownership is used.
705This approach wastes storage if a large number of KTs are created/terminated at program start and then the program continues sequentially, which is rare.
706
707Each heap uses segregated free-buckets that have free objects distributed across 60 different sizes from 16 to 16M.
708All objects in a bucket are the same size.
709The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation, which is specified by calling @mallopt( M_MMAP_THRESHOLD )@, where the cross over must be $\ge$ the page size or $\le$ the largest bucket (16M).
710Each cache-aligned bucket has a stack of the same-sized freed objects, where a stack ensures hot storage is reused first.
711llheap can be configured with object ownership, where an object is freed to the heap from which it is allocated, or object no-ownership, where an object is freed to the KT's current heap.
712For ownership, a shared remote stack is added to the freelist structure, so push/pop operations require locking.
713Pushes are eager on each remove free \vs batching, and pops are lazy when there is no cheap storage available, then the entire remote stack is gulped and added to the bucket's free list.
714
715Initial threads are assigned empty heaps from the heap array.
716The first thread allocation causes a request for storage from the shared @sbrk@ area.
717The size of this request is the maximum of the request size or the @sbrk@-extension-size / 16.
718This heuristic means the @sbrk@ area is subdivided into separate heap buffers (HB) per thread, providing no contention and data locality.
719A thread does bump allocation in its current buffer, until it starts reusing freed storage or there is insufficient storage, and it obtains another buffer.
720Thread buffers are not linked;
721only logically connected to the thread through allocated and deallocated storage.
722When a thread ends, its heap is returned to the heap array but no storage is released.
723A new thread receiving a freed heap starts with it fully populated with freed storage.
724The heuristic is that threads often do similar work, so the free storage in the heap is reusable, resulting in less internal fragmentation.
725%The heuristic is that threads often do similar work so the free storage in the heap is immediately available.
726%The downside is the risk of more external fragmentation, if the freed storage is never reused.
727The downside is if the freed storage is never reused creating external fragmentation.
728
729Algorithm~\ref{alg:heapObjectAlloc} shows the allocation outline for an object of size $S$.
730The allocation is divided into small (@sbrk@) or large (@mmap@).
731For small allocations, $S$ is quantized into a bucket size.
732Quantizing is performed using a direct lookup for sizes < 64K or a binary search over the ordered bucket array for $S$ $\ge$ 64K.
733Then, the allocation storage is obtained from the following locations, in order of increasing latency: the bucket's free stack, the heap's local buffer, the bucket's remote stack, the global buffer, the OS (@sbrk@).
734For large allocations, the storage is directly allocated from the OS using @mmap@.
735
736\begin{algorithm}[t]
737\caption{Dynamic object allocation of size $S$}
738\label{alg:heapObjectAlloc}
739\begin{algorithmic}
740\STATE $S \gets S + \text{header-size}$
741\IF {$S < \textit{mmap-threshhold}$}
742 \STATE $\textit{B} \gets \text{smallest free-bucket} \geq S$
743 \IF {$\textit{B's free-list \(\neg\)empty}$}
744 \STATE $\textit{O} \gets \text{pop an object from B's free-list}$
745 \ELSIF {$\textit{heap's allocation buffer} \ge B$}
746 \STATE $\textit{O} \gets \text{bump allocate object of size B from allocation buffer}$
747 \ELSIF {$\textit{heap's remote-list \(\neg\)empty}$}
748 \STATE $\textit{merge heap's remote-list into free-list}$
749 \STATE $\textit{O} \gets \text{pop an object from B's free-list}$
750 \ELSE
751 \STATE $\textit{O} \gets \text{allocate an object of size B from global pool}$
752 \ENDIF
753\ELSE
754 \STATE $\textit{O} \gets \text{allocate an object of size S using \lstinline{mmap} system-call}$
755\ENDIF
756\RETURN $\textit{O}$
757\end{algorithmic}
758\end{algorithm}
759
760Algorithm~\ref{alg:heapObjectFreeOwn} shows the deallocation (free) outline for an object at address $A$ with ownership.
761First, the address is divided into small (@sbrk@) or large (@mmap@).
762For small allocations, the bucket associated with the request size is retrieved from the allocation header.
763If the bucket is local to the thread, the allocation is pushed onto the thread's associated bucket.
764If the bucket is not local to the thread, the allocation is pushed onto the owning thread's remote stack.
765For large allocations, the storage is unmapped back to the OS.
766Without object ownership, the algorithm is the same as for ownership except when the bucket is not local to the thread.
767In that case, the corresponding bucket of the owner thread is computed by the deallocating thread, and the allocation is pushed onto the deallocating thread's corresponding bucket, \ie no search is required.
768
769\begin{algorithm}[t]
770\caption{Dynamic object free at address $A$ with object ownership}
771\label{alg:heapObjectFreeOwn}
772\begin{algorithmic}
773\IF {$\textit{A heap allocation}$}
774 \STATE $\text{B} \gets \textit{O's owner}$
775 \IF {$\textit{B's thread = current heap thread}$}
776 \STATE $\text{push A to B's free-list}$
777 \ELSE
778 \STATE $\text{push A to B's remote-list}$
779 \ENDIF
780\ELSE
781 \STATE $\text{return A to system using system call \lstinline{munmap}}$
782\ENDIF
783\end{algorithmic}
784\end{algorithm}
785
786\begin{comment}
787\begin{algorithm}
788\caption{Dynamic object free at address $A$ without object ownership}
789\label{alg:heapObjectFreeNoOwn}
790\begin{algorithmic}[1]
791\IF {$\textit{A mapped allocation}$}
792 \STATE $\text{return A's dynamic memory to system using system call \lstinline{munmap}}$
793\ELSE
794 \STATE $\text{B} \gets \textit{O's owner}$
795 \IF {$\textit{B is thread-local heap's bucket}$}
796 \STATE $\text{push A to B's free-list}$
797 \ELSE
798 \STATE $\text{C} \gets \textit{thread local heap's bucket with same size as B}$
799 \STATE $\text{push A to C's free-list}$
800 \ENDIF
801\ENDIF
802\end{algorithmic}
803\end{algorithm}
804\end{comment}
805
806Finally, the llheap design funnels all allocation/deallocation operations through the @malloc@ and @free@ routines, which are the only routines to directly access and manage the internal data structures of the heap.
807Other allocation operations, \eg @calloc@, @memalign@, and @realloc@, are composed of calls to @malloc@ and possibly @free@, and may manipulate header information after storage is allocated.
808This design simplifies heap-management code during development and maintenance.
809
810
811\subsubsection{Bounded Allocation}
812
813The llheap design results in bounded allocation.
814For small allocations, once all the buckets have freed objects, storage is recycled.
815For large allocations, the storage is directly recycled back to the OS.
816When a thread terminates, its heap is recycled to the next new thread and the above process begins for that thread.
817The pathological case is threads allocating a large amount of storage, freeing it, and then quiescing, which demonstrates that the bound constant can be large.
818This pathological pattern occurs for \emph{immortal} threads, \eg I/O threads with program lifetime and bursts of activity performing many allocations/deallocations.
819Hence, independent of external fragmentation in thread heaps, storage cannot grow unbounded unless the program does not free.
820
821
822\subsubsection{Alignment}
823
824The minimum storage alignment $M$ comes from the architecture application-binary-interface (ABI) based on hardware factors: bus width (32 or 64-bit), largest register (double, long double), largest atomic instruction (double compare-and-swap), or vector data (Intel MMX).
825An access with a nonaligned address maybe slow or an error.
826A memory allocator must assume the largest hardware requirement because it is unaware of the data type occupying the allocation.
827Often the minimum storage alignment is an 8/16-byte boundary on a 32/64-bit computer, respectively.
828Alignments larger than $M$ are powers of 2, such as page alignment (4/8K).
829Any alignment less than $M$ is raised to the minimal alignment.
830
831llheap aligns its allocation header on an $M$ boundary and its size is $M$, making the following data $M$ aligned.
832This pattern means there is no minimal alignment computation along the allocation fast path, \ie new storage and reused storage is always correctly aligned.
833An alignment $N$ greater than $M$ is accomplished with a \emph{pessimistic} request for storage that ensures \emph{both} the alignment and size request are satisfied.
834\begin{center}
835\input{Alignment2}
836\end{center}
837The amount of storage necessary is $alignment - M + size$, which ensures there is an address, $A$, after the storage returned from @malloc@, $P$, that is a multiple of $alignment$ followed by sufficient storage for the data object.
838The approach is pessimistic if $P$ happens to have the correct alignment $N$, and the initial allocation has requested sufficient space to move to the next multiple of $N$.
839In this case, there is $alignment - M$ bytes of unused storage after the data object, which could be used by @realloc@.
840Note, the address returned by the allocation is $A$, which is subsequently returned for deallocation.
841However, the deallocation requires the value $P$, which must be computable from $A$, from which $H$ can be computed $P - M$.
842Hence, there must be a mechanism to detect $P$ $\neq$ $A$ and compute $P$ from $A$.
843
844To detect and perform this computation, llheap uses two headers:
845the \emph{original} header $H$ associated with the allocation, and a \emph{fake} header $F$ within this storage before the alignment boundary $A$.
846\begin{center}
847\input{Alignment2Impl}
848\end{center}
849Since every allocation is aligned at $M$, $P$ $\neq$ $A$ only holds for alignments greater than $M$.
850When $P$ $\neq$ $A$, the minimum distance between $P$ and $A$ is $M$ bytes, due to the pessimistic storage allocation.
851Therefore, there is always room for an $M$-byte fake header before $A$.
852The fake header must supply an indicator to distinguish it from a normal header and the location of address $P$ generated by the allocation.
853This information is encoded as an offset from A to P and the initial alignment (discussed in Section~\ref{s:ReallocStickyProperties}).
854To distinguish a fake header from a normal header, the least-significant bit of the alignment is set to 1 because the offset participates in multiple calculations, while the alignment is just remembered data.
855\begin{center}
856\input{FakeHeader}
857\end{center}
858
859Note, doing alignment with containers requires a separate container for the aligned fixed-sized objects, so there are more kinds of containers that must be managed.
860
861
862\subsubsection{\lstinline{realloc} and Sticky Properties}
863\label{s:ReallocStickyProperties}
864
865The allocation routine @realloc@ provides a memory management pattern for shrinking/enlarging an existing allocation, while maintaining some or all of the object data.
866The realloc pattern is simpler than the suboptimal manual steps.
867\begin{flushleft}
868\setlength{\tabcolsep}{10pt}
869\begin{tabular}{ll}
870\multicolumn{1}{c}{\textbf{realloc pattern}} & \multicolumn{1}{c}{\textbf{manual}} \\
871\begin{C++}
872T * naddr = realloc( oaddr, newSize );
873
874
875
876\end{C++}
877&
878\begin{C++}
879T * naddr = (T *)malloc( newSize ); $\C[2in]{// new storage}$
880memcpy( naddr, addr, oldSize ); $\C{// copy old bytes}$
881free( addr ); $\C{// free old storage}$
882addr = naddr; $\C{// change pointer}\CRT$
883\end{C++}
884\end{tabular}
885\end{flushleft}
886The manual steps are suboptimal because there may be internal fragmentation at the end of the allocation due to bucket sizes.
887If this storage is sufficiently large, it eliminates a new allocation and copying.
888Alternatively, if the storage is made smaller, there may be a reasonable crossover point, where just increasing the internal fragmentation eliminates a new allocation and copying.
889Hence, using @realloc@ as often as possible can reduce storage management costs.
890In fact, if @oaddr@ is @nullptr@, @realloc@ does a @malloc( newSize)@, and if @newSize@ is 0, @realloc@ does a @free( oaddr )@, so all allocation/deallocation can be done with @realloc@.
891
892The hidden problem with this pattern is the effect of zero fill and alignment with respect to reallocation.
893For safety, these properties must persist (be ``sticky'') when storage size changes.
894Prior to llheap, allocation properties are not preserved across reallocation nor is it possible to query an allocation to maintain these properties manually.
895Hence, a random call to @realloc@ that reallocates storage may cause downstream errors, if allocation properties are needed.
896This silent problem is unintuitive to programmers, can cause catastrophic failure, and is difficult to debug because it is transient.
897To prevent these problems, llheap preserves initial allocation properties within an allocation, allowing them to be queried, and the semantics of @realloc@ preserve these properties on any storage change.
898As a result, the realloc pattern is efficient and safe.
899
900Note, @realloc@ has a compile-time disadvantage \vs @malloc@, because @malloc@ simplifies optimization opportunities.
901For @malloc@ the compiler knows the new storage address is not aliased, which is not true for @realloc@: the same storage can be returned.
902The compiler uses this knowledge to optimize the region of code between the @malloc@ call and the point where the pointer escapes or it finds the matching @free@.
903For @realloc@, the compiler must also analyse the code \emph{before} the call and this analysis may fail.
904
905Finally, there is a flaw in @realloc@'s definition: if there is no memory to allocate new storage for an expansion, the original allocation is not freed or moved, @errno@ is set to @ENOMEM@, and a null pointer is returned.
906This semantics preserves the original allocation so the data is not lost in a failure case.
907However, most calls to @realloc@ are written: @p = realloc( p, size )@, so the original storage is leaked when pointer @p@ is overwritten with null, negating the benefit of not freeing the storage for recovery purposes.
908Programmers can follow a coding pattern of:
909\begin{C++}
910char * p;
911...
912void * p1 = realloc( p, size );
913if ( p1 ) p = (char *)p1;
914else // release some storage
915\end{C++}
916However, most programmers ignore return codes.
917A better alternative is to change @realloc@'s interface to be like @posix_memalign@, which returns two results, a return code and a storage address, so the error code is separate from the returned storage.
918\begin{C++}
919int retcode = realloc( (void **)&p, size );
920\end{C++}
921which returns 0 or @ENOMEM@, only changes @p@ for expansion, but requires an ugly cast on the call.
922
923
924\subsubsection{Sticky Test}
925
926Since sticky properties are an important safety feature for @realloc@, an ad-hoc @realloc@ test was created (not shown) to test whether a memory allocator preserves zero-fill from @calloc@ and/or alignment from @memalign@.
927The first test @calloc@s a large array (zero fill), sets the array to 42, shortens it, and then enlarges it to the original size.
928It does these steps 100 times attempting to get a reused large block of memory that is still set to 42, showing new storage does not preserve zero fill.
929The second test @memalign@s storage and @realloc@s it multiple times making it larger until the current storage must be copied into new storage.
930The alignment of each storage address returned from @realloc@ is verified with the original alignment.
931
932If a test fails, that sticky properties is not provided;
933if the test passes, that sticky property is provided in some form but not necessarily in all forms (test just got lucky).
934If an allocator fails these tests, it is unnecessary to perform a manual inspection of the @realloc@ code for sticky properties.
935Only llheap passes the test, as its @realloc@ applies sticky properties.
936
937
938\subsubsection{Header}
939
940To preserve allocation properties requires storing additional information about an allocation.
941Figure~\ref{f:llheapHeader} shows llheap captures this information in the per object header, which has two fields (left/right) sized appropriately for 32/64-bit alignment requirements.
942
943\begin{figure}
944\centering
945\input{Header}
946\caption{llheap Header}
947\label{f:llheapHeader}
948\end{figure}
949
950The left field is a union of three values:
951\begin{description}[leftmargin=*,topsep=2pt,itemsep=2pt,parsep=0pt]
952\item[bucket pointer]
953is for deallocation and points back to the bucket associated with this storage request (see Figure~\ref{f:llheapDesign} for the fields accessible in a bucket).
954\item[mapped size]
955is for deallocation of mapped storage and is the storage size for unmapping.
956\item[next free block]
957is an intrusive pointer linking same-size free blocks onto a bucket's stack of free objects.
958\end{description}
959The low-order 3-bits of these fields are unused for any stored values, due to the minimum aligned of 8-bytes (even for 32-bit addressing).
960The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively.
961Note, the zero-filled/mapped bits are only used in the normal header and the alignment bit in the fake header.
962This implementation allows a fast test if any of the lower 3-bits are on (@&@ and compare).
963If no bits are on, it implies a basic allocation, which is handled quickly in the fast path for allocation and free;
964otherwise, the bits are analysed and appropriate actions are taken for the complex cases.
965
966The right field remembers the allocation request size versus the allocation (bucket) size, \eg request of 42 bytes is rounded up to 64 bytes.
967Since programmers think in request size rather than allocation size, the request size allows better generation of statistics or errors and also helps in memory management.
968
969
970\subsection{Statistics and Debugging}
971
972llheap can be built to accumulate fast and largely contention-free allocation statistics to help understand dynamic memory behaviour.
973Incrementing statistic counters must appear on the allocation fast path.
974To make statistics performant enough for use on running systems, each heap has its own set of statistic counters, so statistic operations do not require slow atomic operations.
975
976To locate all statistic counters, heaps are linked together in statistics mode, and this list is locked and traversed to sum all counters across heaps.
977Note, the list is locked to prevent errors traversing an active list, which may have nodes added or removed dynamically;
978the statistics counters are not locked and can flicker during accumulation.
979Hence, printing statistics during program execution is an approximation.
980Figure~\ref{f:StatiticsOutput} shows an example of statistics output, which covers all allocation operations and information about deallocating storage not owned by a thread.
981No other memory allocator provides as comprehensive statistical information.
982These statistics were invaluable during the development of llheap for debugging and verifying correctness, and should be equally valuable to application developers.
983
984\begin{figure}
985\begin{C++}
986PID: 2167216 Heap statistics: (storage request / allocation)
987 malloc >0 calls 19,938,000,110; 0 calls 2,064,000,000; storage 4,812,152,081,688 / 5,487,040,092,624 bytes
988 aalloc >0 calls 0; 0 calls 0; storage 0 / 0 bytes
989 calloc >0 calls 7; 0 calls 0; storage 1,040 / 1,152 bytes
990 memalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
991 amemalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
992 cmemalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
993 resize >0 calls 0; 0 calls 0; storage 0 / 0 bytes
994 realloc >0 calls 0; 0 calls 0; storage 0 / 0 bytes
995 copies 0; smaller 0; alignment 0; 0 fill 0
996 free !null calls 19,938,000,092; null / 0 calls 4,064,000,004; storage 4,812,152,003,021 / 5,487,040,005,152 bytes
997 remote pushes 4; pulls 0; storage 0 / 0 bytes
998 sbrk calls 1; storage 8,388,608 bytes
999 mmap calls 2,000,000; storage 2,097,152,000,000 / 2,105,344,000,000 bytes
1000 munmap calls 2,000,000; storage 2,097,152,000,000 / 2,105,344,000,000 bytes
1001 remainder calls 0; storage 0 bytes
1002 threads started 4; exited 4
1003 heaps $new$ 4; reused 0
1004\end{C++}
1005\caption{Statistics Output}
1006\label{f:StatiticsOutput}
1007\end{figure}
1008
1009llheap can also be built with debug checking, which inserts many asserts along all allocation paths.
1010These assertions detect incorrect allocation usage, like double frees, unfreed storage, or memory corruption because internal values (like header fields) are overwritten.
1011These checks are best effort as opposed to complete allocation checking as in @valgrind@~\cite{valgind}.
1012Nevertheless, the checks detect many allocation problems.
1013There is a problem in detecting unfreed storage because some library routines assume their allocations have life-time duration, and hence, do not free their storage.
1014For example, @printf@ might allocate a 1024-byte buffer on the first call and never delete this buffer.
1015To prevent a false positive for unfreed storage, it is possible to specify an amount of storage that is never freed (see @malloc_unfreed@ in Section~\ref{s:ExtendedCAPI}), and it is subtracted from the total allocate/free difference.
1016Determining the amount of never-freed storage is annoying, but once done, any warnings of unfreed storage are application related.
1017Debugging mode also scrubs each allocation with @0xff@, so assumptions about zero-filled objects generate errors.
1018Finally, if a program does segment-fault in debug mode, a stack backtrace is printed to help in debugging.
1019
1020Tests indicate only a 30\% performance decrease when statistics \emph{and} debugging are enabled in programs with 10\% to 15\% allocation cost, and the latency cost for accumulating statistic from each heap is mitigated by limited calls, often only one at the end of the program.
1021
1022
1023% \subsection{Design Choices}
1024%
1025% llheap's design was reviewed and changed multiple times during its development.
1026% All designs focused on the allocation/free \newterm{fast path}, \ie the shortest code path for the most common operations.
1027% The model chosen is 1:1, giving one heap per thread for each kernel thread (KT).
1028% Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
1029% Therefore, the majority of heap operations are uncontended, modulo operations on the global heap and ownership.
1030%
1031% Problems:
1032% \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1033% \item
1034% Need to know when a KT starts/terminates to create/delete its heap.
1035%
1036% \noindent
1037% It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
1038% \item
1039% There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
1040%
1041% \noindent
1042% The classic solution only deletes a heap after all referents are returned, which is complex.
1043% The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
1044% If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
1045% To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
1046% In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
1047% \item
1048% There can be significant external fragmentation as the number of KTs increases.
1049%
1050% \noindent
1051% In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
1052% Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
1053% \item
1054% Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
1055% \begin{quote}
1056% A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable}
1057% \end{quote}
1058% If a KT is preempted during an allocation operation, the OS can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness.
1059% Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable.
1060% Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the OS is providing the second thread via the signal handler.
1061
1062% There is the same serially-reusable problem with UTs migrating across KTs.
1063% \end{itemize}
1064% Tests showed this design produced the closest performance match with the best current allocators, and code inspection showed most of these allocators use different variations of this approach.
1065
1066
1067\subsection{Bootstrapping}
1068
1069There are problems bootstrapping a memory allocator.
1070Programs can be statically or dynamically linked.
1071The order in which the linker schedules startup code is poorly supported so it cannot be controlled entirely.
1072Knowing a KT's start and end independently from the KT code is also difficult.
1073
1074For static linking, the allocator is loaded with the program.
1075Hence, allocation calls immediately invoke the allocator operation defined by the loaded allocation library and there is only one memory allocator used in the program.
1076This approach allows allocator substitution by placing an allocation library before any other in the linked/load path.
1077
1078Allocator substitution is similar for dynamic linking.
1079However, the dynamic loader starts first and needs to perform dynamic allocations \emph{before} the substitution allocator is loaded.
1080As a result, the dynamic loader uses a default allocator until the substitution allocator is loaded, after which all allocation operations are handled by the substitution allocator, including those from the dynamic loader.
1081Hence, some part of the @sbrk@ area may be used by the default allocator and substitution allocator statistics cannot be correct.
1082Furthermore, dynamic linking uses an assembler trampoline to call the procedure linkage table resolver, so there is an additional cost along the allocator fast path for all allocation operations.
1083Testing showed up to a 5\% performance decrease with dynamic linking as compared to static linking, even when using @tls_model( "initial-exec" )@ to obtain tighter binding.
1084
1085After the allocator is loaded, it needs to be initialized before the first allocation request.
1086Currently, the only mechanism to control initialization is via constructor routines (see below), each with an integer priority, where the linker calls the constructors in increasing order of priority.
1087However, there are few conventions for priorities amongst libraries, where constructors with equal priorities are called in arbitrary order.
1088(Only a transitive closure of references amongst library calls can establish an absolute initialization order.)
1089As a result, the first call to an allocation routine can occur without initialization.
1090To deal with this problem, it is necessary to have a global flag that is checked along the allocation fast path to trigger initialization (singleton pattern).
1091
1092Along these lines, there is a subtle problem is defining when a program starts and ends.
1093For example, prolog/epilog code outside of the program should not be considered in statistics as the application does not make these calls.
1094llheap establishes these two points using constructor/destructor routines with initialization priority 100, where system libraries use priorities $\le$ 100 and application programs have priorities $>$ 100.
1095\begin{flushleft}
1096\hspace*{\parindentlnth}
1097\setlength{\tabcolsep}{20pt}
1098\begin{tabular}{@{}ll@{}}
1099\begin{C++}
1100@__attribute__(( constructor( 100 ) ))@
1101static void startup( void ) {
1102 // clear statistic counters
1103 // reset allocUnfreed counter
1104}
1105
1106\end{C++}
1107&
1108\begin{C++}
1109@__attribute__(( destructor( 100 ) ))@
1110static void shutdown( void ) {
1111 // sum allocUnfreed for all heaps
1112 // subtract global unfreed storage
1113 // if allocUnfreed > 0 then print warning message
1114}
1115\end{C++}
1116\end{tabular}
1117\end{flushleft}
1118Hence, @startup@ is called after the program prologue but before the application starts, and @shutdown@ is called after the program terminates but before the program epilogue.
1119By resetting counters in @startup@, prologue allocations are ignored, and checking unfreed storage in @shutdown@ checks only application memory management, ignoring the program epilogue.
1120
1121Unfortunately, @startup@/@shutdown@ only apply to the program KT, not to any additional KTs created by the program.
1122However, it is essential for the allocator to know when each KT is started/terminated to initialize/de-initialize the KT's heap.
1123Initialization can be handled by making the global flag (above) thread-local and then the initialization check along the fast path covers the first allocation by a newly created thread.
1124De-initialization is handled by registering a destructor routine using @pthread_key_create@ in the initialization code triggered along the fast path, which subsequently calls the destructor at thread termination.
1125
1126
1127\subsection{User-level Threading Support}
1128\label{s:UserlevelThreadingSupport}
1129
1130llheap is the underlying allocator in the user-threading programming languages \uC and \CFA.
1131These systems have preemptive scheduling, which requires management of timing events through a signal handle (@SIGALRM@).
1132The complexity in these system is the serially-reusable problem (see Section~\ref{s:SingleThreadedMemoryAllocator}) when UTs are time sliced (language level) independently from KTs (OS level).
1133The solution is to prevent interrupts resulting in a CPU or KT change during critical operations, eliminating problems like starting a memory operation on one KT and completing it on another when the underlying heaps are different.
1134% For user-level threading, the serially-reusable problem occurs with time slicing for preemptable user-level scheduling, as the interrupted UT is unlikely to be restarted on the same KT.
1135However, without time slicing, a long running UT prevents the execution of other UTs (starvation).
1136
1137The languages modify llheap using two techniques to prevent time slicing during non-interruptible allocation operations.
1138On the slow path, when executing expensive operations, time-slicing interrupts are disabled/enabled, so the operation completes atomically on the KT.
1139On the fast path, all non-interruptible allocation/deallocation routines are placed in a separate code segment.
1140The linker places this segment into a contiguous block of memory. %, but the order of routines within the block is unspecified.
1141Then the time-slice signal handler compares the program counter at the point of interrupt with the start/end address of the non-interruptible segment, and if executing within the segment, the signal handler returns without context switching.
1142The llheap funnel design simplifies this implementation so only a few funnel and statistics routines are located in the non-interruptible section.
1143% This technique is fragile as no mechanism exists to ensure all crucial code along the fast path is placed into the non-interruptible segment.
1144
1145Interestingly, marking non-interruptible operations by bracketing them with a set/reset of a thread-local flag fails, as read/write is not atomic on some machines.
1146For example, the ARM processor stores the thread-local pointer in a coprocessor register that cannot perform atomic base-displacement addressing.
1147Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a UT.
1148As well, switching to a T:C model with restartable critical sections using @librseq@~\cite{Desnoyers19} was examined (see Section~\ref{s:MutualExclusion}).
1149However, tests showed that while @librseq@ can determine the particular CPU quickly, setting up the restartable critical-section along the allocation fast-path produced a significant decrease in performance.
1150Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with UT migration across KTs.
1151For example, UT$_1$ is executing an allocation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
1152The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately performs an allocation.
1153Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical section.
1154Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its allocation operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
1155If @librseq@ had an @rseq_abort@ which:
1156\begin{enumerate}[leftmargin=*,topsep=2pt,itemsep=0pt,parsep=0pt]
1157\item
1158marks the current restartable critical-section as cancelled so it restarts when attempting to commit.
1159\item
1160does nothing if there is no current restartable critical section in progress.
1161\end{enumerate}
1162Then @rseq_abort@ could be called on the backside of a user-level context-switching.
1163A feature similar to this idea might exist for hardware transactional memory.
1164A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
1165
1166
1167\subsection{C API}
1168
1169Figure~\ref{f:CDynamicAllocationAPI} shows the C dynamic allocation API, which is neither orthogonal nor complete.
1170For example, it is possible to zero fill or align an allocation but not both, it is only possible to zero fill an array allocation, and it is not possible to resize a memory allocation without data copying.
1171As a result, programmers must provide missing alternatives, which is error prone, rightly blaming the C programming language for a poor allocation API.
1172Furthermore, newer programming languages have better type systems that can provide safer and more powerful APIs for memory allocation.
1173The following presents llheap API changes.
1174
1175\begin{figure}
1176\hspace*{\parindentlnth}
1177\begin{tabular}{@{}l|l@{}}
1178\begin{C++}
1179void * malloc( size_t size );
1180void * calloc( size_t dimension, size_t size );
1181void * realloc( void * oaddr, size_t size );
1182void * reallocarray( void * oaddr, size_t dimension, size_t size );
1183void free( void * addr );
1184void * memalign( size_t alignment, size_t size );
1185void * aligned_alloc( size_t alignment, size_t size );
1186int posix_memalign( void ** memptr, size_t alignment, size_t size );
1187void * valloc( size_t size );
1188void * pvalloc( size_t size );
1189\end{C++}
1190&
1191\begin{C++}
1192int mallopt( int option, int value );
1193size_t malloc_usable_size( void * addr );
1194void malloc_stats( void );
1195int malloc_info( int options, FILE * fp );
1196
1197// Unsupported
1198struct mallinfo mallinfo( void );
1199int malloc_trim( size_t );
1200void * malloc_get_state( void );
1201int malloc_set_state( void * );
1202\end{C++}
1203\end{tabular}
1204\caption{llheap support of C dynamic-allocation API}
1205\label{f:CDynamicAllocationAPI}
1206\end{figure}
1207
1208
1209\subsubsection{Extended C API}
1210\label{s:ExtendedCAPI}
1211
1212llheap transparently augments the C dynamic memory API to increase functionality, orthogonality, and safety.
1213\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1214\item
1215@malloc@ remembers the original allocation size separate from the actual allocation size.
1216\item
1217@calloc@ sets the sticky zero-fill property.
1218\item
1219@memalign@, @aligned_alloc@, @posix_memalign@, @valloc@ and @pvalloc@ set the sticky alignment property, remembering the specified alignment size.
1220\item
1221@realloc@ and @reallocarray@ preserve sticky properties across copying.
1222\item
1223@malloc_stats@ prints detailed statistics of allocation/free operations when linked with a statistic version.
1224\item
1225Existence of shell variable @MALLOC_STATS@ implicitly calls @malloc_stats@ at program termination, so precompiled programs do not have to be modified.
1226\end{itemize}
1227
1228llheap extends the C dynamic-memory API with new allocation operations with APIs matching existing C counterparts.
1229\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
1230\item
1231@aalloc@ extends @calloc@ for dynamic array allocation \emph{without} zero-filling the memory (faster than @calloc@).
1232\item
1233@resize@ extends @realloc@ for resizing an allocation \emph{without} copying previous data or preserving sticky properties (faster than @realloc@).
1234\item
1235@resizearray@ extends @resize@ for an array allocation (faster than @reallocarray@).
1236\item
1237@amemalign@ extends @aalloc@ with alignment and sets sticky alignment property.
1238\item
1239@cmemalign@ extends @amemalign@ with zero fill and sets sticky zero-fill and alignment property.
1240\item
1241@aligned_resize@ extends @resize@ with an alignment.
1242\item
1243@aligned_resizearray@ extends @resizearray@ with alignment.
1244\item
1245@aligned_realloc@ extends @realloc@ with alignment.
1246\item
1247@aligned_reallocarray@ extends @resizearray@ with alignment.
1248\end{itemize}
1249
1250llheap extends the C dynamic memory API with new control operations.
1251The following routines are called \emph{once} during llheap startup to set specific limits \emph{before} an application starts.
1252Setting these value early is essential because allocations can occur from the dynamic loader and other libraries before application code executes.
1253To set a value, define a specific routine in an application and return the desired value, \eg
1254\begin{C++}
1255size_t malloc_extend() { return 16 * 1024 * 1024; }
1256\end{C++}
1257\begin{itemize}[leftmargin=*,topsep=0pt,itemsep=1pt,parsep=0pt]
1258\item
1259@malloc_extend@ returns the number of bytes to extend the @sbrk@ area when there is insufficient free storage to service an allocation request.
1260\item
1261@malloc_mmap_start@ returns the crossover allocation size from the @sbrk@ area to separate mapped areas, see also @mallopt( M_MMAP_THRESHOLD )@.
1262\item
1263@malloc_unfreed@ returns the amount subtracted from the global unfreed program storage to adjust for unreleased storage from routines like @printf@ (debug only).
1264\end{itemize}
1265
1266llheap extends the C dynamic-memory API with functions to query object properties.
1267\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
1268\item
1269@malloc_size@ returns the requested size of a dynamic object, which is updated when an object is resized, similar to @malloc_usable_size@.
1270\item
1271@malloc_alignment@ returns the object alignment, where the minimal alignment is 16 bytes.
1272\item
1273@malloc_zero_fill@ returns true if the object is zero filled.
1274\item
1275@malloc_remote@ returns true if the object is from a remote heap (@OWNERSHIP@ only).
1276\end{itemize}
1277
1278llheap extends the C dynamic-memory API with new statistics control.
1279\begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
1280\item
1281@malloc_stats_fd@ sets the file descriptor for @malloc_stats@ writes (default @stdout@).
1282\item
1283@malloc_stats_clear@ clears the statistics counters for all thread heaps.
1284\item
1285@heap_stats@ extends @malloc_stats@ to only print statistics for the heap associated with the executing thread.
1286\end{itemize}
1287
1288
1289\subsubsection{Modern Allocation API}
1290
1291Modern programming languages have complex type systems that can be used to consolidate the panoply of memory allocation routines and features, providing a simpler programming experience and safety.
1292The \CFA language is used to demonstrate this capability, because llheap forms the memory allocator for this C variant, but other languages can provide similar APIs.
1293
1294\CFA polymorphism reduces the allocation API to two overloaded routines allocating a single object or an array of objects.
1295\begin{cfa}
1296forall( T & ) {
1297 T * alloc( /* list of property functions ... */ ) { ... } // singleton allocation
1298 T * alloc( size_t @dimension@, /* list of property functions ... */ ) { ... } // array allocation
1299}
1300\end{cfa}
1301Because the \CFA type system uses the return type to select overloads (like Ada), this capability is leveraged to remove the object-size parameter and return cast for regular calls to C @malloc@ or @memalign@.
1302\begin{cfa}
1303inline T * alloc( ... ) {
1304 if ( _Alignof(T) <= defaultAlign() ) return @(T *)@malloc( @sizeof(T)@ ); // C allocation
1305 else return @(T *)@memalign( @_Alignof(T)@, @sizeof(T)@ ); // C allocation
1306}
1307\end{cfa}
1308The calls to these two routine are now much safer than the C equivalents.
1309\begin{C++}
1310int * ip = alloc(); $\C[2.75in]{// T => int, sizeof => 4/8, alignment => default}$
1311double * dp = alloc(); $\C{// T => double, sizeof => 8, alignment => default}$
1312struct Spinlock { ... } [[aligned(128)]] * sp = alloc(); $\C{// T => Spinlock, sizeof => ..., alignment = 128}$
1313int * ia = alloc( 10 ); $\C{// T => int, sizeof => 4/8, alignment => default, dimension => 10}\CRT$
1314\end{C++}
1315At compile time, each call to @alloc@ extracts the return type @T@ from the left-hand side of the assignment, which is then used in @sizeof@, @_Alignof@, and casting the storage to the correct type.
1316The @inline@ and constant expression allow the compiler to remove the @if@ statement.
1317This interface removes all the common allocation-call errors in C and provides a uniform name covering all allocation reducing the cognitive burden.
1318
1319The property functions are a variable number of routines providing @alloc@ with management details and actions.
1320The functions are @align@, @fill@, @resize@, and @realloc@, and written in prefix versus postfix notation solely for aesthetic reasons, \eg @3`fill@ $\equiv$ @fill( 3 )@.
1321The examples are arrays but apply equally to singleton allocations.
1322\begin{cfa}
1323int * ip = alloc( 5, @4096`align@, @5`fill@ ); $\C[3in]{// start array on 4096 boundary and initialize elements with 5}$
1324int * ip2 = alloc( 10, @ip`fill@, @(malloc_alignment( ip ))`align@ ); $\C{// first 5 elements same as ip, same alignment as ip}$
1325_Complex double * cdp = alloc( 5, @(3.5+4.1i)`fill@ ); $\C{// initialize complex elements with 3.5+4.1i}$
1326struct S { int i, j; };
1327S * sp = alloc( 10, @((S){3, 4})`fill@ ); $\C{// initialize structure elements with {3, 4}}$
1328ip = alloc( 10, @ip`realloc@, @10`fill@ ); $\C{// make array ip larger and initialize new elements with 10}$
1329double * dp = alloc( 5, @ip2`resize@, @256`align@, @13.5`fill@ ); $\C{// reuse ip2 storage for something else}\CRT$
1330\end{cfa}
1331Finally, \CFA has constructors and destructors, like \CC, which are invoked when allocating with @new@ and @delete@.
1332\begin{cfa}
1333T * t = new( 3, 4, 5 ); $\C[3in]{// allocate T and call constructor T\{ 3, 4, 5 \}}$
1334W * w = new( 3.5 ); $\C{// allocate W and call constructor W\{ 3,5 \}}$
1335delete( t, w ); $\C{// call destructors and free t and w}\CRT$
1336\end{cfa}
1337The benefits of high-level API simplifications should not be underestimated with respect to programmer productivity and safety.
1338
1339
1340\section{Performance}
1341\label{c:Performance}
1342
1343This section uses a number of benchmarks to compare the behaviour of currently popular memory allocators with llheap.
1344The goal is to see if llheap is a competitive memory allocator;
1345no attempt is made to select a performance winner.
1346
1347
1348\subsection{Experimental Environment}
1349\label{s:ExperimentalEnvironment}
1350
1351The performance experiments are run on three different multi-core architectures, ARM, AMD, and Intel, covering memory models weak order (WO) and total store order (TSO), to determine if there is consistency across architectures:
1352\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1353\item[ARM]
1354Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
1355\item[AMD]
1356Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model
1357\item[Intel]
1358Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model
1359\end{description}
1360For the parallel experiments, threads are pinned to cores in a linear fashion, \ie from core $N$ to $N+M$, where $N$ is the start of a socket boundary.
1361This layout produces the best throughput, as there is little or no communication among threads in the benchmarks, so binding tightly to the cache layout is unnecessary;
1362hence, there is almost no OS or NUMA effects perturbing the benchmarks.
1363
1364The compilers are gcc/g++-14.2.0 and gfortran-14.2.0 running on the Linux v6.8.0-52-generic OS, with @LD_PRELOAD@ used to override the default allocator.
1365To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
1366\begin{uC++}
1367static inline void * pass( void * v ) { $\C[2.5in]{// prevent eliding, cheaper than volatile}$
1368 __asm__ __volatile__( "" : "+r"(v) ); return v;
1369}
1370void * vp = pass( malloc( 0 ) ); $\C{// wrap malloc call to prevent elision}\CRT$
1371\end{uC++}
1372The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all allocators.
1373
1374
1375\subsection{Memory Allocators}
1376\label{s:MemoryAllocators}
1377
1378Historically, a number of C/\CC, stand-alone, general-purpose memory-allocators, \eg dlmalloc~\cite{dlmalloc}, have been written for use by programming languages providing unmanaged memory.
1379For this work, 6 of the popular, thread-safe memory-allocators are selected for comparison, along with llheap.
1380
1381\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt,listparindent=\parindent]
1382\item[glibc~\cite{glibc}] % https://sourceware.org/glibc/wiki/MallocInternals
1383is the default glibc allocator, derived from ptmalloc, derived from dlmalloc.
1384glibc has multiple threads sharing multiple heaps with a global shared heap, header per allocation, free-lists with different organizational criteria and searching, and coalescing of certain adjacent free-areas.
1385Version Ubuntu GLIBC 2.31-0ubuntu9.7 2.31 compiled by Ubuntu 24.04.
1386
1387\item[hoard~\cite{hoard}]
1388has multiple threads sharing multiple heaps with a global shared heap, where each heap is composed of superblocks containing fixed-sized objects, with each super-block having a single header for its objects and reuse of superblocks if empty.
1389Version 3.13.0, compiled with gcc-14.2.0, default configuration, using command @make@.
1390Over the past 5 years, hoard development has stopped;
1391it fails on the ARM architecture, possibly because of the WO memory model.
1392
1393\item[jemalloc~\cite{Evans06}]
1394has multiple threads sharing multiple heaps (arenas) composed of same-sized chunks subdivided into regions composed of pages where each page is a container of same-sized objects.
1395The components are organized into a number of data structures to facilitate allocations, freeing, and coalescing.
1396Large objects are allocated using @mmap@.
1397Version jemalloc-5.3.0~\cite{jemalloc}, built with the default configuration, using commands: @autogen.sh; configure; make; make install@.
1398
1399\item[mimalloc~\cite{Leijen19}]
1400has a heap per thread composed of a reserved area subdivided into 3-sized page buffers, where each page is a container of same-sized objects.
1401Each page manages its own internal free list and the free list is build when a page is created so there is no initial bump pointer.
1402Empty pages are coalesced for reuse.
1403Uses a fast freelist search for small allocation sizes.
1404Onwership is handled with a separate remote free-list, and remote frees are batched before pushing to the owner heap.
1405Version mimalloc-v2.1.2, built with the default configuration, using commands @cmake . ; make@.
1406
1407\item[tbbmalloc~{\cite[pp.~314--315]{Kukanov07}}] is the allocator shipped with Intel's Threading Building Blocks (TBB).
1408tbbmalloc has a heap per thread for small allocations, with large allocation handled using a single request.
1409There is a global heap to acquire and reuse space obtained from the OS;
1410its reserved space is divided into thread buffers (containers).
1411A thread heap is composed of linked containers, with binning used to manage the allocations/deallocations within the containers.
1412Small object space is not returned to the OS.
1413An allocation has to search its container list to find a partially filled one.
1414The search is mitigated by moving mostly-free containers to the start of the container list;
1415free containers are returned to the global heap.
1416Ownership is handled with a separate remote free-list.
1417Version @libtbbmalloc.so.2.11@, installed using @apt-get install libtbb-dev@.
1418
1419\item[tcmalloc~\cite{tcmalloc}] is the allocator shipped with Google's perftools.\footnote{
1420Currently, there are two versions of tcmalloc: Google's perftools and one experimental version available on GitHub, which is not an officially supported Google product.
1421We selected the perftools version because it is the most likely choice for users as it installs directly onto multiple OSs.}
1422tcmalloc has per CPU heaps for small allocations, with large allocation handled with a single request.
1423CPU heaps require a rollback mechanism, @rseq@, to prevent the serially-reusable problem.
1424There is a global heap to acquire and reuse space obtained from the OS;
1425its reserved space is divided into multi-page spans (containers) of fixed sized objects.
1426A CPU heap uses binning to manage the allocations/deallocations within the containers.
1427Free containers are returned to the OS.
1428Version @libtcmalloc_minimal.so.4@, installed using @apt-get install google-perftools@.
1429\end{description}
1430
1431Untested allocators:
1432\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1433\item[ptmalloc3]
1434is 8 years old and already integrated into glibc.
1435\item[rpmalloc]
1436requires explicit insertion of initialization/finalization calls for handling concurrent kernel threads.
1437Having to augment programs, like SPEC CPU benchmarks, is deemed outside of normal programmer expectations.
1438% An allocator should just plugin and work.
1439\item[lock free] allocators guarantee allocation progress whether threads are delayed or killed using an atomic instruction, often CAS.
1440The original lock-free allocator~\cite{Michael04} is completely lock-free.
1441As stated, atomic instructions on the fast path result in a significant performance penalty.
1442Hence, new allocators are not completely lock free, switching to a combination of synchronization-free, \ie 1:1 allocator model, on the fast path and lock-free on the slow path(s) to manipulate shared data structures~\cite{rpmalloc}.
1443These allocators are better labelled as \newterm{hybrid locking} rather than lock free, as the lock-free aspect is not contributing to performance.
1444
1445% We observe that none of the pre-built standard malloc replacement libraries for ubuntu \url{https://launchpad.net/ubuntu/+search?text=malloc} are completely lock-free.
1446% 1:1 allocators can avoid synchronization (locks, or lock-free techniques with atomic instructions as well as cache coherence overheads) in their critical fast paths, but care must be taken to ensure the the amount of free memory captured in thread-local structures is bounded.
1447
1448% Another approach to synchronization for allocators is \newterm{Restartable Critical Sections} ~\cite {https://dl.acm.org/doi/10.1145/512429.512451, https://dl.acm.org/doi/pdf/10.5555/1698184, https://doi.org/10.1145/1064979.1064985}, which are available in linux as the \newterm{RSEQ} facility ~\cite{https://www.gnu.org/software/libc/manual/html_node/Restartable-Sequences.html}.
1449% Restartable Critical Sections provide obstruction-free progress by means of specially crafted transactions that will be rolled back if they happen to be interrupted by the kernel.
1450% Restartable Critical Sections transactions can only operate on CPU-specific data, however, which forces a T:C allocator configuration.
1451% Google's experimental tcmalloc \url{https://google.github.io/tcmalloc/rseq.html} uses RSEQ.
1452% SuperMalloc \url{ACM DL is dead at the moment, but it's in ISMM 2015} attempts to use hardware transactional memory for lock elision, but falls back to classic locking if the hardware facility is not present or when a given transactional attempt encounters repeated progress failures.
1453
1454
1455\end{description}
1456
1457Allocator size is an indirect indicator of complexity.
1458Lines-of-code are computed with command @cloc *.{h,c,cc,cpp}@, except for hoard:
1459@cloc --exclude-lang="Bourne Shell",SKILL,Markdown,Bazel Heap-Layers source include@.
1460\begin{center}
1461\setlength{\tabcolsep}{13pt}
1462\begin{tabular}{@{}rrrrrrrr@{}}
1463llheap & glibc & hoard & jemalloc & mimalloc & tbbmalloc & tcmalloc \\
14641,450 & 3,807 & 11,932 & 24,512 & 6,887 & 6,256 & 33,963 \\
1465\end{tabular}
1466\end{center}
1467
1468
1469\section{Benchmarks}
1470\label{s:Benchmarks}
1471
1472There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
1473\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
1474\item[Benchmarks]
1475are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
1476The applications are supposed to represent common execution patterns that need to perform well with respect to an underlying software implementation.
1477Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns, resulting in little or no information about why an application did or did perform well for the tested software.
1478\item[Micro-Benchmarks]
1479attempt to extract the common execution patterns associated with an application and run the pattern independently.
1480This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
1481Micro-benchmarks are often criticized for inadequately representing real-world applications, but that is not their purpose.
1482\end{description}
1483
1484While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
1485In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
1486As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
1487Many of these benchmark applications and micro-benchmarks are old and do not reflect current application allocation patterns.
1488
1489Except for the SPEC CPU benchmark, the other performance benchmarks used for testing are micro-benchmarks created for this paper.
1490All the benchmarks are used solely to extract differences among memory allocators.
1491The term benchmark in the following discussion means benchmark or micro-benchmark.
1492
1493
1494\subsection{SPEC CPU 2017}
1495
1496SPEC CPU 2017 is an industry-standardized suite for measuring and comparing performance of compute-intensive programs.
1497It contains integer and floating-point tests written in C, \CC, and Fortran, covering throughput and speed, where each test contains multiple benchmarks~\cite{SPECCPU2017}.
1498All the benchmarks perform dynamic allocation, from light to heavy.
1499However, the dynamic allocation is relatively small in comparison to the benchmark computation.
1500Therefore, differences among allocators should be small, unless a particular access pattern triggers a pathological case.
1501The reason for performing SPEC CPU across the allocators is to prove this hypothesis.
1502For allocator comparisons, we consider SPEC CPU differences of 5\% as equal and undetectable in general workloads and computing environments.
1503For compiler comparisons, small differences of 1\% or 2\% are considered significant.
1504
1505Table~\ref{t:SPEC-CPU-benchmark} shows the elapsed time (inverted throughput) of the SPEC CPU tests condensed to the geomean across the benchmarks for each of the four SPEC tests, intrate, intspeed, fprate, and fpspeed, covering integer and floating-point operations.
1506The tests are configured with size = ref, intrate/fprate: copies = 1, intspeed: threads = 1, fpspeed: threads = 16;
1507only fpspeed is concurrent using OpenMP.
1508Rigorous testing of SPEC CPU often runs many benchmark copies in parallel to completely load all computer cores.
1509However, these tests quickly run into architectural bottlenecks having little to do with an allocator's behaviour.
1510Runnning a single program bound to one core means the focus is strictly on allocator differences rather than conjoining transient OS and hardware differences.
1511The throughputs are ranked with {\color{red}red} lowest time and {\color{blue}blue} highest, where lower is best.
1512Hoard failed in multiple experiments on the ARM architecture, marked with {\color{purple}*Err*}, making it impossible to report the successful tests.
1513
1514The results show all allocators do well;
1515the average, median, and relative standard deviation (right column)\footnote{$rstd = \sigma / \mu \times 100$, where $\sigma =$ standard deviation and $\mu =$ average} proves our hypothesis that the performance difference, 0.6\% to 2.3\%, across allocators is small.
1516One implementation trend we observed is that two of the integer tests, @omnetpp@ and @xalancbmk@, had an execution pattern that exercised the cache.
1517For the three allocators using headers-per-allocation, glibc, llheap, and tbbmalloc, performance could be up to 40\% slower, between the best and worst allocator results.
1518The reason is that the headers consumed part of the cache line, resulting in more cache misses.
1519These two experiments, disproportionally increased the geomean for these allocators for both integral experiments on all architectures.
1520Hence, headers-per-allocation are disadvantaged for this specific execution pattern.
1521The floating-point tests show no trends among the allocators.
1522The goal for llheap in this experiment is to do well, which is established by it being close to the median result, meaning it is normally in the middle of the allocator results.
1523
1524\begin{table}
1525\centering
1526\caption{SPEC CPU benchmark, 3 hardware architectures, geomean per test in seconds, lower is better}
1527\label{t:SPEC-CPU-benchmark}
1528%\setlength{\tabcolsep}{6pt}
1529\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
1530 & bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1531\cline{2-12}
1532 & intrate & {\color{blue}314.4} & {\color{violet}*Err*} & 300.3 & 309.9 & 302.6 & 313 & {\color{red}298.7} & 306.5 & 309.9 & 2\% \\
1533ARM & intspeed & {\color{blue}439.1} & {\color{violet}*Err*} & 417.6 & 431.1 & 419.9 & 436.2 & {\color{red}415.5} & 426.6 & 431.1 & 2.2\% \\
1534 & fprate & 347.6 & {\color{violet}*Err*} & {\color{red}333.9} & 352.2 & {\color{blue}356.6} & 345.9 & 344.5 & 346.8 & 347.6 & 2\% \\
1535 & fpspeed & 248.4 & {\color{violet}*Err*} & 245.3 & 245.7 & {\color{blue}250.9} & 246.6 & {\color{red}243.8} & 246.8 & 246.6 & 0.93\%
1536\end{tabular}
1537
1538\begin{comment}
1539\bigskip
1540\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
1541 & bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1542\cline{2-12}
1543 & intrate & 251 & 242 & 239 & 249 & 240 & {\color{blue}251} & {\color{red}237} & 244 & 242 & 2.3\% \\
1544AMD & intspeed & 356 & 337 & 335 & 351 & 339 & {\color{blue}356} & {\color{red}333} & 344 & 339 & 2.7\% \\
1545 & fprate & 256 & 261 & {\color{red}250} & 257 & {\color{blue}270} & 256 & 254 & 258 & 256 & 2.3\% \\
1546 & fpspeed & 340 & {\color{blue}353} & {\color{red}326} & 338 & 348 & 341 & 328 & 339 & 340 & 2.7\%
1547\end{tabular}
1548\end{comment}
1549
1550\bigskip
1551\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
1552 & bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1553\cline{2-12}
1554 & intrate & 251.2 & {\color{red}241.1} & 251.9 & 249.3 & 251.6 & 251.5 & {\color{blue}252.3} & 249.9 & 251.5 & 1.5\% \\
1555AMD & intspeed & {\color{blue}356.1} & {\color{red}337.1} & 355.4 & 351.7 & 355.5 & 355.8 & 355.9 & 352.5 & 355.5 & 1.8\% \\
1556 & fprate & {\color{red}253.9} & {\color{blue}259.9} & 254.4 & 255.8 & 254.5 & 254.4 & 254.7 & 255.4 & 254.5 & 0.75\% \\
1557 & fpspeed & 329.9 & {\color{blue}339.6} & 330.6 & {\color{red}327.2} & 329.9 & 329.8 & 329.5 & 330.9 & 329.9 & 1.1\%
1558\end{tabular}
1559
1560\bigskip
1561\begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
1562 & bench./alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
1563\cline{2-12}
1564 & intrate & 188.6 & 185.1 & 183.1 & 188.6 & 181.5 & {\color{blue}189.4} & {\color{red}181.2} & 185.4 & 185.1 & 1.8\% \\
1565Intel & intspeed & 271.6 & 264.6 & 263.5 & 270.2 & 261.2 & {\color{blue}272.1} & {\color{red}260.3} & 266.2 & 264.6 & 1.7\% \\
1566 & fprate & 202.7 & {\color{red}201.8} & 204.4 & 205.1 & {\color{blue}205.3} & 204.7 & 203.7 & 204 & 204.4 & 0.59\% \\
1567 & fpspeed & 237.3 & 235.3 & 234.5 & 235.6 & {\color{blue}244.5} & 236.1 & {\color{red}233.6} & 236.7 & 235.6 & 1.4\%
1568\end{tabular}
1569\end{table}
1570
1571
1572\subsection{Realloc Benchmark}
1573
1574Some examination of @realloc@ is necessary to encourage its use.
1575Reallocation can be very efficient (both in space and time) when manipulating variable-sized objects, like strings, multi-precise numbers, or dynamic-sized arrays.
1576Both X11 (500+ calls) and glibc (300+ calls) use realloc for various purposes.
1577For example, in \CC:
1578\begin{C++}
1579string s = "abc"; // initial allocation and copy new value
1580s = "gh"; // change size and copy new value
1581s = "l" + s + "r"; // change size and copy new value
1582s = s.substr(0,2); // reduce size
1583\end{C++}
1584variable @s@ changes size and value multiple times, plus temporary strings are created implicitly, \eg multiple concatenations, all of which requires multiple allocations, copying, and deallocations.
1585@realloc@ can optimize some of these operations in two ways:
1586\begin{enumerate}[leftmargin=*]
1587\item
1588For decreasing size, Figure~\ref{f:ReallocOptDecreasing} shows a logical truncation of the existing object rather than creating a new object, \ie use a heuristic to decide whether to perform the 3-step procedure (allocate, copy, and free), or pretend the storage is decreased and return the old storage and value, performing zero work but increasing internal fragmentation.
1589For example, a request to decrease size from 96 to 75 bytes can be implemented two ways:
1590The 21 bytes of internal fragmentation at the end of the logical reallocation may be unavailable, directly available if the allocator supports @malloc_usable_size@, or indirectly available if put back on the allocator free list.
1591\item
1592For increasing size, Figure~\ref{f:ReallocOptIncreasing} takes advantage of the fact that many memory allocators quantize request sizes (binning), often returning slightly more storage than requested (internal fragmentation).
1593For example, an initial request for 75 bytes may return 96 bytes of storage, giving 21 bytes of internal fragmentation:
1594For increasing the size up to 21 bytes, realloc can take advantage of this unused space rather than performing the 3-step procedure, which can also result in unused storage.
1595\end{enumerate}
1596
1597\begin{figure}
1598\centering
1599\subfloat[Decreasing]{\label{f:ReallocOptDecreasing}\input{decreasing}}
1600\hspace*{5pt}
1601\vrule
1602\hspace*{5pt}
1603\subfloat[Increasing]{\label{f:ReallocOptIncreasing}\raisebox{0.38\totalheight}{\input{increasing}}}
1604\caption{Realloc Optimizations}
1605\label{f:ReallocOptimizations}
1606\end{figure}
1607
1608Figure~\ref{f:reallocShrinkBenchmark} shows a benchmark to determine if an allocator takes advantage of the first optimization.
1609The benchmark takes a fixed-size allocation and reduction it by 10\%--90\% in steps of 10\%, checking the storage addresses at each reduction step if the same or new storage is returned.
1610The fixed-sized allocation is varied between sizes 64--16K in powers of 2.
1611Hence, both small and large sized storage are reduced.
1612The following table shows the approximate percentage point where storage is retained on shrinkage, \eg the storage reduction must be greater than 50\% of the prior allocation before a new allocation is performed for the smaller size, data is copied, and prior storage released.
1613\begin{center}
1614\setlength{\tabcolsep}{15pt}
1615\begin{tabular}{@{}ccccccc@{}}
1616glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc \\
161790\% & 50\% & 20\% & 50\% & 50\% & 90\% & 50\%
1618\end{tabular}
1619\end{center}
1620The results show glibc and tbbmalloc do not perform this optimization, while the other allocators do with 50\% as the most popular crossover point.
1621
1622Figure~\ref{f:reallocGrowBenchmark} shows a benchmark to determine if an allocator takes advantage of the second optimization.
1623This benchmark creates an array of fixed-sized elements increasing the array size by 1 from 1--10,000 elements.
1624Then the element size is varied from 32, 64, 128, 256 bytes.
1625To prevent allocators from doing a bump allocation across the entire benchmark, a small perturbation is introduced where storage is allocated, held, and then released at infrequent points across the experiment.
1626A companion experiment is a manual simulation of the @realloc@: @malloc@ new storage, copy old data, and free old storage.
1627Note, the @realloc@ simulation is performing an equivalent perturbation to the @realloc@ benchmark each time through the loop.
1628The experiment is repeated 10,000 times for @realloc@ and 100 times for the simulation to obtain similar timing ranges.
1629The performance difference between the @realloc@ and @realloc@-simulation experiments shows if @realloc@ is optimizing unused internal fragmentation at the end of its quantized bucket.
1630
1631Figure~\ref{f:reallocGrowResults} shows the results for the @realloc@ and @realloc@ simulation benchmarks.
1632The difference between the benchmarks is two orders of magnitude, \ie all allocators are reusing some internal fragmentation to prevent a reallocation and copy as the array grows.
1633The large difference is the extra copying in the simulation case, which is expensive.
1634Within the @realloc@ benchmark, allocators glibc, hoard, jemalloc, and tbbmalloc have higher cost, while the remaining allocators have almost identical results.
1635Within the @realloc@ simulation benchmark, allocators glibc and tbbmalloc have higher cost, while the remaining allocators have almost identical results.
1636This benchmark confirms that @realloc@ can provide some level of performance benefit for dynamically growing data structures, \eg strings or arrays.
1637Therefore, encouraging its use is reasonable, if and only if, it is safe to do so.
1638Note, this encouragement is apt for container developers, where low-level storage management is performed internally for the benefit of application users.
1639
1640\begin{figure}
1641\begin{C++}
1642for ( size_t p = 10; p <= 100; p += 10 ) {
1643 for ( size_t s = 64; s < 16 * 1024; s <<= 1 ) {
1644 bool reuse = false;
1645 void * prev = pass( malloc( s ) );
1646 void * curr = pass( realloc( prev, s * p / 100 ) );
1647 if ( prev == curr ) { /* print */ }
1648 free( curr );
1649 }
1650}
1651\end{C++}
1652\vspace*{-10pt}
1653\caption{\lstinline{realloc} Shrink Benchmark}
1654\label{f:reallocShrinkBenchmark}
1655
1656\vspace*{10pt}
1657
1658%\setlength{\tabcolsep}{15pt}
1659\begin{tabular}{@{}ll@{}}
1660\multicolumn{1}{c}{\lstinline{realloc}} & \multicolumn{1}{c}{\lstinline{realloc} simulation} \\
1661\begin{C++}
1662struct S { size_t ca[DIM]; }; // varied 32, 64, 128, 256
1663enum { Ssize = sizeof( S ) };
1664for ( size_t t = 0; t < @10$'$000@; t += 1 ) {
1665 S * sa = nullptr, * perturb = nullptr;
1666 for ( size_t i = 0, s = Ssize; i < 10$'$000; i += 1, s += Ssize ) {
1667 sa = (S *)@realloc( sa, s );@
1668
1669 sa[i].ca[0] = i;
1670 if ( i % 1024 == 0 ) perturb = (S *)realloc( perturb, s );
1671 }
1672 free( sa );
1673 free( perturb );
1674}
1675\end{C++}
1676&
1677\begin{C++}
1678struct S { size_t ca[DIM]; }; // varied 32, 64, 128, 256
1679enum { Ssize = sizeof( S ) };
1680for ( size_t t = 0; t < @100@; t += 1 ) {
1681 S * sa = nullptr, * so = (S *)malloc( Ssize );
1682 for ( size_t i = 0, s = Ssize; i < 10$'$000; i += 1, s += Ssize ) {
1683 sa = (S *)@malloc( s )@; // simulate realloc
1684 memcpy( sa, so, s - Ssize ); // so one smaller
1685 sa[i].ca[0] = i;
1686 free( so );
1687 so = sa;
1688 }
1689 free( sa );
1690}
1691\end{C++}
1692\end{tabular}
1693\caption{\lstinline{realloc} Grow Benchmark}
1694\label{f:reallocGrowBenchmark}
1695
1696\vspace*{20pt}
1697
1698\hspace*{-17pt}
1699\setlength{\tabcolsep}{-13pt}
1700\begin{tabular}{@{}l@{\hspace*{-5pt}{\vrule height 1.05in}\hspace*{-5pt}}l@{}}
1701\begin{tabular}{@{}lll@{}}
1702\input{prolog.realloc.tex} & \input{swift.realloc.tex} & \input{java.realloc.tex}
1703\\
1704\multicolumn{3}{@{}c@{}}{\lstinline{realloc}, 10,000 repetitions}
1705\end{tabular}
1706&
1707\setlength{\tabcolsep}{-10pt}
1708\begin{tabular}{@{}lll@{}}
1709\input{prolog.reallocsim.tex} & \input{swift.reallocsim.tex} & \input{java.reallocsim.tex}
1710\\
1711\multicolumn{3}{@{}c@{}}{\lstinline{realloc} simulation, 100 repetitions}
1712\end{tabular}
1713\end{tabular}
1714
1715\caption{\lstinline{realloc} Grow Results, x-axis in bytes, lower is better}
1716\label{f:reallocGrowResults}
1717\end{figure}
1718
1719
1720\subsubsection{Cache Benchmark}
1721\label{s:CacheBenchmark}
1722
1723The cache benchmarks attempt to look for false sharing (see Section~\ref{s:FalseSharing}).
1724Unfortunately, testing for allocator-induced false-sharing is difficult, because it is equivalent to searching for randomly conjoined allocations within a large storage space.
1725Figure~\ref{f:CacheBenchmark} shows a benchmark for program induced false-sharing, where pointers are passed among threads.
1726As a side effect, this benchmark is indirectly checking which allocator model is being used.
1727The program main runs the benchmark with 4, 8, 16, and 32 threads, passing each thread a separate array of dynamically allocated storage from its common heap with @ASIZE@ elements.
1728Each thread then traverse the array adding a value to each element (read and write).
1729The traversal is repeated T times.
1730Each thread frees the array at the end.
1731The experiment is run with a small and medium sized array.
1732If there is any heap sharing, the small array has a higher probability for false sharing, \eg the first and last array elements for different array can be juxtaposed in memory, and hence appear in the same cache line.
1733
1734\begin{figure}
1735\begin{C++}
1736enum { TIMES = 10$'$000$'$000$'$000, ASIZE = 3 }; $\C{// repetitions, array size 3 or 30}$
1737void * worker( void * arg ) { $\C{// array passed from program main}$
1738 volatile size_t * arr = (size_t *)arg; $\C{// volatile prevents code elision}$
1739 for ( size_t t = 0; t < TIMES / ASIZE; t += 1 ) $\C{// repeat experiment N times}$
1740 for ( size_t r = 0; r < ASIZE; r += 1 ) $\C{// iterate through array}$
1741 arr[r] += r; $\C{// read/write array elements}$
1742 free( (void *)arr ); $\C{// cast away volatile}$
1743}
1744\end{C++}
1745\vspace*{-5pt}
1746\caption{Cache False-Sharing Benchmark}
1747\label{f:CacheBenchmark}
1748\end{figure}
1749
1750Figure~\ref{f:cacheResults} shows the results for the cache benchmark run with array sizes 3 and 30.
1751Allocators glibc, llheap, mimalloc, and tbbmalloc show little or no false-sharing issues at both 3 and 30 array sizes, \ie all generate virtually the same result.
1752Note, on the Intel, there is a rise at 32 cores, because of an L3 cache shift at 16 cores; stepping to 32 cores introduces NUMA effects.
1753This result correlates with these allocators using a 1:1 allocator model.
1754Allocators hoard, jemalloc, and tcmalloc show false-sharing issues at both 3 and 30 array sizes, reducing performance by 2 times at size 3.
1755The @perf@ performance analyzer shows a large number of cache misses for these allocators, indicating false sharing.
1756This result correlates with these allocators using some form of heap sharing.
1757
1758\begin{figure}
1759\setlength{\tabcolsep}{-8pt}
1760\begin{tabular}{@{}l@{\hspace*{-5pt}{\vrule height 1.05in}\hspace*{-5pt}}l@{}}
1761\begin{tabular}{@{}lll@{}}
1762\input{prolog.cacheS.tex} & \input{swift.cacheS.tex} & \input{java.cacheS.tex}
1763\\
1764\multicolumn{3}{@{}c@{}}{3 Element Array}
1765\end{tabular}
1766&
1767\begin{tabular}{@{}lll@{}}
1768\input{prolog.cacheL.tex} & \input{swift.cacheL.tex} & \input{java.cacheL.tex}
1769\\
1770\multicolumn{3}{@{}c@{}}{30 Element Array}
1771\end{tabular}
1772\end{tabular}
1773\caption{Cache False-Sharing Results, x-axis in cores, lower is better}
1774\label{f:cacheResults}
1775\end{figure}
1776
1777
1778\subsection{Ownership Benchmark}
1779
1780% In multi-threaded allocators with H:T or 1:1 structure, one thread can allocation storage, send it to another thread, and the receiving thread deallocates it.
1781% This raises the question of where the storage is returned: the heap (area) from which it was allocated or a different heap;
1782% in some cases there is no choice, when storage is bound to its allocation area.
1783% If storage is returned to its allocation heap, there are concurrency issues if the allocation area is shared.
1784% If the storage is returned to another heap, there can still be concurrency issues, but the real problem is storage drain in the allocation heap and storage bloat in the deallocation heap, without a secondary mechanism to redistribute storage.
1785% This choice is the \newterm{ownership problem}.
1786
1787Historically the Larson benchmark~\cite{larson99memory} is purported to test for ownership issues, but in actuality, the benchmark is a complex simulation of a server environment.
1788Multiple threads allocate and free a number of random-sized objects within a size range.
1789Each thread runs for a time period, and at termination, creates a child thread and passes its array of objects as an argument, which does not require synchronization.
1790The number of thread generations varies with thread speed.
1791% It calculates memory operations per second as an indicator of the memory allocator's performance.
1792Because the benchmark performs multiple kinds of tests, it is impossible to extracted just the remote-free rate.
1793
1794Therefore, a new benchmark is created to measure the asynchronous transfer cost from the deallocating to the allocating thread (remote free).
1795However, the allocating thread must first asynchronously transferred the allocations to the deallocating thread.
1796This cost needs to be mitigated so it does not mask the remote-free measurement.
1797To accomplish this, a thread batches its allocations (lots of 100), and atomically exchanges this batch with a freeing thread, which then individually frees the batch components.
1798Hence, the cost of the asynchronous allocation transfer is much less than the individual cost of the remote free.
1799
1800Figure~\ref{f:OwnershipBenchmark} shows the pseudo-code for the benchmark.
1801There is a global matrix of allocation addresses: one row for each thread and one column for each batch.
1802Each thread starts at a specific row and fills that row with two different sized allocations.
1803A thread then loops until it atomically exchanges its row pointer with another thread's row pointer.
1804The storage in the received batch is then remote freed, the batch row is reset with new allocations, and the process repeats for a timed duration.
1805As well, after each allocation, an integer is written into the storage, and that integer is read before the deallocation.
1806
1807Figure~\ref{f:Ownership} (a)--(c) shows the throughput of the ownership benchmark.
1808The results are divided into three groups.
1809glibc and tbbmalloc are slowest because of many system calls to @futex@. % and @nano_sleep@.
1810Figure~\ref{f:Ownership}~(d) shows the system time climbing during scaling on the AMD;
1811the other architectures are similar.
1812llheap and mimalloc are next, as these allocators do not batch remote frees, so every free requires locking.
1813jemalloc, hoard, and tcmalloc are fastest, as these allocators batch remote frees, reducing locking.
1814For 1:1 allocators, eager remote return makes sense as the returned storage can be reused during the owning thread's lifetime.
1815For N:T allocators, lazy remote return using batching makes sense as heaps outlive threads so eventually returned storage can be used by any existing or new thread.
1816Batching is possible for 1:1 allocators, but results in complexity and external fragmentation, which is only warranted in certain cases.
1817
1818\begin{figure}
1819\begin{cfa}
1820void * batches[MaxThread][MaxBatch]; $\C{// thread global}$
1821struct Aligned { CALIGN void * * col; };
1822volatile Aligned allocations[MaxThread];
1823
1824Aligned batch = { batches[id] }; $\C{// thread local}$
1825size_t cnt = 0, a = 0;
1826for ( ; ! stop; ) { $\C{// loop for T second}$
1827 for ( ssize_t i = Batch - 1; i >= 0; i -= 1 ) { $\C{// allocations, oppose order from frees}$
1828 batch.col[i] = malloc( i & 1 ? 42 : 192 ); $\C{// two allocation sizes}$
1829 *(int *)batch.col[i] = 42; $\C{// write storage}$
1830 }
1831 Aligned obatch = batch;
1832 while ( (batch.col = Fas( allocations[a].col, batch.col )) == obatch.col || batch.col == nullptr ) { // atomic exchange
1833 if ( stop ) goto fini;
1834 a = (a + 1) % Threads; $\C{// try another batch}$
1835 }
1836 for ( size_t i = 0; i < Batch; i += 1 ) { $\C{// deallocations}$
1837 if ( *(int *)batch.col[i] != 42 ) abort(); $\C{// read storage check}$
1838 free( batch.col[i] ); $\C{// remote free}$
1839 }
1840 cnt += Batch; $\C{// sum allocations/frees}$
1841 a = (a + 1) % Threads; $\C{// try another batch}$
1842} fini: ;
1843\end{cfa}
1844\caption{Ownership Benchmark Outline}
1845\label{f:OwnershipBenchmark}
1846\end{figure}
1847
1848\begin{figure}
1849\hspace*{-14pt}
1850\setlength{\tabcolsep}{-13pt}
1851\begin{tabular}{@{}lll@{\hspace*{-6pt}{\vrule height 2.05in}\hspace*{-6pt}}l@{}}
1852\input{prolog.ownership.tex}
1853&
1854\input{swift.ownership.tex}
1855&
1856\input{java.ownership.tex}
1857&
1858\input{swift.ownershipres.tex}
1859\end{tabular}
1860\caption{Ownership Results, x-axis is cores, (a)--(c) higher is better, (d) lower is better}
1861\label{f:Ownership}
1862\end{figure}
1863
1864
1865\subsection{Delay Benchmark}
1866
1867The delay benchmark is a torture test of abrupt allocation patterns looking for delays that increase latency.
1868A flat response across the tests means there are few or no allocator-induced pauses.
1869The test examines small and large requests, where small requests are handled by the heap (@sbrk@) and large requests are handled by the OS (@mmap@).
1870Putting large requests in the heap causes external fragmentation when freed, unless an allocator subdivided the space, leading to pauses.
1871The @mallopt@ function provides the option @M_MMAP_THRESHOLD@ to set the division point in bytes for requests that cannot be satisfied by an allocator's free list.
1872Each @sbrk@ test in this benchmark is repeated 5,000,000,000 times and each @mmap@ test is performed 1,000,000 times;
1873the different repetitions result from the high cost of the OS calls making the experiment run too long.
1874A \emph{long running} experiment, rather than short experiments with averaged results, is searching for blowup scenarios in time and/or space.
1875Finally, scaling is tested with 4, 8, 16, and 32 pinned threads, where the threads synchronize between tests using a @pthread@ barrier.
1876In all experiments, allocated storage has its first and last byte assigned a character to simulate usage.
1877
1878The tests are performed in this order:
1879\begin{enumerate}[leftmargin=18pt,topsep=3pt,itemsep=2pt,parsep=0pt]
1880\item
1881@x = malloc( 0 ) / free( x )@:
1882handles the pathological case of an zero-sized allocation and free.
1883The POSIX standard allows two meanings for this case: return @NULL@ or a unique pointer, where both can be freed.
1884The fastest implementation is to return @NULL@, rather than create a fictitious allocation.
1885However, this overloads the @malloc@ return-value to mean error or a zero-sized allocation.
1886To comply with the POSIX standard, the check for running out of memory is:
1887\begin{uC++}
1888if ( malloc( 0 ) == NULL && errno == ENOMEM ) ... // no memory
1889\end{uC++}
1890Unfortunately, most programmers assume @NULL@ means an error, \eg two tests in the SPEC CPU benchmark fail if @NULL@ is returned for a zero-sized allocation.
1891Hence, returning @NULL@ for a zero-sized allocation is an impractical allocator option.
1892
1893\item
1894@free( NULL )@: handles the pathological case of freeing a non-existent or zero-byte allocation.
1895Non-existent allocations occur as algorithm base-cases, such as an unused pointer set to @NULL@.
1896Having the allocator ignore this case eliminates checking for an erroneous @free@ call on a @NULL@ value.
1897This call should be fast.
1898
1899\item
1900\label{expS}
1901@x = malloc( 42 ) / free( x )@:
1902handles a fixed-sized allocation and free.
1903
1904\item
1905@x[0..100) = malloc( 42 ) / free( x[0..100) )@:
1906handles a group of fixed-sized allocations and group free.
1907
1908\item
1909@x[0..1000) = malloc( 42 ) / free( x[0..1000) )@:
1910handles a larger group of fixed-sized allocations and group free.
1911
1912\item
1913@x[0..100) = malloc( 42 ) / free( x(100..0] )@:
1914handles a group of fixed-sized allocations and group free in reverse order.
1915
1916\item
1917\label{expE}
1918@x[0..1000) = malloc( 42 ) / free( x(1000..0] )@:
1919handles a larger group of fixed-sized allocations and group free in reverse order.
1920
1921\item
1922@x = malloc( [0..100) ) / free( x )@:
1923handles a variable-sized allocation and free.
1924
1925\item
1926@x[0..100) = malloc( [0..100) ) / free( x[0..100) )@:
1927handles a group of variable-sized allocations and group free.
1928
1929\item
1930@x[0..1000) = malloc( [0..1000) ) / free( x[0..1000) )@:
1931handles a larger group of variable-sized allocations and group free.
1932
1933\item
1934@x[0..100) = malloc( [0..100) ) / free( x(100..0] )@:
1935handles a group of variable-sized allocations and group free in reverse order.
1936
1937\item
1938@x[0..1000) = malloc( [0..1000) ) / free( x(1000..0] )@:
1939handles a larger group of variable-sized allocations and group free in reverse order.
1940\end{enumerate}
1941Experiments \ref{expS}--\ref{expE} are repeated with a fixed-sized allocation of 1,048,576, where @M_MMAP_THRESHOLD@ is set to 524,288 to force the use of @mmap@, resulting in 17 experiments.
1942Because the @mmap@ experiments test the operating-system memory-management not the allocators, the variable-sized @mmap@ experiments are deemed unnecessary.
1943A test with random-sized @sbrk@ allocations @malloc( [0..N) random )@ was performed, but the results are the same as fixed sized as all the allocation sizes are quickly accessed over the large number of experiment repetitions.
1944That is, once the buckets or superblocks for the allocation sizes are created, access order is irrelevant.
1945
1946Figures~\ref{f:LatencyExpARM}--\ref{f:LatencyExpIntel} show the results of the @sbrk@ and @mmap@ experiments across the seven allocators with parallel scaling.
1947The average of the N threads is graphed for each experiment and the standard deviation is the error bar.
1948For the @sbrk@ graphs, a good allocator result should be low (smaller is better), flat across scaling (cores), with no error bars (STD $\approx$ 0) indicating no jitter (pauses) among the threads.
1949The result patterns across the three hardware architectures are similar, with differences correlating to CPU speed and cache differences.
1950
1951The key observation across the @sbrk@ graphs is that llheap and mimalloc are always at the bottom (lower is better) and flat with respect to scaling.
1952The only exception is on the Intel, where all allocators experienced similar non-flat behaviour, because of the L3 cache shift at 16 cores.
1953Some anomalies are tcmalloc and hoard experiencing large jitter (see error bars) and scaling issues in some experiments, which is correlated with poorer results;
1954jemalloc has significant scaling issues for experiments 5, 7, 10, and 12, resulting from large numbers of @futex@ calls, possibly related to @madvise@ for returning storage to the OS;
1955and glibc and tbbmalloc are often slower than the other allocators (symbols are on top of each other).
1956
1957The key observation across the @mmap@ graphs is that only three allocators, glibc, llheap, and tbbmalloc honoured the @mmap@ threshold request (symbols are on top of each other).
1958The other allocators made no @mmap@ calls, so their results are extremely low.
1959The exception is hoard, which did make @mmap@ calls that were uncorrelated with @M_MMAP_THRESHOLD@, and had significant jitter due to a large number of @futex@ calls.
1960For the allocators using @mmap@, there should be some scaling effect as more threads make more system calls.
1961
1962\begin{figure}
1963\input{prolog.tex}
1964\vspace*{-20pt}
1965\caption{Delay Results, ARM, x-axis is cores, lower is better}
1966\label{f:LatencyExpARM}
1967\end{figure}
1968
1969\begin{figure}
1970\input{swift.tex}
1971\vspace*{-20pt}
1972\caption{Delay Results, AMD, x-axis is cores, lower is better}
1973\label{f:LatencyExpAMD}
1974\end{figure}
1975
1976\begin{figure}
1977\input{java.tex}
1978\vspace*{-20pt}
1979\caption{Delay Results, Intel, x-axis is cores, lower is better}
1980\label{f:LatencyExpIntel}
1981\end{figure}
1982
1983Figures~\ref{f:LatencyResARM}--\ref{f:LatencyResIntel} show a time/space perspective across the entire experiment.
1984The user, system, and real times along with the maximum memory usage are presented for the @sbrk@ and @mmap@ experiments.
1985The result patterns across the three hardware architectures are similar.
1986If an allocator disappears in a graph, its result is less than 1 on a logarithmic scale.
1987Surprisingly, there are large (2 orders of magnitude) time differences among the allocators.
1988
1989\begin{figure}
1990\hspace*{15pt}
1991\input{prolog2.tex}
1992\vspace*{-20pt}
1993\caption{Delay Results, ARM, x-axis is cores, lower is better}
1994\label{f:LatencyResARM}
1995
1996\hspace*{15pt}
1997\input{swift2.tex}
1998\vspace*{-20pt}
1999\caption{Delay Results, AMD, x-axis is cores, lower is better}
2000\label{f:LatencyResAMD}
2001
2002\hspace*{15pt}
2003\input{java2.tex}
2004\vspace*{-20pt}
2005\caption{Delay Results, Intel, x-axis is cores, lower is better}
2006\label{f:LatencyResIntel}
2007\end{figure}
2008
2009For @sbrk@ graphs, the user time should be high and scale with cores, the system time very low, the real time constant, and the maximum memory scales with cores.
2010For user time, llheap and mimalloc, are at the bottom (lower is better) and all allocators have linear scaling as cores increase.
2011The remaining allocators are slower by one to two orders of magnitude, which correlates with high results in the experiments.
2012For system time jemalloc has non-trivial system time that scales with cores, caused by a large number of @futex@ calls.
2013The remaining allocators have virtually zero system time (not on graph).
2014The exception is a random anomaly where allocators had small amounts of system time, which appeared/disappeared on different experiment runs as if something slightly perturbs the experiment (OS?) over its 20 hour run.
2015For real time, llheap and mimalloc, take the least overall time and all allocators except jemalloc have flat performance.
2016For maximum memory, all allocators scale with cores, and there is a rough inverse correlation between user time and memory usage, \ie time \vs speed tradeoff.
2017
2018For @mmap@ graphs, only used by glibc, llheap, and tbbmalloc, the user time should be low and scale with cores, the system time should be high and scale with cores, the real time constant, and the maximum memory scales with cores.
2019For user time, glibc, llheap, and tbbmalloc, are at the bottom because there are no @sbrk@ requests.
2020The remaining allocators all use a non-trivial amount of time handling the large requests, except mimalloc, which handles the large request identically to a small request.
2021Interestingly, the amount of time varies by one to two orders of magnitude.
2022For system time, glibc, llheap, and tbbmalloc, are at the top because of the OS calls to @mmap@.
2023Interestingly, the remaining allocators still use orders of magnitude of system time, except mimalloc ($<$ 1 so invisible).
2024For real time, all allocators scale linearly with cores, except mimalloc, which is flat.
2025For maximum memory, all allocators scale with cores, and there is a rough inverse correlation between user time and memory usage, \ie time \vs speed tradeoff.
2026
2027
2028\subsection{Out of Memory Benchmark}
2029
2030Figure~\ref{f:OutMemoryBenchmark} show a \CC program with unbounded memory allocation.
2031The program is run in a shell with restricted data size.
2032Hence, it quickly runs out of memory, causing @malloc@, which is called by \CC @new@, to return a @nullptr@ with @errno@ set to @ENOMEM@.
2033Routine @new@ sees the @nullptr@ and calls the handler routine set by @set_new_handler@, which prints a message, and resets the default handler to raise the @bad_alloc@ exception caught in the program main.
2034Note, to raise an exception requires dynamic allocation, but \CC preallocates a few special exception, like @bad_alloc@, for special cases.
2035
2036All allocators printed the correct output except hoard, mimalloc, and tcmalloc.
2037Hoard prints @MAP_FAILED@ and hangs spinning on a spinlock in a complex call chain.
2038mimalloc aborts the program because it incorrectly attempts to raise the @bad_alloc@ exception itself if and only it is compiled with \CC, whereas it is compiled with C.
2039The correct design is to return a @nullptr@ with @errno@ set to @ENOMEM@ to \CC @new@, which then raises the exception;
2040hence, the allocator can be compiled with C or \CC.
2041tcmalloc prints the correct output but adds ``allocation failed'' messages.
2042
2043\begin{figure}
2044\begin{tabular}{@{\hspace*{\parindentlnth}}l@{\hspace*{2\parindentlnth}}l@{}@{}}
2045\begin{cfa}
2046static void handler() {
2047 cout << "Memory allocation failed\n";
2048 set_new_handler( nullptr );
2049}
2050
2051
2052\end{cfa}
2053&
2054\begin{cfa}
2055int main() {
2056 set_new_handler( handler );
2057 try {
2058 for ( ;; ) pass( new char[50] ); // unbounded allocation
2059 } catch( const bad_alloc & e ) { cout << e.what() << endl; }
2060}
2061\end{cfa}
2062\end{tabular}
2063\caption{Out of Memory Benchmark}
2064\label{f:OutMemoryBenchmark}
2065\end{figure}
2066
2067
2068\section{Conclusion}
2069
2070The goal of this work is to build a full-featured, low-latency (or high bandwidth) memory allocator for both KT and UT multi-threading systems that is competitive with the best current memory allocators while extending the feature set of existing and new allocator routines.
2071The new llheap allocator achieves all of these goals, while maintaining and managing sticky allocation information \emph{without a performance loss}.
2072Hence, it is possible to use @realloc@ frequently as a safe operation, rather than just occasionally or not at all.
2073Furthermore, the ability to query sticky properties and other information allows programmers to write safer programs, as it is possible to dynamically match allocation styles from unknown library routines that return allocations.
2074
2075Extending the C allocation API with @resize@, advanced @realloc@, @aalloc@, @amemalign@, @cmemalign@ and other alignment variations means programmers do not have to generate these allocation operations themselves.
2076The ability of the type systems in modern languages, \eg \CFA, to condense the allocation API to one routine with completely orthogonal allocation properties shows how far the allocation API can be advanced.
2077The result is increased safety and a cognitive reduction in performing dynamic allocation.
2078All of these extensions should eliminate common reasons for C programmers to roll their own memory allocator and/or allocation function, which is a huge safety advantage.
2079
2080The ability to compile llheap with static/dynamic linking and optional statistics/debugging provides programmers with multiple mechanisms to balance performance and safety.
2081These allocator versions are easy to use because they can be linked to an application without recompilation.
2082Providing comprehensive statistics for all allocation operations is invaluable in understanding and debugging a program's dynamic behaviour.
2083No other memory allocator provides such comprehensive statistics gathering.
2084This capability was used extensively during the development of llheap to verify its behaviour, and to verify the benchmarks developed for the paper.
2085As well, the debugging mode, where allocations are checked along with internal pre/post-conditions and invariants, is extremely useful especially for students ($\approx$1,000 students have tested the \uC version of llheap).
2086While not as powerful as the @valgrind@ interpreter, lheap's debugging mode can detect a large number of allocation mistakes.
2087The contention-free statistics gathering and debugging have a low enough cost to be used in production code.
2088Finally, no other memory allocator addresses the needs of user-level threading, which are now available in many modern languages.
2089
2090Creating a benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs, has been an interesting challenge.
2091The purpose of these performance tests is not to pick winners and losers among the allocators, because each allocator optimizes a particular set of allocation patterns: there is no optimal memory-allocator.
2092The goal is to demonstrate that llheap's performance, both in time and space, across some interesting allocation patterns, is comparable to the best allocators in use today.
2093Admittedly, there are pathological cases where llheap might use significant amounts of memory because it never coalesces or returns storage to the OS.
2094These pathological cases do not correlate to long running applications, where llheap can perform very well.
2095In the small set of tested benchmarks, no heap blowup was observed, while some tests caused time blowups in other allocators.
2096Therefore, llheap is a viable drop-in replacement for many applications and its ancillary features make it safer and more informative.
2097
2098
2099\subsection{Recommendations}
2100
2101Substantial work has been put into building a new allocator and benchmarks, plus doing comprehensive performance tests among allocators.
2102Based on this work, we make two recommendations:
2103\begin{enumerate}[leftmargin=*, topsep=0pt,itemsep=0pt,parsep=0pt]
2104\item
2105Hoard is no longer maintained and did not do well (even broke) in some performance experiments.
2106We recommend to those doing memory allocation research not to use it.
2107\item
2108glibc did not perform as well as other allocators.
2109Given it is the default memory allocator for many academic and industry applications, this seems unfortunate and skews performance resulting so developers may draw incorrect conclusions.
2110As such, we recommend the adoption of a newer memory allocator for glibc.
2111We offer llheap for the reasons given above, but most importantly, its small code base.
2112glibc maintainers come and go.
2113Therefore, it is crucial for a new maintainer to on-board quickly and have a thorough understanding of the code base within a month.
2114The llheap code base is small and can be learned quickly because of its simple design, making it an ideal choice as a substitute allocator.
2115\end{enumerate}
2116
2117
2118\section{Acknowledgements}
2119
2120This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
2121% Special thanks to Trevor Brown for many helpful discussions.
2122
2123\bibliographystyle{ACM-Reference-Format}
2124\bibliography{pl,local}
2125
2126\end{document}
2127\endinput
2128
2129% Local Variables: %
2130% tab-width: 4 %
2131% fill-column: 120 %
2132% compile-command: "make" %
2133% End: %
Note: See TracBrowser for help on using the repository browser.