1 | % Type: Paper
|
---|
2 | %
|
---|
3 | % Abstract
|
---|
4 | %
|
---|
5 | % A new C-based concurrent memory-allocator is presented, called llheap (ll => low latency). It supports C/C++ applications with multiple kernel threads, or it can be embedded into user-threading runtime-systems. llheap extends the C allocation API with new functions providing orthogonal access to allocation features; hence, programmers do have to code missing combinations. llheap also extends the C allocation semantics by remembering multiple aspects of the initial allocation. These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations. As well, realloc/reallocarray preserve initial zero-fill and alignment properties when adjusting storage size, again increasing future allocation safety. The allocator provides a contention-free statistics gathering mode, and a debugging mode for dynamically checking allocation pre/post conditions and invariants. These modes are invaluable for understanding and debugging a program's dynamic allocation behaviour, with low enough cost to be used in production code. An example is presented for condensing the allocation API using advanced type-systems, providing a single type-safe allocation routine using named arguments. Finally, performance results across a number of benchmarks show llheap is competitive with other modern memory allocators.
|
---|
6 | %
|
---|
7 | % Upload: llheap.pdf
|
---|
8 | %
|
---|
9 | % Computing Classification Systems
|
---|
10 | %
|
---|
11 | % Add
|
---|
12 | % 500 Software and its engineering > Software libraries and repositories
|
---|
13 | % Add
|
---|
14 | % 300 Computing methodologies > Concurrent programming languages
|
---|
15 | %
|
---|
16 | % Authors, submitter has to have an orcid
|
---|
17 | %
|
---|
18 | % Details & Comments
|
---|
19 | %
|
---|
20 | % cover letter
|
---|
21 | %
|
---|
22 | % Funding
|
---|
23 | % yes
|
---|
24 | % Government of Canada >
|
---|
25 | % Natural Sciences and Engineering Research Council of Canada
|
---|
26 | %
|
---|
27 | % Electronic Supplementary Materials No
|
---|
28 | % Are you submitting a conference paper extension: No
|
---|
29 | % X ACM uses CrossCheck, an automated service that checks for plagiarism. Any submission to ACM is subject to such a check. Confirm that you are familiar with the ACM Plagiarism Polic
|
---|
30 | % To confirm that you have reviewed all title, author, and affiliation information in the submission form and the manuscript for accuracy, and approve its exact use in the final, published article, please check the box to the right. X
|
---|
31 |
|
---|
32 | \documentclass[manuscript,screen,review]{acmart}
|
---|
33 |
|
---|
34 | % Latex packages used in the document.
|
---|
35 |
|
---|
36 | \usepackage{comment}
|
---|
37 | \usepackage{epic,eepic}
|
---|
38 | \usepackage{upquote} % switch curled `'" to straight
|
---|
39 | \usepackage{relsize}
|
---|
40 | \usepackage{xspace}
|
---|
41 | \usepackage{xcolor}
|
---|
42 | \usepackage{calc}
|
---|
43 | \usepackage{algorithm}
|
---|
44 | \usepackage{algorithmic}
|
---|
45 | \usepackage{enumitem}
|
---|
46 | \usepackage{tabularx} % allows \lstMakeShortInline@
|
---|
47 | \usepackage[scaled=0.88]{helvet} % descent Helvetica font and scale to times size
|
---|
48 | \usepackage[T1]{fontenc}
|
---|
49 | \usepackage{listings} % format program code
|
---|
50 | \usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt,font={rm,md,up}]{subfig}
|
---|
51 | \renewcommand{\thesubfigure}{(\alph{subfigure})}
|
---|
52 |
|
---|
53 | \hypersetup{breaklinks=true}
|
---|
54 | \usepackage{breakurl}
|
---|
55 |
|
---|
56 | % \usepackage[pagewise]{lineno}
|
---|
57 | % \renewcommand{\linenumberfont}{\scriptsize\sffamily}
|
---|
58 |
|
---|
59 | \usepackage{varioref} % extended references
|
---|
60 | % adjust varioref package with default "section" and "page" titles, and optional title with faraway page numbers
|
---|
61 | % \VRef{label} => Section 2.7, \VPageref{label} => page 17
|
---|
62 | % \VRef[Figure]{label} => Figure 3.4, \VPageref{label} => page 17
|
---|
63 | % \renewcommand{\reftextfaceafter}{\unskip}
|
---|
64 | % \renewcommand{\reftextfacebefore}{\unskip}
|
---|
65 | % \renewcommand{\reftextafter}{\unskip}
|
---|
66 | % \renewcommand{\reftextbefore}{\unskip}
|
---|
67 | % \renewcommand{\reftextfaraway}[1]{\unskip, p.~\pageref{#1}}
|
---|
68 | % \renewcommand{\reftextpagerange}[2]{\unskip, pp.~\pageref{#1}--\pageref{#2}}
|
---|
69 | % \newcommand{\VRef}[2][Section]{\ifx#1\@empty\else{#1}\nobreakspace\fi\vref{#2}}
|
---|
70 | % \newcommand{\VRefrange}[3][Sections]{\ifx#1\@empty\else{#1}\nobreakspace\fi\vrefrange{#2}{#3}}
|
---|
71 | % \newcommand{\VPageref}[2][page]{\ifx#1\@empty\else{#1}\nobreakspace\fi\pageref{#2}}
|
---|
72 | % \newcommand{\VPagerefrange}[3][pages]{\ifx#1\@empty\else{#1}\nobreakspace\fi\pageref{#2}{#3}}
|
---|
73 |
|
---|
74 | \makeatletter
|
---|
75 | \newcommand{\abbrevFont}{\textit} % set empty for no italics
|
---|
76 | \newcommand{\CheckCommaColon}{\@ifnextchar{,}{}{\@ifnextchar{:}{}{,\xspace}}}
|
---|
77 | \newcommand{\CheckPeriod}{\@ifnextchar{.}{}{.\xspace}}
|
---|
78 | \newcommand{\EG}{\abbrevFont{e}.\abbrevFont{g}.}
|
---|
79 | \newcommand{\eg}{\EG\CheckCommaColon}
|
---|
80 | \newcommand{\IE}{\abbrevFont{i}.\abbrevFont{e}.}
|
---|
81 | \newcommand{\ie}{\IE\CheckCommaColon}
|
---|
82 | \newcommand{\ETC}{\abbrevFont{etc}}
|
---|
83 | \newcommand{\etc}{\ETC\CheckPeriod}
|
---|
84 | \newcommand{\VS}{\abbrevFont{vs}}
|
---|
85 | \newcommand{\vs}{\VS\CheckPeriod}
|
---|
86 |
|
---|
87 | \newcommand{\newtermFont}{\emph}
|
---|
88 | \newcommand{\newterm}[1]{\newtermFont{#1}}
|
---|
89 |
|
---|
90 | \newcommand{\CFAIcon}{\textsf{C}\raisebox{\depth}{\rotatebox{180}{\textsf{A}}}\xspace} % Cforall symbolic name
|
---|
91 | \newcommand{\CFA}{\protect\CFAIcon} % safe for section/caption
|
---|
92 | \newcommand{\CFL}{\textrm{Cforall}\xspace} % Cforall symbolic name
|
---|
93 | \newcommand{\CCIcon}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}} % C++ icon
|
---|
94 | \newcommand{\CC}[1][]{\protect\CCIcon{#1}\xspace} % C++ symbolic name
|
---|
95 | \newcommand{\uC}{$\mu$\CC}
|
---|
96 | \newcommand{\Csharp}{C\raisebox{-0.7ex}{\relsize{2}$^\sharp$}\xspace} % C# symbolic name
|
---|
97 |
|
---|
98 | \newcommand{\LstBasicStyle}[1]{{\lst@basicstyle{#1}}}
|
---|
99 | \newcommand{\LstKeywordStyle}[1]{{\lst@basicstyle{\lst@keywordstyle{#1}}}}
|
---|
100 | \newcommand{\LstCommentStyle}[1]{{\lst@basicstyle{\lst@commentstyle{#1}}}}
|
---|
101 | \newcommand{\LstStringStyle}[1]{{\lst@basicstyle{\lst@stringstyle{#1}}}}
|
---|
102 |
|
---|
103 | \newlength{\parindentlnth}
|
---|
104 | \setlength{\parindentlnth}{\parindent}
|
---|
105 | \newlength{\gcolumnposn} % temporary hack because lstlisting does not handle tabs correctly
|
---|
106 | \newlength{\columnposn}
|
---|
107 | \setlength{\gcolumnposn}{3.25in}
|
---|
108 | \setlength{\columnposn}{\gcolumnposn}
|
---|
109 | \renewcommand{\C}[2][\@empty]{\ifx#1\@empty\else\global\setlength{\columnposn}{#1}\global\columnposn=\columnposn\fi\hfill\makebox[\textwidth-\columnposn][l]{\lst@basicstyle{\LstCommentStyle{#2}}}}
|
---|
110 | \newcommand{\CRT}{\global\columnposn=\gcolumnposn}
|
---|
111 | \makeatother
|
---|
112 |
|
---|
113 | \lstset{
|
---|
114 | columns=fullflexible,
|
---|
115 | basicstyle=\linespread{0.9}\sf, % reduce line spacing and use sanserif font
|
---|
116 | stringstyle=\fontsize{9}{9}\selectfont\tt, % use typewriter font
|
---|
117 | tabsize=5, % N space tabbing
|
---|
118 | xleftmargin=\parindentlnth, % indent code to paragraph indentation
|
---|
119 | escapechar=\$, % LaTeX escape in CFA code
|
---|
120 | mathescape=false, % disable LaTeX math escape in CFA code $...$
|
---|
121 | keepspaces=true, %
|
---|
122 | showstringspaces=false, % do not show spaces with cup
|
---|
123 | showlines=true, % show blank lines at end of code
|
---|
124 | aboveskip=4pt, % spacing above/below code block
|
---|
125 | belowskip=2pt,
|
---|
126 | numberstyle=\footnotesize\sf, % numbering style
|
---|
127 | moredelim=**[is][\color{red}]{@}{@},
|
---|
128 | % replace/adjust listing characters that look bad in sanserif
|
---|
129 | literate=
|
---|
130 | % {-}{\makebox[1ex][c]{\raisebox{0.4ex}{\rule{0.75ex}{0.1ex}}}}1
|
---|
131 | {-}{\raisebox{0pt}{\ttfamily-}}1
|
---|
132 | {^}{\raisebox{0.6ex}{\(\scriptstyle\land\,\)}}1
|
---|
133 | {~}{\raisebox{0.3ex}{\(\scriptstyle\sim\,\)}}1
|
---|
134 | % {'}{\ttfamily'\hspace*{-0.4ex}}1
|
---|
135 | {`}{\raisebox{-2pt}{\large\textasciigrave\hspace{-1pt}}}1
|
---|
136 | {<-}{$\leftarrow$}2
|
---|
137 | {=>}{$\Rightarrow$}2
|
---|
138 | % {->}{\raisebox{-1pt}{\texttt{-}}\kern-0.1ex\textgreater}2,
|
---|
139 | }% lstset
|
---|
140 |
|
---|
141 | % CFA programming language, based on ANSI C (with some gcc additions)
|
---|
142 | \lstdefinelanguage{CFA}[ANSI]{C}{
|
---|
143 | morekeywords={
|
---|
144 | _Alignas, _Alignof, __alignof, __alignof__, and, asm, __asm, __asm__, _Atomic, __attribute, __attribute__,
|
---|
145 | __auto_type, basetypeof, _Bool, catch, catchResume, choose, coerce, _Complex, __complex, __complex__, __const, __const__,
|
---|
146 | coroutine, _Decimal32, _Decimal64, _Decimal128, disable, enable, exception, __extension__, fallthrough, finally, fixup,
|
---|
147 | __float80, float80, __float128, float128, _Float16, _Float32, _Float32x, _Float64, _Float64x, _Float128, _Float128x,
|
---|
148 | forall, fortran, generator, _Generic, _Imaginary, __imag, __imag__, inline, __inline, __inline__, int128, __int128, __int128_t,
|
---|
149 | __label__, monitor, mutex, _Noreturn, __builtin_offsetof, one_t, or, recover, report, restrict, __restrict, __restrict__,
|
---|
150 | __signed, __signed__, _Static_assert, suspend, thread, __thread, _Thread_local, throw, throwResume, timeout, trait, try,
|
---|
151 | typeof, __typeof, __typeof__, typeid, __uint128_t, __builtin_va_arg, __builtin_va_list, virtual, __volatile, __volatile__,
|
---|
152 | vtable, waitfor, waituntil, when, with, zero_t,
|
---|
153 | },
|
---|
154 | moredirectives={defined,include_next},
|
---|
155 | }
|
---|
156 |
|
---|
157 | % uC++ programming language, based on ANSI C++
|
---|
158 | \lstdefinelanguage{uC++}[GNU]{C++}{
|
---|
159 | morekeywords={
|
---|
160 | _Accept, _AcceptReturn, _AcceptWait, _Actor, _At, _Catch, _CatchResume, _CorActor, _Cormonitor, _Coroutine,
|
---|
161 | _Disable, _Else, _Enable, _Event, _Exception, _Finally, _Monitor, _Mutex, _Nomutex, _PeriodicTask, _RealTimeTask,
|
---|
162 | _Resume, _ResumeTop, _Select, _SporadicTask, _Task, _Timeout, _When, _With, _Throw},
|
---|
163 | }
|
---|
164 |
|
---|
165 | % Go programming language: https://github.com/julienc91/listings-golang/blob/master/listings-golang.sty
|
---|
166 | \lstdefinelanguage{Golang}{
|
---|
167 | morekeywords=[1]{package,import,func,type,struct,return,defer,panic,recover,select,var,const,iota,},
|
---|
168 | morekeywords=[2]{string,uint,uint8,uint16,uint32,uint64,int,int8,int16,int32,int64,
|
---|
169 | bool,float32,float64,complex64,complex128,byte,rune,uintptr, error,interface},
|
---|
170 | morekeywords=[3]{map,slice,make,new,nil,len,cap,copy,close,true,false,delete,append,real,imag,complex,chan,},
|
---|
171 | morekeywords=[4]{for,break,continue,range,goto,switch,case,fallthrough,if,else,default,},
|
---|
172 | morekeywords=[5]{Println,Printf,Error,},
|
---|
173 | sensitive=true,
|
---|
174 | morecomment=[l]{//},
|
---|
175 | morecomment=[s]{/*}{*/},
|
---|
176 | morestring=[b]',
|
---|
177 | morestring=[b]",
|
---|
178 | morestring=[s]{`}{`},
|
---|
179 | }
|
---|
180 |
|
---|
181 | \lstnewenvironment{cfa}[1][]{\lstset{language=CFA,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
|
---|
182 | \lstnewenvironment{C++}[1][]{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
|
---|
183 | \lstnewenvironment{uC++}[1][]{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
|
---|
184 | \lstnewenvironment{Go}[1][]{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
|
---|
185 | \lstnewenvironment{python}[1][]{\lstset{language=python,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
|
---|
186 | \lstnewenvironment{java}[1][]{\lstset{language=java,moredelim=**[is][\protect\color{red}]{@}{@}}\lstset{#1}}{}
|
---|
187 |
|
---|
188 | \newsavebox{\myboxA}
|
---|
189 | \newsavebox{\myboxB}
|
---|
190 | \newsavebox{\myboxC}
|
---|
191 | \newsavebox{\myboxD}
|
---|
192 |
|
---|
193 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
---|
194 |
|
---|
195 | \title{High-Performance Concurrent Memory Allocation}
|
---|
196 |
|
---|
197 | \author{Mubeen Zulfiqar}
|
---|
198 | \email{m3zulfiq@uwaterloo.ca}
|
---|
199 | \author{Ayelet Wasik}
|
---|
200 | \email{aisraeli@plg.uwaterloo.ca}
|
---|
201 | \author{Peter A. Buhr}
|
---|
202 | \email{pabuhr@uwaterloo.ca}
|
---|
203 | \orcid{0000-0003-3747-9281}
|
---|
204 | \affiliation{%
|
---|
205 | \institution{University of Waterloo}
|
---|
206 | \city{Waterloo}
|
---|
207 | \state{Ontario}
|
---|
208 | \country{Canada}
|
---|
209 | }
|
---|
210 | \author{Dave Dice}
|
---|
211 | \email{dave.dice@oracle.com}
|
---|
212 | \orcid{0000-0001-9164-7747}
|
---|
213 | \affiliation{%
|
---|
214 | \institution{Oracle Labs}
|
---|
215 | \city{Burlington}
|
---|
216 | \state{Massachusetts}
|
---|
217 | \country{USA}
|
---|
218 | }
|
---|
219 | \author{Bryan Chan}
|
---|
220 | \email{bryan.chan@huawei.com}
|
---|
221 | \affiliation{%
|
---|
222 | \institution{Huawei Compiler Lab}
|
---|
223 | \city{Markham}
|
---|
224 | \state{Ontario}
|
---|
225 | \country{Canada}
|
---|
226 | }
|
---|
227 |
|
---|
228 | \renewcommand{\shortauthors}{Zulfiqar et al.}
|
---|
229 |
|
---|
230 | % inline code @...@
|
---|
231 | \lstMakeShortInline@%
|
---|
232 |
|
---|
233 | \begin{document}
|
---|
234 |
|
---|
235 | \begin{abstract}
|
---|
236 | A new C-based concurrent memory-allocator is presented, called llheap (ll $\Rightarrow$ low latency).
|
---|
237 | It supports C/\CC applications with multiple kernel threads, or it can be embedded into user-threading runtime-systems.
|
---|
238 | llheap extends the C allocation API with new functions providing orthogonal access to allocation features;
|
---|
239 | hence, programmers do have to code missing combinations.
|
---|
240 | llheap also extends the C allocation semantics by remembering multiple aspects of the initial allocation.
|
---|
241 | These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations.
|
---|
242 | As well, \lstinline{realloc}/\lstinline{reallocarray} preserve initial zero-fill and alignment properties when adjusting storage size, again increasing future allocation safety.
|
---|
243 | The allocator provides a contention-free statistics gathering mode, and a debugging mode for dynamically checking allocation pre/post conditions and invariants.
|
---|
244 | These modes are invaluable for understanding and debugging a program's dynamic allocation behaviour, with low enough cost to be used in production code.
|
---|
245 | An example is presented for condensing the allocation API using advanced type-systems, providing a single type-safe allocation routine using named arguments.
|
---|
246 | Finally, performance results across a number of benchmarks show llheap is competitive with other modern memory allocators.
|
---|
247 | \end{abstract}
|
---|
248 |
|
---|
249 | \begin{CCSXML}
|
---|
250 | <concept>
|
---|
251 | <concept_id>10011007.10011006.10011072</concept_id>
|
---|
252 | <concept_desc>Software and its engineering~Software libraries and repositories</concept_desc>
|
---|
253 | <concept_significance>500</concept_significance>
|
---|
254 | </concept>
|
---|
255 | </ccs2012>
|
---|
256 |
|
---|
257 | <ccs2012>
|
---|
258 | <concept>
|
---|
259 | <concept_id>10010147.10011777.10011014</concept_id>
|
---|
260 | <concept_desc>Computing methodologies~Concurrent programming languages</concept_desc>
|
---|
261 | <concept_significance>300</concept_significance>
|
---|
262 | </concept>
|
---|
263 | \end{CCSXML}
|
---|
264 |
|
---|
265 | \ccsdesc[500]{Software and its engineering~Software libraries and repositories}
|
---|
266 | \ccsdesc[300]{Computing methodologies~Concurrent programming languages}
|
---|
267 |
|
---|
268 | \keywords{memory allocation, (user-level) concurrency, type-safety, statistics, debugging, high performance}
|
---|
269 |
|
---|
270 | \received{20 February 2007}
|
---|
271 | \received[revised]{12 March 2009}
|
---|
272 | \received[accepted]{5 June 2009}
|
---|
273 |
|
---|
274 |
|
---|
275 | \maketitle
|
---|
276 |
|
---|
277 | \section{Introduction}
|
---|
278 |
|
---|
279 | Memory management services a series of program allocation/deallocation requests and attempts to satisfy them from variable-sized blocks of memory while minimizing total memory usage.
|
---|
280 | A general-purpose memory allocator cannot anticipate storage requests so its time and space performance cannot be optimal (bin packing).
|
---|
281 | Each allocator takes advantage of a subset of typical allocation patterns (heuristics) to produce excellent results, both in time and space (similar to LRU paging).
|
---|
282 | Nevertheless, allocators are a series of compromises, possibly with static or dynamic tuning parameters to optimize specific request patterns.
|
---|
283 |
|
---|
284 |
|
---|
285 | \subsection{Memory Structure}
|
---|
286 | \label{s:MemoryStructure}
|
---|
287 |
|
---|
288 | Figure~\ref{f:ProgramAddressSpace} shows the typical layout of a program's address space (high addresses to low) divided into a number of zones, with free memory surrounding the dynamic code/data~\cite{memlayout}.
|
---|
289 | Static code and data are placed into memory at load time from the executable and are fixed-sized at runtime.
|
---|
290 | Dynamic code/data memory is managed by the dynamic loader for libraries loaded at runtime, which is complex especially in a multi-threaded program~\cite{Huang06}.
|
---|
291 | However, changes to the dynamic code/data space are typically infrequent, most occurring at program startup and are largely outside of a program's control.
|
---|
292 | Stack memory is managed by the program call/return mechanism using a LIFO technique.
|
---|
293 | For stackful coroutines and user threads, new stacks are commonly created in the dynamic-allocation memory.
|
---|
294 | The dynamic-allocation memory is often a contiguous area, which starts empty and grows/shrinks as the program creates/deletes variables with independent lifetime.
|
---|
295 | The language's runtime manages this area, where management complexity is a function of the mechanism for deleting variables.
|
---|
296 |
|
---|
297 | \begin{figure}
|
---|
298 | \centering
|
---|
299 | \input{AddressSpace.pstex_t}
|
---|
300 | \vspace{-5pt}
|
---|
301 | \caption{Program Address Space Divided into Zones}
|
---|
302 | \label{f:ProgramAddressSpace}
|
---|
303 | \end{figure}
|
---|
304 |
|
---|
305 |
|
---|
306 | \subsection{Dynamic Memory-Management}
|
---|
307 | \label{s:DynamicMemoryManagement}
|
---|
308 |
|
---|
309 | Modern programming languages provide two forms of storage management: managed or unmanaged.
|
---|
310 | Both forms have explicit allocation, but managed memory has implicit deallocation (garbage collection~\cite{Wilson92}, GC) and unmanaged memory has some form of explicit deallocation.
|
---|
311 | Sometimes there are explicit deallocation hints in managed.
|
---|
312 | Both forms attempt to reuse freed storage in the heap for new allocations.
|
---|
313 | Unmanaged languages have no information about allocated \newterm{objects}, and hence, use techniques during freeing to detect adjacent unused storage if coalescing.
|
---|
314 | Conservative GC attempts to find free objects in an unmanaged system by scanning memory and marking anything that \emph{looks} like a live object.
|
---|
315 | However, \emph{conservative} means some non-objects might be marked as live;
|
---|
316 | the goal is not to miss any live objects.
|
---|
317 | Managed languages maintain sufficient information to locate all live objects.
|
---|
318 | Precise GC is then able to mark just the live objects.
|
---|
319 | Both approaches then sweep through the unmarked objects looking for adjacent free storage to coalesce.
|
---|
320 | Precise GC has a further coalescing option of compacting used objects and adjusting the pointers used to find them to the new locations, resulting in a large area of contiguous free storage.
|
---|
321 | Languages such as Lisp~\cite{CommonLisp}, Java~\cite{Java}, Haskell~\cite{Haskell}, Go~\cite{Go}, are managed and normally implemented using precise GC.
|
---|
322 | (Both Go~\cite{Go1.3} and Netscape JavaScript~\cite{JavaScriptGC} switched from conservative to precise GC.)
|
---|
323 | Languages such as C~\cite{C}, \CC~\cite{C++}, Rust~\cite{Rust} and Swift~\cite{swift} (because of explicit management of weak references) are unmanaged but could be used with conservative GC~\cite{Boehm88}.
|
---|
324 | This work only examines unmanaged memory with \emph{explicit} deallocation.
|
---|
325 | % While GC is not part this work, some of the results are applicable to the allocation phase in any memory-management approach.
|
---|
326 |
|
---|
327 | Most programs use a general-purpose allocator, usually the one provided by the programming-language's runtime.
|
---|
328 | In certain languages, programmers can write specialize allocators for specific needs.
|
---|
329 | POSIX~\cite{POSIX17} provides for replacement of the default memory allocator in C and \CC through a standard API.
|
---|
330 | Most industry JVMs provide multiple GCs, from which a user selects one for their workload.
|
---|
331 | %Jikes RVM MMTk~\cite{MMTk} provides a similar generalization for the Java virtual machine.
|
---|
332 | As well, new languages support concurrency (kernel/user threading), which must be safely handled by the allocator.
|
---|
333 | Hence, alternative allocators exist for C/\CC with the goal of scaling in multi-threaded programs~\cite{Berger00,mtmalloc,streamflow,tcmalloc}.
|
---|
334 | This work examines the design of high-performance allocators for use by kernel and user multi-threaded applications written in C/\CC.
|
---|
335 |
|
---|
336 |
|
---|
337 | \subsection{Contributions}
|
---|
338 | \label{s:Contributions}
|
---|
339 |
|
---|
340 | This work provides the following contributions to the area of explicit concurrent dynamic-allocation.
|
---|
341 | \begin{enumerate}[leftmargin=18pt,topsep=3pt,itemsep=0pt]
|
---|
342 | \item
|
---|
343 | Implementation of a new stand-alone concurrent low-latency memory-allocator, called llheap~\cite{llheap}, ($\approx$1,500 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions for the concurrent languages \uC~\cite{uC++} and \CFA~\cite{Moss18,Delisle21} using user-level threads running on multiple kernel threads (M:N threading).
|
---|
344 |
|
---|
345 | \item
|
---|
346 | Extend the C allocation API with new functions @aalloc@, @amemalign@, @cmemalign@, @resize@, @aligned_resize@, @aligned_realloc@, and @aligned_reallocarray@ to make allocation properties orthogonally accessible.
|
---|
347 |
|
---|
348 | \item
|
---|
349 | Extend the C allocation semantics by preserving with each allocation its request size, the amount allocated, whether it is zero fill, and its alignment.
|
---|
350 |
|
---|
351 | \item
|
---|
352 | Provide additional query operations @malloc_alignment@, @malloc_zero_fill@, and @malloc_size@ to access allocation information.
|
---|
353 |
|
---|
354 | \item
|
---|
355 | Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ and @reallocarray@ to zero-fill and align when storage is extended or copied.
|
---|
356 | Without this extension, it is unsafe to @realloc@ storage if the properties are not preserved when copying.
|
---|
357 | This silent problem is unintuitive to programmers and difficult to locate because it is transient.
|
---|
358 |
|
---|
359 | \item
|
---|
360 | Provide optional extensive, fast, and contention-free allocation statistics to understand allocation behaviour.
|
---|
361 |
|
---|
362 | \item
|
---|
363 | Provide runtime checks to validate allocation operations and identify the amount of unfreed storage at program termination.
|
---|
364 |
|
---|
365 | \item
|
---|
366 | Build 8 different versions of the allocator: static or dynamic linking, with or without statistics or debugging.
|
---|
367 | A program may link to any of these 8 versions of the allocator often without recompilation (linking or @LD_PRELOAD@).
|
---|
368 |
|
---|
369 | \item
|
---|
370 | Demonstrate how advanced programming-language type-systems can condense the allocation API providing a single type-safe allocation function using named arguments.
|
---|
371 |
|
---|
372 | \item
|
---|
373 | Create a benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs.
|
---|
374 |
|
---|
375 | \item
|
---|
376 | Run performance experiments using the new benchmark test-suite comparing llheap with six of the best allocators in use today.
|
---|
377 | The goal is to demonstrate that llheap's performance, both in time and space, is comparable to the best allocators in use today.
|
---|
378 | \end{enumerate}
|
---|
379 |
|
---|
380 |
|
---|
381 | \section{Background}
|
---|
382 |
|
---|
383 | The following is an overview of allocator design options that affect memory usage and performance (see~\cite{Zulfiqar22} for more details).
|
---|
384 | Dynamic acquires and releases obtain \newterm{object} storage via calls such as @malloc@/@new@ and @free@/@delete@ in C/\CC, respectively.
|
---|
385 | A \newterm{memory allocator} contains a complex data-structure and code that manages the layout of objects in the dynamic-allocation zone.
|
---|
386 | % The management goals are to make allocation/deallocation operations as fast as possible while densely packing objects to make efficient use of memory.
|
---|
387 | Since objects in C/\CC cannot be moved, only adjacent free storage can be \newterm{coalesced} into larger free areas.
|
---|
388 | The allocator grows or shrinks the dynamic-allocation zone to obtain storage for objects and reduce memory usage using \newterm{operating system} (OS) calls, such as @mmap@ or @sbrk@ in UNIX.
|
---|
389 |
|
---|
390 |
|
---|
391 | \vspace*{-7pt}
|
---|
392 | \subsection{Allocator Components}
|
---|
393 | \label{s:AllocatorComponents}
|
---|
394 |
|
---|
395 | Figure~\ref{f:AllocatorComponents} shows the two important data components for a memory allocator, management and storage, collectively called the \newterm{heap}.
|
---|
396 | The \newterm{management data} is a data structure located at a known memory address and contains fixed-sized information in the static-data memory that references components in the dynamic-allocation memory.
|
---|
397 | For multi-threaded programs, additional management data may exist in \newterm{thread-local storage} (TLS) for each kernel thread executing the program.
|
---|
398 | The \newterm{storage data} is composed of allocated/freed objects, and \newterm{reserved memory}.
|
---|
399 | Allocated objects (white) are variable sized, and are allocated and maintained by the program;
|
---|
400 | \ie only the program knows the location of allocated storage.
|
---|
401 | Freed objects (light grey) represent memory deallocated by the program, which are linked into one or more lists facilitating location for new allocations.
|
---|
402 | Reserved memory (dark grey) is one or more blocks of memory obtained from the OS but not yet used by the program;
|
---|
403 | if there are multiple reserved blocks, they are normally linked together.
|
---|
404 |
|
---|
405 | \begin{figure}
|
---|
406 | \centering
|
---|
407 | \input{AllocatorComponents}
|
---|
408 | \caption{Allocator Components (Heap)}
|
---|
409 | \label{f:AllocatorComponents}
|
---|
410 | \end{figure}
|
---|
411 |
|
---|
412 | In many allocator designs, allocated objects and reserved blocks have management data embedded within them (see also Section~\ref{s:ObjectContainers}).
|
---|
413 | Figure~\ref{f:AllocatedObject} shows an allocated object with a header, trailer, and optional spacing around the object.
|
---|
414 | The header contains information about the object, \eg size, type, \etc.
|
---|
415 | The trailer may be used to simplify coalescing and/or for safety purposes to mark the end of an object.
|
---|
416 | An object may be preceded by padding to ensure proper alignment.
|
---|
417 | Some algorithms quantize allocation requests, resulting in additional space after an object.
|
---|
418 | When padding and spacing are necessary, neither can be used to satisfy a future allocation request while the current allocation exists.
|
---|
419 |
|
---|
420 | A free object often contains management data, \eg size, pointers, \etc.
|
---|
421 | Often the free list is linked internally so it does not consume additional storage, \ie the link fields are placed at known locations in the unused memory blocks.
|
---|
422 | For internal linking, the amount of management data for a free node defines the minimum allocation size, \eg if 16 bytes are needed for a free-list node, allocation requests less than 16 bytes are rounded up.
|
---|
423 | Often the minimum storage alignment and free-node size are the same.
|
---|
424 | The information in an allocated or freed object is overwritten when it transitions from allocated to freed and vice-versa by new program data and/or management information, receptively.
|
---|
425 | For safety purposes, freed storage may be scrubbed (overwritten) to expose inadvertent bugs, such as assuming variables are zero initialized.
|
---|
426 |
|
---|
427 | \begin{figure}
|
---|
428 | \centering
|
---|
429 | \input{AllocatedObject}
|
---|
430 | \caption{Allocated Object}
|
---|
431 | \label{f:AllocatedObject}
|
---|
432 |
|
---|
433 | \bigskip
|
---|
434 |
|
---|
435 | \input{IntExtFragmentation}
|
---|
436 | \caption{Internal and External Fragmentation}
|
---|
437 | \label{f:InternalExternalFragmentation}
|
---|
438 | \end{figure}
|
---|
439 |
|
---|
440 |
|
---|
441 | \subsection{Single-Threaded Memory-Allocator}
|
---|
442 | \label{s:SingleThreadedMemoryAllocator}
|
---|
443 |
|
---|
444 | In a sequential (single threaded) program, the program thread performs all allocation operations without direct concurrency issues.
|
---|
445 | However, interrupts introduce indirect concurrency, if the signal handler performs allocation/deallocation (serially reusable problem~\cite{SeriallyReusable}).
|
---|
446 | In general, the primary issues in a single-threaded allocator are fragmentation and locality.
|
---|
447 |
|
---|
448 |
|
---|
449 | \subsubsection{Fragmentation}
|
---|
450 | \label{s:Fragmentation}
|
---|
451 |
|
---|
452 | Fragmentation is unused memory requested from the OS.
|
---|
453 | Figure~\ref{f:InternalExternalFragmentation} shows fragmentation has two forms: \emph{internal} or \emph{external}.
|
---|
454 |
|
---|
455 | \newterm{Internal fragmentation} is inaccessible \emph{allocated} memory, such as headers, trailers, \etc.
|
---|
456 | Internal fragmentation is problematic when management space approaches the object size, \eg for objects $<$16 bytes, memory usage doubles.
|
---|
457 |
|
---|
458 | \newterm{External fragmentation} is memory not allocated in the program~\cite{Wilson95,Lim98,Siebert00}, which includes all external management data, freed objects, and reserved memory.
|
---|
459 | This memory is problematic resulting in heap blowup and fragmented memory.
|
---|
460 | \newterm{Blowup} occurs when freed memory becomes a checkerboard of adjacent allocated and free areas, where the free blocks are too small to service requests, leading to unbounded external fragmentation growth~\cite{Berger00}.
|
---|
461 | Heap blowup is a fundamental problem in unmanaged languages without compaction.
|
---|
462 |
|
---|
463 | Three basic approaches for controlling fragmentation are identified~\cite{Johnstone99}.
|
---|
464 | The first approach is \newterm{sequential-fit} with a list of free objects (possibly ordered by size) that is searched for a block large enough to fit a requested object.
|
---|
465 | Different search policies determine the free object selected, \eg the first free object large enough (first fit) or closest to the requested size (best fit).
|
---|
466 | Any storage larger than the request can become spacing after the object or split into a smaller free object.
|
---|
467 |
|
---|
468 | The second approach is \newterm{segregation} or \newterm{binning} with a set of lists for different sized freed objects.
|
---|
469 | The request size is rounded up to the nearest bin size, often leading to internal fragmentation after the object.
|
---|
470 | A binning algorithm searches for the smallest bin that covers the request, and selects the first free object, if available.
|
---|
471 | Fewer bin sizes means more internal fragmentation but increased reuse as more request sizes match the bin size.
|
---|
472 | More bin sizes has less internal fragmentation size but more external fragmentation as larger bins cannot service smaller requests.
|
---|
473 | Allowing larger bins to service smaller allocations means the freed object can be returned to the matching or larger bin (some advantages to either scheme).
|
---|
474 |
|
---|
475 | The third approach is \newterm{splitting} and \newterm{coalescing}.
|
---|
476 | If there is no matching free storage for allocation, a larger free object is split to get the allocation and the smaller object is put back on the free list.
|
---|
477 | For example, in the \newterm{buddy system}, a block of free memory is split into equal chunks, splitting continues until a minimal block is created that fits the allocation.
|
---|
478 | When an object is deallocated, it is coalesced with the objects immediately before/after it in memory, if they are free, creating a larger block.
|
---|
479 | Coalescing can be done eagerly at each deallocation or lazily when an allocation cannot be fulfilled.
|
---|
480 | However, coalescing increases allocation latency (unbounded delays), both for allocation and deallocation.
|
---|
481 | While coalescing does not reduce external fragmentation, the coalesced blocks improve fragmentation quality so future allocations are less likely to cause heap blowup.
|
---|
482 |
|
---|
483 |
|
---|
484 | \subsubsection{Locality}
|
---|
485 | \label{s:Locality}
|
---|
486 |
|
---|
487 | The principle of locality recognizes that programs tend to reference a small set of data, called a \newterm{working set}, for a certain period of time, composed of temporal and spatial accesses~\cite{Denning05}.
|
---|
488 | % Temporal clustering implies a group of objects are accessed repeatedly within a short time period, while spatial clustering implies a group of objects physically close together (nearby addresses) are accessed repeatedly within a short time period.
|
---|
489 | % Temporal locality commonly occurs during an iterative computation with a fixed set of disjoint variables, while spatial locality commonly occurs when traversing an array.
|
---|
490 | Hardware takes advantage of the working set through multiple levels of caching and paging, \ie memory hierarchy.
|
---|
491 | % When an object is accessed, the memory physically located around the object is also cached with the expectation that the current and nearby objects will be referenced within a short period of time.
|
---|
492 | % For example, entire cache lines are transferred between cache and memory, and entire virtual-memory pages are transferred between memory and disk.
|
---|
493 | % A program exhibiting good locality has better performance due to fewer cache misses and page faults\footnote{With the advent of large RAM memory, paging is becoming less of an issue in modern programming.}.
|
---|
494 |
|
---|
495 | Temporal locality is largely controlled by program accesses to variables~\cite{Feng05}.
|
---|
496 | An allocator has only indirect influence on temporal locality but largely dictates spatial locality.
|
---|
497 | For temporal locality, an allocator tries to return recently freed storage for new allocations, as this memory is still \emph{warm} in the memory hierarchy.
|
---|
498 | For spatial locality, an allocator places objects used together close together in memory, so the working set of the program fits into the fewest possible cache lines and pages.
|
---|
499 | % However, usage patterns are different for every program as is the underlying hardware memory architecture;
|
---|
500 | % hence, no general-purpose memory-allocator can provide ideal locality for every program on every computer.
|
---|
501 |
|
---|
502 | An allocator can easily degrade locality by increasing the working set.
|
---|
503 | For example, it can access an unbounded number of free objects when matching an allocation or coalescing, causing multiple cache or page misses~\cite{Grunwald93}.
|
---|
504 | An allocator can spatially separate related data by binning free storage anywhere in memory, so the related objects are highly separated.
|
---|
505 |
|
---|
506 |
|
---|
507 | \subsection{Multi-Threaded Memory-Allocator}
|
---|
508 | \label{s:MultiThreadedMemoryAllocator}
|
---|
509 |
|
---|
510 | In a concurrent program, multiple kernel threads (KT) perform allocations, requiring some form of mutual exclusion.
|
---|
511 | Along with fragmentation and locality issues, a multi-threaded allocator must deal with false sharing and additional forms of heap blowup.
|
---|
512 |
|
---|
513 |
|
---|
514 | \subsubsection{Mutual Exclusion}
|
---|
515 | \label{s:MutualExclusion}
|
---|
516 |
|
---|
517 | % \newterm{Mutual exclusion} provides sequential access to the shared-management data of the heap.
|
---|
518 | There are two performance issues for mutual exclusion.
|
---|
519 | First, the cost of performing atomic instructions every time a shared resource is accessed to provide mutual exclusion.
|
---|
520 | Solutions using any atomic fence, atomic instruction (lock free), or lock along a fast path, even with zero contention, results in significant slowdown.
|
---|
521 | Second, \newterm{contention} on simultaneous access, so threads must wait until the resource is released.
|
---|
522 | Contention can be reduced by:
|
---|
523 | 1) Using multiple fine-grained locks versus few course-gain locks to spread the contention.
|
---|
524 | 2) Using trylock and generating new storage if the lock is busy (classic space versus time tradeoff).
|
---|
525 | % 3) Using one of the many lock-free approaches for reducing contention on basic data-structure operations~\cite{Fatourou12}.
|
---|
526 | % However, all approaches have degenerate cases where program contention to the heap is high, which is beyond the allocator's control.
|
---|
527 | Figure~\ref{f:ThreadHeapRelationship} shows how a multi-threaded allocator reduces contention by subdividing a single heap into multiple heaps.
|
---|
528 |
|
---|
529 | \begin{figure}
|
---|
530 | \centering
|
---|
531 | \subfloat[T:1]{
|
---|
532 | \input{SingleHeap}
|
---|
533 | \label{f:SingleHeap}
|
---|
534 | } % subfloat
|
---|
535 | \vrule
|
---|
536 | \subfloat[T:H]{
|
---|
537 | \input{SharedHeaps}
|
---|
538 | \label{f:SharedHeaps}
|
---|
539 | } % subfloat
|
---|
540 | \vrule
|
---|
541 | \subfloat[1:1]{
|
---|
542 | \input{PerThreadHeap}
|
---|
543 | \label{f:PerThreadHeap}
|
---|
544 | } % subfloat
|
---|
545 | \caption{Multiple Heaps, Thread:Heap Relationship}
|
---|
546 | \label{f:ThreadHeapRelationship}
|
---|
547 | \end{figure}
|
---|
548 |
|
---|
549 | \begin{description}[leftmargin=*]
|
---|
550 | \item[T:1 model (Figure~\ref{f:SingleHeap})] is all threads (T) sharing a single heap (1).
|
---|
551 | % The arrows indicate memory movement for allocation/deallocation operations.
|
---|
552 | % Memory is obtained from freed objects, reserved memory, or the OS;
|
---|
553 | % freed memory can be returned to the OS.
|
---|
554 | To handle concurrency, a single lock is used for all heap operations or fine-grained (lock-free) locking if operations can be made independent.
|
---|
555 | As threads perform large numbers of allocations, a single heap becomes a significant source of contention.
|
---|
556 |
|
---|
557 | \item[T:H model (Figure~\ref{f:SharedHeaps})] is multiple threads (T) sharing multiple heaps (H).
|
---|
558 | The allocator allocates/deallocates heaps and assigns threads to heaps often based on dynamic contention pressure.
|
---|
559 | While locking is required for heap access, contention is (normally) reduced as access is spread across the heaps.
|
---|
560 | Locking can be reduced (eliminated) using the T:C variant, \ie each CPU has a heap, and a thread cannot migrate from the CPU if executing an allocator critical-section, implemented with restartable critical sections~\cite{Desnoyers19,Dice02} (see also Section~\ref{s:UserlevelThreadingSupport}).
|
---|
561 | % The goal is minimal heaps (storage) and contention per heap (time).
|
---|
562 | Multiple heaps increase external fragmentation as the ratio of heaps to threads increases, which can lead to heap blowup, where the worst-case scenario is more heaps than threads.
|
---|
563 | The external fragmentation is now multiplied by the number of heaps, since each heap manages its own free storage and allocates its own reserved memory.
|
---|
564 | When freeing, objects normally need to be returned to their original heap (see Section~\ref{s:Ownership}).
|
---|
565 | % Returning storage to the OS may be difficult or impossible, \eg the contiguous @sbrk@ area in Unix.
|
---|
566 | % In the worst case, a program in which objects are allocated from one heap but deallocated to another heap means these freed objects are never reused.
|
---|
567 |
|
---|
568 | A shared \newterm{global heap} (G) is often introduced to manage the reserved memory among heaps and centralize interacts with the OS.
|
---|
569 | Instead of heaps making individual object allocations/deallocations through the global heap, resulting in locking and high contention, the global heap partitions the reserved memory into heap (allocation) buffers, which are given out to heaps for their own suballocations.
|
---|
570 | Hence, a heap's allocations are temporally and spatially accessed densely in a small set of buffers, rather than spread sparsely across the entire reserve memory.
|
---|
571 | Buffers are allocated at heap startup, after which allocation often reaches a steady state through free lists.
|
---|
572 | Allocation buffers may increase external fragmentation, since some memory may never be used.
|
---|
573 |
|
---|
574 | \item[1:1 model (Figure~\ref{f:PerThreadHeap})] is each thread (1) having its own heap (1), eliminating most contention and locking if threads seldom access another thread's heap (see Section~\ref{s:Ownership}).
|
---|
575 | A thread's objects are consolidated in its heap, better utilizing the cache and paging during thread execution.
|
---|
576 | In contrast, the T:H model can spread thread objects over a larger area in different heaps.
|
---|
577 | %For example, assume page boundaries coincide with cache line boundaries, if a thread heap always acquires pages of memory then no two threads share a page or cache line unless pointers are passed among them.
|
---|
578 | When a thread terminates, it can free its heap objects to the global heap, or the thread heap is retained as-is and reused for a new thread in the future.
|
---|
579 | Destroying a heap can reduce external fragmentation sooner, since all free objects in the global heap are available for immediate reuse.
|
---|
580 | Alternatively, reusing a heap can aid the inheriting thread, if it has a similar allocation pattern, because the heap in primed with freed storage of the right sizes.
|
---|
581 | \end{description}
|
---|
582 |
|
---|
583 |
|
---|
584 | \subsubsection{False Sharing}
|
---|
585 | \label{s:FalseSharing}
|
---|
586 |
|
---|
587 | False sharing occurs for a read/write or write/write among threads modifying different memory sharing a cache line~\cite{Bolosky93}.
|
---|
588 | The write invalidates each thread's cache, even though the threads may be uninterested in the other modified object.
|
---|
589 | False sharing can occur three ways:
|
---|
590 | 1) Thread T$_1$ allocates objects O$_1$ and O$_2$ on the same cache line and passes O$_2$'s reference to thread T$_2$.
|
---|
591 | 2) Thread T$_1$ allocates object O$_1$ and thread T$_2$ allocates O$_2$, where objects O$_1$ and O$_2$ are on the same cache line.
|
---|
592 | 3) T$_2$ deallocates O$_2$, T$_1$ allocates O$_1$ on the same cache line as O$_2$, and T$_2$ reallocated O$_2$ while T$_1$ is using O$_1$.
|
---|
593 | In all three cases, the false sharing is hidden and possibly transient (non-deterministic), making it extremely difficult to find and fix.
|
---|
594 | Case 1) occurs in all three allocator models, and is induced by program behaviour, not the allocator.
|
---|
595 | Case 2) and 3) are allocator induced, and occurs in T:1 and T:H models due to heap sharing, but not 1:1 with private heaps, except possibly at boundary points among heaps.
|
---|
596 |
|
---|
597 |
|
---|
598 | \subsubsection{Object Containers}
|
---|
599 | \label{s:ObjectContainers}
|
---|
600 |
|
---|
601 | Associating header data with every allocation can result in significant internal fragmentation, as shown in Figure~\ref{f:AllocatedObject}.
|
---|
602 | While the header and object are spatially together in memory, they are generally not accessed temporally together~\cite{Feng05}.
|
---|
603 | The result is poor cache usage, since only a portion of the cache line is holding useful data from the program's perspective.
|
---|
604 | % \eg an object is accessed by the program after it is allocated, while the header is accessed by the allocator after it is free.
|
---|
605 |
|
---|
606 | \begin{figure}
|
---|
607 | \centering
|
---|
608 | \input{Container}
|
---|
609 | \caption{Object Container}
|
---|
610 | \label{f:ObjectContainer}
|
---|
611 | \end{figure}
|
---|
612 |
|
---|
613 | The alternative approach factors common header data to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks}~\cite[\S~3]{Berger00}) suballocated from a heap's allocation buffers, as in Figure~\ref{f:ObjectContainer}.
|
---|
614 | A trailer may also be used at the end of the container.
|
---|
615 | To find the header from an allocation within the container, the container is aligned on a power of 2 boundary and the lower bits of the object address are truncated (or rounded up, minus the trailer size, to obtain the trailer address).
|
---|
616 | Container size is a tradeoff between internal and external fragmentation as some portion of a container may not be used and this portion is unusable for other kinds of allocations.
|
---|
617 | A consequence of this tradeoff is its effect on spatial locality, which can produce positive or negative results depending on the program's access patterns.
|
---|
618 | Normally, heap ownership applies to its containers.
|
---|
619 | Without ownership, different objects in a container may be on different heap free-lists.
|
---|
620 | Finally, containers are linked together for management purposes, and should all objects in a container become free, the container can be repurposed for different sized objects or given to another heap through a global heap.
|
---|
621 |
|
---|
622 |
|
---|
623 | \subsubsection{Ownership}
|
---|
624 | \label{s:Ownership}
|
---|
625 |
|
---|
626 | Object \newterm{ownership} is defined as the heap to which an object is returned upon deallocation~\cite[\S~6.1]{Berger00}.
|
---|
627 | If a thread returns an object to its originating heap, a heap has ownership of its objects.
|
---|
628 | Containers force ownership of internal contiguous objects, unless the entire container changes ownership after it becomes empty.
|
---|
629 | Alternatively, a thread can return an object to the heap it is currently associated with, which can be any heap accessible during a thread's lifetime.
|
---|
630 | The advantage of ownership is preventing heap blowup by returning storage for reuse by the owner heap.
|
---|
631 | Ownership prevents the problem of a producer thread allocating from one heap, passing the object to a consumer thread, and the consumer deallocates the object to another heap, hence draining the producer heap of storage.
|
---|
632 | The disadvantage of ownership is deallocating to another thread's heap requires an atomic operation.
|
---|
633 |
|
---|
634 | Object ownership can be immediate or delayed, meaning free objects may be batched on a separate free list either by the returning or receiving thread.
|
---|
635 | The returning thread batches objects to reduce contention by passing multiple objects at once;
|
---|
636 | however, batching across multiple allocation sizes and heaps is complex and there is no obvious time when to push back to the owner heap.
|
---|
637 | It is simpler for the returning threads to immediately return to the receiving thread's batch list as the receiving thread has better knowledge when to incorporate the batch list into its free pool.
|
---|
638 | The receiving thread often delays incorporating returned storage until its local storage in drained.
|
---|
639 |
|
---|
640 |
|
---|
641 | \subsubsection{User-Level Threading}
|
---|
642 |
|
---|
643 | Any heap model can be used with user-level (M:N) threading.
|
---|
644 | However, an important goal of user threads (UT) is for fast operations (creation/termination/context-switching) by not interacting with the OS, allowing large numbers of high-performance interacting threads ($>$ 10,000).
|
---|
645 | In general, UTs use whatever kernel-level heap-model is provided by the language runtime.
|
---|
646 | Hence, a UT allocates/deallocates from/to the heap of the KT on which it is executing.
|
---|
647 |
|
---|
648 | However, there is a subtle concurrency problem with user threading and shared heaps.
|
---|
649 | With kernel threading, an operation started by a KT is always completed by that thread, even if preempted;
|
---|
650 | hence, any locking correctness associated with the shared heap is preserved.
|
---|
651 | However, this correctness property is not preserved for user-level threading.
|
---|
652 | A UT can start an allocation/deallocation on one KT, be preempted by user-level time slicing, and continue running on a different KT to complete the operation~\cite{Dice02}.
|
---|
653 | When the UT continues on the new KT, it may have pointers into the previous KT's heap and hold locks associated with it.
|
---|
654 | To get the same KT safety, time slicing must be disabled/\-enabled around these operations to prevent movement.
|
---|
655 | However, eagerly disabling time slicing on the allocation/deallocation fast path is expensive, especially as preemption is infrequent (millisecond intervals).
|
---|
656 | Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it completes atomically.
|
---|
657 | Occasional ignoring a preemption is normally benign;
|
---|
658 | in the worst case, ignoring preemption results in starvation.
|
---|
659 | To mitigate starvation, techniques like rolling the preemption forward at the next context switch can be used.
|
---|
660 |
|
---|
661 |
|
---|
662 | \section{llheap}
|
---|
663 |
|
---|
664 | This section presents our new stand-alone, concurrent, low-latency memory allocator, called llheap (low-latency heap), fulfilling the GNU C Library allocator API~\cite{GNUallocAPI} for C/\CC programs using KTs, with specialized versions for the programming languages \uC and \CFA using user-level threads running over multiple KTs (M:N threading).
|
---|
665 | The primary design objective for llheap is low latency across all allocator calls independent of application access-patterns and/or number of threads, \ie very seldom does the allocator delay during an allocator call.
|
---|
666 | Excluded from the low-latency objective are (large) allocations requiring initialization, \eg zero fill, and/or data copying, along with unbounded delays to acquire storage from the OS or OS scheduling, all of which are outside the allocator's purview.
|
---|
667 | A direct consequence of this objective is very simple or no storage coalescing;
|
---|
668 | hence, llheap's design is willing to use more storage to lower latency.
|
---|
669 | This objective is apropos because systems research and industrial applications are striving for low latency and modern computers have huge amounts of RAM.
|
---|
670 | Finally, llheap's performance must be comparable with current allocators, both in space and time (see performance comparison in Section~\ref{c:Performance}).
|
---|
671 |
|
---|
672 |
|
---|
673 | \subsection{llheap Design}
|
---|
674 |
|
---|
675 | Figure~\ref{f:llheapDesign} shows the design of llheap, which uses the following features:
|
---|
676 | 1:1 allocator model eliminating locking on the fast path,
|
---|
677 | separate small (@sbrk@) and large object management (@mmap@),
|
---|
678 | headers per allocation versus containers,
|
---|
679 | small object binning (buckets) forming lists for different sized freed objects,
|
---|
680 | optional fast-lookup table for converting allocation requests into bucket sizes,
|
---|
681 | no coalescing to minimize latency,
|
---|
682 | optional heap ownership (build time),
|
---|
683 | reserved memory (buffer pool) per heap obtained from a global pool,
|
---|
684 | global heap managing freed thread heaps and interacting with the OS to obtained storage,
|
---|
685 | optional statistic-counters table for accumulating counts of allocation operations and a debugging version for testing (build time).
|
---|
686 |
|
---|
687 | \begin{figure}
|
---|
688 | \centering
|
---|
689 | % \includegraphics[width=0.65\textwidth]{figures/NewHeapStructure.eps}
|
---|
690 | \input{llheap}
|
---|
691 | \caption{llheap Design}
|
---|
692 | \label{f:llheapDesign}
|
---|
693 | \end{figure}
|
---|
694 |
|
---|
695 | llheap starts by creating an empty array for $N$ global heaps from storage obtained using @mmap@ that persists for program duration, where $N$ is the number of computer cores.
|
---|
696 | There is a global last-array pointer and bump-pointer within this array to locate the next free heap storage.
|
---|
697 | When an array's storage is exhausted, another empty array is allocated.
|
---|
698 | Terminated threads push their heap onto a global-stack top-pointer, where free heaps are intrusively linked.
|
---|
699 | When statistics are turned on, there is a global top pointer for a intrusive linked-list to link \emph{all} the heaps (not shown), which is traversed to accumulate statistics counters across heaps when @malloc_stats@ is called.
|
---|
700 |
|
---|
701 | When a KT starts, it pops heap storage from the heap free-list, or if empty, gets the next free heap-storage.
|
---|
702 | When a KT terminates, its heap is pushed onto the heap free-list for reuse by a new KT, which prevents unbounded heap growth.
|
---|
703 | The free heaps are stored in a stack so hot storage is reused first.
|
---|
704 | Preserving all heaps created during the program lifetime solves the storage lifetime problem when ownership is used.
|
---|
705 | This approach wastes storage if a large number of KTs are created/terminated at program start and then the program continues sequentially, which is rare.
|
---|
706 |
|
---|
707 | Each heap uses segregated free-buckets that have free objects distributed across 60 different sizes from 16 to 16M.
|
---|
708 | All objects in a bucket are the same size.
|
---|
709 | The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation, which is specified by calling @mallopt( M_MMAP_THRESHOLD )@, where the cross over must be $\ge$ the page size or $\le$ the largest bucket (16M).
|
---|
710 | Each cache-aligned bucket has a stack of the same-sized freed objects, where a stack ensures hot storage is reused first.
|
---|
711 | llheap can be configured with object ownership, where an object is freed to the heap from which it is allocated, or object no-ownership, where an object is freed to the KT's current heap.
|
---|
712 | For ownership, a shared remote stack is added to the freelist structure, so push/pop operations require locking.
|
---|
713 | Pushes are eager on each remove free \vs batching, and pops are lazy when there is no cheap storage available, then the entire remote stack is gulped and added to the bucket's free list.
|
---|
714 |
|
---|
715 | Initial threads are assigned empty heaps from the heap array.
|
---|
716 | The first thread allocation causes a request for storage from the shared @sbrk@ area.
|
---|
717 | The size of this request is the maximum of the request size or the @sbrk@-extension-size / 16.
|
---|
718 | This heuristic means the @sbrk@ area is subdivided into separate heap buffers (HB) per thread, providing no contention and data locality.
|
---|
719 | A thread does bump allocation in its current buffer, until it starts reusing freed storage or there is insufficient storage, and it obtains another buffer.
|
---|
720 | Thread buffers are not linked;
|
---|
721 | only logically connected to the thread through allocated and deallocated storage.
|
---|
722 | When a thread ends, its heap is returned to the heap array but no storage is released.
|
---|
723 | A new thread receiving a freed heap starts with it fully populated with freed storage.
|
---|
724 | The heuristic is that threads often do similar work, so the free storage in the heap is reusable, resulting in less internal fragmentation.
|
---|
725 | %The heuristic is that threads often do similar work so the free storage in the heap is immediately available.
|
---|
726 | %The downside is the risk of more external fragmentation, if the freed storage is never reused.
|
---|
727 | The downside is if the freed storage is never reused creating external fragmentation.
|
---|
728 |
|
---|
729 | Algorithm~\ref{alg:heapObjectAlloc} shows the allocation outline for an object of size $S$.
|
---|
730 | The allocation is divided into small (@sbrk@) or large (@mmap@).
|
---|
731 | For small allocations, $S$ is quantized into a bucket size.
|
---|
732 | Quantizing is performed using a direct lookup for sizes < 64K or a binary search over the ordered bucket array for $S$ $\ge$ 64K.
|
---|
733 | Then, the allocation storage is obtained from the following locations, in order of increasing latency: the bucket's free stack, the heap's local buffer, the bucket's remote stack, the global buffer, the OS (@sbrk@).
|
---|
734 | For large allocations, the storage is directly allocated from the OS using @mmap@.
|
---|
735 |
|
---|
736 | \begin{algorithm}[t]
|
---|
737 | \caption{Dynamic object allocation of size $S$}
|
---|
738 | \label{alg:heapObjectAlloc}
|
---|
739 | \begin{algorithmic}
|
---|
740 | \STATE $S \gets S + \text{header-size}$
|
---|
741 | \IF {$S < \textit{mmap-threshhold}$}
|
---|
742 | \STATE $\textit{B} \gets \text{smallest free-bucket} \geq S$
|
---|
743 | \IF {$\textit{B's free-list \(\neg\)empty}$}
|
---|
744 | \STATE $\textit{O} \gets \text{pop an object from B's free-list}$
|
---|
745 | \ELSIF {$\textit{heap's allocation buffer} \ge B$}
|
---|
746 | \STATE $\textit{O} \gets \text{bump allocate object of size B from allocation buffer}$
|
---|
747 | \ELSIF {$\textit{heap's remote-list \(\neg\)empty}$}
|
---|
748 | \STATE $\textit{merge heap's remote-list into free-list}$
|
---|
749 | \STATE $\textit{O} \gets \text{pop an object from B's free-list}$
|
---|
750 | \ELSE
|
---|
751 | \STATE $\textit{O} \gets \text{allocate an object of size B from global pool}$
|
---|
752 | \ENDIF
|
---|
753 | \ELSE
|
---|
754 | \STATE $\textit{O} \gets \text{allocate an object of size S using \lstinline{mmap} system-call}$
|
---|
755 | \ENDIF
|
---|
756 | \RETURN $\textit{O}$
|
---|
757 | \end{algorithmic}
|
---|
758 | \end{algorithm}
|
---|
759 |
|
---|
760 | Algorithm~\ref{alg:heapObjectFreeOwn} shows the deallocation (free) outline for an object at address $A$ with ownership.
|
---|
761 | First, the address is divided into small (@sbrk@) or large (@mmap@).
|
---|
762 | For small allocations, the bucket associated with the request size is retrieved from the allocation header.
|
---|
763 | If the bucket is local to the thread, the allocation is pushed onto the thread's associated bucket.
|
---|
764 | If the bucket is not local to the thread, the allocation is pushed onto the owning thread's remote stack.
|
---|
765 | For large allocations, the storage is unmapped back to the OS.
|
---|
766 | Without object ownership, the algorithm is the same as for ownership except when the bucket is not local to the thread.
|
---|
767 | In that case, the corresponding bucket of the owner thread is computed by the deallocating thread, and the allocation is pushed onto the deallocating thread's corresponding bucket, \ie no search is required.
|
---|
768 |
|
---|
769 | \begin{algorithm}[t]
|
---|
770 | \caption{Dynamic object free at address $A$ with object ownership}
|
---|
771 | \label{alg:heapObjectFreeOwn}
|
---|
772 | \begin{algorithmic}
|
---|
773 | \IF {$\textit{A heap allocation}$}
|
---|
774 | \STATE $\text{B} \gets \textit{O's owner}$
|
---|
775 | \IF {$\textit{B's thread = current heap thread}$}
|
---|
776 | \STATE $\text{push A to B's free-list}$
|
---|
777 | \ELSE
|
---|
778 | \STATE $\text{push A to B's remote-list}$
|
---|
779 | \ENDIF
|
---|
780 | \ELSE
|
---|
781 | \STATE $\text{return A to system using system call \lstinline{munmap}}$
|
---|
782 | \ENDIF
|
---|
783 | \end{algorithmic}
|
---|
784 | \end{algorithm}
|
---|
785 |
|
---|
786 | \begin{comment}
|
---|
787 | \begin{algorithm}
|
---|
788 | \caption{Dynamic object free at address $A$ without object ownership}
|
---|
789 | \label{alg:heapObjectFreeNoOwn}
|
---|
790 | \begin{algorithmic}[1]
|
---|
791 | \IF {$\textit{A mapped allocation}$}
|
---|
792 | \STATE $\text{return A's dynamic memory to system using system call \lstinline{munmap}}$
|
---|
793 | \ELSE
|
---|
794 | \STATE $\text{B} \gets \textit{O's owner}$
|
---|
795 | \IF {$\textit{B is thread-local heap's bucket}$}
|
---|
796 | \STATE $\text{push A to B's free-list}$
|
---|
797 | \ELSE
|
---|
798 | \STATE $\text{C} \gets \textit{thread local heap's bucket with same size as B}$
|
---|
799 | \STATE $\text{push A to C's free-list}$
|
---|
800 | \ENDIF
|
---|
801 | \ENDIF
|
---|
802 | \end{algorithmic}
|
---|
803 | \end{algorithm}
|
---|
804 | \end{comment}
|
---|
805 |
|
---|
806 | Finally, the llheap design funnels all allocation/deallocation operations through the @malloc@ and @free@ routines, which are the only routines to directly access and manage the internal data structures of the heap.
|
---|
807 | Other allocation operations, \eg @calloc@, @memalign@, and @realloc@, are composed of calls to @malloc@ and possibly @free@, and may manipulate header information after storage is allocated.
|
---|
808 | This design simplifies heap-management code during development and maintenance.
|
---|
809 |
|
---|
810 |
|
---|
811 | \subsubsection{Bounded Allocation}
|
---|
812 |
|
---|
813 | The llheap design results in bounded allocation.
|
---|
814 | For small allocations, once all the buckets have freed objects, storage is recycled.
|
---|
815 | For large allocations, the storage is directly recycled back to the OS.
|
---|
816 | When a thread terminates, its heap is recycled to the next new thread and the above process begins for that thread.
|
---|
817 | The pathological case is threads allocating a large amount of storage, freeing it, and then quiescing, which demonstrates that the bound constant can be large.
|
---|
818 | This pathological pattern occurs for \emph{immortal} threads, \eg I/O threads with program lifetime and bursts of activity performing many allocations/deallocations.
|
---|
819 | Hence, independent of external fragmentation in thread heaps, storage cannot grow unbounded unless the program does not free.
|
---|
820 |
|
---|
821 |
|
---|
822 | \subsubsection{Alignment}
|
---|
823 |
|
---|
824 | The minimum storage alignment $M$ comes from the architecture application-binary-interface (ABI) based on hardware factors: bus width (32 or 64-bit), largest register (double, long double), largest atomic instruction (double compare-and-swap), or vector data (Intel MMX).
|
---|
825 | An access with a nonaligned address maybe slow or an error.
|
---|
826 | A memory allocator must assume the largest hardware requirement because it is unaware of the data type occupying the allocation.
|
---|
827 | Often the minimum storage alignment is an 8/16-byte boundary on a 32/64-bit computer, respectively.
|
---|
828 | Alignments larger than $M$ are powers of 2, such as page alignment (4/8K).
|
---|
829 | Any alignment less than $M$ is raised to the minimal alignment.
|
---|
830 |
|
---|
831 | llheap aligns its allocation header on an $M$ boundary and its size is $M$, making the following data $M$ aligned.
|
---|
832 | This pattern means there is no minimal alignment computation along the allocation fast path, \ie new storage and reused storage is always correctly aligned.
|
---|
833 | An alignment $N$ greater than $M$ is accomplished with a \emph{pessimistic} request for storage that ensures \emph{both} the alignment and size request are satisfied.
|
---|
834 | \begin{center}
|
---|
835 | \input{Alignment2}
|
---|
836 | \end{center}
|
---|
837 | The amount of storage necessary is $alignment - M + size$, which ensures there is an address, $A$, after the storage returned from @malloc@, $P$, that is a multiple of $alignment$ followed by sufficient storage for the data object.
|
---|
838 | The approach is pessimistic if $P$ happens to have the correct alignment $N$, and the initial allocation has requested sufficient space to move to the next multiple of $N$.
|
---|
839 | In this case, there is $alignment - M$ bytes of unused storage after the data object, which could be used by @realloc@.
|
---|
840 | Note, the address returned by the allocation is $A$, which is subsequently returned for deallocation.
|
---|
841 | However, the deallocation requires the value $P$, which must be computable from $A$, from which $H$ can be computed $P - M$.
|
---|
842 | Hence, there must be a mechanism to detect $P$ $\neq$ $A$ and compute $P$ from $A$.
|
---|
843 |
|
---|
844 | To detect and perform this computation, llheap uses two headers:
|
---|
845 | the \emph{original} header $H$ associated with the allocation, and a \emph{fake} header $F$ within this storage before the alignment boundary $A$.
|
---|
846 | \begin{center}
|
---|
847 | \input{Alignment2Impl}
|
---|
848 | \end{center}
|
---|
849 | Since every allocation is aligned at $M$, $P$ $\neq$ $A$ only holds for alignments greater than $M$.
|
---|
850 | When $P$ $\neq$ $A$, the minimum distance between $P$ and $A$ is $M$ bytes, due to the pessimistic storage allocation.
|
---|
851 | Therefore, there is always room for an $M$-byte fake header before $A$.
|
---|
852 | The fake header must supply an indicator to distinguish it from a normal header and the location of address $P$ generated by the allocation.
|
---|
853 | This information is encoded as an offset from A to P and the initial alignment (discussed in Section~\ref{s:ReallocStickyProperties}).
|
---|
854 | To distinguish a fake header from a normal header, the least-significant bit of the alignment is set to 1 because the offset participates in multiple calculations, while the alignment is just remembered data.
|
---|
855 | \begin{center}
|
---|
856 | \input{FakeHeader}
|
---|
857 | \end{center}
|
---|
858 |
|
---|
859 | Note, doing alignment with containers requires a separate container for the aligned fixed-sized objects, so there are more kinds of containers that must be managed.
|
---|
860 |
|
---|
861 |
|
---|
862 | \subsubsection{\lstinline{realloc} and Sticky Properties}
|
---|
863 | \label{s:ReallocStickyProperties}
|
---|
864 |
|
---|
865 | The allocation routine @realloc@ provides a memory management pattern for shrinking/enlarging an existing allocation, while maintaining some or all of the object data.
|
---|
866 | The realloc pattern is simpler than the suboptimal manual steps.
|
---|
867 | \begin{flushleft}
|
---|
868 | \setlength{\tabcolsep}{10pt}
|
---|
869 | \begin{tabular}{ll}
|
---|
870 | \multicolumn{1}{c}{\textbf{realloc pattern}} & \multicolumn{1}{c}{\textbf{manual}} \\
|
---|
871 | \begin{C++}
|
---|
872 | T * naddr = realloc( oaddr, newSize );
|
---|
873 |
|
---|
874 |
|
---|
875 |
|
---|
876 | \end{C++}
|
---|
877 | &
|
---|
878 | \begin{C++}
|
---|
879 | T * naddr = (T *)malloc( newSize ); $\C[2in]{// new storage}$
|
---|
880 | memcpy( naddr, addr, oldSize ); $\C{// copy old bytes}$
|
---|
881 | free( addr ); $\C{// free old storage}$
|
---|
882 | addr = naddr; $\C{// change pointer}\CRT$
|
---|
883 | \end{C++}
|
---|
884 | \end{tabular}
|
---|
885 | \end{flushleft}
|
---|
886 | The manual steps are suboptimal because there may be internal fragmentation at the end of the allocation due to bucket sizes.
|
---|
887 | If this storage is sufficiently large, it eliminates a new allocation and copying.
|
---|
888 | Alternatively, if the storage is made smaller, there may be a reasonable crossover point, where just increasing the internal fragmentation eliminates a new allocation and copying.
|
---|
889 | Hence, using @realloc@ as often as possible can reduce storage management costs.
|
---|
890 | In fact, if @oaddr@ is @nullptr@, @realloc@ does a @malloc( newSize)@, and if @newSize@ is 0, @realloc@ does a @free( oaddr )@, so all allocation/deallocation can be done with @realloc@.
|
---|
891 |
|
---|
892 | The hidden problem with this pattern is the effect of zero fill and alignment with respect to reallocation.
|
---|
893 | For safety, these properties must persist (be ``sticky'') when storage size changes.
|
---|
894 | Prior to llheap, allocation properties are not preserved across reallocation nor is it possible to query an allocation to maintain these properties manually.
|
---|
895 | Hence, a random call to @realloc@ that reallocates storage may cause downstream errors, if allocation properties are needed.
|
---|
896 | This silent problem is unintuitive to programmers, can cause catastrophic failure, and is difficult to debug because it is transient.
|
---|
897 | To prevent these problems, llheap preserves initial allocation properties within an allocation, allowing them to be queried, and the semantics of @realloc@ preserve these properties on any storage change.
|
---|
898 | As a result, the realloc pattern is efficient and safe.
|
---|
899 |
|
---|
900 | Note, @realloc@ has a compile-time disadvantage \vs @malloc@, because @malloc@ simplifies optimization opportunities.
|
---|
901 | For @malloc@ the compiler knows the new storage address is not aliased, which is not true for @realloc@: the same storage can be returned.
|
---|
902 | The compiler uses this knowledge to optimize the region of code between the @malloc@ call and the point where the pointer escapes or it finds the matching @free@.
|
---|
903 | For @realloc@, the compiler must also analyse the code \emph{before} the call and this analysis may fail.
|
---|
904 |
|
---|
905 | Finally, there is a flaw in @realloc@'s definition: if there is no memory to allocate new storage for an expansion, the original allocation is not freed or moved, @errno@ is set to @ENOMEM@, and a null pointer is returned.
|
---|
906 | This semantics preserves the original allocation so the data is not lost in a failure case.
|
---|
907 | However, most calls to @realloc@ are written: @p = realloc( p, size )@, so the original storage is leaked when pointer @p@ is overwritten with null, negating the benefit of not freeing the storage for recovery purposes.
|
---|
908 | Programmers can follow a coding pattern of:
|
---|
909 | \begin{C++}
|
---|
910 | char * p;
|
---|
911 | ...
|
---|
912 | void * p1 = realloc( p, size );
|
---|
913 | if ( p1 ) p = (char *)p1;
|
---|
914 | else // release some storage
|
---|
915 | \end{C++}
|
---|
916 | However, most programmers ignore return codes.
|
---|
917 | A better alternative is to change @realloc@'s interface to be like @posix_memalign@, which returns two results, a return code and a storage address, so the error code is separate from the returned storage.
|
---|
918 | \begin{C++}
|
---|
919 | int retcode = realloc( (void **)&p, size );
|
---|
920 | \end{C++}
|
---|
921 | which returns 0 or @ENOMEM@, only changes @p@ for expansion, but requires an ugly cast on the call.
|
---|
922 |
|
---|
923 |
|
---|
924 | \subsubsection{Sticky Test}
|
---|
925 |
|
---|
926 | Since sticky properties are an important safety feature for @realloc@, an ad-hoc @realloc@ test was created (not shown) to test whether a memory allocator preserves zero-fill from @calloc@ and/or alignment from @memalign@.
|
---|
927 | The first test @calloc@s a large array (zero fill), sets the array to 42, shortens it, and then enlarges it to the original size.
|
---|
928 | It does these steps 100 times attempting to get a reused large block of memory that is still set to 42, showing new storage does not preserve zero fill.
|
---|
929 | The second test @memalign@s storage and @realloc@s it multiple times making it larger until the current storage must be copied into new storage.
|
---|
930 | The alignment of each storage address returned from @realloc@ is verified with the original alignment.
|
---|
931 |
|
---|
932 | If a test fails, that sticky properties is not provided;
|
---|
933 | if the test passes, that sticky property is provided in some form but not necessarily in all forms (test just got lucky).
|
---|
934 | If an allocator fails these tests, it is unnecessary to perform a manual inspection of the @realloc@ code for sticky properties.
|
---|
935 | Only llheap passes the test, as its @realloc@ applies sticky properties.
|
---|
936 |
|
---|
937 |
|
---|
938 | \subsubsection{Header}
|
---|
939 |
|
---|
940 | To preserve allocation properties requires storing additional information about an allocation.
|
---|
941 | Figure~\ref{f:llheapHeader} shows llheap captures this information in the per object header, which has two fields (left/right) sized appropriately for 32/64-bit alignment requirements.
|
---|
942 |
|
---|
943 | \begin{figure}
|
---|
944 | \centering
|
---|
945 | \input{Header}
|
---|
946 | \caption{llheap Header}
|
---|
947 | \label{f:llheapHeader}
|
---|
948 | \end{figure}
|
---|
949 |
|
---|
950 | The left field is a union of three values:
|
---|
951 | \begin{description}[leftmargin=*,topsep=2pt,itemsep=2pt,parsep=0pt]
|
---|
952 | \item[bucket pointer]
|
---|
953 | is for deallocation and points back to the bucket associated with this storage request (see Figure~\ref{f:llheapDesign} for the fields accessible in a bucket).
|
---|
954 | \item[mapped size]
|
---|
955 | is for deallocation of mapped storage and is the storage size for unmapping.
|
---|
956 | \item[next free block]
|
---|
957 | is an intrusive pointer linking same-size free blocks onto a bucket's stack of free objects.
|
---|
958 | \end{description}
|
---|
959 | The low-order 3-bits of these fields are unused for any stored values, due to the minimum aligned of 8-bytes (even for 32-bit addressing).
|
---|
960 | The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively.
|
---|
961 | Note, the zero-filled/mapped bits are only used in the normal header and the alignment bit in the fake header.
|
---|
962 | This implementation allows a fast test if any of the lower 3-bits are on (@&@ and compare).
|
---|
963 | If no bits are on, it implies a basic allocation, which is handled quickly in the fast path for allocation and free;
|
---|
964 | otherwise, the bits are analysed and appropriate actions are taken for the complex cases.
|
---|
965 |
|
---|
966 | The right field remembers the allocation request size versus the allocation (bucket) size, \eg request of 42 bytes is rounded up to 64 bytes.
|
---|
967 | Since programmers think in request size rather than allocation size, the request size allows better generation of statistics or errors and also helps in memory management.
|
---|
968 |
|
---|
969 |
|
---|
970 | \subsection{Statistics and Debugging}
|
---|
971 |
|
---|
972 | llheap can be built to accumulate fast and largely contention-free allocation statistics to help understand dynamic memory behaviour.
|
---|
973 | Incrementing statistic counters must appear on the allocation fast path.
|
---|
974 | To make statistics performant enough for use on running systems, each heap has its own set of statistic counters, so statistic operations do not require slow atomic operations.
|
---|
975 |
|
---|
976 | To locate all statistic counters, heaps are linked together in statistics mode, and this list is locked and traversed to sum all counters across heaps.
|
---|
977 | Note, the list is locked to prevent errors traversing an active list, which may have nodes added or removed dynamically;
|
---|
978 | the statistics counters are not locked and can flicker during accumulation.
|
---|
979 | Hence, printing statistics during program execution is an approximation.
|
---|
980 | Figure~\ref{f:StatiticsOutput} shows an example of statistics output, which covers all allocation operations and information about deallocating storage not owned by a thread.
|
---|
981 | No other memory allocator provides as comprehensive statistical information.
|
---|
982 | These statistics were invaluable during the development of llheap for debugging and verifying correctness, and should be equally valuable to application developers.
|
---|
983 |
|
---|
984 | \begin{figure}
|
---|
985 | \begin{C++}
|
---|
986 | PID: 2167216 Heap statistics: (storage request / allocation)
|
---|
987 | malloc >0 calls 19,938,000,110; 0 calls 2,064,000,000; storage 4,812,152,081,688 / 5,487,040,092,624 bytes
|
---|
988 | aalloc >0 calls 0; 0 calls 0; storage 0 / 0 bytes
|
---|
989 | calloc >0 calls 7; 0 calls 0; storage 1,040 / 1,152 bytes
|
---|
990 | memalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
|
---|
991 | amemalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
|
---|
992 | cmemalign >0 calls 0; 0 calls 0; storage 0 / 0 bytes
|
---|
993 | resize >0 calls 0; 0 calls 0; storage 0 / 0 bytes
|
---|
994 | realloc >0 calls 0; 0 calls 0; storage 0 / 0 bytes
|
---|
995 | copies 0; smaller 0; alignment 0; 0 fill 0
|
---|
996 | free !null calls 19,938,000,092; null / 0 calls 4,064,000,004; storage 4,812,152,003,021 / 5,487,040,005,152 bytes
|
---|
997 | remote pushes 4; pulls 0; storage 0 / 0 bytes
|
---|
998 | sbrk calls 1; storage 8,388,608 bytes
|
---|
999 | mmap calls 2,000,000; storage 2,097,152,000,000 / 2,105,344,000,000 bytes
|
---|
1000 | munmap calls 2,000,000; storage 2,097,152,000,000 / 2,105,344,000,000 bytes
|
---|
1001 | remainder calls 0; storage 0 bytes
|
---|
1002 | threads started 4; exited 4
|
---|
1003 | heaps $new$ 4; reused 0
|
---|
1004 | \end{C++}
|
---|
1005 | \caption{Statistics Output}
|
---|
1006 | \label{f:StatiticsOutput}
|
---|
1007 | \end{figure}
|
---|
1008 |
|
---|
1009 | llheap can also be built with debug checking, which inserts many asserts along all allocation paths.
|
---|
1010 | These assertions detect incorrect allocation usage, like double frees, unfreed storage, or memory corruption because internal values (like header fields) are overwritten.
|
---|
1011 | These checks are best effort as opposed to complete allocation checking as in @valgrind@~\cite{valgind}.
|
---|
1012 | Nevertheless, the checks detect many allocation problems.
|
---|
1013 | There is a problem in detecting unfreed storage because some library routines assume their allocations have life-time duration, and hence, do not free their storage.
|
---|
1014 | For example, @printf@ might allocate a 1024-byte buffer on the first call and never delete this buffer.
|
---|
1015 | To prevent a false positive for unfreed storage, it is possible to specify an amount of storage that is never freed (see @malloc_unfreed@ in Section~\ref{s:ExtendedCAPI}), and it is subtracted from the total allocate/free difference.
|
---|
1016 | Determining the amount of never-freed storage is annoying, but once done, any warnings of unfreed storage are application related.
|
---|
1017 | Debugging mode also scrubs each allocation with @0xff@, so assumptions about zero-filled objects generate errors.
|
---|
1018 | Finally, if a program does segment-fault in debug mode, a stack backtrace is printed to help in debugging.
|
---|
1019 |
|
---|
1020 | Tests indicate only a 30\% performance decrease when statistics \emph{and} debugging are enabled in programs with 10\% to 15\% allocation cost, and the latency cost for accumulating statistic from each heap is mitigated by limited calls, often only one at the end of the program.
|
---|
1021 |
|
---|
1022 |
|
---|
1023 | % \subsection{Design Choices}
|
---|
1024 | %
|
---|
1025 | % llheap's design was reviewed and changed multiple times during its development.
|
---|
1026 | % All designs focused on the allocation/free \newterm{fast path}, \ie the shortest code path for the most common operations.
|
---|
1027 | % The model chosen is 1:1, giving one heap per thread for each kernel thread (KT).
|
---|
1028 | % Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
|
---|
1029 | % Therefore, the majority of heap operations are uncontended, modulo operations on the global heap and ownership.
|
---|
1030 | %
|
---|
1031 | % Problems:
|
---|
1032 | % \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
|
---|
1033 | % \item
|
---|
1034 | % Need to know when a KT starts/terminates to create/delete its heap.
|
---|
1035 | %
|
---|
1036 | % \noindent
|
---|
1037 | % It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
|
---|
1038 | % \item
|
---|
1039 | % There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
|
---|
1040 | %
|
---|
1041 | % \noindent
|
---|
1042 | % The classic solution only deletes a heap after all referents are returned, which is complex.
|
---|
1043 | % The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
|
---|
1044 | % If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
|
---|
1045 | % To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
|
---|
1046 | % In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
|
---|
1047 | % \item
|
---|
1048 | % There can be significant external fragmentation as the number of KTs increases.
|
---|
1049 | %
|
---|
1050 | % \noindent
|
---|
1051 | % In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
|
---|
1052 | % Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
|
---|
1053 | % \item
|
---|
1054 | % Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
|
---|
1055 | % \begin{quote}
|
---|
1056 | % A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable}
|
---|
1057 | % \end{quote}
|
---|
1058 | % If a KT is preempted during an allocation operation, the OS can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness.
|
---|
1059 | % Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable.
|
---|
1060 | % Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the OS is providing the second thread via the signal handler.
|
---|
1061 |
|
---|
1062 | % There is the same serially-reusable problem with UTs migrating across KTs.
|
---|
1063 | % \end{itemize}
|
---|
1064 | % Tests showed this design produced the closest performance match with the best current allocators, and code inspection showed most of these allocators use different variations of this approach.
|
---|
1065 |
|
---|
1066 |
|
---|
1067 | \subsection{Bootstrapping}
|
---|
1068 |
|
---|
1069 | There are problems bootstrapping a memory allocator.
|
---|
1070 | Programs can be statically or dynamically linked.
|
---|
1071 | The order in which the linker schedules startup code is poorly supported so it cannot be controlled entirely.
|
---|
1072 | Knowing a KT's start and end independently from the KT code is also difficult.
|
---|
1073 |
|
---|
1074 | For static linking, the allocator is loaded with the program.
|
---|
1075 | Hence, allocation calls immediately invoke the allocator operation defined by the loaded allocation library and there is only one memory allocator used in the program.
|
---|
1076 | This approach allows allocator substitution by placing an allocation library before any other in the linked/load path.
|
---|
1077 |
|
---|
1078 | Allocator substitution is similar for dynamic linking.
|
---|
1079 | However, the dynamic loader starts first and needs to perform dynamic allocations \emph{before} the substitution allocator is loaded.
|
---|
1080 | As a result, the dynamic loader uses a default allocator until the substitution allocator is loaded, after which all allocation operations are handled by the substitution allocator, including those from the dynamic loader.
|
---|
1081 | Hence, some part of the @sbrk@ area may be used by the default allocator and substitution allocator statistics cannot be correct.
|
---|
1082 | Furthermore, dynamic linking uses an assembler trampoline to call the procedure linkage table resolver, so there is an additional cost along the allocator fast path for all allocation operations.
|
---|
1083 | Testing showed up to a 5\% performance decrease with dynamic linking as compared to static linking, even when using @tls_model( "initial-exec" )@ to obtain tighter binding.
|
---|
1084 |
|
---|
1085 | After the allocator is loaded, it needs to be initialized before the first allocation request.
|
---|
1086 | Currently, the only mechanism to control initialization is via constructor routines (see below), each with an integer priority, where the linker calls the constructors in increasing order of priority.
|
---|
1087 | However, there are few conventions for priorities amongst libraries, where constructors with equal priorities are called in arbitrary order.
|
---|
1088 | (Only a transitive closure of references amongst library calls can establish an absolute initialization order.)
|
---|
1089 | As a result, the first call to an allocation routine can occur without initialization.
|
---|
1090 | To deal with this problem, it is necessary to have a global flag that is checked along the allocation fast path to trigger initialization (singleton pattern).
|
---|
1091 |
|
---|
1092 | Along these lines, there is a subtle problem is defining when a program starts and ends.
|
---|
1093 | For example, prolog/epilog code outside of the program should not be considered in statistics as the application does not make these calls.
|
---|
1094 | llheap establishes these two points using constructor/destructor routines with initialization priority 100, where system libraries use priorities $\le$ 100 and application programs have priorities $>$ 100.
|
---|
1095 | \begin{flushleft}
|
---|
1096 | \hspace*{\parindentlnth}
|
---|
1097 | \setlength{\tabcolsep}{20pt}
|
---|
1098 | \begin{tabular}{@{}ll@{}}
|
---|
1099 | \begin{C++}
|
---|
1100 | @__attribute__(( constructor( 100 ) ))@
|
---|
1101 | static void startup( void ) {
|
---|
1102 | // clear statistic counters
|
---|
1103 | // reset allocUnfreed counter
|
---|
1104 | }
|
---|
1105 |
|
---|
1106 | \end{C++}
|
---|
1107 | &
|
---|
1108 | \begin{C++}
|
---|
1109 | @__attribute__(( destructor( 100 ) ))@
|
---|
1110 | static void shutdown( void ) {
|
---|
1111 | // sum allocUnfreed for all heaps
|
---|
1112 | // subtract global unfreed storage
|
---|
1113 | // if allocUnfreed > 0 then print warning message
|
---|
1114 | }
|
---|
1115 | \end{C++}
|
---|
1116 | \end{tabular}
|
---|
1117 | \end{flushleft}
|
---|
1118 | Hence, @startup@ is called after the program prologue but before the application starts, and @shutdown@ is called after the program terminates but before the program epilogue.
|
---|
1119 | By resetting counters in @startup@, prologue allocations are ignored, and checking unfreed storage in @shutdown@ checks only application memory management, ignoring the program epilogue.
|
---|
1120 |
|
---|
1121 | Unfortunately, @startup@/@shutdown@ only apply to the program KT, not to any additional KTs created by the program.
|
---|
1122 | However, it is essential for the allocator to know when each KT is started/terminated to initialize/de-initialize the KT's heap.
|
---|
1123 | Initialization can be handled by making the global flag (above) thread-local and then the initialization check along the fast path covers the first allocation by a newly created thread.
|
---|
1124 | De-initialization is handled by registering a destructor routine using @pthread_key_create@ in the initialization code triggered along the fast path, which subsequently calls the destructor at thread termination.
|
---|
1125 |
|
---|
1126 |
|
---|
1127 | \subsection{User-level Threading Support}
|
---|
1128 | \label{s:UserlevelThreadingSupport}
|
---|
1129 |
|
---|
1130 | llheap is the underlying allocator in the user-threading programming languages \uC and \CFA.
|
---|
1131 | These systems have preemptive scheduling, which requires management of timing events through a signal handle (@SIGALRM@).
|
---|
1132 | The complexity in these system is the serially-reusable problem (see Section~\ref{s:SingleThreadedMemoryAllocator}) when UTs are time sliced (language level) independently from KTs (OS level).
|
---|
1133 | The solution is to prevent interrupts resulting in a CPU or KT change during critical operations, eliminating problems like starting a memory operation on one KT and completing it on another when the underlying heaps are different.
|
---|
1134 | % For user-level threading, the serially-reusable problem occurs with time slicing for preemptable user-level scheduling, as the interrupted UT is unlikely to be restarted on the same KT.
|
---|
1135 | However, without time slicing, a long running UT prevents the execution of other UTs (starvation).
|
---|
1136 |
|
---|
1137 | The languages modify llheap using two techniques to prevent time slicing during non-interruptible allocation operations.
|
---|
1138 | On the slow path, when executing expensive operations, time-slicing interrupts are disabled/enabled, so the operation completes atomically on the KT.
|
---|
1139 | On the fast path, all non-interruptible allocation/deallocation routines are placed in a separate code segment.
|
---|
1140 | The linker places this segment into a contiguous block of memory. %, but the order of routines within the block is unspecified.
|
---|
1141 | Then the time-slice signal handler compares the program counter at the point of interrupt with the start/end address of the non-interruptible segment, and if executing within the segment, the signal handler returns without context switching.
|
---|
1142 | The llheap funnel design simplifies this implementation so only a few funnel and statistics routines are located in the non-interruptible section.
|
---|
1143 | % This technique is fragile as no mechanism exists to ensure all crucial code along the fast path is placed into the non-interruptible segment.
|
---|
1144 |
|
---|
1145 | Interestingly, marking non-interruptible operations by bracketing them with a set/reset of a thread-local flag fails, as read/write is not atomic on some machines.
|
---|
1146 | For example, the ARM processor stores the thread-local pointer in a coprocessor register that cannot perform atomic base-displacement addressing.
|
---|
1147 | Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a UT.
|
---|
1148 | As well, switching to a T:C model with restartable critical sections using @librseq@~\cite{Desnoyers19} was examined (see Section~\ref{s:MutualExclusion}).
|
---|
1149 | However, tests showed that while @librseq@ can determine the particular CPU quickly, setting up the restartable critical-section along the allocation fast-path produced a significant decrease in performance.
|
---|
1150 | Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with UT migration across KTs.
|
---|
1151 | For example, UT$_1$ is executing an allocation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
|
---|
1152 | The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately performs an allocation.
|
---|
1153 | Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical section.
|
---|
1154 | Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its allocation operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
|
---|
1155 | If @librseq@ had an @rseq_abort@ which:
|
---|
1156 | \begin{enumerate}[leftmargin=*,topsep=2pt,itemsep=0pt,parsep=0pt]
|
---|
1157 | \item
|
---|
1158 | marks the current restartable critical-section as cancelled so it restarts when attempting to commit.
|
---|
1159 | \item
|
---|
1160 | does nothing if there is no current restartable critical section in progress.
|
---|
1161 | \end{enumerate}
|
---|
1162 | Then @rseq_abort@ could be called on the backside of a user-level context-switching.
|
---|
1163 | A feature similar to this idea might exist for hardware transactional memory.
|
---|
1164 | A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
|
---|
1165 |
|
---|
1166 |
|
---|
1167 | \subsection{C API}
|
---|
1168 |
|
---|
1169 | Figure~\ref{f:CDynamicAllocationAPI} shows the C dynamic allocation API, which is neither orthogonal nor complete.
|
---|
1170 | For example, it is possible to zero fill or align an allocation but not both, it is only possible to zero fill an array allocation, and it is not possible to resize a memory allocation without data copying.
|
---|
1171 | As a result, programmers must provide missing alternatives, which is error prone, rightly blaming the C programming language for a poor allocation API.
|
---|
1172 | Furthermore, newer programming languages have better type systems that can provide safer and more powerful APIs for memory allocation.
|
---|
1173 | The following presents llheap API changes.
|
---|
1174 |
|
---|
1175 | \begin{figure}
|
---|
1176 | \hspace*{\parindentlnth}
|
---|
1177 | \begin{tabular}{@{}l|l@{}}
|
---|
1178 | \begin{C++}
|
---|
1179 | void * malloc( size_t size );
|
---|
1180 | void * calloc( size_t dimension, size_t size );
|
---|
1181 | void * realloc( void * oaddr, size_t size );
|
---|
1182 | void * reallocarray( void * oaddr, size_t dimension, size_t size );
|
---|
1183 | void free( void * addr );
|
---|
1184 | void * memalign( size_t alignment, size_t size );
|
---|
1185 | void * aligned_alloc( size_t alignment, size_t size );
|
---|
1186 | int posix_memalign( void ** memptr, size_t alignment, size_t size );
|
---|
1187 | void * valloc( size_t size );
|
---|
1188 | void * pvalloc( size_t size );
|
---|
1189 | \end{C++}
|
---|
1190 | &
|
---|
1191 | \begin{C++}
|
---|
1192 | int mallopt( int option, int value );
|
---|
1193 | size_t malloc_usable_size( void * addr );
|
---|
1194 | void malloc_stats( void );
|
---|
1195 | int malloc_info( int options, FILE * fp );
|
---|
1196 |
|
---|
1197 | // Unsupported
|
---|
1198 | struct mallinfo mallinfo( void );
|
---|
1199 | int malloc_trim( size_t );
|
---|
1200 | void * malloc_get_state( void );
|
---|
1201 | int malloc_set_state( void * );
|
---|
1202 | \end{C++}
|
---|
1203 | \end{tabular}
|
---|
1204 | \caption{llheap support of C dynamic-allocation API}
|
---|
1205 | \label{f:CDynamicAllocationAPI}
|
---|
1206 | \end{figure}
|
---|
1207 |
|
---|
1208 |
|
---|
1209 | \subsubsection{Extended C API}
|
---|
1210 | \label{s:ExtendedCAPI}
|
---|
1211 |
|
---|
1212 | llheap transparently augments the C dynamic memory API to increase functionality, orthogonality, and safety.
|
---|
1213 | \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
|
---|
1214 | \item
|
---|
1215 | @malloc@ remembers the original allocation size separate from the actual allocation size.
|
---|
1216 | \item
|
---|
1217 | @calloc@ sets the sticky zero-fill property.
|
---|
1218 | \item
|
---|
1219 | @memalign@, @aligned_alloc@, @posix_memalign@, @valloc@ and @pvalloc@ set the sticky alignment property, remembering the specified alignment size.
|
---|
1220 | \item
|
---|
1221 | @realloc@ and @reallocarray@ preserve sticky properties across copying.
|
---|
1222 | \item
|
---|
1223 | @malloc_stats@ prints detailed statistics of allocation/free operations when linked with a statistic version.
|
---|
1224 | \item
|
---|
1225 | Existence of shell variable @MALLOC_STATS@ implicitly calls @malloc_stats@ at program termination, so precompiled programs do not have to be modified.
|
---|
1226 | \end{itemize}
|
---|
1227 |
|
---|
1228 | llheap extends the C dynamic-memory API with new allocation operations with APIs matching existing C counterparts.
|
---|
1229 | \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
|
---|
1230 | \item
|
---|
1231 | @aalloc@ extends @calloc@ for dynamic array allocation \emph{without} zero-filling the memory (faster than @calloc@).
|
---|
1232 | \item
|
---|
1233 | @resize@ extends @realloc@ for resizing an allocation \emph{without} copying previous data or preserving sticky properties (faster than @realloc@).
|
---|
1234 | \item
|
---|
1235 | @resizearray@ extends @resize@ for an array allocation (faster than @reallocarray@).
|
---|
1236 | \item
|
---|
1237 | @amemalign@ extends @aalloc@ with alignment and sets sticky alignment property.
|
---|
1238 | \item
|
---|
1239 | @cmemalign@ extends @amemalign@ with zero fill and sets sticky zero-fill and alignment property.
|
---|
1240 | \item
|
---|
1241 | @aligned_resize@ extends @resize@ with an alignment.
|
---|
1242 | \item
|
---|
1243 | @aligned_resizearray@ extends @resizearray@ with alignment.
|
---|
1244 | \item
|
---|
1245 | @aligned_realloc@ extends @realloc@ with alignment.
|
---|
1246 | \item
|
---|
1247 | @aligned_reallocarray@ extends @resizearray@ with alignment.
|
---|
1248 | \end{itemize}
|
---|
1249 |
|
---|
1250 | llheap extends the C dynamic memory API with new control operations.
|
---|
1251 | The following routines are called \emph{once} during llheap startup to set specific limits \emph{before} an application starts.
|
---|
1252 | Setting these value early is essential because allocations can occur from the dynamic loader and other libraries before application code executes.
|
---|
1253 | To set a value, define a specific routine in an application and return the desired value, \eg
|
---|
1254 | \begin{C++}
|
---|
1255 | size_t malloc_extend() { return 16 * 1024 * 1024; }
|
---|
1256 | \end{C++}
|
---|
1257 | \begin{itemize}[leftmargin=*,topsep=0pt,itemsep=1pt,parsep=0pt]
|
---|
1258 | \item
|
---|
1259 | @malloc_extend@ returns the number of bytes to extend the @sbrk@ area when there is insufficient free storage to service an allocation request.
|
---|
1260 | \item
|
---|
1261 | @malloc_mmap_start@ returns the crossover allocation size from the @sbrk@ area to separate mapped areas, see also @mallopt( M_MMAP_THRESHOLD )@.
|
---|
1262 | \item
|
---|
1263 | @malloc_unfreed@ returns the amount subtracted from the global unfreed program storage to adjust for unreleased storage from routines like @printf@ (debug only).
|
---|
1264 | \end{itemize}
|
---|
1265 |
|
---|
1266 | llheap extends the C dynamic-memory API with functions to query object properties.
|
---|
1267 | \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
|
---|
1268 | \item
|
---|
1269 | @malloc_size@ returns the requested size of a dynamic object, which is updated when an object is resized, similar to @malloc_usable_size@.
|
---|
1270 | \item
|
---|
1271 | @malloc_alignment@ returns the object alignment, where the minimal alignment is 16 bytes.
|
---|
1272 | \item
|
---|
1273 | @malloc_zero_fill@ returns true if the object is zero filled.
|
---|
1274 | \item
|
---|
1275 | @malloc_remote@ returns true if the object is from a remote heap (@OWNERSHIP@ only).
|
---|
1276 | \end{itemize}
|
---|
1277 |
|
---|
1278 | llheap extends the C dynamic-memory API with new statistics control.
|
---|
1279 | \begin{itemize}[leftmargin=*,topsep=3pt,itemsep=1pt,parsep=0pt]
|
---|
1280 | \item
|
---|
1281 | @malloc_stats_fd@ sets the file descriptor for @malloc_stats@ writes (default @stdout@).
|
---|
1282 | \item
|
---|
1283 | @malloc_stats_clear@ clears the statistics counters for all thread heaps.
|
---|
1284 | \item
|
---|
1285 | @heap_stats@ extends @malloc_stats@ to only print statistics for the heap associated with the executing thread.
|
---|
1286 | \end{itemize}
|
---|
1287 |
|
---|
1288 |
|
---|
1289 | \subsubsection{Modern Allocation API}
|
---|
1290 |
|
---|
1291 | Modern programming languages have complex type systems that can be used to consolidate the panoply of memory allocation routines and features, providing a simpler programming experience and safety.
|
---|
1292 | The \CFA language is used to demonstrate this capability, because llheap forms the memory allocator for this C variant, but other languages can provide similar APIs.
|
---|
1293 |
|
---|
1294 | \CFA polymorphism reduces the allocation API to two overloaded routines allocating a single object or an array of objects.
|
---|
1295 | \begin{cfa}
|
---|
1296 | forall( T & ) {
|
---|
1297 | T * alloc( /* list of property functions ... */ ) { ... } // singleton allocation
|
---|
1298 | T * alloc( size_t @dimension@, /* list of property functions ... */ ) { ... } // array allocation
|
---|
1299 | }
|
---|
1300 | \end{cfa}
|
---|
1301 | Because the \CFA type system uses the return type to select overloads (like Ada), this capability is leveraged to remove the object-size parameter and return cast for regular calls to C @malloc@ or @memalign@.
|
---|
1302 | \begin{cfa}
|
---|
1303 | inline T * alloc( ... ) {
|
---|
1304 | if ( _Alignof(T) <= defaultAlign() ) return @(T *)@malloc( @sizeof(T)@ ); // C allocation
|
---|
1305 | else return @(T *)@memalign( @_Alignof(T)@, @sizeof(T)@ ); // C allocation
|
---|
1306 | }
|
---|
1307 | \end{cfa}
|
---|
1308 | The calls to these two routine are now much safer than the C equivalents.
|
---|
1309 | \begin{C++}
|
---|
1310 | int * ip = alloc(); $\C[2.75in]{// T => int, sizeof => 4/8, alignment => default}$
|
---|
1311 | double * dp = alloc(); $\C{// T => double, sizeof => 8, alignment => default}$
|
---|
1312 | struct Spinlock { ... } [[aligned(128)]] * sp = alloc(); $\C{// T => Spinlock, sizeof => ..., alignment = 128}$
|
---|
1313 | int * ia = alloc( 10 ); $\C{// T => int, sizeof => 4/8, alignment => default, dimension => 10}\CRT$
|
---|
1314 | \end{C++}
|
---|
1315 | At compile time, each call to @alloc@ extracts the return type @T@ from the left-hand side of the assignment, which is then used in @sizeof@, @_Alignof@, and casting the storage to the correct type.
|
---|
1316 | The @inline@ and constant expression allow the compiler to remove the @if@ statement.
|
---|
1317 | This interface removes all the common allocation-call errors in C and provides a uniform name covering all allocation reducing the cognitive burden.
|
---|
1318 |
|
---|
1319 | The property functions are a variable number of routines providing @alloc@ with management details and actions.
|
---|
1320 | The functions are @align@, @fill@, @resize@, and @realloc@, and written in prefix versus postfix notation solely for aesthetic reasons, \eg @3`fill@ $\equiv$ @fill( 3 )@.
|
---|
1321 | The examples are arrays but apply equally to singleton allocations.
|
---|
1322 | \begin{cfa}
|
---|
1323 | int * ip = alloc( 5, @4096`align@, @5`fill@ ); $\C[3in]{// start array on 4096 boundary and initialize elements with 5}$
|
---|
1324 | int * ip2 = alloc( 10, @ip`fill@, @(malloc_alignment( ip ))`align@ ); $\C{// first 5 elements same as ip, same alignment as ip}$
|
---|
1325 | _Complex double * cdp = alloc( 5, @(3.5+4.1i)`fill@ ); $\C{// initialize complex elements with 3.5+4.1i}$
|
---|
1326 | struct S { int i, j; };
|
---|
1327 | S * sp = alloc( 10, @((S){3, 4})`fill@ ); $\C{// initialize structure elements with {3, 4}}$
|
---|
1328 | ip = alloc( 10, @ip`realloc@, @10`fill@ ); $\C{// make array ip larger and initialize new elements with 10}$
|
---|
1329 | double * dp = alloc( 5, @ip2`resize@, @256`align@, @13.5`fill@ ); $\C{// reuse ip2 storage for something else}\CRT$
|
---|
1330 | \end{cfa}
|
---|
1331 | Finally, \CFA has constructors and destructors, like \CC, which are invoked when allocating with @new@ and @delete@.
|
---|
1332 | \begin{cfa}
|
---|
1333 | T * t = new( 3, 4, 5 ); $\C[3in]{// allocate T and call constructor T\{ 3, 4, 5 \}}$
|
---|
1334 | W * w = new( 3.5 ); $\C{// allocate W and call constructor W\{ 3,5 \}}$
|
---|
1335 | delete( t, w ); $\C{// call destructors and free t and w}\CRT$
|
---|
1336 | \end{cfa}
|
---|
1337 | The benefits of high-level API simplifications should not be underestimated with respect to programmer productivity and safety.
|
---|
1338 |
|
---|
1339 |
|
---|
1340 | \section{Performance}
|
---|
1341 | \label{c:Performance}
|
---|
1342 |
|
---|
1343 | This section uses a number of benchmarks to compare the behaviour of currently popular memory allocators with llheap.
|
---|
1344 | The goal is to see if llheap is a competitive memory allocator;
|
---|
1345 | no attempt is made to select a performance winner.
|
---|
1346 |
|
---|
1347 |
|
---|
1348 | \subsection{Experimental Environment}
|
---|
1349 | \label{s:ExperimentalEnvironment}
|
---|
1350 |
|
---|
1351 | The performance experiments are run on three different multi-core architectures, ARM, AMD, and Intel, covering memory models weak order (WO) and total store order (TSO), to determine if there is consistency across architectures:
|
---|
1352 | \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
|
---|
1353 | \item[ARM]
|
---|
1354 | Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
|
---|
1355 | \item[AMD]
|
---|
1356 | Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model
|
---|
1357 | \item[Intel]
|
---|
1358 | Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model
|
---|
1359 | \end{description}
|
---|
1360 | For the parallel experiments, threads are pinned to cores in a linear fashion, \ie from core $N$ to $N+M$, where $N$ is the start of a socket boundary.
|
---|
1361 | This layout produces the best throughput, as there is little or no communication among threads in the benchmarks, so binding tightly to the cache layout is unnecessary;
|
---|
1362 | hence, there is almost no OS or NUMA effects perturbing the benchmarks.
|
---|
1363 |
|
---|
1364 | The compilers are gcc/g++-14.2.0 and gfortran-14.2.0 running on the Linux v6.8.0-52-generic OS, with @LD_PRELOAD@ used to override the default allocator.
|
---|
1365 | To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
|
---|
1366 | \begin{uC++}
|
---|
1367 | static inline void * pass( void * v ) { $\C[2.5in]{// prevent eliding, cheaper than volatile}$
|
---|
1368 | __asm__ __volatile__( "" : "+r"(v) ); return v;
|
---|
1369 | }
|
---|
1370 | void * vp = pass( malloc( 0 ) ); $\C{// wrap malloc call to prevent elision}\CRT$
|
---|
1371 | \end{uC++}
|
---|
1372 | The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all allocators.
|
---|
1373 |
|
---|
1374 |
|
---|
1375 | \subsection{Memory Allocators}
|
---|
1376 | \label{s:MemoryAllocators}
|
---|
1377 |
|
---|
1378 | Historically, a number of C/\CC, stand-alone, general-purpose memory-allocators, \eg dlmalloc~\cite{dlmalloc}, have been written for use by programming languages providing unmanaged memory.
|
---|
1379 | For this work, 6 of the popular, thread-safe memory-allocators are selected for comparison, along with llheap.
|
---|
1380 |
|
---|
1381 | \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt,listparindent=\parindent]
|
---|
1382 | \item[glibc~\cite{glibc}] % https://sourceware.org/glibc/wiki/MallocInternals
|
---|
1383 | is the default glibc allocator, derived from ptmalloc, derived from dlmalloc.
|
---|
1384 | glibc has multiple threads sharing multiple heaps with a global shared heap, header per allocation, free-lists with different organizational criteria and searching, and coalescing of certain adjacent free-areas.
|
---|
1385 | Version Ubuntu GLIBC 2.31-0ubuntu9.7 2.31 compiled by Ubuntu 24.04.
|
---|
1386 |
|
---|
1387 | \item[hoard~\cite{hoard}]
|
---|
1388 | has multiple threads sharing multiple heaps with a global shared heap, where each heap is composed of superblocks containing fixed-sized objects, with each super-block having a single header for its objects and reuse of superblocks if empty.
|
---|
1389 | Version 3.13.0, compiled with gcc-14.2.0, default configuration, using command @make@.
|
---|
1390 | Over the past 5 years, hoard development has stopped;
|
---|
1391 | it fails on the ARM architecture, possibly because of the WO memory model.
|
---|
1392 |
|
---|
1393 | \item[jemalloc~\cite{Evans06}]
|
---|
1394 | has multiple threads sharing multiple heaps (arenas) composed of same-sized chunks subdivided into regions composed of pages where each page is a container of same-sized objects.
|
---|
1395 | The components are organized into a number of data structures to facilitate allocations, freeing, and coalescing.
|
---|
1396 | Large objects are allocated using @mmap@.
|
---|
1397 | Version jemalloc-5.3.0~\cite{jemalloc}, built with the default configuration, using commands: @autogen.sh; configure; make; make install@.
|
---|
1398 |
|
---|
1399 | \item[mimalloc~\cite{Leijen19}]
|
---|
1400 | has a heap per thread composed of a reserved area subdivided into 3-sized page buffers, where each page is a container of same-sized objects.
|
---|
1401 | Each page manages its own internal free list and the free list is build when a page is created so there is no initial bump pointer.
|
---|
1402 | Empty pages are coalesced for reuse.
|
---|
1403 | Uses a fast freelist search for small allocation sizes.
|
---|
1404 | Onwership is handled with a separate remote free-list, and remote frees are batched before pushing to the owner heap.
|
---|
1405 | Version mimalloc-v2.1.2, built with the default configuration, using commands @cmake . ; make@.
|
---|
1406 |
|
---|
1407 | \item[tbbmalloc~{\cite[pp.~314--315]{Kukanov07}}] is the allocator shipped with Intel's Threading Building Blocks (TBB).
|
---|
1408 | tbbmalloc has a heap per thread for small allocations, with large allocation handled using a single request.
|
---|
1409 | There is a global heap to acquire and reuse space obtained from the OS;
|
---|
1410 | its reserved space is divided into thread buffers (containers).
|
---|
1411 | A thread heap is composed of linked containers, with binning used to manage the allocations/deallocations within the containers.
|
---|
1412 | Small object space is not returned to the OS.
|
---|
1413 | An allocation has to search its container list to find a partially filled one.
|
---|
1414 | The search is mitigated by moving mostly-free containers to the start of the container list;
|
---|
1415 | free containers are returned to the global heap.
|
---|
1416 | Ownership is handled with a separate remote free-list.
|
---|
1417 | Version @libtbbmalloc.so.2.11@, installed using @apt-get install libtbb-dev@.
|
---|
1418 |
|
---|
1419 | \item[tcmalloc~\cite{tcmalloc}] is the allocator shipped with Google's perftools.\footnote{
|
---|
1420 | Currently, there are two versions of tcmalloc: Google's perftools and one experimental version available on GitHub, which is not an officially supported Google product.
|
---|
1421 | We selected the perftools version because it is the most likely choice for users as it installs directly onto multiple OSs.}
|
---|
1422 | tcmalloc has per CPU heaps for small allocations, with large allocation handled with a single request.
|
---|
1423 | CPU heaps require a rollback mechanism, @rseq@, to prevent the serially-reusable problem.
|
---|
1424 | There is a global heap to acquire and reuse space obtained from the OS;
|
---|
1425 | its reserved space is divided into multi-page spans (containers) of fixed sized objects.
|
---|
1426 | A CPU heap uses binning to manage the allocations/deallocations within the containers.
|
---|
1427 | Free containers are returned to the OS.
|
---|
1428 | Version @libtcmalloc_minimal.so.4@, installed using @apt-get install google-perftools@.
|
---|
1429 | \end{description}
|
---|
1430 |
|
---|
1431 | Untested allocators:
|
---|
1432 | \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
|
---|
1433 | \item[ptmalloc3]
|
---|
1434 | is 8 years old and already integrated into glibc.
|
---|
1435 | \item[rpmalloc]
|
---|
1436 | requires explicit insertion of initialization/finalization calls for handling concurrent kernel threads.
|
---|
1437 | Having to augment programs, like SPEC CPU benchmarks, is deemed outside of normal programmer expectations.
|
---|
1438 | % An allocator should just plugin and work.
|
---|
1439 | \item[lock free] allocators guarantee allocation progress whether threads are delayed or killed using an atomic instruction, often CAS.
|
---|
1440 | The original lock-free allocator~\cite{Michael04} is completely lock-free.
|
---|
1441 | As stated, atomic instructions on the fast path result in a significant performance penalty.
|
---|
1442 | Hence, new allocators are not completely lock free, switching to a combination of synchronization-free, \ie 1:1 allocator model, on the fast path and lock-free on the slow path(s) to manipulate shared data structures~\cite{rpmalloc}.
|
---|
1443 | These allocators are better labelled as \newterm{hybrid locking} rather than lock free, as the lock-free aspect is not contributing to performance.
|
---|
1444 |
|
---|
1445 | % We observe that none of the pre-built standard malloc replacement libraries for ubuntu \url{https://launchpad.net/ubuntu/+search?text=malloc} are completely lock-free.
|
---|
1446 | % 1:1 allocators can avoid synchronization (locks, or lock-free techniques with atomic instructions as well as cache coherence overheads) in their critical fast paths, but care must be taken to ensure the the amount of free memory captured in thread-local structures is bounded.
|
---|
1447 |
|
---|
1448 | % Another approach to synchronization for allocators is \newterm{Restartable Critical Sections} ~\cite {https://dl.acm.org/doi/10.1145/512429.512451, https://dl.acm.org/doi/pdf/10.5555/1698184, https://doi.org/10.1145/1064979.1064985}, which are available in linux as the \newterm{RSEQ} facility ~\cite{https://www.gnu.org/software/libc/manual/html_node/Restartable-Sequences.html}.
|
---|
1449 | % Restartable Critical Sections provide obstruction-free progress by means of specially crafted transactions that will be rolled back if they happen to be interrupted by the kernel.
|
---|
1450 | % Restartable Critical Sections transactions can only operate on CPU-specific data, however, which forces a T:C allocator configuration.
|
---|
1451 | % Google's experimental tcmalloc \url{https://google.github.io/tcmalloc/rseq.html} uses RSEQ.
|
---|
1452 | % SuperMalloc \url{ACM DL is dead at the moment, but it's in ISMM 2015} attempts to use hardware transactional memory for lock elision, but falls back to classic locking if the hardware facility is not present or when a given transactional attempt encounters repeated progress failures.
|
---|
1453 |
|
---|
1454 |
|
---|
1455 | \end{description}
|
---|
1456 |
|
---|
1457 | Allocator size is an indirect indicator of complexity.
|
---|
1458 | Lines-of-code are computed with command @cloc *.{h,c,cc,cpp}@, except for hoard:
|
---|
1459 | @cloc --exclude-lang="Bourne Shell",SKILL,Markdown,Bazel Heap-Layers source include@.
|
---|
1460 | \begin{center}
|
---|
1461 | \setlength{\tabcolsep}{13pt}
|
---|
1462 | \begin{tabular}{@{}rrrrrrrr@{}}
|
---|
1463 | llheap & glibc & hoard & jemalloc & mimalloc & tbbmalloc & tcmalloc \\
|
---|
1464 | 1,450 & 3,807 & 11,932 & 24,512 & 6,887 & 6,256 & 33,963 \\
|
---|
1465 | \end{tabular}
|
---|
1466 | \end{center}
|
---|
1467 |
|
---|
1468 |
|
---|
1469 | \section{Benchmarks}
|
---|
1470 | \label{s:Benchmarks}
|
---|
1471 |
|
---|
1472 | There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
|
---|
1473 | \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
|
---|
1474 | \item[Benchmarks]
|
---|
1475 | are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
|
---|
1476 | The applications are supposed to represent common execution patterns that need to perform well with respect to an underlying software implementation.
|
---|
1477 | Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns, resulting in little or no information about why an application did or did perform well for the tested software.
|
---|
1478 | \item[Micro-Benchmarks]
|
---|
1479 | attempt to extract the common execution patterns associated with an application and run the pattern independently.
|
---|
1480 | This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
|
---|
1481 | Micro-benchmarks are often criticized for inadequately representing real-world applications, but that is not their purpose.
|
---|
1482 | \end{description}
|
---|
1483 |
|
---|
1484 | While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
|
---|
1485 | In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
|
---|
1486 | As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
|
---|
1487 | Many of these benchmark applications and micro-benchmarks are old and do not reflect current application allocation patterns.
|
---|
1488 |
|
---|
1489 | Except for the SPEC CPU benchmark, the other performance benchmarks used for testing are micro-benchmarks created for this paper.
|
---|
1490 | All the benchmarks are used solely to extract differences among memory allocators.
|
---|
1491 | The term benchmark in the following discussion means benchmark or micro-benchmark.
|
---|
1492 |
|
---|
1493 |
|
---|
1494 | \subsection{SPEC CPU 2017}
|
---|
1495 |
|
---|
1496 | SPEC CPU 2017 is an industry-standardized suite for measuring and comparing performance of compute-intensive programs.
|
---|
1497 | It contains integer and floating-point tests written in C, \CC, and Fortran, covering throughput and speed, where each test contains multiple benchmarks~\cite{SPECCPU2017}.
|
---|
1498 | All the benchmarks perform dynamic allocation, from light to heavy.
|
---|
1499 | However, the dynamic allocation is relatively small in comparison to the benchmark computation.
|
---|
1500 | Therefore, differences among allocators should be small, unless a particular access pattern triggers a pathological case.
|
---|
1501 | The reason for performing SPEC CPU across the allocators is to prove this hypothesis.
|
---|
1502 | For allocator comparisons, we consider SPEC CPU differences of 5\% as equal and undetectable in general workloads and computing environments.
|
---|
1503 | For compiler comparisons, small differences of 1\% or 2\% are considered significant.
|
---|
1504 |
|
---|
1505 | Table~\ref{t:SPEC-CPU-benchmark} shows the elapsed time (inverted throughput) of the SPEC CPU tests condensed to the geomean across the benchmarks for each of the four SPEC tests, intrate, intspeed, fprate, and fpspeed, covering integer and floating-point operations.
|
---|
1506 | The tests are configured with size = ref, intrate/fprate: copies = 1, intspeed: threads = 1, fpspeed: threads = 16;
|
---|
1507 | only fpspeed is concurrent using OpenMP.
|
---|
1508 | Rigorous testing of SPEC CPU often runs many benchmark copies in parallel to completely load all computer cores.
|
---|
1509 | However, these tests quickly run into architectural bottlenecks having little to do with an allocator's behaviour.
|
---|
1510 | Runnning a single program bound to one core means the focus is strictly on allocator differences rather than conjoining transient OS and hardware differences.
|
---|
1511 | The throughputs are ranked with {\color{red}red} lowest time and {\color{blue}blue} highest, where lower is best.
|
---|
1512 | Hoard failed in multiple experiments on the ARM architecture, marked with {\color{purple}*Err*}, making it impossible to report the successful tests.
|
---|
1513 |
|
---|
1514 | The results show all allocators do well;
|
---|
1515 | the average, median, and relative standard deviation (right column)\footnote{$rstd = \sigma / \mu \times 100$, where $\sigma =$ standard deviation and $\mu =$ average} proves our hypothesis that the performance difference, 0.6\% to 2.3\%, across allocators is small.
|
---|
1516 | One implementation trend we observed is that two of the integer tests, @omnetpp@ and @xalancbmk@, had an execution pattern that exercised the cache.
|
---|
1517 | For the three allocators using headers-per-allocation, glibc, llheap, and tbbmalloc, performance could be up to 40\% slower, between the best and worst allocator results.
|
---|
1518 | The reason is that the headers consumed part of the cache line, resulting in more cache misses.
|
---|
1519 | These two experiments, disproportionally increased the geomean for these allocators for both integral experiments on all architectures.
|
---|
1520 | Hence, headers-per-allocation are disadvantaged for this specific execution pattern.
|
---|
1521 | The floating-point tests show no trends among the allocators.
|
---|
1522 | The goal for llheap in this experiment is to do well, which is established by it being close to the median result, meaning it is normally in the middle of the allocator results.
|
---|
1523 |
|
---|
1524 | \begin{table}
|
---|
1525 | \centering
|
---|
1526 | \caption{SPEC CPU benchmark, 3 hardware architectures, geomean per test in seconds, lower is better}
|
---|
1527 | \label{t:SPEC-CPU-benchmark}
|
---|
1528 | %\setlength{\tabcolsep}{6pt}
|
---|
1529 | \begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
|
---|
1530 | & bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
|
---|
1531 | \cline{2-12}
|
---|
1532 | & intrate & {\color{blue}314.4} & {\color{violet}*Err*} & 300.3 & 309.9 & 302.6 & 313 & {\color{red}298.7} & 306.5 & 309.9 & 2\% \\
|
---|
1533 | ARM & intspeed & {\color{blue}439.1} & {\color{violet}*Err*} & 417.6 & 431.1 & 419.9 & 436.2 & {\color{red}415.5} & 426.6 & 431.1 & 2.2\% \\
|
---|
1534 | & fprate & 347.6 & {\color{violet}*Err*} & {\color{red}333.9} & 352.2 & {\color{blue}356.6} & 345.9 & 344.5 & 346.8 & 347.6 & 2\% \\
|
---|
1535 | & fpspeed & 248.4 & {\color{violet}*Err*} & 245.3 & 245.7 & {\color{blue}250.9} & 246.6 & {\color{red}243.8} & 246.8 & 246.6 & 0.93\%
|
---|
1536 | \end{tabular}
|
---|
1537 |
|
---|
1538 | \begin{comment}
|
---|
1539 | \bigskip
|
---|
1540 | \begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
|
---|
1541 | & bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
|
---|
1542 | \cline{2-12}
|
---|
1543 | & intrate & 251 & 242 & 239 & 249 & 240 & {\color{blue}251} & {\color{red}237} & 244 & 242 & 2.3\% \\
|
---|
1544 | AMD & intspeed & 356 & 337 & 335 & 351 & 339 & {\color{blue}356} & {\color{red}333} & 344 & 339 & 2.7\% \\
|
---|
1545 | & fprate & 256 & 261 & {\color{red}250} & 257 & {\color{blue}270} & 256 & 254 & 258 & 256 & 2.3\% \\
|
---|
1546 | & fpspeed & 340 & {\color{blue}353} & {\color{red}326} & 338 & 348 & 341 & 328 & 339 & 340 & 2.7\%
|
---|
1547 | \end{tabular}
|
---|
1548 | \end{comment}
|
---|
1549 |
|
---|
1550 | \bigskip
|
---|
1551 | \begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
|
---|
1552 | & bench/alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
|
---|
1553 | \cline{2-12}
|
---|
1554 | & intrate & 251.2 & {\color{red}241.1} & 251.9 & 249.3 & 251.6 & 251.5 & {\color{blue}252.3} & 249.9 & 251.5 & 1.5\% \\
|
---|
1555 | AMD & intspeed & {\color{blue}356.1} & {\color{red}337.1} & 355.4 & 351.7 & 355.5 & 355.8 & 355.9 & 352.5 & 355.5 & 1.8\% \\
|
---|
1556 | & fprate & {\color{red}253.9} & {\color{blue}259.9} & 254.4 & 255.8 & 254.5 & 254.4 & 254.7 & 255.4 & 254.5 & 0.75\% \\
|
---|
1557 | & fpspeed & 329.9 & {\color{blue}339.6} & 330.6 & {\color{red}327.2} & 329.9 & 329.8 & 329.5 & 330.9 & 329.9 & 1.1\%
|
---|
1558 | \end{tabular}
|
---|
1559 |
|
---|
1560 | \bigskip
|
---|
1561 | \begin{tabular}{@{}p{15pt}@{\hspace{15pt}}r|*{7}{r}|*{3}{r}@{}}
|
---|
1562 | & bench./alloc. & glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc & avg & med & rstd \\
|
---|
1563 | \cline{2-12}
|
---|
1564 | & intrate & 188.6 & 185.1 & 183.1 & 188.6 & 181.5 & {\color{blue}189.4} & {\color{red}181.2} & 185.4 & 185.1 & 1.8\% \\
|
---|
1565 | Intel & intspeed & 271.6 & 264.6 & 263.5 & 270.2 & 261.2 & {\color{blue}272.1} & {\color{red}260.3} & 266.2 & 264.6 & 1.7\% \\
|
---|
1566 | & fprate & 202.7 & {\color{red}201.8} & 204.4 & 205.1 & {\color{blue}205.3} & 204.7 & 203.7 & 204 & 204.4 & 0.59\% \\
|
---|
1567 | & fpspeed & 237.3 & 235.3 & 234.5 & 235.6 & {\color{blue}244.5} & 236.1 & {\color{red}233.6} & 236.7 & 235.6 & 1.4\%
|
---|
1568 | \end{tabular}
|
---|
1569 | \end{table}
|
---|
1570 |
|
---|
1571 |
|
---|
1572 | \subsection{Realloc Benchmark}
|
---|
1573 |
|
---|
1574 | Some examination of @realloc@ is necessary to encourage its use.
|
---|
1575 | Reallocation can be very efficient (both in space and time) when manipulating variable-sized objects, like strings, multi-precise numbers, or dynamic-sized arrays.
|
---|
1576 | Both X11 (500+ calls) and glibc (300+ calls) use realloc for various purposes.
|
---|
1577 | For example, in \CC:
|
---|
1578 | \begin{C++}
|
---|
1579 | string s = "abc"; // initial allocation and copy new value
|
---|
1580 | s = "gh"; // change size and copy new value
|
---|
1581 | s = "l" + s + "r"; // change size and copy new value
|
---|
1582 | s = s.substr(0,2); // reduce size
|
---|
1583 | \end{C++}
|
---|
1584 | variable @s@ changes size and value multiple times, plus temporary strings are created implicitly, \eg multiple concatenations, all of which requires multiple allocations, copying, and deallocations.
|
---|
1585 | @realloc@ can optimize some of these operations in two ways:
|
---|
1586 | \begin{enumerate}[leftmargin=*]
|
---|
1587 | \item
|
---|
1588 | For decreasing size, Figure~\ref{f:ReallocOptDecreasing} shows a logical truncation of the existing object rather than creating a new object, \ie use a heuristic to decide whether to perform the 3-step procedure (allocate, copy, and free), or pretend the storage is decreased and return the old storage and value, performing zero work but increasing internal fragmentation.
|
---|
1589 | For example, a request to decrease size from 96 to 75 bytes can be implemented two ways:
|
---|
1590 | The 21 bytes of internal fragmentation at the end of the logical reallocation may be unavailable, directly available if the allocator supports @malloc_usable_size@, or indirectly available if put back on the allocator free list.
|
---|
1591 | \item
|
---|
1592 | For increasing size, Figure~\ref{f:ReallocOptIncreasing} takes advantage of the fact that many memory allocators quantize request sizes (binning), often returning slightly more storage than requested (internal fragmentation).
|
---|
1593 | For example, an initial request for 75 bytes may return 96 bytes of storage, giving 21 bytes of internal fragmentation:
|
---|
1594 | For increasing the size up to 21 bytes, realloc can take advantage of this unused space rather than performing the 3-step procedure, which can also result in unused storage.
|
---|
1595 | \end{enumerate}
|
---|
1596 |
|
---|
1597 | \begin{figure}
|
---|
1598 | \centering
|
---|
1599 | \subfloat[Decreasing]{\label{f:ReallocOptDecreasing}\input{decreasing}}
|
---|
1600 | \hspace*{5pt}
|
---|
1601 | \vrule
|
---|
1602 | \hspace*{5pt}
|
---|
1603 | \subfloat[Increasing]{\label{f:ReallocOptIncreasing}\raisebox{0.38\totalheight}{\input{increasing}}}
|
---|
1604 | \caption{Realloc Optimizations}
|
---|
1605 | \label{f:ReallocOptimizations}
|
---|
1606 | \end{figure}
|
---|
1607 |
|
---|
1608 | Figure~\ref{f:reallocShrinkBenchmark} shows a benchmark to determine if an allocator takes advantage of the first optimization.
|
---|
1609 | The benchmark takes a fixed-size allocation and reduction it by 10\%--90\% in steps of 10\%, checking the storage addresses at each reduction step if the same or new storage is returned.
|
---|
1610 | The fixed-sized allocation is varied between sizes 64--16K in powers of 2.
|
---|
1611 | Hence, both small and large sized storage are reduced.
|
---|
1612 | The following table shows the approximate percentage point where storage is retained on shrinkage, \eg the storage reduction must be greater than 50\% of the prior allocation before a new allocation is performed for the smaller size, data is copied, and prior storage released.
|
---|
1613 | \begin{center}
|
---|
1614 | \setlength{\tabcolsep}{15pt}
|
---|
1615 | \begin{tabular}{@{}ccccccc@{}}
|
---|
1616 | glibc & hoard & jemalloc & llheap & mimalloc & tbbmalloc & tcmalloc \\
|
---|
1617 | 90\% & 50\% & 20\% & 50\% & 50\% & 90\% & 50\%
|
---|
1618 | \end{tabular}
|
---|
1619 | \end{center}
|
---|
1620 | The results show glibc and tbbmalloc do not perform this optimization, while the other allocators do with 50\% as the most popular crossover point.
|
---|
1621 |
|
---|
1622 | Figure~\ref{f:reallocGrowBenchmark} shows a benchmark to determine if an allocator takes advantage of the second optimization.
|
---|
1623 | This benchmark creates an array of fixed-sized elements increasing the array size by 1 from 1--10,000 elements.
|
---|
1624 | Then the element size is varied from 32, 64, 128, 256 bytes.
|
---|
1625 | To prevent allocators from doing a bump allocation across the entire benchmark, a small perturbation is introduced where storage is allocated, held, and then released at infrequent points across the experiment.
|
---|
1626 | A companion experiment is a manual simulation of the @realloc@: @malloc@ new storage, copy old data, and free old storage.
|
---|
1627 | Note, the @realloc@ simulation is performing an equivalent perturbation to the @realloc@ benchmark each time through the loop.
|
---|
1628 | The experiment is repeated 10,000 times for @realloc@ and 100 times for the simulation to obtain similar timing ranges.
|
---|
1629 | The performance difference between the @realloc@ and @realloc@-simulation experiments shows if @realloc@ is optimizing unused internal fragmentation at the end of its quantized bucket.
|
---|
1630 |
|
---|
1631 | Figure~\ref{f:reallocGrowResults} shows the results for the @realloc@ and @realloc@ simulation benchmarks.
|
---|
1632 | The difference between the benchmarks is two orders of magnitude, \ie all allocators are reusing some internal fragmentation to prevent a reallocation and copy as the array grows.
|
---|
1633 | The large difference is the extra copying in the simulation case, which is expensive.
|
---|
1634 | Within the @realloc@ benchmark, allocators glibc, hoard, jemalloc, and tbbmalloc have higher cost, while the remaining allocators have almost identical results.
|
---|
1635 | Within the @realloc@ simulation benchmark, allocators glibc and tbbmalloc have higher cost, while the remaining allocators have almost identical results.
|
---|
1636 | This benchmark confirms that @realloc@ can provide some level of performance benefit for dynamically growing data structures, \eg strings or arrays.
|
---|
1637 | Therefore, encouraging its use is reasonable, if and only if, it is safe to do so.
|
---|
1638 | Note, this encouragement is apt for container developers, where low-level storage management is performed internally for the benefit of application users.
|
---|
1639 |
|
---|
1640 | \begin{figure}
|
---|
1641 | \begin{C++}
|
---|
1642 | for ( size_t p = 10; p <= 100; p += 10 ) {
|
---|
1643 | for ( size_t s = 64; s < 16 * 1024; s <<= 1 ) {
|
---|
1644 | bool reuse = false;
|
---|
1645 | void * prev = pass( malloc( s ) );
|
---|
1646 | void * curr = pass( realloc( prev, s * p / 100 ) );
|
---|
1647 | if ( prev == curr ) { /* print */ }
|
---|
1648 | free( curr );
|
---|
1649 | }
|
---|
1650 | }
|
---|
1651 | \end{C++}
|
---|
1652 | \vspace*{-10pt}
|
---|
1653 | \caption{\lstinline{realloc} Shrink Benchmark}
|
---|
1654 | \label{f:reallocShrinkBenchmark}
|
---|
1655 |
|
---|
1656 | \vspace*{10pt}
|
---|
1657 |
|
---|
1658 | %\setlength{\tabcolsep}{15pt}
|
---|
1659 | \begin{tabular}{@{}ll@{}}
|
---|
1660 | \multicolumn{1}{c}{\lstinline{realloc}} & \multicolumn{1}{c}{\lstinline{realloc} simulation} \\
|
---|
1661 | \begin{C++}
|
---|
1662 | struct S { size_t ca[DIM]; }; // varied 32, 64, 128, 256
|
---|
1663 | enum { Ssize = sizeof( S ) };
|
---|
1664 | for ( size_t t = 0; t < @10$'$000@; t += 1 ) {
|
---|
1665 | S * sa = nullptr, * perturb = nullptr;
|
---|
1666 | for ( size_t i = 0, s = Ssize; i < 10$'$000; i += 1, s += Ssize ) {
|
---|
1667 | sa = (S *)@realloc( sa, s );@
|
---|
1668 |
|
---|
1669 | sa[i].ca[0] = i;
|
---|
1670 | if ( i % 1024 == 0 ) perturb = (S *)realloc( perturb, s );
|
---|
1671 | }
|
---|
1672 | free( sa );
|
---|
1673 | free( perturb );
|
---|
1674 | }
|
---|
1675 | \end{C++}
|
---|
1676 | &
|
---|
1677 | \begin{C++}
|
---|
1678 | struct S { size_t ca[DIM]; }; // varied 32, 64, 128, 256
|
---|
1679 | enum { Ssize = sizeof( S ) };
|
---|
1680 | for ( size_t t = 0; t < @100@; t += 1 ) {
|
---|
1681 | S * sa = nullptr, * so = (S *)malloc( Ssize );
|
---|
1682 | for ( size_t i = 0, s = Ssize; i < 10$'$000; i += 1, s += Ssize ) {
|
---|
1683 | sa = (S *)@malloc( s )@; // simulate realloc
|
---|
1684 | memcpy( sa, so, s - Ssize ); // so one smaller
|
---|
1685 | sa[i].ca[0] = i;
|
---|
1686 | free( so );
|
---|
1687 | so = sa;
|
---|
1688 | }
|
---|
1689 | free( sa );
|
---|
1690 | }
|
---|
1691 | \end{C++}
|
---|
1692 | \end{tabular}
|
---|
1693 | \caption{\lstinline{realloc} Grow Benchmark}
|
---|
1694 | \label{f:reallocGrowBenchmark}
|
---|
1695 |
|
---|
1696 | \vspace*{20pt}
|
---|
1697 |
|
---|
1698 | \hspace*{-17pt}
|
---|
1699 | \setlength{\tabcolsep}{-13pt}
|
---|
1700 | \begin{tabular}{@{}l@{\hspace*{-5pt}{\vrule height 1.05in}\hspace*{-5pt}}l@{}}
|
---|
1701 | \begin{tabular}{@{}lll@{}}
|
---|
1702 | \input{prolog.realloc.tex} & \input{swift.realloc.tex} & \input{java.realloc.tex}
|
---|
1703 | \\
|
---|
1704 | \multicolumn{3}{@{}c@{}}{\lstinline{realloc}, 10,000 repetitions}
|
---|
1705 | \end{tabular}
|
---|
1706 | &
|
---|
1707 | \setlength{\tabcolsep}{-10pt}
|
---|
1708 | \begin{tabular}{@{}lll@{}}
|
---|
1709 | \input{prolog.reallocsim.tex} & \input{swift.reallocsim.tex} & \input{java.reallocsim.tex}
|
---|
1710 | \\
|
---|
1711 | \multicolumn{3}{@{}c@{}}{\lstinline{realloc} simulation, 100 repetitions}
|
---|
1712 | \end{tabular}
|
---|
1713 | \end{tabular}
|
---|
1714 |
|
---|
1715 | \caption{\lstinline{realloc} Grow Results, x-axis in bytes, lower is better}
|
---|
1716 | \label{f:reallocGrowResults}
|
---|
1717 | \end{figure}
|
---|
1718 |
|
---|
1719 |
|
---|
1720 | \subsubsection{Cache Benchmark}
|
---|
1721 | \label{s:CacheBenchmark}
|
---|
1722 |
|
---|
1723 | The cache benchmarks attempt to look for false sharing (see Section~\ref{s:FalseSharing}).
|
---|
1724 | Unfortunately, testing for allocator-induced false-sharing is difficult, because it is equivalent to searching for randomly conjoined allocations within a large storage space.
|
---|
1725 | Figure~\ref{f:CacheBenchmark} shows a benchmark for program induced false-sharing, where pointers are passed among threads.
|
---|
1726 | As a side effect, this benchmark is indirectly checking which allocator model is being used.
|
---|
1727 | The program main runs the benchmark with 4, 8, 16, and 32 threads, passing each thread a separate array of dynamically allocated storage from its common heap with @ASIZE@ elements.
|
---|
1728 | Each thread then traverse the array adding a value to each element (read and write).
|
---|
1729 | The traversal is repeated T times.
|
---|
1730 | Each thread frees the array at the end.
|
---|
1731 | The experiment is run with a small and medium sized array.
|
---|
1732 | If there is any heap sharing, the small array has a higher probability for false sharing, \eg the first and last array elements for different array can be juxtaposed in memory, and hence appear in the same cache line.
|
---|
1733 |
|
---|
1734 | \begin{figure}
|
---|
1735 | \begin{C++}
|
---|
1736 | enum { TIMES = 10$'$000$'$000$'$000, ASIZE = 3 }; $\C{// repetitions, array size 3 or 30}$
|
---|
1737 | void * worker( void * arg ) { $\C{// array passed from program main}$
|
---|
1738 | volatile size_t * arr = (size_t *)arg; $\C{// volatile prevents code elision}$
|
---|
1739 | for ( size_t t = 0; t < TIMES / ASIZE; t += 1 ) $\C{// repeat experiment N times}$
|
---|
1740 | for ( size_t r = 0; r < ASIZE; r += 1 ) $\C{// iterate through array}$
|
---|
1741 | arr[r] += r; $\C{// read/write array elements}$
|
---|
1742 | free( (void *)arr ); $\C{// cast away volatile}$
|
---|
1743 | }
|
---|
1744 | \end{C++}
|
---|
1745 | \vspace*{-5pt}
|
---|
1746 | \caption{Cache False-Sharing Benchmark}
|
---|
1747 | \label{f:CacheBenchmark}
|
---|
1748 | \end{figure}
|
---|
1749 |
|
---|
1750 | Figure~\ref{f:cacheResults} shows the results for the cache benchmark run with array sizes 3 and 30.
|
---|
1751 | Allocators glibc, llheap, mimalloc, and tbbmalloc show little or no false-sharing issues at both 3 and 30 array sizes, \ie all generate virtually the same result.
|
---|
1752 | Note, on the Intel, there is a rise at 32 cores, because of an L3 cache shift at 16 cores; stepping to 32 cores introduces NUMA effects.
|
---|
1753 | This result correlates with these allocators using a 1:1 allocator model.
|
---|
1754 | Allocators hoard, jemalloc, and tcmalloc show false-sharing issues at both 3 and 30 array sizes, reducing performance by 2 times at size 3.
|
---|
1755 | The @perf@ performance analyzer shows a large number of cache misses for these allocators, indicating false sharing.
|
---|
1756 | This result correlates with these allocators using some form of heap sharing.
|
---|
1757 |
|
---|
1758 | \begin{figure}
|
---|
1759 | \setlength{\tabcolsep}{-8pt}
|
---|
1760 | \begin{tabular}{@{}l@{\hspace*{-5pt}{\vrule height 1.05in}\hspace*{-5pt}}l@{}}
|
---|
1761 | \begin{tabular}{@{}lll@{}}
|
---|
1762 | \input{prolog.cacheS.tex} & \input{swift.cacheS.tex} & \input{java.cacheS.tex}
|
---|
1763 | \\
|
---|
1764 | \multicolumn{3}{@{}c@{}}{3 Element Array}
|
---|
1765 | \end{tabular}
|
---|
1766 | &
|
---|
1767 | \begin{tabular}{@{}lll@{}}
|
---|
1768 | \input{prolog.cacheL.tex} & \input{swift.cacheL.tex} & \input{java.cacheL.tex}
|
---|
1769 | \\
|
---|
1770 | \multicolumn{3}{@{}c@{}}{30 Element Array}
|
---|
1771 | \end{tabular}
|
---|
1772 | \end{tabular}
|
---|
1773 | \caption{Cache False-Sharing Results, x-axis in cores, lower is better}
|
---|
1774 | \label{f:cacheResults}
|
---|
1775 | \end{figure}
|
---|
1776 |
|
---|
1777 |
|
---|
1778 | \subsection{Ownership Benchmark}
|
---|
1779 |
|
---|
1780 | % In multi-threaded allocators with H:T or 1:1 structure, one thread can allocation storage, send it to another thread, and the receiving thread deallocates it.
|
---|
1781 | % This raises the question of where the storage is returned: the heap (area) from which it was allocated or a different heap;
|
---|
1782 | % in some cases there is no choice, when storage is bound to its allocation area.
|
---|
1783 | % If storage is returned to its allocation heap, there are concurrency issues if the allocation area is shared.
|
---|
1784 | % If the storage is returned to another heap, there can still be concurrency issues, but the real problem is storage drain in the allocation heap and storage bloat in the deallocation heap, without a secondary mechanism to redistribute storage.
|
---|
1785 | % This choice is the \newterm{ownership problem}.
|
---|
1786 |
|
---|
1787 | Historically the Larson benchmark~\cite{larson99memory} is purported to test for ownership issues, but in actuality, the benchmark is a complex simulation of a server environment.
|
---|
1788 | Multiple threads allocate and free a number of random-sized objects within a size range.
|
---|
1789 | Each thread runs for a time period, and at termination, creates a child thread and passes its array of objects as an argument, which does not require synchronization.
|
---|
1790 | The number of thread generations varies with thread speed.
|
---|
1791 | % It calculates memory operations per second as an indicator of the memory allocator's performance.
|
---|
1792 | Because the benchmark performs multiple kinds of tests, it is impossible to extracted just the remote-free rate.
|
---|
1793 |
|
---|
1794 | Therefore, a new benchmark is created to measure the asynchronous transfer cost from the deallocating to the allocating thread (remote free).
|
---|
1795 | However, the allocating thread must first asynchronously transferred the allocations to the deallocating thread.
|
---|
1796 | This cost needs to be mitigated so it does not mask the remote-free measurement.
|
---|
1797 | To accomplish this, a thread batches its allocations (lots of 100), and atomically exchanges this batch with a freeing thread, which then individually frees the batch components.
|
---|
1798 | Hence, the cost of the asynchronous allocation transfer is much less than the individual cost of the remote free.
|
---|
1799 |
|
---|
1800 | Figure~\ref{f:OwnershipBenchmark} shows the pseudo-code for the benchmark.
|
---|
1801 | There is a global matrix of allocation addresses: one row for each thread and one column for each batch.
|
---|
1802 | Each thread starts at a specific row and fills that row with two different sized allocations.
|
---|
1803 | A thread then loops until it atomically exchanges its row pointer with another thread's row pointer.
|
---|
1804 | The storage in the received batch is then remote freed, the batch row is reset with new allocations, and the process repeats for a timed duration.
|
---|
1805 | As well, after each allocation, an integer is written into the storage, and that integer is read before the deallocation.
|
---|
1806 |
|
---|
1807 | Figure~\ref{f:Ownership} (a)--(c) shows the throughput of the ownership benchmark.
|
---|
1808 | The results are divided into three groups.
|
---|
1809 | glibc and tbbmalloc are slowest because of many system calls to @futex@. % and @nano_sleep@.
|
---|
1810 | Figure~\ref{f:Ownership}~(d) shows the system time climbing during scaling on the AMD;
|
---|
1811 | the other architectures are similar.
|
---|
1812 | llheap and mimalloc are next, as these allocators do not batch remote frees, so every free requires locking.
|
---|
1813 | jemalloc, hoard, and tcmalloc are fastest, as these allocators batch remote frees, reducing locking.
|
---|
1814 | For 1:1 allocators, eager remote return makes sense as the returned storage can be reused during the owning thread's lifetime.
|
---|
1815 | For N:T allocators, lazy remote return using batching makes sense as heaps outlive threads so eventually returned storage can be used by any existing or new thread.
|
---|
1816 | Batching is possible for 1:1 allocators, but results in complexity and external fragmentation, which is only warranted in certain cases.
|
---|
1817 |
|
---|
1818 | \begin{figure}
|
---|
1819 | \begin{cfa}
|
---|
1820 | void * batches[MaxThread][MaxBatch]; $\C{// thread global}$
|
---|
1821 | struct Aligned { CALIGN void * * col; };
|
---|
1822 | volatile Aligned allocations[MaxThread];
|
---|
1823 |
|
---|
1824 | Aligned batch = { batches[id] }; $\C{// thread local}$
|
---|
1825 | size_t cnt = 0, a = 0;
|
---|
1826 | for ( ; ! stop; ) { $\C{// loop for T second}$
|
---|
1827 | for ( ssize_t i = Batch - 1; i >= 0; i -= 1 ) { $\C{// allocations, oppose order from frees}$
|
---|
1828 | batch.col[i] = malloc( i & 1 ? 42 : 192 ); $\C{// two allocation sizes}$
|
---|
1829 | *(int *)batch.col[i] = 42; $\C{// write storage}$
|
---|
1830 | }
|
---|
1831 | Aligned obatch = batch;
|
---|
1832 | while ( (batch.col = Fas( allocations[a].col, batch.col )) == obatch.col || batch.col == nullptr ) { // atomic exchange
|
---|
1833 | if ( stop ) goto fini;
|
---|
1834 | a = (a + 1) % Threads; $\C{// try another batch}$
|
---|
1835 | }
|
---|
1836 | for ( size_t i = 0; i < Batch; i += 1 ) { $\C{// deallocations}$
|
---|
1837 | if ( *(int *)batch.col[i] != 42 ) abort(); $\C{// read storage check}$
|
---|
1838 | free( batch.col[i] ); $\C{// remote free}$
|
---|
1839 | }
|
---|
1840 | cnt += Batch; $\C{// sum allocations/frees}$
|
---|
1841 | a = (a + 1) % Threads; $\C{// try another batch}$
|
---|
1842 | } fini: ;
|
---|
1843 | \end{cfa}
|
---|
1844 | \caption{Ownership Benchmark Outline}
|
---|
1845 | \label{f:OwnershipBenchmark}
|
---|
1846 | \end{figure}
|
---|
1847 |
|
---|
1848 | \begin{figure}
|
---|
1849 | \hspace*{-14pt}
|
---|
1850 | \setlength{\tabcolsep}{-13pt}
|
---|
1851 | \begin{tabular}{@{}lll@{\hspace*{-6pt}{\vrule height 2.05in}\hspace*{-6pt}}l@{}}
|
---|
1852 | \input{prolog.ownership.tex}
|
---|
1853 | &
|
---|
1854 | \input{swift.ownership.tex}
|
---|
1855 | &
|
---|
1856 | \input{java.ownership.tex}
|
---|
1857 | &
|
---|
1858 | \input{swift.ownershipres.tex}
|
---|
1859 | \end{tabular}
|
---|
1860 | \caption{Ownership Results, x-axis is cores, (a)--(c) higher is better, (d) lower is better}
|
---|
1861 | \label{f:Ownership}
|
---|
1862 | \end{figure}
|
---|
1863 |
|
---|
1864 |
|
---|
1865 | \subsection{Delay Benchmark}
|
---|
1866 |
|
---|
1867 | The delay benchmark is a torture test of abrupt allocation patterns looking for delays that increase latency.
|
---|
1868 | A flat response across the tests means there are few or no allocator-induced pauses.
|
---|
1869 | The test examines small and large requests, where small requests are handled by the heap (@sbrk@) and large requests are handled by the OS (@mmap@).
|
---|
1870 | Putting large requests in the heap causes external fragmentation when freed, unless an allocator subdivided the space, leading to pauses.
|
---|
1871 | The @mallopt@ function provides the option @M_MMAP_THRESHOLD@ to set the division point in bytes for requests that cannot be satisfied by an allocator's free list.
|
---|
1872 | Each @sbrk@ test in this benchmark is repeated 5,000,000,000 times and each @mmap@ test is performed 1,000,000 times;
|
---|
1873 | the different repetitions result from the high cost of the OS calls making the experiment run too long.
|
---|
1874 | A \emph{long running} experiment, rather than short experiments with averaged results, is searching for blowup scenarios in time and/or space.
|
---|
1875 | Finally, scaling is tested with 4, 8, 16, and 32 pinned threads, where the threads synchronize between tests using a @pthread@ barrier.
|
---|
1876 | In all experiments, allocated storage has its first and last byte assigned a character to simulate usage.
|
---|
1877 |
|
---|
1878 | The tests are performed in this order:
|
---|
1879 | \begin{enumerate}[leftmargin=18pt,topsep=3pt,itemsep=2pt,parsep=0pt]
|
---|
1880 | \item
|
---|
1881 | @x = malloc( 0 ) / free( x )@:
|
---|
1882 | handles the pathological case of an zero-sized allocation and free.
|
---|
1883 | The POSIX standard allows two meanings for this case: return @NULL@ or a unique pointer, where both can be freed.
|
---|
1884 | The fastest implementation is to return @NULL@, rather than create a fictitious allocation.
|
---|
1885 | However, this overloads the @malloc@ return-value to mean error or a zero-sized allocation.
|
---|
1886 | To comply with the POSIX standard, the check for running out of memory is:
|
---|
1887 | \begin{uC++}
|
---|
1888 | if ( malloc( 0 ) == NULL && errno == ENOMEM ) ... // no memory
|
---|
1889 | \end{uC++}
|
---|
1890 | Unfortunately, most programmers assume @NULL@ means an error, \eg two tests in the SPEC CPU benchmark fail if @NULL@ is returned for a zero-sized allocation.
|
---|
1891 | Hence, returning @NULL@ for a zero-sized allocation is an impractical allocator option.
|
---|
1892 |
|
---|
1893 | \item
|
---|
1894 | @free( NULL )@: handles the pathological case of freeing a non-existent or zero-byte allocation.
|
---|
1895 | Non-existent allocations occur as algorithm base-cases, such as an unused pointer set to @NULL@.
|
---|
1896 | Having the allocator ignore this case eliminates checking for an erroneous @free@ call on a @NULL@ value.
|
---|
1897 | This call should be fast.
|
---|
1898 |
|
---|
1899 | \item
|
---|
1900 | \label{expS}
|
---|
1901 | @x = malloc( 42 ) / free( x )@:
|
---|
1902 | handles a fixed-sized allocation and free.
|
---|
1903 |
|
---|
1904 | \item
|
---|
1905 | @x[0..100) = malloc( 42 ) / free( x[0..100) )@:
|
---|
1906 | handles a group of fixed-sized allocations and group free.
|
---|
1907 |
|
---|
1908 | \item
|
---|
1909 | @x[0..1000) = malloc( 42 ) / free( x[0..1000) )@:
|
---|
1910 | handles a larger group of fixed-sized allocations and group free.
|
---|
1911 |
|
---|
1912 | \item
|
---|
1913 | @x[0..100) = malloc( 42 ) / free( x(100..0] )@:
|
---|
1914 | handles a group of fixed-sized allocations and group free in reverse order.
|
---|
1915 |
|
---|
1916 | \item
|
---|
1917 | \label{expE}
|
---|
1918 | @x[0..1000) = malloc( 42 ) / free( x(1000..0] )@:
|
---|
1919 | handles a larger group of fixed-sized allocations and group free in reverse order.
|
---|
1920 |
|
---|
1921 | \item
|
---|
1922 | @x = malloc( [0..100) ) / free( x )@:
|
---|
1923 | handles a variable-sized allocation and free.
|
---|
1924 |
|
---|
1925 | \item
|
---|
1926 | @x[0..100) = malloc( [0..100) ) / free( x[0..100) )@:
|
---|
1927 | handles a group of variable-sized allocations and group free.
|
---|
1928 |
|
---|
1929 | \item
|
---|
1930 | @x[0..1000) = malloc( [0..1000) ) / free( x[0..1000) )@:
|
---|
1931 | handles a larger group of variable-sized allocations and group free.
|
---|
1932 |
|
---|
1933 | \item
|
---|
1934 | @x[0..100) = malloc( [0..100) ) / free( x(100..0] )@:
|
---|
1935 | handles a group of variable-sized allocations and group free in reverse order.
|
---|
1936 |
|
---|
1937 | \item
|
---|
1938 | @x[0..1000) = malloc( [0..1000) ) / free( x(1000..0] )@:
|
---|
1939 | handles a larger group of variable-sized allocations and group free in reverse order.
|
---|
1940 | \end{enumerate}
|
---|
1941 | Experiments \ref{expS}--\ref{expE} are repeated with a fixed-sized allocation of 1,048,576, where @M_MMAP_THRESHOLD@ is set to 524,288 to force the use of @mmap@, resulting in 17 experiments.
|
---|
1942 | Because the @mmap@ experiments test the operating-system memory-management not the allocators, the variable-sized @mmap@ experiments are deemed unnecessary.
|
---|
1943 | A test with random-sized @sbrk@ allocations @malloc( [0..N) random )@ was performed, but the results are the same as fixed sized as all the allocation sizes are quickly accessed over the large number of experiment repetitions.
|
---|
1944 | That is, once the buckets or superblocks for the allocation sizes are created, access order is irrelevant.
|
---|
1945 |
|
---|
1946 | Figures~\ref{f:LatencyExpARM}--\ref{f:LatencyExpIntel} show the results of the @sbrk@ and @mmap@ experiments across the seven allocators with parallel scaling.
|
---|
1947 | The average of the N threads is graphed for each experiment and the standard deviation is the error bar.
|
---|
1948 | For the @sbrk@ graphs, a good allocator result should be low (smaller is better), flat across scaling (cores), with no error bars (STD $\approx$ 0) indicating no jitter (pauses) among the threads.
|
---|
1949 | The result patterns across the three hardware architectures are similar, with differences correlating to CPU speed and cache differences.
|
---|
1950 |
|
---|
1951 | The key observation across the @sbrk@ graphs is that llheap and mimalloc are always at the bottom (lower is better) and flat with respect to scaling.
|
---|
1952 | The only exception is on the Intel, where all allocators experienced similar non-flat behaviour, because of the L3 cache shift at 16 cores.
|
---|
1953 | Some anomalies are tcmalloc and hoard experiencing large jitter (see error bars) and scaling issues in some experiments, which is correlated with poorer results;
|
---|
1954 | jemalloc has significant scaling issues for experiments 5, 7, 10, and 12, resulting from large numbers of @futex@ calls, possibly related to @madvise@ for returning storage to the OS;
|
---|
1955 | and glibc and tbbmalloc are often slower than the other allocators (symbols are on top of each other).
|
---|
1956 |
|
---|
1957 | The key observation across the @mmap@ graphs is that only three allocators, glibc, llheap, and tbbmalloc honoured the @mmap@ threshold request (symbols are on top of each other).
|
---|
1958 | The other allocators made no @mmap@ calls, so their results are extremely low.
|
---|
1959 | The exception is hoard, which did make @mmap@ calls that were uncorrelated with @M_MMAP_THRESHOLD@, and had significant jitter due to a large number of @futex@ calls.
|
---|
1960 | For the allocators using @mmap@, there should be some scaling effect as more threads make more system calls.
|
---|
1961 |
|
---|
1962 | \begin{figure}
|
---|
1963 | \input{prolog.tex}
|
---|
1964 | \vspace*{-20pt}
|
---|
1965 | \caption{Delay Results, ARM, x-axis is cores, lower is better}
|
---|
1966 | \label{f:LatencyExpARM}
|
---|
1967 | \end{figure}
|
---|
1968 |
|
---|
1969 | \begin{figure}
|
---|
1970 | \input{swift.tex}
|
---|
1971 | \vspace*{-20pt}
|
---|
1972 | \caption{Delay Results, AMD, x-axis is cores, lower is better}
|
---|
1973 | \label{f:LatencyExpAMD}
|
---|
1974 | \end{figure}
|
---|
1975 |
|
---|
1976 | \begin{figure}
|
---|
1977 | \input{java.tex}
|
---|
1978 | \vspace*{-20pt}
|
---|
1979 | \caption{Delay Results, Intel, x-axis is cores, lower is better}
|
---|
1980 | \label{f:LatencyExpIntel}
|
---|
1981 | \end{figure}
|
---|
1982 |
|
---|
1983 | Figures~\ref{f:LatencyResARM}--\ref{f:LatencyResIntel} show a time/space perspective across the entire experiment.
|
---|
1984 | The user, system, and real times along with the maximum memory usage are presented for the @sbrk@ and @mmap@ experiments.
|
---|
1985 | The result patterns across the three hardware architectures are similar.
|
---|
1986 | If an allocator disappears in a graph, its result is less than 1 on a logarithmic scale.
|
---|
1987 | Surprisingly, there are large (2 orders of magnitude) time differences among the allocators.
|
---|
1988 |
|
---|
1989 | \begin{figure}
|
---|
1990 | \hspace*{15pt}
|
---|
1991 | \input{prolog2.tex}
|
---|
1992 | \vspace*{-20pt}
|
---|
1993 | \caption{Delay Results, ARM, x-axis is cores, lower is better}
|
---|
1994 | \label{f:LatencyResARM}
|
---|
1995 |
|
---|
1996 | \hspace*{15pt}
|
---|
1997 | \input{swift2.tex}
|
---|
1998 | \vspace*{-20pt}
|
---|
1999 | \caption{Delay Results, AMD, x-axis is cores, lower is better}
|
---|
2000 | \label{f:LatencyResAMD}
|
---|
2001 |
|
---|
2002 | \hspace*{15pt}
|
---|
2003 | \input{java2.tex}
|
---|
2004 | \vspace*{-20pt}
|
---|
2005 | \caption{Delay Results, Intel, x-axis is cores, lower is better}
|
---|
2006 | \label{f:LatencyResIntel}
|
---|
2007 | \end{figure}
|
---|
2008 |
|
---|
2009 | For @sbrk@ graphs, the user time should be high and scale with cores, the system time very low, the real time constant, and the maximum memory scales with cores.
|
---|
2010 | For user time, llheap and mimalloc, are at the bottom (lower is better) and all allocators have linear scaling as cores increase.
|
---|
2011 | The remaining allocators are slower by one to two orders of magnitude, which correlates with high results in the experiments.
|
---|
2012 | For system time jemalloc has non-trivial system time that scales with cores, caused by a large number of @futex@ calls.
|
---|
2013 | The remaining allocators have virtually zero system time (not on graph).
|
---|
2014 | The exception is a random anomaly where allocators had small amounts of system time, which appeared/disappeared on different experiment runs as if something slightly perturbs the experiment (OS?) over its 20 hour run.
|
---|
2015 | For real time, llheap and mimalloc, take the least overall time and all allocators except jemalloc have flat performance.
|
---|
2016 | For maximum memory, all allocators scale with cores, and there is a rough inverse correlation between user time and memory usage, \ie time \vs speed tradeoff.
|
---|
2017 |
|
---|
2018 | For @mmap@ graphs, only used by glibc, llheap, and tbbmalloc, the user time should be low and scale with cores, the system time should be high and scale with cores, the real time constant, and the maximum memory scales with cores.
|
---|
2019 | For user time, glibc, llheap, and tbbmalloc, are at the bottom because there are no @sbrk@ requests.
|
---|
2020 | The remaining allocators all use a non-trivial amount of time handling the large requests, except mimalloc, which handles the large request identically to a small request.
|
---|
2021 | Interestingly, the amount of time varies by one to two orders of magnitude.
|
---|
2022 | For system time, glibc, llheap, and tbbmalloc, are at the top because of the OS calls to @mmap@.
|
---|
2023 | Interestingly, the remaining allocators still use orders of magnitude of system time, except mimalloc ($<$ 1 so invisible).
|
---|
2024 | For real time, all allocators scale linearly with cores, except mimalloc, which is flat.
|
---|
2025 | For maximum memory, all allocators scale with cores, and there is a rough inverse correlation between user time and memory usage, \ie time \vs speed tradeoff.
|
---|
2026 |
|
---|
2027 |
|
---|
2028 | \subsection{Out of Memory Benchmark}
|
---|
2029 |
|
---|
2030 | Figure~\ref{f:OutMemoryBenchmark} show a \CC program with unbounded memory allocation.
|
---|
2031 | The program is run in a shell with restricted data size.
|
---|
2032 | Hence, it quickly runs out of memory, causing @malloc@, which is called by \CC @new@, to return a @nullptr@ with @errno@ set to @ENOMEM@.
|
---|
2033 | Routine @new@ sees the @nullptr@ and calls the handler routine set by @set_new_handler@, which prints a message, and resets the default handler to raise the @bad_alloc@ exception caught in the program main.
|
---|
2034 | Note, to raise an exception requires dynamic allocation, but \CC preallocates a few special exception, like @bad_alloc@, for special cases.
|
---|
2035 |
|
---|
2036 | All allocators printed the correct output except hoard, mimalloc, and tcmalloc.
|
---|
2037 | Hoard prints @MAP_FAILED@ and hangs spinning on a spinlock in a complex call chain.
|
---|
2038 | mimalloc aborts the program because it incorrectly attempts to raise the @bad_alloc@ exception itself if and only it is compiled with \CC, whereas it is compiled with C.
|
---|
2039 | The correct design is to return a @nullptr@ with @errno@ set to @ENOMEM@ to \CC @new@, which then raises the exception;
|
---|
2040 | hence, the allocator can be compiled with C or \CC.
|
---|
2041 | tcmalloc prints the correct output but adds ``allocation failed'' messages.
|
---|
2042 |
|
---|
2043 | \begin{figure}
|
---|
2044 | \begin{tabular}{@{\hspace*{\parindentlnth}}l@{\hspace*{2\parindentlnth}}l@{}@{}}
|
---|
2045 | \begin{cfa}
|
---|
2046 | static void handler() {
|
---|
2047 | cout << "Memory allocation failed\n";
|
---|
2048 | set_new_handler( nullptr );
|
---|
2049 | }
|
---|
2050 |
|
---|
2051 |
|
---|
2052 | \end{cfa}
|
---|
2053 | &
|
---|
2054 | \begin{cfa}
|
---|
2055 | int main() {
|
---|
2056 | set_new_handler( handler );
|
---|
2057 | try {
|
---|
2058 | for ( ;; ) pass( new char[50] ); // unbounded allocation
|
---|
2059 | } catch( const bad_alloc & e ) { cout << e.what() << endl; }
|
---|
2060 | }
|
---|
2061 | \end{cfa}
|
---|
2062 | \end{tabular}
|
---|
2063 | \caption{Out of Memory Benchmark}
|
---|
2064 | \label{f:OutMemoryBenchmark}
|
---|
2065 | \end{figure}
|
---|
2066 |
|
---|
2067 |
|
---|
2068 | \section{Conclusion}
|
---|
2069 |
|
---|
2070 | The goal of this work is to build a full-featured, low-latency (or high bandwidth) memory allocator for both KT and UT multi-threading systems that is competitive with the best current memory allocators while extending the feature set of existing and new allocator routines.
|
---|
2071 | The new llheap allocator achieves all of these goals, while maintaining and managing sticky allocation information \emph{without a performance loss}.
|
---|
2072 | Hence, it is possible to use @realloc@ frequently as a safe operation, rather than just occasionally or not at all.
|
---|
2073 | Furthermore, the ability to query sticky properties and other information allows programmers to write safer programs, as it is possible to dynamically match allocation styles from unknown library routines that return allocations.
|
---|
2074 |
|
---|
2075 | Extending the C allocation API with @resize@, advanced @realloc@, @aalloc@, @amemalign@, @cmemalign@ and other alignment variations means programmers do not have to generate these allocation operations themselves.
|
---|
2076 | The ability of the type systems in modern languages, \eg \CFA, to condense the allocation API to one routine with completely orthogonal allocation properties shows how far the allocation API can be advanced.
|
---|
2077 | The result is increased safety and a cognitive reduction in performing dynamic allocation.
|
---|
2078 | All of these extensions should eliminate common reasons for C programmers to roll their own memory allocator and/or allocation function, which is a huge safety advantage.
|
---|
2079 |
|
---|
2080 | The ability to compile llheap with static/dynamic linking and optional statistics/debugging provides programmers with multiple mechanisms to balance performance and safety.
|
---|
2081 | These allocator versions are easy to use because they can be linked to an application without recompilation.
|
---|
2082 | Providing comprehensive statistics for all allocation operations is invaluable in understanding and debugging a program's dynamic behaviour.
|
---|
2083 | No other memory allocator provides such comprehensive statistics gathering.
|
---|
2084 | This capability was used extensively during the development of llheap to verify its behaviour, and to verify the benchmarks developed for the paper.
|
---|
2085 | As well, the debugging mode, where allocations are checked along with internal pre/post-conditions and invariants, is extremely useful especially for students ($\approx$1,000 students have tested the \uC version of llheap).
|
---|
2086 | While not as powerful as the @valgrind@ interpreter, lheap's debugging mode can detect a large number of allocation mistakes.
|
---|
2087 | The contention-free statistics gathering and debugging have a low enough cost to be used in production code.
|
---|
2088 | Finally, no other memory allocator addresses the needs of user-level threading, which are now available in many modern languages.
|
---|
2089 |
|
---|
2090 | Creating a benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs, has been an interesting challenge.
|
---|
2091 | The purpose of these performance tests is not to pick winners and losers among the allocators, because each allocator optimizes a particular set of allocation patterns: there is no optimal memory-allocator.
|
---|
2092 | The goal is to demonstrate that llheap's performance, both in time and space, across some interesting allocation patterns, is comparable to the best allocators in use today.
|
---|
2093 | Admittedly, there are pathological cases where llheap might use significant amounts of memory because it never coalesces or returns storage to the OS.
|
---|
2094 | These pathological cases do not correlate to long running applications, where llheap can perform very well.
|
---|
2095 | In the small set of tested benchmarks, no heap blowup was observed, while some tests caused time blowups in other allocators.
|
---|
2096 | Therefore, llheap is a viable drop-in replacement for many applications and its ancillary features make it safer and more informative.
|
---|
2097 |
|
---|
2098 |
|
---|
2099 | \subsection{Recommendations}
|
---|
2100 |
|
---|
2101 | Substantial work has been put into building a new allocator and benchmarks, plus doing comprehensive performance tests among allocators.
|
---|
2102 | Based on this work, we make two recommendations:
|
---|
2103 | \begin{enumerate}[leftmargin=*, topsep=0pt,itemsep=0pt,parsep=0pt]
|
---|
2104 | \item
|
---|
2105 | Hoard is no longer maintained and did not do well (even broke) in some performance experiments.
|
---|
2106 | We recommend to those doing memory allocation research not to use it.
|
---|
2107 | \item
|
---|
2108 | glibc did not perform as well as other allocators.
|
---|
2109 | Given it is the default memory allocator for many academic and industry applications, this seems unfortunate and skews performance resulting so developers may draw incorrect conclusions.
|
---|
2110 | As such, we recommend the adoption of a newer memory allocator for glibc.
|
---|
2111 | We offer llheap for the reasons given above, but most importantly, its small code base.
|
---|
2112 | glibc maintainers come and go.
|
---|
2113 | Therefore, it is crucial for a new maintainer to on-board quickly and have a thorough understanding of the code base within a month.
|
---|
2114 | The llheap code base is small and can be learned quickly because of its simple design, making it an ideal choice as a substitute allocator.
|
---|
2115 | \end{enumerate}
|
---|
2116 |
|
---|
2117 |
|
---|
2118 | \section{Acknowledgements}
|
---|
2119 |
|
---|
2120 | This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
|
---|
2121 | % Special thanks to Trevor Brown for many helpful discussions.
|
---|
2122 |
|
---|
2123 | \bibliographystyle{ACM-Reference-Format}
|
---|
2124 | \bibliography{pl,local}
|
---|
2125 |
|
---|
2126 | \end{document}
|
---|
2127 | \endinput
|
---|
2128 |
|
---|
2129 | % Local Variables: %
|
---|
2130 | % tab-width: 4 %
|
---|
2131 | % fill-column: 120 %
|
---|
2132 | % compile-command: "make" %
|
---|
2133 | % End: %
|
---|