Changeset fa59c40
- Timestamp:
- Jan 18, 2025, 3:32:44 PM (8 months ago)
- Branches:
- master
- Children:
- 8e90fd6
- Parents:
- 9ae4f5f
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/llheap/Paper.tex
r9ae4f5f rfa59c40 187 187 \author[1]{Peter A. Buhr*} 188 188 \author[2]{Bryan Chan} 189 \author[3]{Dave Dice} 189 190 \authormark{ZULFIQAR \textsc{et al.}} 190 191 191 192 \address[1]{\orgdiv{Cheriton School of Computer Science}, \orgname{University of Waterloo}, \orgaddress{\state{Waterloo, ON}, \country{Canada}}} 192 193 \address[2]{\orgdiv{Huawei Compiler Lab}, \orgname{Huawei}, \orgaddress{\state{Markham, ON}, \country{Canada}}} 194 \address[3]{\orgdiv{Oracle Labs}, \orgname{Oracle}, \orgaddress{\state{Burlington, MA}, \country{USA}}} 195 193 196 194 197 \corres{*Peter A. Buhr, Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada. \email{pabuhr{\char`\@}uwaterloo.ca}} … … 201 204 llheap extends the feature set of existing C allocation by remembering zero-filled (\lstinline{calloc}) and aligned properties (\lstinline{memalign}) in an allocation. 202 205 These properties can be queried, allowing programmers to write safer programs by preserving these properties in future allocations. 203 As well, \lstinline{realloc} preservesthese properties when adjusting storage size, again increasing future allocation safety.206 As well, \lstinline{realloc}/\lstinline{reallocarray} preserve these properties when adjusting storage size, again increasing future allocation safety. 204 207 llheap also extends the C allocation API with \lstinline{aalloc}, \lstinline{amemalign}, \lstinline{cmemalign}, \lstinline{resize}, and extended \lstinline{realloc}, providing orthogonal access to allocation features; 205 208 hence, programmers do have to code missing combinations. … … 226 229 \section{Introduction} 227 230 228 Memory management services a series of program allocation/deallocation requests and attempts to satisfy them from a variable-sized block of memory, while minimizing total memory usage.229 A general-purpose dynamic-allocation algorithm cannot anticipate allocation requests so its time and space performance is rarely optimal .230 However, allocators take advantage of regular allocation patterns in typical programsto produce excellent results, both in time and space (similar to LRU paging).231 Allocators use a number ofsimilar techniques, but each optimizes specific allocation patterns.232 Nevertheless, allocators are a series of compromises, occasionally with some static or dynamic tuning parameters to optimize specific program-request patterns.231 Memory management services a series of program allocation/deallocation requests and attempts to satisfy them from a variable-sized block(s) of memory, while minimizing total memory usage. 232 A general-purpose dynamic-allocation algorithm cannot anticipate allocation requests so its time and space performance is rarely optimal (bin packing). 233 However, allocators take advantage of allocation patterns in typical programs (heuristics) to produce excellent results, both in time and space (similar to LRU paging). 234 Allocators use similar techniques, but each optimizes specific allocation patterns. 235 Nevertheless, allocators are a series of compromises, occasionally with some static or dynamic tuning parameters to optimize specific request patterns. 233 236 234 237 … … 283 286 \begin{enumerate}[leftmargin=*,itemsep=0pt] 284 287 \item 285 Implementation of a new stand-alone concurrent low-latency memory-allocator ($\approx$1, 200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions for the concurrent languages \uC~\cite{uC++} and \CFA~\cite{Moss18,Delisle21} using user-level threads running on multiple kernel threads (M:N threading).288 Implementation of a new stand-alone concurrent low-latency memory-allocator ($\approx$1,400 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions for the concurrent languages \uC~\cite{uC++} and \CFA~\cite{Moss18,Delisle21} using user-level threads running on multiple kernel threads (M:N threading). 286 289 287 290 \item … … 289 292 290 293 \item 291 Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ to zero-fill and align when storage is extended or copied.294 Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ and @reallocarray@ to zero-fill and align when storage is extended or copied. 292 295 Without this extension, it is unsafe to @realloc@ storage these allocations if the properties are not preserved when copying. 293 296 This silent problem is unintuitive to programmers and difficult to locate because it is transient. … … 295 298 \item 296 299 Provide additional heap operations to make allocation properties orthogonally accessible. 297 \begin{itemize}[topsep= 2pt,itemsep=2pt,parsep=0pt]298 \item 299 @aalloc( dim , elemSize )@ same as @calloc@ except memory is \emph{not} zero filled.300 \item 301 @amemalign( alignment, dim , elemSize )@ same as @aalloc@ with memory alignment.302 \item 303 @cmemalign( alignment, dim , elemSize )@ same as @calloc@ with memory alignment.300 \begin{itemize}[topsep=0pt,itemsep=0pt,parsep=0pt] 301 \item 302 @aalloc( dimension, elemSize )@ same as @calloc@ except memory is \emph{not} zero filled, which is significantly faster than @calloc@. 303 \item 304 @amemalign( alignment, dimension, elemSize )@ same as @aalloc@ with memory alignment. 305 \item 306 @cmemalign( alignment, dimension, elemSize )@ same as @calloc@ with memory alignment. 304 307 \item 305 308 @resize( oaddr, size )@ re-purpose an old allocation for a new type \emph{without} preserving fill or alignment. 306 309 \item 307 @resize( oaddr, alignment, size )@ re-purpose an old allocation with new alignment but \emph{without} preserving fill. 308 \item 309 @realloc( oaddr, alignment, size )@ same as @realloc@ but adding or changing alignment. 310 @aligned_resize( oaddr, alignment, size )@ re-purpose an old allocation with new alignment but \emph{without} preserving fill. 311 \item 312 @aligned_realloc( oaddr, alignment, size )@ same as @realloc@ but adding or changing alignment. 313 \item 314 @aligned_reallocarray( oaddr, alignment, dimension, elemSize )@ same as @reallocarray@ but adding or changing alignment. 310 315 \end{itemize} 311 316 312 317 \item 313 318 Provide additional query operations to access information about an allocation: 314 \begin{itemize}[topsep= 3pt,itemsep=2pt,parsep=0pt]319 \begin{itemize}[topsep=0pt,itemsep=0pt,parsep=0pt] 315 320 \item 316 321 @malloc_alignment( addr )@ returns the alignment of the allocation. 317 322 If the allocation is not aligned or @addr@ is @NULL@, the minimal alignment is returned. 318 323 \item 319 @malloc_zero_fill( addr )@ returns a boolean result indicating if the memory is allocated with zero fill, e.g.,by @calloc@/@cmemalign@.324 @malloc_zero_fill( addr )@ returns a boolean result indicating if the memory is allocated with zero fill, \eg by @calloc@/@cmemalign@. 320 325 \item 321 326 @malloc_size( addr )@ returns the size of the memory allocation. 322 327 \item 323 @malloc_usable_size( addr )@ returns the usable (total) size of the memory, i.e.,the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@.328 @malloc_usable_size( addr )@ returns the usable (total) size of the memory, \ie the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@. 324 329 \end{itemize} 325 330 326 331 \item 327 332 Provide optional extensive, fast, and contention-free allocation statistics to understand allocation behaviour, accessed by: 328 \begin{itemize}[topsep= 3pt,itemsep=2pt,parsep=0pt]333 \begin{itemize}[topsep=0pt,itemsep=0pt,parsep=0pt] 329 334 \item 330 335 @malloc_stats()@ print memory-allocation statistics on the file-descriptor set by @malloc_stats_fd@ (default @stderr@). … … 359 364 The management goals are to make allocation/deallocation operations as fast as possible while densely packing objects to make efficient use of memory. 360 365 Since objects in C/\CC cannot be moved to aid the packing process, only adjacent free storage can be \newterm{coalesced} into larger free areas. 361 The allocator grows or shrinks the dynamic-allocation zone to obtain storage for objects and reduce memory usage via operating-systemcalls, such as @mmap@ or @sbrk@ in UNIX.366 The allocator grows or shrinks the dynamic-allocation zone to obtain storage for objects and reduce memory usage via OS calls, such as @mmap@ or @sbrk@ in UNIX. 362 367 363 368 … … 984 989 That is, rather than requesting new storage for a single object, an entire buffer is requested from which multiple objects are allocated later. 985 990 Any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or OS, respectively. 986 The allocation buffer reduces contention and the number of global/ operating-systemcalls.991 The allocation buffer reduces contention and the number of global/OS calls. 987 992 For coalescing, a buffer is split into smaller objects by allocations, and recomposed into larger buffer areas during deallocations. 988 993 … … 1021 1026 1022 1027 1023 \section{Allocator} 1024 \label{c:Allocator} 1025 1026 This section presents a new stand-alone concurrent low-latency memory-allocator ($\approx$1,200 lines of code), called llheap (low-latency heap), for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for the programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading). 1027 The new allocator fulfills the GNU C Library allocator API~\cite{GNUallocAPI}. 1028 1029 1030 \subsection{llheap} 1031 1032 The primary design objective for llheap is low-latency across all allocator calls independent of application access-patterns and/or number of threads, \ie very seldom does the allocator have a delay during an allocator call. 1028 \section{llheap} 1029 1030 This section presents our new stand-alone, concurrent, low-latency memory-allocator, called llheap (low-latency heap), fulfilling the GNU C Library allocator API~\cite{GNUallocAPI} for C/\CC programs using kernel threads (1:1 threading), with specialized versions for the programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading). 1031 The primary design objective for llheap is low-latency across all allocator calls independent of application access-patterns and/or number of threads, \ie very seldom does the allocator delay during an allocator call. 1033 1032 Excluded from the low-latency objective are (large) allocations requiring initialization, \eg zero fill, and/or data copying, which are outside the allocator's purview. 1034 1033 A direct consequence of this objective is very simple or no storage coalescing; 1035 1034 hence, llheap's design is willing to use more storage to lower latency. 1036 This objective is apropos because systems research and industrial applications are striving for low latency and computers have huge amounts of RAM memory.1035 This objective is apropos because systems research and industrial applications are striving for low latency and modern computers have huge amounts of RAM memory. 1037 1036 Finally, llheap's performance should be comparable with the current best allocators, both in space and time (see performance comparison in Section~\ref{c:Performance}). 1038 1037 1039 % The objective of llheap's new design was to fulfill following requirements:1040 % \begin{itemize}1041 % \item It should be concurrent and thread-safe for multi-threaded programs.1042 % \item It should avoid global locks, on resources shared across all threads, as much as possible.1043 % \item It's performance (FIX ME: cite performance benchmarks) should be comparable to the commonly used allocators (FIX ME: cite common allocators).1044 % \item It should be a lightweight memory allocator.1045 % \end{itemize}1046 1047 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%1048 1038 1049 1039 \subsection{Design Choices} 1050 1040 1051 % Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}). 1052 % Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design. 1053 1054 1055 % \paragraph{T:1 model} 1056 % Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs. 1057 % This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes. 1058 % When KTs $\le$ N, the common bucket sizes are uncontented; 1059 % when KTs $>$ N, the free buckets are contented and latency increases significantly. 1060 % In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared. 1061 % Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency; 1062 % however, picking N is workload specific. 1063 % 1064 % \begin{figure} 1065 % \centering 1066 % \input{AllocDS1} 1067 % \caption{T:1 with Shared Buckets} 1068 % \label{f:T1SharedBuckets} 1069 % \end{figure} 1070 % 1071 % Problems: 1072 % \begin{itemize} 1073 % \item 1074 % Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator. 1075 % \item 1076 % When no thread is assigned a bucket number, its free storage is unavailable. 1077 % \item 1078 % All KTs contend for the global-pool lock for initial allocations, before free-lists get populated. 1079 % \end{itemize} 1080 % Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency. 1081 1082 % \paragraph{T:H model} 1083 % Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs. 1084 % A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket. 1085 % When KT $\le$ N, the heaps might be uncontented; 1086 % when KTs $>$ N, the heaps are contented. 1087 % In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared. 1088 % By increasing N, this approach reduces contention but increases storage (time versus space); 1089 % however, picking N is workload specific. 1090 % 1091 % \begin{figure} 1092 % \centering 1093 % \input{AllocDS2} 1094 % \caption{T:H with Shared Heaps} 1095 % \label{f:THSharedHeaps} 1096 % \end{figure} 1097 % 1098 % Problems: 1099 % \begin{itemize} 1100 % \item 1101 % Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator. 1102 % \item 1103 % When no thread is assigned to a heap, its free storage is unavailable. 1104 % \item 1105 % Ownership issues arise (see Section~\ref{s:Ownership}). 1106 % \item 1107 % All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated. 1108 % \end{itemize} 1109 % Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency. 1110 1111 % \paragraph{T:H model, H = number of CPUs} 1112 % This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@. 1113 % (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.) 1114 % Hence, each CPU logically has its own private heap and local pool. 1115 % A memory operation is serviced from the heap associated with the CPU executing the operation. 1116 % This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership). 1117 % This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs. 1118 % 1119 % Problems: 1120 % \begin{itemize} 1121 % \item 1122 % Need to know when a CPU is added/removed from the @taskset@. 1123 % \item 1124 % Need a fast way to determine the CPU a KT is executing on to access the appropriate heap. 1125 % \item 1126 % Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}. 1127 % \begin{quote} 1128 % A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable} 1129 % \end{quote} 1130 % If a KT is preempted during an allocation operation, the OS can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness. 1131 % Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable. 1132 % Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the OS is providing the second thread via the signal handler. 1133 % 1134 % Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted. 1135 % \end{itemize} 1136 % Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs. 1137 % Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs. 1138 % For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs. 1139 % The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation. 1140 % Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection. 1141 % Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap. 1142 % If @librseq@ had an @rseq_abort@ which: 1143 % \begin{enumerate} 1144 % \item 1145 % Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit. 1146 % \item 1147 % Do nothing if there is no current restartable critical subsection in progress. 1148 % \end{enumerate} 1149 % Then @rseq_abort@ could be called on the backside of a user-level context-switching. 1150 % A feature similar to this idea might exist for hardware transactional-memory. 1151 % A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection. 1152 1153 % \subsubsection{Allocation Fastpath} 1154 % \label{s:AllocationFastpath} 1155 1156 llheap's design was reviewed and changed multiple times during its development, with the final choices are discussed here. 1157 (See~\cite{Zulfiqar22} for a discussion of alternate choices and reasons for rejecting them.) 1158 All designs were analyzed for the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced. 1159 The heap model chosen is 1:1, which is the T:H model with T = H, where there is one thread-local heap for each KT. 1041 llheap's design was reviewed and changed multiple times during its development, with the final choices discussed here. 1042 All designs focused on the allocation/free \newterm{fastpath}, \ie the shortest code path for the most common operations, \eg when an allocation can immediately return free storage or returned storage is not coalesced. 1043 The model chosen is 1:1, so there is one thread-local heap for each KT. 1160 1044 (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.) 1161 1045 Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted. 1162 Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership.1046 Therefore, heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership. 1163 1047 1164 1048 Problems: … … 1205 1089 For the T:1 and T:H models, locking must exist along the allocation fastpath because the buckets or heaps might be shared by multiple threads, even when KTs $\le$ N. 1206 1090 For the T:H=CPU and 1:1 models, locking is eliminated along the allocation fastpath. 1207 However, T:H=CPU has poor operating-systemsupport to determine the CPU id (heap id) and prevent the serially-reusable problem for KTs.1091 However, T:H=CPU has poor OS support to determine the CPU id (heap id) and prevent the serially-reusable problem for KTs. 1208 1092 More OS support is required to make this model viable, but there is still the serially-reusable problem with user-level threading. 1209 So the 1:1 model had no atomic actions along the fastpath and no special operating-systemsupport requirements.1093 So the 1:1 model had no atomic actions along the fastpath and no special OS support requirements. 1210 1094 The 1:1 model still has the serially-reusable problem with user-level threading, which is addressed in Section~\ref{s:UserlevelThreadingSupport}, and the greatest potential for heap blowup for certain allocation patterns. 1211 1095 … … 1241 1125 A primary goal of llheap is low latency, hence the name low-latency heap (llheap). 1242 1126 Two forms of latency are internal and external. 1243 Internal latency is the time to perform an allocation, while external latency is time to obtain /return storage from/to the OS.1127 Internal latency is the time to perform an allocation, while external latency is time to obtain or return storage from or to the OS. 1244 1128 Ideally latency is $O(1)$ with a small constant. 1245 1129 1246 To obtain $O(1)$ internal latency means no searching on the allocation fastpath and largely prohibits coalescing, which leads to external fragmentation. 1247 The mitigating factor is that most programs have well behaved allocation patterns, where the majority of allocation operations can be $O(1)$, and heap blowup does not occur without coalescing (although the allocation footprint may be slightly larger). 1248 1249 To obtain $O(1)$ external latency means obtaining one large storage area from the OS and subdividing it across all program allocations, which requires a good guess at the program storage high-watermark and potential large external fragmentation. 1250 Excluding real-time operating-systems, operating-system operations are unbounded, and hence some external latency is unavoidable. 1251 The mitigating factor is that operating-system calls can often be reduced if a programmer has a sense of the storage high-watermark and the allocator is capable of using this information (see @malloc_expansion@ \pageref{p:malloc_expansion}). 1252 Furthermore, while operating-system calls are unbounded, many are now reasonably fast, so their latency is tolerable and infrequent. 1130 $O(1)$ internal latency means no open searching on the allocation fastpath, which largely prohibits coalescing. 1131 The mitigating factor is that most programs have a small, fixed, allocation pattern, where the majority of allocation operations can be $O(1)$ and heap blowup does not occur without coalescing (although the allocation footprint may be slightly larger). 1132 Modern computers have large memories so a slight increase in program footprint is not a problem. 1133 1134 $O(1)$ external latency means obtaining one large storage area from the OS and subdividing it across all program allocations, which requires a good guess at the program storage high-watermark and potential large external fragmentation. 1135 Excluding real-time OSs, OS operations are unbounded, and hence some external latency is unavoidable. 1136 The mitigating factor is that OS calls can often be reduced if a programmer has a sense of the storage high-watermark and the allocator is capable of using this information (see @malloc_expansion@ \pageref{p:malloc_expansion}). 1137 Furthermore, while OS calls are unbounded, many are now reasonably fast, so their latency is tolerable because it occurs infrequently. 1253 1138 1254 1139 … … 1392 1277 \subsubsection{Alignment} 1393 1278 1394 Most dynamic memory allocations have a minimum storage alignment for the contained object(s). 1395 Often the minimum memory alignment, M, is the bus width (32 or 64-bit) or the largest register (double, long double) or largest atomic instruction (DCAS) or vector data (MMMX). 1396 In general, the minimum storage alignment is 8/16-byte boundary on 32/64-bit computers. 1397 For consistency, the object header is normally aligned at this same boundary. 1398 Larger alignments must be a power of 2, such as page alignment (4/8K). 1399 Any alignment request, N, $\le$ the minimum alignment is handled as a normal allocation with minimal alignment. 1400 1401 For alignments greater than the minimum, the obvious approach for aligning to address @A@ is: compute the next address that is a multiple of @N@ after the current end of the heap, @E@, plus room for the header before @A@ and the size of the allocation after @A@, moving the end of the heap to @E'@. 1402 \begin{center} 1403 \input{Alignment1} 1404 \end{center} 1405 The storage between @E@ and @H@ is chained onto the appropriate free list for future allocations. 1406 The same approach is used for sufficiently large free blocks, where @E@ is the start of the free block, and any unused storage before @H@ or after the allocated object becomes free storage. 1407 In this approach, the aligned address @A@ is the same as the allocated storage address @P@, \ie @P@ $=$ @A@ for all allocation routines, which simplifies deallocation. 1408 However, if there are a large number of aligned requests, this approach leads to memory fragmentation from the small free areas around the aligned object. 1409 As well, it does not work for large allocations, where many memory allocators switch from program @sbrk@ to operating-system @mmap@. 1410 The reason is that @mmap@ only starts on a page boundary, and it is difficult to reuse the storage before the alignment boundary for other requests. 1411 Finally, this approach is incompatible with allocator designs that funnel allocation requests through @malloc@ as it directly manipulates management information within the allocator to optimize the space/time of a request. 1412 1413 Instead, llheap alignment is accomplished by making a \emph{pessimistic} allocation request for sufficient storage to ensure that \emph{both} the alignment and size request are satisfied, \eg: 1279 Allocators have a different minimum storage alignment from the hardware's basic types. 1280 Often the minimum allocator alignment, $M$, is the bus width (32 or 64-bit), the largest register (double, long double), largest atomic instruction (DCAS), or vector data (MMMX). 1281 The reason for this larger requirement is the lack of knowledge about the data type occupying the allocation. 1282 Hence, an allocator assumes the worst-case scenario for the start of data and the compiler correctly aligns items within this data because it knows their types. 1283 Often the minimum storage alignment is an 8/16-byte boundary on a 32/64-bit computer. 1284 Alignments larger than $M$ are normally a power of 2, such as page alignment (4/8K). 1285 Any alignment less than $M$ is raised to the minimal alignment. 1286 1287 llheap aligns its header at the $M$ boundary and its size is $M$; 1288 hence, data following the header is aligned at $M$. 1289 This pattern means there is no minimal alignment computation along the allocation fastpath, \ie new storage and reused storage is always correctly aligned. 1290 An alignment $N$ greater than $M$ is accomplished with a \emph{pessimistic} request for storage that ensures \emph{both} the alignment and size request are satisfied, \eg: 1414 1291 \begin{center} 1415 1292 \input{Alignment2} 1416 1293 \end{center} 1417 The amount of storage necessary is @alignment - M + size@, which ensures there is an address, @A@, after the storage returned from @malloc@, @P@, that is a multiple of @alignment@ followed by sufficient storage for the data object. 1418 The approach is pessimistic because if @P@ already has the correct alignment @N@, the initial allocation has already requested sufficient space to move to the next multiple of @N@. 1419 For this special case, there is @alignment - M@ bytes of unused storage after the data object, which subsequently can be used by @realloc@. 1420 1421 Note, the address returned is @A@, which is subsequently returned to @free@. 1422 However, to correctly free the allocated object, the value @P@ must be computable, since that is the value generated by @malloc@ and returned within @memalign@. 1423 Hence, there must be a mechanism to detect when @P@ $\neq$ @A@ and how to compute @P@ from @A@. 1424 1425 The llheap approach uses two headers: 1426 the \emph{original} header associated with a memory allocation from @malloc@, and a \emph{fake} header within this storage before the alignment boundary @A@, which is returned from @memalign@, e.g.: 1294 The amount of storage necessary is $alignment - M + size$, which ensures there is an address, $A$, after the storage returned from @malloc@, $P$, that is a multiple of $alignment$ followed by sufficient storage for the data object. 1295 The approach is pessimistic if $P$ happens to have the correct alignment $N$, and the initial allocation has requested sufficient space to move to the next multiple of $N$. 1296 In this case, there is $alignment - M$ bytes of unused storage after the data object, which could be used by @realloc@. 1297 Note, the address returned by the allocation is $A$, which is subsequently returned to @free@. 1298 To correctly free the object, the value $P$ must be computable from $A$, since that is the actual start of the allocation, from which $H$ can be computed $P - M$. 1299 Hence, there must be a mechanism to detect when $P$ $\neq$ $A$ and then compute $P$ from $A$. 1300 1301 To detect and perform this computation, llheap uses two headers: 1302 the \emph{original} header $H$ associated with the allocation, and a \emph{fake} header $F$ within this storage before the alignment boundary $A$, e.g.: 1427 1303 \begin{center} 1428 1304 \input{Alignment2Impl} 1429 1305 \end{center} 1430 Since @malloc@ has a minimum alignment of @M@, @P@ $\neq$ @A@ only holds for alignments greater than @M@. 1431 When @P@ $\neq$ @A@, the minimum distance between @P@ and @A@ is @M@ bytes, due to the pessimistic storage-allocation. 1432 Therefore, there is always room for an @M@-byte fake header before @A@. 1433 1434 The fake header must supply an indicator to distinguish it from a normal header and the location of address @P@ generated by @malloc@. 1306 Since every allocation is aligned at $M$, $P$ $\neq$ $A$ only holds for alignments greater than $M$. 1307 When $P$ $\neq$ $A$, the minimum distance between $P$ and $A$ is $M$ bytes, due to the pessimistic storage-allocation. 1308 Therefore, there is always room for an $M$-byte fake header before $A$. 1309 The fake header must supply an indicator to distinguish it from a normal header and the location of address $P$ generated by the allocation. 1435 1310 This information is encoded as an offset from A to P and the initialize alignment (discussed in Section~\ref{s:ReallocStickyProperties}). 1436 1311 To distinguish a fake header from a normal header, the least-significant bit of the alignment is used because the offset participates in multiple calculations, while the alignment is just remembered data. … … 1443 1318 \label{s:ReallocStickyProperties} 1444 1319 1445 The allocation routine @realloc@ provides a memory-management pattern for shrinking/enlarging an existing allocation, while maintaining some or all of the object data, rather than performing the following steps manually. 1320 The allocation routine @realloc@ provides a memory-management pattern for shrinking/enlarging an existing allocation, while maintaining some or all of the object data. 1321 The realloc pattern is simpler than the suboptimal manually steps. 1446 1322 \begin{flushleft} 1447 1323 \begin{tabular}{ll} … … 1455 1331 & 1456 1332 \begin{lstlisting} 1457 T * naddr = (T *)malloc( newSize ); $\C[2 .4in]{// new storage}$1333 T * naddr = (T *)malloc( newSize ); $\C[2in]{// new storage}$ 1458 1334 memcpy( naddr, addr, oldSize ); $\C{// copy old bytes}$ 1459 1335 free( addr ); $\C{// free old storage}$ … … 1462 1338 \end{tabular} 1463 1339 \end{flushleft} 1464 The realloc pattern leverages available storage at the end of an allocation due to bucket sizes, possibly eliminating a new allocation and copying. 1465 This pattern is not used enough to reduce storage management costs. 1466 In fact, if @oaddr@ is @nullptr@, @realloc@ does a @malloc@, so even the initial @malloc@ can be a @realloc@ for consistency in the allocation pattern. 1467 1468 The hidden problem for this pattern is the effect of zero fill and alignment with respect to reallocation. 1469 Are these properties transient or persistent (``sticky'')? 1470 For example, when memory is initially allocated by @calloc@ or @memalign@ with zero fill or alignment properties, respectively, what happens when those allocations are given to @realloc@ to change size? 1471 That is, if @realloc@ logically extends storage into unused bucket space or allocates new storage to satisfy a size change, are initial allocation properties preserved? 1472 Currently, allocation properties are not preserved, so subsequent use of @realloc@ storage may cause inefficient execution or errors due to lack of zero fill or alignment. 1473 This silent problem is unintuitive to programmers and difficult to locate because it is transient. 1474 To prevent these problems, llheap preserves initial allocation properties for the lifetime of an allocation and the semantics of @realloc@ are augmented to preserve these properties, with additional query routines. 1475 This change makes the realloc pattern efficient and safe. 1340 The manual steps are suboptimal because there may be sufficient internal fragmentation at the end of the allocation due to bucket sizes. 1341 If this storage is large enough, it eliminates a new allocation and copying. 1342 Alternatively, if the storage is made smaller, there may be a reasonable crossover point, where just increasing the internal fragmentation eliminates a new allocation and copying. 1343 This pattern should be used more frequently to reduce storage management costs. 1344 In fact, if @oaddr@ is @nullptr@, @realloc@ does a @malloc( newSize)@, and if @newSize@ is 0, @realloc@ does a @free( oaddr )@, so all allocation/deallocation can be done with @realloc@. 1345 1346 The hidden problem with this pattern is the effect of zero fill and alignment with respect to reallocation. 1347 For safety, we argue these properties should be persistent (``sticky'') and not transient. 1348 For example, when memory is initially allocated by @calloc@ or @memalign@ with zero fill or alignment properties, any subsequent reallocations of this storage must preserve these properties. 1349 Currently, allocation properties are not preserved nor is it possible to query an allocation to maintain these properties manually. 1350 Hence, subsequent use of @realloc@ storage that assumes any initially properties may cause errors. 1351 This silent problem is unintuitive to programmers, can cause catastrophic failure, and is difficult to debug because it is transient. 1352 To prevent these problems, llheap preserves initial allocation properties within an allocation, allowing them to be queried, and the semantics of @realloc@ preserve these properties on any storage change. 1353 As a result, the realloc pattern is efficient and safe. 1476 1354 1477 1355 1478 1356 \subsubsection{Header} 1479 1357 1480 To preserve allocation properties requires storing additional information with an allocation, 1481 The best available option is the header, where Figure~\ref{f:llheapNormalHeader} shows the llheap storage layout. 1482 The header has two data field sized appropriately for 32/64-bit alignment requirements. 1483 The first field is a union of three values: 1358 To preserve allocation properties requires storing additional information about an allocation. 1359 Figure~\ref{f:llheapHeader} shows llheap captures this information in the header, which has two fields (left/right) sized appropriately for 32/64-bit alignment requirements. 1360 1361 \begin{figure} 1362 \centering 1363 \input{Header} 1364 \caption{llheap Header} 1365 \label{f:llheapHeader} 1366 \end{figure} 1367 1368 The left field is a union of three values: 1484 1369 \begin{description} 1485 1370 \item[bucket pointer] 1486 is for allocatedstorage and points back to the bucket associated with this storage requests (see Figure~\ref{f:llheapStructure} for the fields accessible in a bucket).1371 is for deallocated of heap storage and points back to the bucket associated with this storage requests (see Figure~\ref{f:llheapStructure} for the fields accessible in a bucket). 1487 1372 \item[mapped size] 1488 is for mapped storage and is the storage size for use inunmapping.1373 is for deallocation of mapped storage and is the storage size for unmapping. 1489 1374 \item[next free block] 1490 is for free storage and is an intrusive pointer chaining same-size free blocks onto a bucket's free stack.1375 is for freed storage and is an intrusive pointer chaining same-size free blocks onto a bucket's stack of free objects. 1491 1376 \end{description} 1492 The second field remembers the request size versus the allocation (bucket) size, \eg request 42 bytes which is rounded up to 64 bytes. 1377 The low-order 3-bits of this field are unused for any stored values as these values are at least 8-byte aligned. 1378 The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively. 1379 Note, the zero-filled/mapped bits are only used in the normal header and the alignment bit in the fake header. 1380 This implementation allows a fast test if any of the lower 3-bits are on (@&@ and compare). 1381 If no bits are on, it implies a basic allocation, which is handled quickly in the fastpath for allocation and free; 1382 otherwise, the bits are analysed and appropriate actions are taken for the complex cases. 1383 1384 The right field remembers the request size versus the allocation (bucket) size, \eg request of 42 bytes is rounded up to 64 bytes. 1493 1385 Since programmers think in request sizes rather than allocation sizes, the request size allows better generation of statistics or errors and also helps in memory management. 1494 1386 1495 \begin{figure}1496 \centering1497 \input{Header}1498 \caption{llheap Normal Header}1499 \label{f:llheapNormalHeader}1500 \end{figure}1501 1502 The low-order 3-bits of the first field are \emph{unused} for any stored values as these values are 16-byte aligned by default, whereas the second field may use all of its bits.1503 The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively.1504 Note, the alignment bit is not used in the normal header and the zero-filled/mapped bits are not used in the fake header.1505 This implementation allows a fast test if any of the lower 3-bits are on (@&@ and compare).1506 If no bits are on, it implies a basic allocation, which is handled quickly;1507 otherwise, the bits are analysed and appropriate actions are taken for the complex cases.1508 Since most allocations are basic, they will take significantly less time as the memory operations will be done along the allocation and free fastpath.1509 1510 1387 1511 1388 \subsection{Statistics and Debugging} 1512 1389 1513 llheap can be built to accumulate fast and largely contention-free allocation statistics to help understand allocationbehaviour.1390 llheap can be built to accumulate fast and largely contention-free allocation statistics to help understand dynamic-memory behaviour. 1514 1391 Incrementing statistic counters must appear on the allocation fastpath. 1515 1392 As noted, any atomic operation along the fastpath produces a significant increase in allocation costs. … … 1741 1618 1742 1619 \medskip\noindent 1743 \lstinline{void * aalloc( size_t dim , size_t elemSize )}1620 \lstinline{void * aalloc( size_t dimension, size_t elemSize )} 1744 1621 extends @calloc@ for allocating a dynamic array of objects with total size @dim@ $\times$ @elemSize@ but \emph{without} zero-filling the memory. 1745 1622 @aalloc@ is significantly faster than @calloc@, which is the only alternative given by the standard memory-allocation routines for array allocation. … … 1753 1630 1754 1631 \medskip\noindent 1755 \lstinline{void * amemalign( size_t alignment, size_t dim , size_t elemSize )}1632 \lstinline{void * amemalign( size_t alignment, size_t dimension, size_t elemSize )} 1756 1633 extends @aalloc@ and @memalign@ for allocating a dynamic array of objects with the starting address on the @alignment@ boundary. 1757 1634 Sets sticky alignment property. … … 1759 1636 1760 1637 \medskip\noindent 1761 \lstinline{void * cmemalign( size_t alignment, size_t dim , size_t elemSize )}1638 \lstinline{void * cmemalign( size_t alignment, size_t dimension, size_t elemSize )} 1762 1639 extends @amemalign@ with zero fill and has the same usage as @amemalign@. 1763 1640 Sets sticky zero-fill and alignment property. … … 1881 1758 1882 1759 \medskip\noindent 1883 \lstinline{T * alloc( ... )} or \lstinline{T * alloc( size_t dim , ... )}1760 \lstinline{T * alloc( ... )} or \lstinline{T * alloc( size_t dimension, ... )} 1884 1761 is overloaded with a variable number of specific allocation operations, or an integer dimension parameter followed by a variable number of specific allocation operations. 1885 1762 These allocation operations can be passed as named arguments when calling the \lstinline{alloc} routine.
Note:
See TracChangeset
for help on using the changeset viewer.