Changeset 9eb7f07c for doc/papers
- Timestamp:
- May 11, 2023, 8:27:58 PM (21 months ago)
- Branches:
- ADT, ast-experimental, master
- Children:
- 2c24971
- Parents:
- d697527
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/llheap/Paper.tex
rd697527 r9eb7f07c 305 305 \item 306 306 Provide additional heap operations to complete programmer expectation with respect to accessing different allocation properties. 307 \begin{itemize}[ itemsep=0pt,parsep=0pt]307 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 308 308 \item 309 309 @resize( oaddr, size )@ re-purpose an old allocation for a new type \emph{without} preserving fill or alignment. … … 325 325 \item 326 326 Provide additional query operations to access information about an allocation: 327 \begin{itemize}[ itemsep=0pt]327 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 328 328 \item 329 329 @malloc_alignment( addr )@ returns the alignment of the allocation pointed-to by @addr@. … … 339 339 \item 340 340 Provide complete, fast, and contention-free allocation statistics to help understand allocation behaviour: 341 \begin{itemize}[ itemsep=0pt]341 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 342 342 \item 343 343 @malloc_stats()@ print memory-allocation statistics on the file-descriptor set by @malloc_stats_fd@. … … 894 894 895 895 Additional restrictions may be applied to the movement of containers to prevent active false-sharing. 896 For example, in Figure~\ref{f:ContainerFalseSharing1}, a container being used by Thread$_1$ changes ownership, through the global heap. 897 In Figure~\ref{f:ContainerFalseSharing2}, when Thread$_2$ allocates an object from the newly acquired container it is actively false-sharing even though no objects are passed among threads. 898 Note, once the object is freed by Thread$_1$, no more false sharing can occur until the container changes ownership again. 896 For example, if a container changes ownership through the global heap, then when a thread allocates an object from the newly acquired container it is actively false-sharing even though no objects are passed among threads. 897 Note, once the thread frees the object, no more false sharing can occur until the container changes ownership again. 899 898 To prevent this form of false sharing, container movement may be restricted to when all objects in the container are free. 900 899 One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area, again pushing storage management complexity back to the operating system. 901 900 902 \begin{figure}903 \centering904 \subfloat[]{905 \input{ContainerFalseSharing1}906 \label{f:ContainerFalseSharing1}907 } % subfloat908 \subfloat[]{909 \input{ContainerFalseSharing2}910 \label{f:ContainerFalseSharing2}911 } % subfloat912 \caption{Active False-Sharing using Containers}913 \label{f:ActiveFalseSharingContainers}914 \end{figure}901 % \begin{figure} 902 % \centering 903 % \subfloat[]{ 904 % \input{ContainerFalseSharing1} 905 % \label{f:ContainerFalseSharing1} 906 % } % subfloat 907 % \subfloat[]{ 908 % \input{ContainerFalseSharing2} 909 % \label{f:ContainerFalseSharing2} 910 % } % subfloat 911 % \caption{Active False-Sharing using Containers} 912 % \label{f:ActiveFalseSharingContainers} 913 % \end{figure} 915 914 916 915 Using containers with ownership increases external fragmentation since a new container for a requested object size must be allocated separately for each thread requesting it. … … 975 974 976 975 The container header allows an alternate approach for managing the heap's free-list. 977 Rather than maintain a global free-list throughout the heap (see~Figure~\ref{f:GlobalFreeListAmongContainers}), the containers are linked through their headers and only the local free objects within a container are linked together (see~Figure~\ref{f:LocalFreeListWithinContainers}).976 Rather than maintain a global free-list throughout the heap the containers are linked through their headers and only the local free objects within a container are linked together. 978 977 Note, maintaining free lists within a container assumes all free objects in the container are associated with the same heap; 979 978 thus, this approach only applies to containers with ownership. 980 979 981 980 This alternate free-list approach can greatly reduce the complexity of moving all freed objects belonging to a container to another heap. 982 To move a container using a global free-list, as in Figure~\ref{f:GlobalFreeListAmongContainers},the free list is first searched to find all objects within the container.981 To move a container using a global free-list, the free list is first searched to find all objects within the container. 983 982 Each object is then removed from the free list and linked together to form a local free-list for the move to the new heap. 984 With local free-lists in containers, as in Figure~\ref{f:LocalFreeListWithinContainers},the container is simply removed from one heap's free list and placed on the new heap's free list.983 With local free-lists in containers, the container is simply removed from one heap's free list and placed on the new heap's free list. 985 984 Thus, when using local free-lists, the operation of moving containers is reduced from $O(N)$ to $O(1)$. 986 985 However, there is the additional storage cost in the header, which increases the header size, and therefore internal fragmentation. 987 986 988 \begin{figure}989 \centering990 \subfloat[Global Free-List Among Containers]{991 \input{FreeListAmongContainers}992 \label{f:GlobalFreeListAmongContainers}993 } % subfloat994 \hspace{0.25in}995 \subfloat[Local Free-List Within Containers]{996 \input{FreeListWithinContainers}997 \label{f:LocalFreeListWithinContainers}998 } % subfloat999 \caption{Container Free-List Structure}1000 \label{f:ContainerFreeListStructure}1001 \end{figure}987 % \begin{figure} 988 % \centering 989 % \subfloat[Global Free-List Among Containers]{ 990 % \input{FreeListAmongContainers} 991 % \label{f:GlobalFreeListAmongContainers} 992 % } % subfloat 993 % \hspace{0.25in} 994 % \subfloat[Local Free-List Within Containers]{ 995 % \input{FreeListWithinContainers} 996 % \label{f:LocalFreeListWithinContainers} 997 % } % subfloat 998 % \caption{Container Free-List Structure} 999 % \label{f:ContainerFreeListStructure} 1000 % \end{figure} 1002 1001 1003 1002 When all objects in the container are the same size, a single free-list is sufficient. … … 1116 1115 \subsection{Design Choices} 1117 1116 1118 llheap's design was reviewed and changed multiple times throughout the work. 1119 Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}). 1120 Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design. 1121 1122 1123 \subsubsection{Allocation Fastpath} 1124 \label{s:AllocationFastpath} 1125 1126 These designs look at the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced. 1127 \paragraph{T:1 model} 1128 Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs. 1129 This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes. 1130 When KTs $\le$ N, the common bucket sizes are uncontented; 1131 when KTs $>$ N, the free buckets are contented and latency increases significantly. 1132 In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared. 1133 Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency; 1134 however, picking N is workload specific. 1135 1136 \begin{figure} 1137 \centering 1138 \input{AllocDS1} 1139 \caption{T:1 with Shared Buckets} 1140 \label{f:T1SharedBuckets} 1141 \end{figure} 1117 % Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}). 1118 % Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design. 1119 1120 1121 % \paragraph{T:1 model} 1122 % Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs. 1123 % This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes. 1124 % When KTs $\le$ N, the common bucket sizes are uncontented; 1125 % when KTs $>$ N, the free buckets are contented and latency increases significantly. 1126 % In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared. 1127 % Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency; 1128 % however, picking N is workload specific. 1129 % 1130 % \begin{figure} 1131 % \centering 1132 % \input{AllocDS1} 1133 % \caption{T:1 with Shared Buckets} 1134 % \label{f:T1SharedBuckets} 1135 % \end{figure} 1136 % 1137 % Problems: 1138 % \begin{itemize} 1139 % \item 1140 % Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator. 1141 % \item 1142 % When no thread is assigned a bucket number, its free storage is unavailable. 1143 % \item 1144 % All KTs contend for the global-pool lock for initial allocations, before free-lists get populated. 1145 % \end{itemize} 1146 % Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency. 1147 1148 % \paragraph{T:H model} 1149 % Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs. 1150 % A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket. 1151 % When KT $\le$ N, the heaps might be uncontented; 1152 % when KTs $>$ N, the heaps are contented. 1153 % In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared. 1154 % By increasing N, this approach reduces contention but increases storage (time versus space); 1155 % however, picking N is workload specific. 1156 % 1157 % \begin{figure} 1158 % \centering 1159 % \input{AllocDS2} 1160 % \caption{T:H with Shared Heaps} 1161 % \label{f:THSharedHeaps} 1162 % \end{figure} 1163 % 1164 % Problems: 1165 % \begin{itemize} 1166 % \item 1167 % Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator. 1168 % \item 1169 % When no thread is assigned to a heap, its free storage is unavailable. 1170 % \item 1171 % Ownership issues arise (see Section~\ref{s:Ownership}). 1172 % \item 1173 % All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated. 1174 % \end{itemize} 1175 % Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency. 1176 1177 % \paragraph{T:H model, H = number of CPUs} 1178 % This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@. 1179 % (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.) 1180 % Hence, each CPU logically has its own private heap and local pool. 1181 % A memory operation is serviced from the heap associated with the CPU executing the operation. 1182 % This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership). 1183 % This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs. 1184 % 1185 % Problems: 1186 % \begin{itemize} 1187 % \item 1188 % Need to know when a CPU is added/removed from the @taskset@. 1189 % \item 1190 % Need a fast way to determine the CPU a KT is executing on to access the appropriate heap. 1191 % \item 1192 % Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}. 1193 % \begin{quote} 1194 % A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable} 1195 % \end{quote} 1196 % If a KT is preempted during an allocation operation, the operating system can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness. 1197 % Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable. 1198 % Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the operating system is providing the second thread via the signal handler. 1199 % 1200 % Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted. 1201 % \end{itemize} 1202 % Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs. 1203 % Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs. 1204 % For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs. 1205 % The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation. 1206 % Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection. 1207 % Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap. 1208 % If @librseq@ had an @rseq_abort@ which: 1209 % \begin{enumerate} 1210 % \item 1211 % Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit. 1212 % \item 1213 % Do nothing if there is no current restartable critical subsection in progress. 1214 % \end{enumerate} 1215 % Then @rseq_abort@ could be called on the backside of a user-level context-switching. 1216 % A feature similar to this idea might exist for hardware transactional-memory. 1217 % A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection. 1218 1219 % \subsubsection{Allocation Fastpath} 1220 % \label{s:AllocationFastpath} 1221 1222 llheap's design was reviewed and changed multiple times during its development. Only the final design choices are 1223 discussed in this paper. 1224 (See~\cite{Zulfiqar22} for a discussion of alternate choices and reasons for rejecting them.) 1225 All designs were analyzed for the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced. 1226 The heap model choosen is 1:1, which is the T:H model with T = H, where there is one thread-local heap for each KT. 1227 (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.) 1228 Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted. 1229 Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership. 1142 1230 1143 1231 Problems: 1144 \begin{itemize} 1145 \item 1146 Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator. 1147 \item 1148 When no thread is assigned a bucket number, its free storage is unavailable. 1149 \item 1150 All KTs contend for the global-pool lock for initial allocations, before free-lists get populated. 1151 \end{itemize} 1152 Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency. 1153 1154 \paragraph{T:H model} 1155 Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs. 1156 A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket. 1157 When KT $\le$ N, the heaps might be uncontented; 1158 when KTs $>$ N, the heaps are contented. 1159 In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared. 1160 By increasing N, this approach reduces contention but increases storage (time versus space); 1161 however, picking N is workload specific. 1162 1163 \begin{figure} 1164 \centering 1165 \input{AllocDS2} 1166 \caption{T:H with Shared Heaps} 1167 \label{f:THSharedHeaps} 1168 \end{figure} 1169 1170 Problems: 1171 \begin{itemize} 1172 \item 1173 Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator. 1174 \item 1175 When no thread is assigned to a heap, its free storage is unavailable. 1176 \item 1177 Ownership issues arise (see Section~\ref{s:Ownership}). 1178 \item 1179 All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated. 1180 \end{itemize} 1181 Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency. 1182 1183 \paragraph{T:H model, H = number of CPUs} 1184 This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@. 1185 (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.) 1186 Hence, each CPU logically has its own private heap and local pool. 1187 A memory operation is serviced from the heap associated with the CPU executing the operation. 1188 This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership). 1189 This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs. 1190 1191 Problems: 1192 \begin{itemize} 1193 \item 1194 Need to know when a CPU is added/removed from the @taskset@. 1195 \item 1196 Need a fast way to determine the CPU a KT is executing on to access the appropriate heap. 1232 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1233 \item 1234 Need to know when a KT starts/terminates to create/delete its heap. 1235 1236 \noindent 1237 It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates. 1238 \item 1239 There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap. 1240 1241 \noindent 1242 The classic solution only deletes a heap after all referents are returned, which is complex. 1243 The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees. 1244 If old referents return storage to a terminated heap, it is handled in the same way as an active heap. 1245 To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation). 1246 In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT. 1247 \item 1248 There can be significant external fragmentation as the number of KTs increases. 1249 1250 \noindent 1251 In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs. 1252 Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs. 1197 1253 \item 1198 1254 Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}. … … 1205 1261 1206 1262 Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted. 1207 \end{itemize} 1208 Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs. 1209 Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs. 1210 For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs. 1211 The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation. 1212 Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection. 1213 Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap. 1214 If @librseq@ had an @rseq_abort@ which: 1215 \begin{enumerate} 1216 \item 1217 Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit. 1218 \item 1219 Do nothing if there is no current restartable critical subsection in progress. 1220 \end{enumerate} 1221 Then @rseq_abort@ could be called on the backside of a user-level context-switching. 1222 A feature similar to this idea might exist for hardware transactional-memory. 1223 A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection. 1224 1225 \paragraph{1:1 model} 1226 This design is the T:H model with T = H, where there is one thread-local heap for each KT. 1227 (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.) 1228 Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted. 1229 Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership. 1230 1231 Problems: 1232 \begin{itemize} 1233 \item 1234 Need to know when a KT starts/terminates to create/delete its heap. 1235 1236 \noindent 1237 It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates. 1238 \item 1239 There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap. 1240 1241 \noindent 1242 The classic solution only deletes a heap after all referents are returned, which is complex. 1243 The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees. 1244 If old referents return storage to a terminated heap, it is handled in the same way as an active heap. 1245 To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation). 1246 In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT. 1247 \item 1248 There can be significant external fragmentation as the number of KTs increases. 1249 1250 \noindent 1251 In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs. 1252 Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs. 1253 \item 1254 There is the same serially-reusable problem with UTs migrating across KTs. 1263 1264 %There is the same serially-reusable problem with UTs migrating across KTs. 1255 1265 \end{itemize} 1256 1266 Tests showed this design produced the closest performance match with the best current allocators, and code inspection showed most of these allocators use different variations of this approach. … … 1296 1306 \subsubsection{Allocation Latency} 1297 1307 1298 A primary goal of llheap is low latency .1308 A primary goal of llheap is low latency, hence the name low-latency heap (llheap). 1299 1309 Two forms of latency are internal and external. 1300 1310 Internal latency is the time to perform an allocation, while external latency is time to obtain/return storage from/to the operating system. … … 1315 1325 1316 1326 Figure~\ref{f:llheapStructure} shows the design of llheap, which uses the following features: 1317 \begin{itemize}1318 \item1319 1327 1:1 multiple-heap model to minimize the fastpath, 1320 \item1321 1328 can be built with or without heap ownership, 1322 \item1323 1329 headers per allocation versus containers, 1324 \item1325 1330 no coalescing to minimize latency, 1326 \item1327 1331 global heap memory (pool) obtained from the operating system using @mmap@ to create and reuse heaps needed by threads, 1328 \item1329 1332 local reserved memory (pool) per heap obtained from global pool, 1330 \item1331 1333 global reserved memory (pool) obtained from the operating system using @sbrk@ call, 1332 \item1333 1334 optional fast-lookup table for converting allocation requests into bucket sizes, 1334 \item1335 1335 optional statistic-counters table for accumulating counts of allocation operations. 1336 \end{itemize}1337 1336 1338 1337 \begin{figure} … … 1360 1359 All objects in a bucket are of the same size. 1361 1360 The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation using @mallopt( M_MMAP_THRESHOLD )@, \ie small objects managed by the program and large objects managed by the operating system. 1362 Each free bucket of a specific size has the following two lists: 1363 \begin{itemize} 1364 \item 1365 A free stack used solely by the KT heap-owner, so push/pop operations do not require locking. 1361 Each free bucket of a specific size has two lists. 1362 1) A free stack used solely by the KT heap-owner, so push/pop operations do not require locking. 1366 1363 The free objects are a stack so hot storage is reused first. 1367 \item 1368 For ownership, a shared away-stack for KTs to return storage allocated by other KTs, so push/pop operations require locking. 1364 2) For ownership, a shared away-stack for KTs to return storage allocated by other KTs, so push/pop operations require locking. 1369 1365 When the free stack is empty, the entire ownership stack is removed and becomes the head of the corresponding free stack. 1370 \end{itemize}1371 1366 1372 1367 Algorithm~\ref{alg:heapObjectAlloc} shows the allocation outline for an object of size $S$. … … 1378 1373 The @char@ type restricts the number of bucket sizes to 256. 1379 1374 For $S$ > 64K, a binary search is used. 1380 Then, the allocation storage is obtained from the following locations (in order), with increasing latency. 1381 \begin{enumerate}[topsep=0pt,itemsep=0pt,parsep=0pt] 1382 \item 1375 Then, the allocation storage is obtained from the following locations (in order), with increasing latency: 1383 1376 bucket's free stack, 1384 \item1385 1377 bucket's away stack, 1386 \item 1387 heap's local pool 1388 \item 1389 global pool 1390 \item 1391 operating system (@sbrk@) 1392 \end{enumerate} 1378 heap's local pool, 1379 global pool, 1380 operating system (@sbrk@). 1393 1381 1394 1382 \begin{algorithm} … … 1808 1796 1809 1797 For existing C allocation routines: 1810 \begin{itemize} 1798 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1811 1799 \item 1812 1800 @calloc@ sets the sticky zero-fill property. … … 1825 1813 \noindent\textbf{Usage} 1826 1814 @aalloc@ takes two parameters. 1827 \begin{itemize} 1815 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1828 1816 \item 1829 1817 @dim@: number of array objects … … 1839 1827 \noindent\textbf{Usage} 1840 1828 @resize@ takes two parameters. 1841 \begin{itemize} 1829 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1842 1830 \item 1843 1831 @oaddr@: address to be resized … … 1853 1841 \noindent\textbf{Usage} 1854 1842 @amemalign@ takes three parameters. 1855 \begin{itemize} 1843 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1856 1844 \item 1857 1845 @alignment@: alignment requirement … … 1873 1861 \noindent\textbf{Usage} 1874 1862 @malloc_alignment@ takes one parameter. 1875 \begin{itemize} 1863 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1876 1864 \item 1877 1865 @addr@: address of an allocated object. … … 1885 1873 @malloc_zero_fill@ takes one parameters. 1886 1874 1887 \begin{itemize} 1875 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1888 1876 \item 1889 1877 @addr@: address of an allocated object. … … 1897 1885 \noindent\textbf{Usage} 1898 1886 @malloc_size@ takes one parameters. 1899 \begin{itemize} 1887 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1900 1888 \item 1901 1889 @addr@: address of an allocated object. … … 1908 1896 \noindent\textbf{Usage} 1909 1897 @malloc_stats_fd@ takes one parameters. 1910 \begin{itemize} 1898 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1911 1899 \item 1912 1900 @fd@: file descriptor. … … 1938 1926 \noindent\textbf{Usage} 1939 1927 takes three parameters. 1940 \begin{itemize} 1928 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 1941 1929 \item 1942 1930 @oaddr@: address to be resized … … 1998 1986 In addition to the \CFA C-style allocator interface, a new allocator interface is provided to further increase orthogonality and usability of dynamic-memory allocation. 1999 1987 This interface helps programmers in three ways. 2000 \begin{itemize} 1988 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 2001 1989 \item 2002 1990 naming: \CFA regular and @ttype@ polymorphism (@ttype@ polymorphism in \CFA is similar to \CC variadic templates) is used to encapsulate a wide range of allocation functionality into a single routine name, so programmers do not have to remember multiple routine names for different kinds of dynamic allocations. … … 2528 2516 2529 2517 The performance experiments were run on two different multi-core architectures (x64 and ARM) to determine if there is consistency across platforms: 2530 \begin{itemize} 2518 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 2531 2519 \item 2532 2520 \textbf{Algol} Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz, GCC version 9.4.0 … … 2793 2781 Each allocator's performance for each thread is shown in different colors. 2794 2782 2795 \begin{itemize} 2783 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 2796 2784 \item Figure~\ref{fig:speed-3-malloc} shows results for chain: malloc 2797 2785 \item Figure~\ref{fig:speed-4-realloc} shows results for chain: realloc … … 3030 3018 Each figure has 2 graphs, one for each experiment environment. 3031 3019 Each graph has following 5 subgraphs that show memory usage and statistics throughout the micro-benchmark's lifetime. 3032 \begin{itemize} 3020 \begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt] 3033 3021 \item \textit{\textbf{current\_req\_mem(B)}} shows the amount of dynamic memory requested and currently in-use of the benchmark. 3034 3022 \item \textit{\textbf{heap}}* shows the memory requested by the program (allocator) from the system that lies in the heap (@sbrk@) area.
Note: See TracChangeset
for help on using the changeset viewer.