Changeset 9eb7f07c for doc/papers


Ignore:
Timestamp:
May 11, 2023, 8:27:58 PM (12 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master
Children:
2c24971
Parents:
d697527
Message:

more updates for llheap paper

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/llheap/Paper.tex

    rd697527 r9eb7f07c  
    305305\item
    306306Provide additional heap operations to complete programmer expectation with respect to accessing different allocation properties.
    307 \begin{itemize}[itemsep=0pt,parsep=0pt]
     307\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    308308\item
    309309@resize( oaddr, size )@ re-purpose an old allocation for a new type \emph{without} preserving fill or alignment.
     
    325325\item
    326326Provide additional query operations to access information about an allocation:
    327 \begin{itemize}[itemsep=0pt]
     327\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    328328\item
    329329@malloc_alignment( addr )@ returns the alignment of the allocation pointed-to by @addr@.
     
    339339\item
    340340Provide complete, fast, and contention-free allocation statistics to help understand allocation behaviour:
    341 \begin{itemize}[itemsep=0pt]
     341\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    342342\item
    343343@malloc_stats()@ print memory-allocation statistics on the file-descriptor set by @malloc_stats_fd@.
     
    894894
    895895Additional restrictions may be applied to the movement of containers to prevent active false-sharing.
    896 For example, in Figure~\ref{f:ContainerFalseSharing1}, a container being used by Thread$_1$ changes ownership, through the global heap.
    897 In Figure~\ref{f:ContainerFalseSharing2}, when Thread$_2$ allocates an object from the newly acquired container it is actively false-sharing even though no objects are passed among threads.
    898 Note, once the object is freed by Thread$_1$, no more false sharing can occur until the container changes ownership again.
     896For example, if a container changes ownership through the global heap, then when a thread allocates an object from the newly acquired container it is actively false-sharing even though no objects are passed among threads.
     897Note, once the thread frees the object, no more false sharing can occur until the container changes ownership again.
    899898To prevent this form of false sharing, container movement may be restricted to when all objects in the container are free.
    900899One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area, again pushing storage management complexity back to the operating system.
    901900
    902 \begin{figure}
    903 \centering
    904 \subfloat[]{
    905         \input{ContainerFalseSharing1}
    906         \label{f:ContainerFalseSharing1}
    907 } % subfloat
    908 \subfloat[]{
    909         \input{ContainerFalseSharing2}
    910         \label{f:ContainerFalseSharing2}
    911 } % subfloat
    912 \caption{Active False-Sharing using Containers}
    913 \label{f:ActiveFalseSharingContainers}
    914 \end{figure}
     901% \begin{figure}
     902% \centering
     903% \subfloat[]{
     904%       \input{ContainerFalseSharing1}
     905%       \label{f:ContainerFalseSharing1}
     906% } % subfloat
     907% \subfloat[]{
     908%       \input{ContainerFalseSharing2}
     909%       \label{f:ContainerFalseSharing2}
     910% } % subfloat
     911% \caption{Active False-Sharing using Containers}
     912% \label{f:ActiveFalseSharingContainers}
     913% \end{figure}
    915914
    916915Using containers with ownership increases external fragmentation since a new container for a requested object size must be allocated separately for each thread requesting it.
     
    975974
    976975The container header allows an alternate approach for managing the heap's free-list.
    977 Rather than maintain a global free-list throughout the heap (see~Figure~\ref{f:GlobalFreeListAmongContainers}), the containers are linked through their headers and only the local free objects within a container are linked together (see~Figure~\ref{f:LocalFreeListWithinContainers}).
     976Rather than maintain a global free-list throughout the heap the containers are linked through their headers and only the local free objects within a container are linked together.
    978977Note, maintaining free lists within a container assumes all free objects in the container are associated with the same heap;
    979978thus, this approach only applies to containers with ownership.
    980979
    981980This alternate free-list approach can greatly reduce the complexity of moving all freed objects belonging to a container to another heap.
    982 To move a container using a global free-list, as in Figure~\ref{f:GlobalFreeListAmongContainers}, the free list is first searched to find all objects within the container.
     981To move a container using a global free-list, the free list is first searched to find all objects within the container.
    983982Each object is then removed from the free list and linked together to form a local free-list for the move to the new heap.
    984 With local free-lists in containers, as in Figure~\ref{f:LocalFreeListWithinContainers}, the container is simply removed from one heap's free list and placed on the new heap's free list.
     983With local free-lists in containers, the container is simply removed from one heap's free list and placed on the new heap's free list.
    985984Thus, when using local free-lists, the operation of moving containers is reduced from $O(N)$ to $O(1)$.
    986985However, there is the additional storage cost in the header, which increases the header size, and therefore internal fragmentation.
    987986
    988 \begin{figure}
    989 \centering
    990 \subfloat[Global Free-List Among Containers]{
    991         \input{FreeListAmongContainers}
    992         \label{f:GlobalFreeListAmongContainers}
    993 } % subfloat
    994 \hspace{0.25in}
    995 \subfloat[Local Free-List Within Containers]{
    996         \input{FreeListWithinContainers}
    997         \label{f:LocalFreeListWithinContainers}
    998 } % subfloat
    999 \caption{Container Free-List Structure}
    1000 \label{f:ContainerFreeListStructure}
    1001 \end{figure}
     987% \begin{figure}
     988% \centering
     989% \subfloat[Global Free-List Among Containers]{
     990%       \input{FreeListAmongContainers}
     991%       \label{f:GlobalFreeListAmongContainers}
     992% } % subfloat
     993% \hspace{0.25in}
     994% \subfloat[Local Free-List Within Containers]{
     995%       \input{FreeListWithinContainers}
     996%       \label{f:LocalFreeListWithinContainers}
     997% } % subfloat
     998% \caption{Container Free-List Structure}
     999% \label{f:ContainerFreeListStructure}
     1000% \end{figure}
    10021001
    10031002When all objects in the container are the same size, a single free-list is sufficient.
     
    11161115\subsection{Design Choices}
    11171116
    1118 llheap's design was reviewed and changed multiple times throughout the work.
    1119 Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}).
    1120 Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design.
    1121 
    1122 
    1123 \subsubsection{Allocation Fastpath}
    1124 \label{s:AllocationFastpath}
    1125 
    1126 These designs look at the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced.
    1127 \paragraph{T:1 model}
    1128 Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs.
    1129 This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes.
    1130 When KTs $\le$ N, the common bucket sizes are uncontented;
    1131 when KTs $>$ N, the free buckets are contented and latency increases significantly.
    1132 In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared.
    1133 Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency;
    1134 however, picking N is workload specific.
    1135 
    1136 \begin{figure}
    1137 \centering
    1138 \input{AllocDS1}
    1139 \caption{T:1 with Shared Buckets}
    1140 \label{f:T1SharedBuckets}
    1141 \end{figure}
     1117% Some of the rejected designs are discussed because they show the path to the final design (see discussion in Section~\ref{s:MultipleHeaps}).
     1118% Note, a few simple tests for a design choice were compared with the current best allocators to determine the viability of a design.
     1119
     1120
     1121% \paragraph{T:1 model}
     1122% Figure~\ref{f:T1SharedBuckets} shows one heap accessed by multiple kernel threads (KTs) using a bucket array, where smaller bucket sizes are shared among N KTs.
     1123% This design leverages the fact that usually the allocation requests are less than 1024 bytes and there are only a few different request sizes.
     1124% When KTs $\le$ N, the common bucket sizes are uncontented;
     1125% when KTs $>$ N, the free buckets are contented and latency increases significantly.
     1126% In all cases, a KT must acquire/release a lock, contented or uncontented, along the fast allocation path because a bucket is shared.
     1127% Therefore, while threads are contending for a small number of buckets sizes, the buckets are distributed among them to reduce contention, which lowers latency;
     1128% however, picking N is workload specific.
     1129%
     1130% \begin{figure}
     1131% \centering
     1132% \input{AllocDS1}
     1133% \caption{T:1 with Shared Buckets}
     1134% \label{f:T1SharedBuckets}
     1135% \end{figure}
     1136%
     1137% Problems:
     1138% \begin{itemize}
     1139% \item
     1140% Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator.
     1141% \item
     1142% When no thread is assigned a bucket number, its free storage is unavailable.
     1143% \item
     1144% All KTs contend for the global-pool lock for initial allocations, before free-lists get populated.
     1145% \end{itemize}
     1146% Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
     1147
     1148% \paragraph{T:H model}
     1149% Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs.
     1150% A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket.
     1151% When KT $\le$ N, the heaps might be uncontented;
     1152% when KTs $>$ N, the heaps are contented.
     1153% In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared.
     1154% By increasing N, this approach reduces contention but increases storage (time versus space);
     1155% however, picking N is workload specific.
     1156%
     1157% \begin{figure}
     1158% \centering
     1159% \input{AllocDS2}
     1160% \caption{T:H with Shared Heaps}
     1161% \label{f:THSharedHeaps}
     1162% \end{figure}
     1163%
     1164% Problems:
     1165% \begin{itemize}
     1166% \item
     1167% Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator.
     1168% \item
     1169% When no thread is assigned to a heap, its free storage is unavailable.
     1170% \item
     1171% Ownership issues arise (see Section~\ref{s:Ownership}).
     1172% \item
     1173% All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated.
     1174% \end{itemize}
     1175% Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
     1176
     1177% \paragraph{T:H model, H = number of CPUs}
     1178% This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@.
     1179% (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.)
     1180% Hence, each CPU logically has its own private heap and local pool.
     1181% A memory operation is serviced from the heap associated with the CPU executing the operation.
     1182% This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership).
     1183% This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs.
     1184%
     1185% Problems:
     1186% \begin{itemize}
     1187% \item
     1188% Need to know when a CPU is added/removed from the @taskset@.
     1189% \item
     1190% Need a fast way to determine the CPU a KT is executing on to access the appropriate heap.
     1191% \item
     1192% Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
     1193% \begin{quote}
     1194% A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable}
     1195% \end{quote}
     1196% If a KT is preempted during an allocation operation, the operating system can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness.
     1197% Note, the serially-reusable problem can occur in sequential programs with preemption, if the signal handler calls the preempted function, unless the function is serially reusable.
     1198% Essentially, the serially-reusable problem is a race condition on an unprotected critical subsection, where the operating system is providing the second thread via the signal handler.
     1199%
     1200% Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted.
     1201% \end{itemize}
     1202% Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs.
     1203% Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs.
     1204% For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
     1205% The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation.
     1206% Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection.
     1207% Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
     1208% If @librseq@ had an @rseq_abort@ which:
     1209% \begin{enumerate}
     1210% \item
     1211% Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit.
     1212% \item
     1213% Do nothing if there is no current restartable critical subsection in progress.
     1214% \end{enumerate}
     1215% Then @rseq_abort@ could be called on the backside of a  user-level context-switching.
     1216% A feature similar to this idea might exist for hardware transactional-memory.
     1217% A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
     1218
     1219% \subsubsection{Allocation Fastpath}
     1220% \label{s:AllocationFastpath}
     1221
     1222llheap's design was reviewed and changed multiple times during its development.  Only the final design choices are
     1223discussed in this paper.
     1224(See~\cite{Zulfiqar22} for a discussion of alternate choices and reasons for rejecting them.)
     1225All designs were analyzed for the allocation/free \newterm{fastpath}, \ie when an allocation can immediately return free storage or returned storage is not coalesced.
     1226The heap model choosen is 1:1, which is the T:H model with T = H, where there is one thread-local heap for each KT.
     1227(See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.)
     1228Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
     1229Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership.
    11421230
    11431231Problems:
    1144 \begin{itemize}
    1145 \item
    1146 Need to know when a KT is created/destroyed to assign/unassign a shared bucket-number from the memory allocator.
    1147 \item
    1148 When no thread is assigned a bucket number, its free storage is unavailable.
    1149 \item
    1150 All KTs contend for the global-pool lock for initial allocations, before free-lists get populated.
    1151 \end{itemize}
    1152 Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
    1153 
    1154 \paragraph{T:H model}
    1155 Figure~\ref{f:THSharedHeaps} shows a fixed number of heaps (N), each a local free pool, where the heaps are sharded (distributed) across the KTs.
    1156 A KT can point directly to its assigned heap or indirectly through the corresponding heap bucket.
    1157 When KT $\le$ N, the heaps might be uncontented;
    1158 when KTs $>$ N, the heaps are contented.
    1159 In all cases, a KT must acquire/release a lock, contented or uncontented along the fast allocation path because a heap is shared.
    1160 By increasing N, this approach reduces contention but increases storage (time versus space);
    1161 however, picking N is workload specific.
    1162 
    1163 \begin{figure}
    1164 \centering
    1165 \input{AllocDS2}
    1166 \caption{T:H with Shared Heaps}
    1167 \label{f:THSharedHeaps}
    1168 \end{figure}
    1169 
    1170 Problems:
    1171 \begin{itemize}
    1172 \item
    1173 Need to know when a KT is created/destroyed to assign/unassign a heap from the memory allocator.
    1174 \item
    1175 When no thread is assigned to a heap, its free storage is unavailable.
    1176 \item
    1177 Ownership issues arise (see Section~\ref{s:Ownership}).
    1178 \item
    1179 All KTs contend for the local/global-pool lock for initial allocations, before free-lists get populated.
    1180 \end{itemize}
    1181 Tests showed having locks along the allocation fast-path produced a significant increase in allocation costs and any contention among KTs produces a significant spike in latency.
    1182 
    1183 \paragraph{T:H model, H = number of CPUs}
    1184 This design is the T:H model but H is set to the number of CPUs on the computer or the number restricted to an application, \eg via @taskset@.
    1185 (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per CPU.)
    1186 Hence, each CPU logically has its own private heap and local pool.
    1187 A memory operation is serviced from the heap associated with the CPU executing the operation.
    1188 This approach removes fastpath locking and contention, regardless of the number of KTs mapped across the CPUs, because only one KT is running on each CPU at a time (modulo operations on the global pool and ownership).
    1189 This approach is essentially an M:N approach where M is the number if KTs and N is the number of CPUs.
    1190 
    1191 Problems:
    1192 \begin{itemize}
    1193 \item
    1194 Need to know when a CPU is added/removed from the @taskset@.
    1195 \item
    1196 Need a fast way to determine the CPU a KT is executing on to access the appropriate heap.
     1232\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
     1233\item
     1234Need to know when a KT starts/terminates to create/delete its heap.
     1235
     1236\noindent
     1237It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
     1238\item
     1239There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
     1240
     1241\noindent
     1242The classic solution only deletes a heap after all referents are returned, which is complex.
     1243The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
     1244If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
     1245To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
     1246In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
     1247\item
     1248There can be significant external fragmentation as the number of KTs increases.
     1249
     1250\noindent
     1251In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
     1252Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
    11971253\item
    11981254Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}.
     
    12051261
    12061262Library @librseq@~\cite{librseq} was used to perform a fast determination of the CPU and to ensure all memory operations complete on one CPU using @librseq@'s restartable sequences, which restart the critical subsection after undoing its writes, if the critical subsection is preempted.
    1207 \end{itemize}
    1208 Tests showed that @librseq@ can determine the particular CPU quickly but setting up the restartable critical-subsection along the allocation fast-path produced a significant increase in allocation costs.
    1209 Also, the number of undoable writes in @librseq@ is limited and restartable sequences cannot deal with user-level thread (UT) migration across KTs.
    1210 For example, UT$_1$ is executing a memory operation by KT$_1$ on CPU$_1$ and a time-slice preemption occurs.
    1211 The signal handler context switches UT$_1$ onto the user-level ready-queue and starts running UT$_2$ on KT$_1$, which immediately calls a memory operation.
    1212 Since KT$_1$ is still executing on CPU$_1$, @librseq@ takes no action because it assumes KT$_1$ is still executing the same critical subsection.
    1213 Then UT$_1$ is scheduled onto KT$_2$ by the user-level scheduler, and its memory operation continues in parallel with UT$_2$ using references into the heap associated with CPU$_1$, which corrupts CPU$_1$'s heap.
    1214 If @librseq@ had an @rseq_abort@ which:
    1215 \begin{enumerate}
    1216 \item
    1217 Marked the current restartable critical-subsection as cancelled so it restarts when attempting to commit.
    1218 \item
    1219 Do nothing if there is no current restartable critical subsection in progress.
    1220 \end{enumerate}
    1221 Then @rseq_abort@ could be called on the backside of a  user-level context-switching.
    1222 A feature similar to this idea might exist for hardware transactional-memory.
    1223 A significant effort was made to make this approach work but its complexity, lack of robustness, and performance costs resulted in its rejection.
    1224 
    1225 \paragraph{1:1 model}
    1226 This design is the T:H model with T = H, where there is one thread-local heap for each KT.
    1227 (See Figure~\ref{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.)
    1228 Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted.
    1229 Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership.
    1230 
    1231 Problems:
    1232 \begin{itemize}
    1233 \item
    1234 Need to know when a KT starts/terminates to create/delete its heap.
    1235 
    1236 \noindent
    1237 It is possible to leverage constructors/destructors for thread-local objects to get a general handle on when a KT starts/terminates.
    1238 \item
    1239 There is a classic \newterm{memory-reclamation} problem for ownership because storage passed to another thread can be returned to a terminated heap.
    1240 
    1241 \noindent
    1242 The classic solution only deletes a heap after all referents are returned, which is complex.
    1243 The cheap alternative is for heaps to persist for program duration to handle outstanding referent frees.
    1244 If old referents return storage to a terminated heap, it is handled in the same way as an active heap.
    1245 To prevent heap blowup, terminated heaps can be reused by new KTs, where a reused heap may be populated with free storage from a prior KT (external fragmentation).
    1246 In most cases, heap blowup is not a problem because programs have a small allocation set-size, so the free storage from a prior KT is apropos for a new KT.
    1247 \item
    1248 There can be significant external fragmentation as the number of KTs increases.
    1249 
    1250 \noindent
    1251 In many concurrent applications, good performance is achieved with the number of KTs proportional to the number of CPUs.
    1252 Since the number of CPUs is relatively small, and a heap is also relatively small, $\approx$10K bytes (not including any associated freed storage), the worst-case external fragmentation is still small compared to the RAM available on large servers with many CPUs.
    1253 \item
    1254 There is the same serially-reusable problem with UTs migrating across KTs.
     1263
     1264%There is the same serially-reusable problem with UTs migrating across KTs.
    12551265\end{itemize}
    12561266Tests showed this design produced the closest performance match with the best current allocators, and code inspection showed most of these allocators use different variations of this approach.
     
    12961306\subsubsection{Allocation Latency}
    12971307
    1298 A primary goal of llheap is low latency.
     1308A primary goal of llheap is low latency, hence the name low-latency heap (llheap).
    12991309Two forms of latency are internal and external.
    13001310Internal latency is the time to perform an allocation, while external latency is time to obtain/return storage from/to the operating system.
     
    13151325
    13161326Figure~\ref{f:llheapStructure} shows the design of llheap, which uses the following features:
    1317 \begin{itemize}
    1318 \item
    131913271:1 multiple-heap model to minimize the fastpath,
    1320 \item
    13211328can be built with or without heap ownership,
    1322 \item
    13231329headers per allocation versus containers,
    1324 \item
    13251330no coalescing to minimize latency,
    1326 \item
    13271331global heap memory (pool) obtained from the operating system using @mmap@ to create and reuse heaps needed by threads,
    1328 \item
    13291332local reserved memory (pool) per heap obtained from global pool,
    1330 \item
    13311333global reserved memory (pool) obtained from the operating system using @sbrk@ call,
    1332 \item
    13331334optional fast-lookup table for converting allocation requests into bucket sizes,
    1334 \item
    13351335optional statistic-counters table for accumulating counts of allocation operations.
    1336 \end{itemize}
    13371336
    13381337\begin{figure}
     
    13601359All objects in a bucket are of the same size.
    13611360The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation using @mallopt( M_MMAP_THRESHOLD )@, \ie small objects managed by the program and large objects managed by the operating system.
    1362 Each free bucket of a specific size has the following two lists:
    1363 \begin{itemize}
    1364 \item
    1365 A free stack used solely by the KT heap-owner, so push/pop operations do not require locking.
     1361Each free bucket of a specific size has two lists.
     13621) A free stack used solely by the KT heap-owner, so push/pop operations do not require locking.
    13661363The free objects are a stack so hot storage is reused first.
    1367 \item
    1368 For ownership, a shared away-stack for KTs to return storage allocated by other KTs, so push/pop operations require locking.
     13642) For ownership, a shared away-stack for KTs to return storage allocated by other KTs, so push/pop operations require locking.
    13691365When the free stack is empty, the entire ownership stack is removed and becomes the head of the corresponding free stack.
    1370 \end{itemize}
    13711366
    13721367Algorithm~\ref{alg:heapObjectAlloc} shows the allocation outline for an object of size $S$.
     
    13781373The @char@ type restricts the number of bucket sizes to 256.
    13791374For $S$ > 64K, a binary search is used.
    1380 Then, the allocation storage is obtained from the following locations (in order), with increasing latency.
    1381 \begin{enumerate}[topsep=0pt,itemsep=0pt,parsep=0pt]
    1382 \item
     1375Then, the allocation storage is obtained from the following locations (in order), with increasing latency:
    13831376bucket's free stack,
    1384 \item
    13851377bucket's away stack,
    1386 \item
    1387 heap's local pool
    1388 \item
    1389 global pool
    1390 \item
    1391 operating system (@sbrk@)
    1392 \end{enumerate}
     1378heap's local pool,
     1379global pool,
     1380operating system (@sbrk@).
    13931381
    13941382\begin{algorithm}
     
    18081796
    18091797For existing C allocation routines:
    1810 \begin{itemize}
     1798\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    18111799\item
    18121800@calloc@ sets the sticky zero-fill property.
     
    18251813\noindent\textbf{Usage}
    18261814@aalloc@ takes two parameters.
    1827 \begin{itemize}
     1815\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    18281816\item
    18291817@dim@: number of array objects
     
    18391827\noindent\textbf{Usage}
    18401828@resize@ takes two parameters.
    1841 \begin{itemize}
     1829\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    18421830\item
    18431831@oaddr@: address to be resized
     
    18531841\noindent\textbf{Usage}
    18541842@amemalign@ takes three parameters.
    1855 \begin{itemize}
     1843\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    18561844\item
    18571845@alignment@: alignment requirement
     
    18731861\noindent\textbf{Usage}
    18741862@malloc_alignment@ takes one parameter.
    1875 \begin{itemize}
     1863\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    18761864\item
    18771865@addr@: address of an allocated object.
     
    18851873@malloc_zero_fill@ takes one parameters.
    18861874
    1887 \begin{itemize}
     1875\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    18881876\item
    18891877@addr@: address of an allocated object.
     
    18971885\noindent\textbf{Usage}
    18981886@malloc_size@ takes one parameters.
    1899 \begin{itemize}
     1887\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    19001888\item
    19011889@addr@: address of an allocated object.
     
    19081896\noindent\textbf{Usage}
    19091897@malloc_stats_fd@ takes one parameters.
    1910 \begin{itemize}
     1898\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    19111899\item
    19121900@fd@: file descriptor.
     
    19381926\noindent\textbf{Usage}
    19391927takes three parameters.
    1940 \begin{itemize}
     1928\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    19411929\item
    19421930@oaddr@: address to be resized
     
    19981986In addition to the \CFA C-style allocator interface, a new allocator interface is provided to further increase orthogonality and usability of dynamic-memory allocation.
    19991987This interface helps programmers in three ways.
    2000 \begin{itemize}
     1988\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    20011989\item
    20021990naming: \CFA regular and @ttype@ polymorphism (@ttype@ polymorphism in \CFA is similar to \CC variadic templates) is used to encapsulate a wide range of allocation functionality into a single routine name, so programmers do not have to remember multiple routine names for different kinds of dynamic allocations.
     
    25282516
    25292517The performance experiments were run on two different multi-core architectures (x64 and ARM) to determine if there is consistency across platforms:
    2530 \begin{itemize}
     2518\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    25312519\item
    25322520\textbf{Algol} Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz, GCC version 9.4.0
     
    27932781Each allocator's performance for each thread is shown in different colors.
    27942782
    2795 \begin{itemize}
     2783\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    27962784\item Figure~\ref{fig:speed-3-malloc} shows results for chain: malloc
    27972785\item Figure~\ref{fig:speed-4-realloc} shows results for chain: realloc
     
    30303018Each figure has 2 graphs, one for each experiment environment.
    30313019Each graph has following 5 subgraphs that show memory usage and statistics throughout the micro-benchmark's lifetime.
    3032 \begin{itemize}
     3020\begin{itemize}[topsep=3pt,itemsep=2pt,parsep=0pt]
    30333021\item \textit{\textbf{current\_req\_mem(B)}} shows the amount of dynamic memory requested and currently in-use of the benchmark.
    30343022\item \textit{\textbf{heap}}* shows the memory requested by the program (allocator) from the system that lies in the heap (@sbrk@) area.
Note: See TracChangeset for help on using the changeset viewer.