1 | % ====================================================================== |
---|
2 | % ====================================================================== |
---|
3 | \chapter{Mutex Statement}\label{s:mutexstmt} |
---|
4 | % ====================================================================== |
---|
5 | % ====================================================================== |
---|
6 | |
---|
7 | The mutual exclusion problem was introduced by Dijkstra in 1965~\cite{Dijkstra65,Dijkstra65a}: |
---|
8 | there are several concurrent processes or threads that communicate by shared variables and from time to time need exclusive access to shared resources. |
---|
9 | A shared resource and code manipulating it form a pairing called a \Newterm{critical section (CS)}, which is a many-to-one relationship; |
---|
10 | \eg if multiple files are being written to by multiple threads, only the pairings of simultaneous writes to the same files are CSs. |
---|
11 | Regions of code where the thread is not interested in the resource are combined into the \Newterm{non-critical section (NCS)}. |
---|
12 | |
---|
13 | Exclusive access to a resource is provided by \Newterm{mutual exclusion (MX)}. |
---|
14 | MX is implemented by some form of \emph{lock}, where the CS is bracketed by lock procedures @acquire@ and @release@. |
---|
15 | Threads execute a loop of the form: |
---|
16 | \begin{cfa} |
---|
17 | loop of $thread$ p: |
---|
18 | NCS; |
---|
19 | acquire( lock ); CS; release( lock ); // protected critical section with MX |
---|
20 | end loop. |
---|
21 | \end{cfa} |
---|
22 | MX guarantees there is never more than one thread in the CS. |
---|
23 | MX must also guarantee eventual progress: when there are competing threads attempting access, eventually some competing thread succeeds, \ie acquires the CS, releases it, and returns to the NCS. |
---|
24 | % Lamport \cite[p.~329]{Lam86mx} extends this requirement to the exit protocol. |
---|
25 | A stronger constraint is that every thread that calls @acquire@ eventually succeeds after some reasonable bounded time. |
---|
26 | |
---|
27 | \section{Monitor} |
---|
28 | \CFA provides a high-level locking object, called a \Newterm{monitor}, an elegant and efficient abstraction for providing mutual exclusion and synchronization for shared-memory systems. |
---|
29 | First proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}. |
---|
30 | Several concurrent programming languages provide monitors as an explicit language construct: \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, \uC~\cite{Buhr92a} and Java~\cite{Java}. |
---|
31 | In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as mutex locks or semaphores to manually implement a monitor. |
---|
32 | |
---|
33 | Figure~\ref{f:AtomicCounter} shows a \CFA and Java monitor implementing an atomic counter. |
---|
34 | A \Newterm{monitor} is a programming technique that implicitly binds mutual exclusion to static function scope by call and return. |
---|
35 | In contrast, lock mutual exclusion, defined by acquire/release calls, is independent of lexical context (analogous to block versus heap storage allocation). |
---|
36 | Restricting acquire and release points in a monitor eases programming, comprehension, and maintenance, at a slight cost in flexibility and efficiency. |
---|
37 | Ultimately, a monitor is implemented using a combination of basic locks and atomic instructions. |
---|
38 | |
---|
39 | \begin{figure} |
---|
40 | \centering |
---|
41 | |
---|
42 | \begin{lrbox}{\myboxA} |
---|
43 | \begin{cfa}[aboveskip=0pt,belowskip=0pt] |
---|
44 | @monitor@ Aint { |
---|
45 | int cnt; |
---|
46 | }; |
---|
47 | int ++?( Aint & @mutex@ m ) { return ++m.cnt; } |
---|
48 | int ?=?( Aint & @mutex@ l, int r ) { l.cnt = r; } |
---|
49 | int ?=?(int & l, Aint & r) { l = r.cnt; } |
---|
50 | |
---|
51 | int i = 0, j = 0; |
---|
52 | Aint x = { 0 }, y = { 0 }; $\C[1.5in]{// no mutex}$ |
---|
53 | ++x; ++y; $\C{// mutex}$ |
---|
54 | x = 2; y = i; $\C{// mutex}$ |
---|
55 | i = x; j = y; $\C{// no mutex}\CRT$ |
---|
56 | \end{cfa} |
---|
57 | \end{lrbox} |
---|
58 | |
---|
59 | \begin{lrbox}{\myboxB} |
---|
60 | \begin{java}[aboveskip=0pt,belowskip=0pt] |
---|
61 | class Aint { |
---|
62 | private int cnt; |
---|
63 | public Aint( int init ) { cnt = init; } |
---|
64 | @synchronized@ public int inc() { return ++cnt; } |
---|
65 | @synchronized@ public void set( int r ) {cnt = r;} |
---|
66 | public int get() { return cnt; } |
---|
67 | } |
---|
68 | int i = 0, j = 0; |
---|
69 | Aint x = new Aint( 0 ), y = new Aint( 0 ); |
---|
70 | x.inc(); y.inc(); |
---|
71 | x.set( 2 ); y.set( i ); |
---|
72 | i = x.get(); j = y.get(); |
---|
73 | \end{java} |
---|
74 | \end{lrbox} |
---|
75 | |
---|
76 | \subfloat[\CFA]{\label{f:AtomicCounterCFA}\usebox\myboxA} |
---|
77 | \hspace*{3pt} |
---|
78 | \vrule |
---|
79 | \hspace*{3pt} |
---|
80 | \subfloat[Java]{\label{f:AtomicCounterJava}\usebox\myboxB} |
---|
81 | \caption{Atomic integer counter} |
---|
82 | \label{f:AtomicCounter} |
---|
83 | \end{figure} |
---|
84 | |
---|
85 | Like Java, \CFA monitors have \Newterm{multi-acquire} semantics so the thread in the monitor may acquire it multiple times without deadlock, allowing recursion and calling other MX functions. |
---|
86 | For robustness, \CFA monitors ensure the monitor lock is released regardless of how an acquiring function ends, normal or exceptional, and returning a shared variable is safe via copying before the lock is released. |
---|
87 | Monitor objects can be passed through multiple helper functions without acquiring mutual exclusion, until a designated function associated with the object is called. |
---|
88 | \CFA functions are designated MX by one or more pointer/reference parameters having qualifier @mutex@. |
---|
89 | Java members are designated MX with \lstinline[language=java]{synchronized}, which applies only to the implicit receiver parameter. |
---|
90 | In the example, the increment and setter operations need mutual exclusion, while the read-only getter operation is not MX because reading an integer is atomic. |
---|
91 | |
---|
92 | As stated, the non-object-oriented nature of \CFA monitors allows a function to acquire multiple mutex objects. |
---|
93 | For example, the bank-transfer problem requires locking two bank accounts to safely debit and credit money between accounts. |
---|
94 | \begin{cfa} |
---|
95 | monitor BankAccount { |
---|
96 | int balance; |
---|
97 | }; |
---|
98 | void deposit( BankAccount & mutex b, int deposit ) with( b ) { |
---|
99 | balance += deposit; |
---|
100 | } |
---|
101 | void transfer( BankAccount & mutex my, BankAccount & mutex your, int me2you ) { |
---|
102 | deposit( my, -me2you ); $\C{// debit}$ |
---|
103 | deposit( your, me2you ); $\C{// credit}$ |
---|
104 | } |
---|
105 | \end{cfa} |
---|
106 | The \CFA monitor implementation ensures multi-lock acquisition is done in a deadlock-free manner regardless of the number of MX parameters and monitor arguments. |
---|
107 | |
---|
108 | |
---|
109 | \section{\lstinline{mutex} statement} |
---|
110 | Restricting implicit lock acquisition to function entry and exit can be awkward for certain problems. |
---|
111 | To increase locking flexibility, some languages introduce a mutex statement. |
---|
112 | \VRef[Figure]{f:ReadersWriter} shows the outline of a reader/writer lock written as a \CFA monitor and mutex statements. |
---|
113 | (The exact lock implementation is irrelevant.) |
---|
114 | The @read@ and @write@ functions are called with a reader/write lock and any arguments to perform reading or writing. |
---|
115 | The @read@ function is not MX because multiple readers can read simultaneously. |
---|
116 | MX is acquired within @read@ by calling the (nested) helper functions @StartRead@ and @EndRead@ or executing the mutex statements. |
---|
117 | Between the calls or statements, reads can execute simultaneous within the body of @read@. |
---|
118 | The @write@ function does not require refactoring because writing is a CS. |
---|
119 | The mutex-statement version is better because it has fewer names, less argument/parameter passing, and can possibly hold MX for a shorter duration. |
---|
120 | |
---|
121 | \begin{figure} |
---|
122 | \centering |
---|
123 | |
---|
124 | \begin{lrbox}{\myboxA} |
---|
125 | \begin{cfa}[aboveskip=0pt,belowskip=0pt] |
---|
126 | monitor RWlock { ... }; |
---|
127 | void read( RWlock & rw, ... ) { |
---|
128 | void StartRead( RWlock & @mutex@ rw ) { ... } |
---|
129 | void EndRead( RWlock & @mutex@ rw ) { ... } |
---|
130 | StartRead( rw ); |
---|
131 | ... // read without MX |
---|
132 | EndRead( rw ); |
---|
133 | } |
---|
134 | void write( RWlock & @mutex@ rw, ... ) { |
---|
135 | ... // write with MX |
---|
136 | } |
---|
137 | \end{cfa} |
---|
138 | \end{lrbox} |
---|
139 | |
---|
140 | \begin{lrbox}{\myboxB} |
---|
141 | \begin{cfa}[aboveskip=0pt,belowskip=0pt] |
---|
142 | |
---|
143 | void read( RWlock & rw, ... ) { |
---|
144 | |
---|
145 | |
---|
146 | @mutex@( rw ) { ... } |
---|
147 | ... // read without MX |
---|
148 | @mutex@{ rw ) { ... } |
---|
149 | } |
---|
150 | void write( RWlock & @mutex@ rw, ... ) { |
---|
151 | ... // write with MX |
---|
152 | } |
---|
153 | \end{cfa} |
---|
154 | \end{lrbox} |
---|
155 | |
---|
156 | \subfloat[monitor]{\label{f:RWmonitor}\usebox\myboxA} |
---|
157 | \hspace*{3pt} |
---|
158 | \vrule |
---|
159 | \hspace*{3pt} |
---|
160 | \subfloat[mutex statement]{\label{f:RWmutexstmt}\usebox\myboxB} |
---|
161 | \caption{Readers writer problem} |
---|
162 | \label{f:ReadersWriter} |
---|
163 | \end{figure} |
---|
164 | |
---|
165 | This work adds a mutex statement to \CFA, but generalizes it beyond implicit monitor locks. |
---|
166 | In detail, the mutex statement has a clause and statement block, similar to a conditional or loop statement. |
---|
167 | The clause accepts any number of lockable objects (like a \CFA MX function prototype), and locks them for the duration of the statement. |
---|
168 | The locks are acquired in a deadlock free manner and released regardless of how control-flow exits the statement. |
---|
169 | The mutex statement provides easy lock usage in the common case of lexically wrapping a CS. |
---|
170 | Examples of \CFA mutex statement are shown in \VRef[Listing]{l:cfa_mutex_ex}. |
---|
171 | |
---|
172 | \begin{cfa}[caption={\CFA mutex statement usage},label={l:cfa_mutex_ex}] |
---|
173 | owner_lock lock1, lock2, lock3; |
---|
174 | @mutex@( lock2, lock3 ) ...; $\C{// inline statement}$ |
---|
175 | @mutex@( lock1, lock2, lock3 ) { ... } $\C{// statement block}$ |
---|
176 | void transfer( BankAccount & my, BankAccount & your, int me2you ) { |
---|
177 | ... // check values, no MX |
---|
178 | @mutex@( my, your ) { // MX is shorter duration that function body |
---|
179 | deposit( my, -me2you ); $\C{// debit}$ |
---|
180 | deposit( your, me2you ); $\C{// credit}$ |
---|
181 | } |
---|
182 | } |
---|
183 | \end{cfa} |
---|
184 | |
---|
185 | \section{Other Languages} |
---|
186 | There are similar constructs to the mutex statement in other programming languages. |
---|
187 | Java has a feature called a synchronized statement, which looks like the \CFA's mutex statement, but only accepts a single object in the clause and only handles monitor locks. |
---|
188 | The \CC standard library has a @scoped_lock@, which is also similar to the mutex statement. |
---|
189 | The @scoped_lock@ takes any number of locks in its constructor, and acquires them in a deadlock-free manner. |
---|
190 | It then releases them when the @scoped_lock@ object is deallocated using \gls{raii}. |
---|
191 | An example of \CC @scoped_lock@ is shown in \VRef[Listing]{l:cc_scoped_lock}. |
---|
192 | |
---|
193 | \begin{cfa}[caption={\CC \lstinline{scoped_lock} usage},label={l:cc_scoped_lock}] |
---|
194 | struct BankAccount { |
---|
195 | @recursive_mutex m;@ $\C{// must be recursive}$ |
---|
196 | int balance = 0; |
---|
197 | }; |
---|
198 | void deposit( BankAccount & b, int deposit ) { |
---|
199 | @scoped_lock lock( b.m );@ $\C{// RAII acquire}$ |
---|
200 | b.balance += deposit; |
---|
201 | } $\C{// RAII release}$ |
---|
202 | void transfer( BankAccount & my, BankAccount & your, int me2you ) { |
---|
203 | @scoped_lock lock( my.m, your.m );@ $\C{// RAII acquire}$ |
---|
204 | deposit( my, -me2you ); $\C{// debit}$ |
---|
205 | deposit( your, me2you ); $\C{// credit}$ |
---|
206 | } $\C{// RAII release}$ |
---|
207 | \end{cfa} |
---|
208 | |
---|
209 | \section{\CFA implementation} |
---|
210 | The \CFA mutex statement takes some ideas from both the Java and \CC features. |
---|
211 | Like Java, \CFA introduces a new statement rather than building from existing language features. |
---|
212 | (\CFA has sufficient language features to mimic \CC RAII locking.) |
---|
213 | This syntactic choice makes MX explicit rather than implicit via object declarations. |
---|
214 | Hence, it is easier for programmers and language tools to identify MX points in a program, \eg scan for all @mutex@ parameters and statements in a body of code. |
---|
215 | Furthermore, concurrent safety is provided across an entire program for the complex operation of acquiring multiple locks in a deadlock-free manner. |
---|
216 | Unlike Java, \CFA's mutex statement and \CC's @scoped_lock@ both use parametric polymorphism to allow user defined types to work with this feature. |
---|
217 | In this case, the polymorphism allows a locking mechanism to acquire MX over an object without having to know the object internals or what kind of lock it is using. |
---|
218 | \CFA's provides and uses this locking trait: |
---|
219 | \begin{cfa} |
---|
220 | forall( L & | sized(L) ) |
---|
221 | trait is_lock { |
---|
222 | void lock( L & ); |
---|
223 | void unlock( L & ); |
---|
224 | }; |
---|
225 | \end{cfa} |
---|
226 | \CC @scoped_lock@ has this trait implicitly based on functions accessed in a template. |
---|
227 | @scoped_lock@ also requires @try_lock@ because of its technique for deadlock avoidance \see{\VRef{s:DeadlockAvoidance}}. |
---|
228 | |
---|
229 | The following shows how the @mutex@ statement is used with \CFA streams to eliminate unpredictable results when printing in a concurrent program. |
---|
230 | For example, if two threads execute: |
---|
231 | \begin{cfa} |
---|
232 | thread$\(_1\)$ : sout | "abc" | "def"; |
---|
233 | thread$\(_2\)$ : sout | "uvw" | "xyz"; |
---|
234 | \end{cfa} |
---|
235 | any of the outputs can appear, included a segment fault due to I/O buffer corruption: |
---|
236 | \begin{cquote} |
---|
237 | \small\tt |
---|
238 | \begin{tabular}{@{}l|l|l|l|l@{}} |
---|
239 | abc def & abc uvw xyz & uvw abc xyz def & abuvwc dexf & uvw abc def \\ |
---|
240 | uvw xyz & def & & yz & xyz |
---|
241 | \end{tabular} |
---|
242 | \end{cquote} |
---|
243 | The stream type for @sout@ is defined to satisfy the @is_lock@ trait, so the @mutex@ statement can be used to lock an output stream while producing output. |
---|
244 | From the programmer's perspective, it is sufficient to know an object can be locked and then any necessary MX is easily available via the @mutex@ statement. |
---|
245 | This ability improves safety and programmer productivity since it abstracts away the concurrent details. |
---|
246 | Hence, a programmer can easily protect cascaded I/O expressions: |
---|
247 | \begin{cfa} |
---|
248 | thread$\(_1\)$ : mutex( sout ) sout | "abc" | "def"; |
---|
249 | thread$\(_2\)$ : mutex( sout ) sout | "uvw" | "xyz"; |
---|
250 | \end{cfa} |
---|
251 | constraining the output to two different lines in either order: |
---|
252 | \begin{cquote} |
---|
253 | \small\tt |
---|
254 | \begin{tabular}{@{}l|l@{}} |
---|
255 | abc def & uvw xyz \\ |
---|
256 | uvw xyz & abc def |
---|
257 | \end{tabular} |
---|
258 | \end{cquote} |
---|
259 | where this level of safe nondeterministic output is acceptable. |
---|
260 | Alternatively, multiple I/O statements can be protected using the mutex statement block: |
---|
261 | \begin{cfa} |
---|
262 | mutex( sout ) { // acquire stream lock for sout for block duration |
---|
263 | sout | "abc"; |
---|
264 | mutex( sout ) sout | "uvw" | "xyz"; // OK because sout lock is recursive |
---|
265 | sout | "def"; |
---|
266 | } // implicitly release sout lock |
---|
267 | \end{cfa} |
---|
268 | The inner lock acquire is likely to occur through a function call that does a thread-safe print. |
---|
269 | |
---|
270 | \section{Deadlock Avoidance}\label{s:DeadlockAvoidance} |
---|
271 | The mutex statement uses the deadlock avoidance technique of lock ordering, where the circular-wait condition of a deadlock cannot occur if all locks are acquired in the same order. |
---|
272 | The @scoped_lock@ uses a deadlock avoidance algorithm where all locks after the first are acquired using @try_lock@ and if any of the lock attempts fail, all acquired locks are released. |
---|
273 | This repeats after selecting a new starting point in a cyclic manner until all locks are acquired successfully. |
---|
274 | This deadlock avoidance algorithm is shown in Figure~\ref{f:cc_deadlock_avoid}. |
---|
275 | The algorithm is taken directly from the source code of the @<mutex>@ header, with some renaming and comments for clarity. |
---|
276 | |
---|
277 | \begin{figure} |
---|
278 | \begin{cfa} |
---|
279 | int first = 0; // first lock to attempt to lock |
---|
280 | do { |
---|
281 | // locks is the array of locks to acquire |
---|
282 | locks[first].lock(); $\C{// lock first lock}$ |
---|
283 | for ( int i = 1; i < Num_Locks; i += 1 ) { $\C{// iterate over rest of locks}$ |
---|
284 | const int idx = (first + i) % Num_Locks; |
---|
285 | if ( ! locks[idx].try_lock() ) { $\C{// try lock each one}$ |
---|
286 | for ( int j = i; j != 0; j -= 1 ) $\C{// release all locks}$ |
---|
287 | locks[(first + j - 1) % Num_Locks].unlock(); |
---|
288 | first = idx; $\C{// rotate which lock to acquire first}$ |
---|
289 | break; |
---|
290 | } |
---|
291 | } |
---|
292 | // if first lock is still held then all have been acquired |
---|
293 | } while ( ! locks[first].owns_lock() ); $\C{// is first lock held?}$ |
---|
294 | \end{cfa} |
---|
295 | \caption{\CC \lstinline{scoped_lock} deadlock avoidance algorithm} |
---|
296 | \label{f:cc_deadlock_avoid} |
---|
297 | \end{figure} |
---|
298 | |
---|
299 | While this algorithm successfully avoids deadlock, there is a livelock scenario. |
---|
300 | Assume two threads, $A$ and $B$, create a @scoped_lock@ accessing two locks, $L1$ and $L2$. |
---|
301 | A livelock can form as follows. |
---|
302 | Thread $A$ creates a @scoped_lock@ with arguments $L1$, $L2$, and $B$ creates a scoped lock with the lock arguments in the opposite order $L2$, $L1$. |
---|
303 | Both threads acquire the first lock in their order and then fail the @try_lock@ since the other lock is held. |
---|
304 | Both threads then reset their starting lock to be their second lock and try again. |
---|
305 | This time $A$ has order $L2$, $L1$, and $B$ has order $L1$, $L2$, which is identical to the starting setup but with the ordering swapped between threads. |
---|
306 | If the threads perform this action in lock-step, they cycle indefinitely without entering the CS, \ie livelock. |
---|
307 | Hence, to use @scoped_lock@ safely, a programmer must manually construct and maintain a global ordering of lock arguments passed to @scoped_lock@. |
---|
308 | |
---|
309 | The lock ordering algorithm used in \CFA mutex functions and statements is deadlock and livelock free. |
---|
310 | The algorithm uses the lock memory addresses as keys, sorts the keys, and then acquires the locks in sorted order. |
---|
311 | For fewer than 7 locks ($2^3-1$), the sort is unrolled performing the minimum number of compare and swaps for the given number of locks; |
---|
312 | for 7 or more locks, insertion sort is used. |
---|
313 | Since it is extremely rare to hold more than 6 locks at a time, the algorithm is fast and executes in $O(1)$ time. |
---|
314 | Furthermore, lock addresses are unique across program execution, even for dynamically allocated locks, so the algorithm is safe across the entire program execution. |
---|
315 | |
---|
316 | The downside to the sorting approach is that it is not fully compatible with manual usages of the same locks outside the @mutex@ statement, \ie the lock are acquired without using the @mutex@ statement. |
---|
317 | The following scenario is a classic deadlock. |
---|
318 | \begin{cquote} |
---|
319 | \begin{tabular}{@{}l@{\hspace{30pt}}l@{}} |
---|
320 | \begin{cfa} |
---|
321 | lock L1, L2; // assume &L1 < &L2 |
---|
322 | $\textbf{thread\(_1\)}$ |
---|
323 | acquire( L2 ); |
---|
324 | acquire( L1 ); |
---|
325 | CS |
---|
326 | release( L1 ); |
---|
327 | release( L2 ); |
---|
328 | \end{cfa} |
---|
329 | & |
---|
330 | \begin{cfa} |
---|
331 | |
---|
332 | $\textbf{thread\(_2\)}$ |
---|
333 | mutex( L1, L2 ) { |
---|
334 | |
---|
335 | CS |
---|
336 | |
---|
337 | } |
---|
338 | \end{cfa} |
---|
339 | \end{tabular} |
---|
340 | \end{cquote} |
---|
341 | Comparatively, if the @scoped_lock@ is used and the same locks are acquired elsewhere, there is no concern of the @scoped_lock@ deadlocking, due to its avoidance scheme, but it may livelock. |
---|
342 | The convenience and safety of the @mutex@ statement, \eg guaranteed lock release with exceptions, should encourage programmers to always use it for locking, mitigating any deadlock scenario versus combining manual locking with the mutex statement. |
---|
343 | |
---|
344 | \section{Performance} |
---|
345 | Given the two multi-acquisition algorithms in \CC and \CFA, each with differing advantages and disadvantages, it interesting to compare their performance. |
---|
346 | Comparison with Java is not possible, since it only takes a single lock. |
---|
347 | |
---|
348 | The comparison starts with a baseline that acquires the locks directly without a mutex statement or @scoped_lock@ in a fixed ordering and then releases them. |
---|
349 | The baseline helps highlight the cost of the deadlock avoidance/prevention algorithms for each implementation. |
---|
350 | |
---|
351 | The benchmark used to evaluate the avoidance algorithms repeatedly acquires a fixed number of locks in a random order and then releases them. |
---|
352 | The pseudocode for the deadlock avoidance benchmark is shown in \VRef[Listing]{l:deadlock_avoid_pseudo}. |
---|
353 | To ensure the comparison exercises the implementation of each lock avoidance algorithm, an identical spinlock is implemented in each language using a set of builtin atomics available in both \CC and \CFA. |
---|
354 | The benchmarks are run for a fixed duration of 10 seconds and then terminate. |
---|
355 | The total number of times the group of locks is acquired is returned for each thread. |
---|
356 | Each variation is run 11 times on 2, 4, 8, 16, 24, 32 cores and with 2, 4, and 8 locks being acquired. |
---|
357 | The median is calculated and is plotted alongside the 95\% confidence intervals for each point. |
---|
358 | |
---|
359 | \begin{cfa}[caption={Deadlock avoidance benchmark pseudocode},label={l:deadlock_avoid_pseudo}] |
---|
360 | |
---|
361 | size_t n_locks; $\C{// number of locks}$ |
---|
362 | size_t n_thds; $\C{// number of threads}$ |
---|
363 | size_t n_gens; $\C{// number of random orderings (default 100)}$ |
---|
364 | size_t total = 0; $\C{// global throughput aggregator}$ |
---|
365 | volatile bool done = false; $\C{// termination flag}$ |
---|
366 | |
---|
367 | test_spinlock locks[n_locks]; |
---|
368 | size_t rands[n_thds][n_locks * n_gens]; $\C{// random ordering per thread}$ |
---|
369 | |
---|
370 | void main( worker & w ) with(w) { $\C{// thread main}$ |
---|
371 | size_t count = 0, idx = 0; |
---|
372 | while ( !done ) { |
---|
373 | idx = (count % n_locks * n_gens) * n_locks; $\C{// get start of next sequence}$ |
---|
374 | mutex(locks[rands[0]], ..., locks[rands[n_locks - 1]]){} $\C{// lock sequence of locks}$ |
---|
375 | count++; |
---|
376 | } |
---|
377 | __atomic_add_fetch(&total, count, __ATOMIC_SEQ_CST); $\C{// atomically add to total}$ |
---|
378 | } |
---|
379 | |
---|
380 | int main( int argc, char * argv[] ) { |
---|
381 | gen_orders(); $\C{// generate random orderings}$ |
---|
382 | { |
---|
383 | worker w[n_thds]; |
---|
384 | sleep( 10`s ); |
---|
385 | done = true; |
---|
386 | } |
---|
387 | printf( "%lu\n", total ); |
---|
388 | } |
---|
389 | |
---|
390 | \end{cfa} |
---|
391 | |
---|
392 | The performance experiments were run on the following multi-core hardware systems to determine differences across platforms: |
---|
393 | \begin{list}{\arabic{enumi}.}{\usecounter{enumi}\topsep=5pt\parsep=5pt\itemsep=0pt} |
---|
394 | % sudo dmidecode -t system |
---|
395 | \item |
---|
396 | Supermicro AS--1123US--TR4 AMD EPYC 7662 64--core socket, hyper-threading $\times$ 2 sockets (256 processing units) 2.0 GHz, TSO memory model, running Linux v5.8.0--55--generic, gcc--10 compiler |
---|
397 | \item |
---|
398 | Supermicro SYS--6029U--TR4 Intel Xeon Gold 5220R 24--core socket, hyper-threading $\times$ 2 sockets (48 processing units) 2.2GHz, TSO memory model, running Linux v5.8.0--59--generic, gcc--10 compiler |
---|
399 | \end{list} |
---|
400 | %The hardware architectures are different in threading (multithreading vs hyper), cache structure (MESI or MESIF), NUMA layout (QPI vs HyperTransport), memory model (TSO vs WO), and energy/thermal mechanisms (turbo-boost). |
---|
401 | %Software that runs well on one architecture may run poorly or not at all on another. |
---|
402 | |
---|
403 | Figure~\ref{f:mutex_bench} shows the results of the benchmark experiments. |
---|
404 | The baseline results for both languages are mostly comparable, except for the 8 locks results in \ref{f:mutex_bench8_AMD} and \ref{f:mutex_bench8_Intel}, where the \CFA baseline is slightly slower. |
---|
405 | The avoidance result for both languages is significantly different, where \CFA's mutex statement achieves throughput that is magnitudes higher than \CC's @scoped_lock@. |
---|
406 | The slowdown for @scoped_lock@ is likely due to its deadlock-avoidance implementation. |
---|
407 | Since it uses a retry based mechanism, it can take a long time for threads to progress. |
---|
408 | Additionally the potential for livelock in the algorithm can result in very little throughput under high contention. |
---|
409 | For example, on the AMD machine with 32 threads and 8 locks, the benchmarks would occasionally livelock indefinitely, with no threads making any progress for 3 hours before the experiment was terminated manually. |
---|
410 | It is likely that shorter bouts of livelock occurred in many of the experiments, which would explain large confidence intervals for some of the data points in the \CC data. |
---|
411 | In Figures~\ref{f:mutex_bench8_AMD} and \ref{f:mutex_bench8_Intel} there is the counter-intuitive result of the mutex statement performing better than the baseline. |
---|
412 | At 7 locks and above the mutex statement switches from a hard coded sort to insertion sort, which should decrease performance. |
---|
413 | It is likely the increase in throughput compared to baseline is due to the delay spent in the insertion sort, which decreases contention on the locks. |
---|
414 | |
---|
415 | \begin{figure} |
---|
416 | \centering |
---|
417 | \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize} |
---|
418 | \subfloat[AMD]{ |
---|
419 | \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Aggregate_Lock_2.pgf}} |
---|
420 | } |
---|
421 | \subfloat[Intel]{ |
---|
422 | \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Aggregate_Lock_2.pgf}} |
---|
423 | } |
---|
424 | \bigskip |
---|
425 | |
---|
426 | \subfloat[AMD]{ |
---|
427 | \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Aggregate_Lock_4.pgf}} |
---|
428 | } |
---|
429 | \subfloat[Intel]{ |
---|
430 | \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Aggregate_Lock_4.pgf}} |
---|
431 | } |
---|
432 | \bigskip |
---|
433 | |
---|
434 | \subfloat[AMD]{ |
---|
435 | \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Aggregate_Lock_8.pgf}} |
---|
436 | \label{f:mutex_bench8_AMD} |
---|
437 | } |
---|
438 | \subfloat[Intel]{ |
---|
439 | \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Aggregate_Lock_8.pgf}} |
---|
440 | \label{f:mutex_bench8_Intel} |
---|
441 | } |
---|
442 | \caption{The aggregate lock benchmark comparing \CC \lstinline{scoped_lock} and \CFA mutex statement throughput (higher is better).} |
---|
443 | \label{f:mutex_bench} |
---|
444 | \end{figure} |
---|
445 | |
---|
446 | % Local Variables: % |
---|
447 | % tab-width: 4 % |
---|
448 | % End: % |
---|