Changeset 6d43cc57 for doc/papers/concurrency/Paper.tex
- Timestamp:
- Jun 22, 2018, 1:51:56 PM (7 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, with_gc
- Children:
- 3d56d15b
- Parents:
- 999c700
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r999c700 r6d43cc57 1537 1537 External scheduling allows users to wait for events from other threads without concern of unrelated events occurring. 1538 1538 The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels. 1539 Of course, both of these paradigms have their own strengths and weaknesses, but for this project, control-flow semantics was chosen to stay consistent with the rest of the languages semantics. 1540 Two challenges specific to \CFA arise when trying to add external scheduling with loose object definitions and multiple-monitor routines. 1541 The previous example shows a simple use @_Accept@ versus @wait@/@signal@ and its advantages. 1542 Note that while other languages often use @accept@/@select@ as the core external scheduling keyword, \CFA uses @waitfor@ to prevent name collisions with existing socket \textbf{api}s. 1539 While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics. 1540 Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor routines (see Section~\ref{s:Multi-MonitorScheduling}). 1543 1541 1544 1542 For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread; … … 1654 1652 \end{cfa} 1655 1653 must have acquired monitor locks that are greater than or equal to the number of locks for the waiting thread signalled from the condition queue. 1656 {\color{red}In general, the signaller does not know the order of waiting threads, so in general, it must acquire the maximum number of mutex locks for the worst-case waiting thread.}1657 1654 1658 1655 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. … … 1854 1851 The extra challenge is that this dependency graph is effectively post-mortem, but the runtime system needs to be able to build and solve these graphs as the dependencies unfold. 1855 1852 Resolving dependency graphs being a complex and expensive endeavour, this solution is not the preferred one. 1856 1857 \subsubsection{Partial Signalling} \label{partial-sig}1858 1853 \end{comment} 1859 1854 … … 1932 1927 1933 1928 \subsection{Loose Object Definitions} 1934 1935 In \uC, a monitor class declaration includes an exhaustive list of monitor operations. 1936 Since \CFA is not object oriented, monitors become both more difficult to implement and less clear for a user: 1937 1938 \begin{cfa} 1939 monitor A {}; 1940 1941 void f(A & mutex a); 1942 void g(A & mutex a) { 1943 waitfor(f); // Obvious which f() to wait for 1944 } 1945 1946 void f(A & mutex a, int); // New different F added in scope 1947 void h(A & mutex a) { 1948 waitfor(f); // Less obvious which f() to wait for 1949 } 1950 \end{cfa} 1951 1952 Furthermore, external scheduling is an example where implementation constraints become visible from the interface. 1953 Here is the cfa-code for the entering phase of a monitor: 1954 \begin{center} 1955 \begin{tabular}{l} 1956 \begin{cfa} 1957 if monitor is free 1958 enter 1959 elif already own the monitor 1960 continue 1961 elif monitor accepts me 1962 enter 1963 else 1964 block 1965 \end{cfa} 1966 \end{tabular} 1967 \end{center} 1929 \label{s:LooseObjectDefinitions} 1930 1931 In an object-oriented programming-language, a class includes an exhaustive list of operations. 1932 However, new members can be added via static inheritance or dynaic members, \eg JavaScript~\cite{JavaScript}. 1933 Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement. 1934 \begin{cfa} 1935 monitor M {}; 1936 void `f`( M & mutex m ); 1937 void g( M & mutex m ) { waitfor( `f` ); } $\C{// clear which f}$ 1938 void `f`( M & mutex m, int ); $\C{// different f}$ 1939 void h( M & mutex m ) { waitfor( `f` ); } $\C{// unclear which f}$ 1940 \end{cfa} 1941 Hence, the cfa-code for the entering a monitor looks like: 1942 \begin{cfa} 1943 if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$ 1944 else if ( $\textrm{\textit{already own monitor}}$ ) $\LstCommentStyle{// \color{red}continue}$ 1945 else if ( $\textrm{\textit{monitor accepts me}}$ ) $\LstCommentStyle{// \color{red}enter}$ 1946 else $\LstCommentStyle{// \color{red}block}$ 1947 \end{cfa} 1968 1948 For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions. 1969 However, a fast check for @monitor accepts me@is much harder to implement depending on the constraints put on the monitors.1970 Indeed, monitors are often expressed as an entry queue and some acceptor queue as in Figure~\ref{fig:ClassicalMonitor}.1949 However, a fast check for \emph{monitor accepts me} is much harder to implement depending on the constraints put on the monitors. 1950 Figure~\ref{fig:ClassicalMonitor} shows monitors are often expressed as an entry (calling) queue, some acceptor queues, and an urgent stack/queue. 1971 1951 1972 1952 \begin{figure} 1973 1953 \centering 1974 \subfloat[Classical Monitor] {1954 \subfloat[Classical monitor] { 1975 1955 \label{fig:ClassicalMonitor} 1976 {\resizebox{0.45\textwidth}{!}{\input{monitor }}}1956 {\resizebox{0.45\textwidth}{!}{\input{monitor.pstex_t}}} 1977 1957 }% subfloat 1978 \q quad1979 \subfloat[ bulk acquire Monitor] {1958 \quad 1959 \subfloat[Bulk acquire monitor] { 1980 1960 \label{fig:BulkMonitor} 1981 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor }}}1961 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor.pstex_t}}} 1982 1962 }% subfloat 1983 \caption{External Scheduling Monitor} 1963 \caption{Monitor Implementation} 1964 \label{f:MonitorImplementation} 1984 1965 \end{figure} 1985 1966 1986 There are other alternatives to these pictures, but in the case of the left picture, implementing a fast accept check is relatively easy. 1987 Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (\eg 128) of mutex members. 1988 This approach requires a unique dense ordering of routines with an upper-bound and that ordering must be consistent across translation units. 1989 For OO languages these constraints are common, since objects only offer adding member routines consistently across translation units via inheritance. 1990 However, in \CFA users can extend objects with mutex routines that are only visible in certain translation unit. 1991 This means that establishing a program-wide dense-ordering among mutex routines can only be done in the program linking phase, and still could have issues when using dynamically shared objects. 1992 1993 The alternative is to alter the implementation as in Figure~\ref{fig:BulkMonitor}. 1994 Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate. 1995 Generating a mask dynamically means that the storage for the mask information can vary between calls to @waitfor@, allowing for more flexibility and extensions. 1996 Storing an array of accepted routine pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search. 1997 Furthermore, supporting nested external scheduling (\eg listing \ref{f:nest-ext}) may now require additional searches for the @waitfor@ statement to check if a routine is already queued. 1998 1967 For a fixed (small) number of mutex routines (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction. 1968 This approach requires a unique dense ordering of routines with a small upper-bound and the ordering must be consistent across translation units. 1969 For object-oriented languages these constraints are common, but \CFA mutex routines can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex routines. 1970 1971 Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation. 1972 The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately. 1973 The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a linear search. 1974 1975 \begin{comment} 1999 1976 \begin{figure} 2000 1977 \begin{cfa}[caption={Example of nested external scheduling},label={f:nest-ext}] … … 2020 1997 In the end, the most flexible approach has been chosen since it allows users to write programs that would otherwise be hard to write. 2021 1998 This decision is based on the assumption that writing fast but inflexible locks is closer to a solved problem than writing locks that are as flexible as external scheduling in \CFA. 2022 2023 % ====================================================================== 2024 % ====================================================================== 1999 \end{comment} 2000 2001 2025 2002 \subsection{Multi-Monitor Scheduling} 2026 % ====================================================================== 2027 % ====================================================================== 2003 \label{s:Multi-MonitorScheduling} 2028 2004 2029 2005 External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax. 2030 Even in the simplest possible case, somenew semantics needs to be established:2006 Even in the simplest possible case, new semantics needs to be established: 2031 2007 \begin{cfa} 2032 2008 monitor M {}; 2033 2034 void f(M & mutex a); 2035 2036 void g(M & mutex b, M & mutex c) { 2037 waitfor(f); // two monitors M => unknown which to pass to f(M & mutex) 2038 } 2039 \end{cfa} 2040 The obvious solution is to specify the correct monitor as follows: 2041 2009 void f( M & mutex m1 ); 2010 void g( M & mutex m1, M & mutex m2 ) { 2011 waitfor( f ); $\C{// pass m1 or m2 to f?}$ 2012 } 2013 \end{cfa} 2014 The solution is for the programmer to disambiguate: 2015 \begin{cfa} 2016 waitfor( f, m2 ); $\C{// wait for call to f with argument m2}$ 2017 \end{cfa} 2018 Routine @g@ has acquired both locks, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@ (while @g@ still holds lock @m1@). 2019 This behaviour can be extended to the multi-monitor @waitfor@ statement. 2042 2020 \begin{cfa} 2043 2021 monitor M {}; 2044 2045 void f(M & mutex a); 2046 2047 void g(M & mutex a, M & mutex b) { 2048 // wait for call to f with argument b 2049 waitfor(f, b); 2050 } 2051 \end{cfa} 2052 This syntax is unambiguous. 2053 Both locks are acquired and kept by @g@. 2054 When routine @f@ is called, the lock for monitor @b@ is temporarily transferred from @g@ to @f@ (while @g@ still holds lock @a@). 2055 This behaviour can be extended to the multi-monitor @waitfor@ statement as follows. 2056 2057 \begin{cfa} 2058 monitor M {}; 2059 2060 void f(M & mutex a, M & mutex b); 2061 2062 void g(M & mutex a, M & mutex b) { 2063 // wait for call to f with arguments a and b 2064 waitfor(f, a, b); 2065 } 2066 \end{cfa} 2067 2068 Note that the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired in the routine. @waitfor@ used in any other context is undefined behaviour. 2022 void f( M & mutex m1, M & mutex m2 ); 2023 void g( M & mutex m1, M & mutex m2 ) { 2024 waitfor( f, m1, m2 ); $\C{// wait for call to f with arguments m1 and m2}$ 2025 } 2026 \end{cfa} 2027 Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by accepting routine. 2069 2028 2070 2029 An important behaviour to note is when a set of monitors only match partially: 2071 2072 2030 \begin{cfa} 2073 2031 mutex struct A {}; 2074 2075 2032 mutex struct B {}; 2076 2077 void g(A & mutex a, B & mutex b) { 2078 waitfor(f, a, b); 2079 } 2080 2033 void g( A & mutex m1, B & mutex m2 ) { 2034 waitfor( f, m1, m2 ); 2035 } 2081 2036 A a1, a2; 2082 2037 B b; 2083 2084 2038 void foo() { 2085 g(a1, b); // block on accept 2086 } 2087 2039 g( a1, b ); // block on accept 2040 } 2088 2041 void bar() { 2089 f( a2, b); // fulfill cooperation2042 f( a2, b ); // fulfill cooperation 2090 2043 } 2091 2044 \end{cfa} … … 2094 2047 It is also important to note that in the case of external scheduling the order of parameters is irrelevant; @waitfor(f,a,b)@ and @waitfor(f,b,a)@ are indistinguishable waiting condition. 2095 2048 2096 % ====================================================================== 2097 % ====================================================================== 2049 2098 2050 \subsection{\protect\lstinline|waitfor| Semantics} 2099 % ======================================================================2100 % ======================================================================2101 2051 2102 2052 Syntactically, the @waitfor@ statement takes a routine identifier and a set of monitors. … … 2197 2147 \end{figure} 2198 2148 2199 % ====================================================================== 2200 % ====================================================================== 2149 2201 2150 \subsection{Waiting For The Destructor} 2202 % ====================================================================== 2203 % ====================================================================== 2151 2204 2152 An interesting use for the @waitfor@ statement is destructor semantics. 2205 2153 Indeed, the @waitfor@ statement can accept any @mutex@ routine, which includes the destructor (see section \ref{data}). … … 2228 2176 2229 2177 2230 % ###### # ###### # # # ####### # ### ##### # #2231 % # # # # # # # # # # # # # # # ## ##2232 % # # # # # # # # # # # # # # # # # #2233 % ###### # # ###### # # # # ##### # # ##### # # #2234 % # ####### # # ####### # # # # # # # #2235 % # # # # # # # # # # # # # # # #2236 % # # # # # # # ####### ####### ####### ####### ### ##### # #2237 2178 \section{Parallelism} 2179 2238 2180 Historically, computer performance was about processor speeds and instruction counts. 2239 2181 However, with heat dissipation being a direct consequence of speed increase, parallelism has become the new source for increased performance~\cite{Sutter05, Sutter05b}. … … 2245 2187 While there are many variations of the presented paradigms, most of these variations do not actually change the guarantees or the semantics, they simply move costs in order to achieve better performance for certain workloads. 2246 2188 2189 2247 2190 \section{Paradigms} 2191 2192 2248 2193 \subsection{User-Level Threads} 2194 2249 2195 A direct improvement on the \textbf{kthread} approach is to use \textbf{uthread}. 2250 2196 These threads offer most of the same features that the operating system already provides but can be used on a much larger scale. … … 2255 2201 Examples of languages that support \textbf{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}. 2256 2202 2203 2257 2204 \subsection{Fibers : User-Level Threads Without Preemption} \label{fibers} 2205 2258 2206 A popular variant of \textbf{uthread} is what is often referred to as \textbf{fiber}. 2259 2207 However, \textbf{fiber} do not present meaningful semantic differences with \textbf{uthread}. … … 2264 2212 An example of a language that uses fibers is Go~\cite{Go} 2265 2213 2214 2266 2215 \subsection{Jobs and Thread Pools} 2216 2267 2217 An approach on the opposite end of the spectrum is to base parallelism on \textbf{pool}. 2268 2218 Indeed, \textbf{pool} offer limited flexibility but at the benefit of a simpler user interface. … … 2275 2225 The gold standard of this implementation is Intel's TBB library~\cite{TBB}. 2276 2226 2227 2277 2228 \subsection{Paradigm Performance} 2229 2278 2230 While the choice between the three paradigms listed above may have significant performance implications, it is difficult to pin down the performance implications of choosing a model at the language level. 2279 2231 Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload. … … 2283 2235 Finally, if the units of uninterrupted work are large, enough the paradigm choice is largely amortized by the actual work done. 2284 2236 2237 2285 2238 \section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel} 2239 2286 2240 A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}. 2287 2241 It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues. … … 2291 2245 Currently \CFA only supports one \textbf{cfacluster}, the initial one. 2292 2246 2247 2293 2248 \subsection{Future Work: Machine Setup}\label{machine} 2249 2294 2250 While this was not done in the context of this paper, another important aspect of clusters is affinity. 2295 2251 While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. … … 2297 2253 OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity. 2298 2254 2255 2299 2256 \subsection{Paradigms}\label{cfaparadigms} 2257 2300 2258 Given these building blocks, it is possible to reproduce all three of the popular paradigms. 2301 2259 Indeed, \textbf{uthread} is the default paradigm in \CFA. … … 2305 2263 2306 2264 2307 2308 2265 \section{Behind the Scenes} 2266 2309 2267 There are several challenges specific to \CFA when implementing concurrency. 2310 2268 These challenges are a direct result of bulk acquire and loose object definitions. … … 2323 2281 Note that since the major contributions of this paper are extending monitor semantics to bulk acquire and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed. 2324 2282 2325 % ====================================================================== 2326 % ====================================================================== 2283 2327 2284 \section{Mutex Routines} 2328 % ======================================================================2329 % ======================================================================2330 2285 2331 2286 The first step towards the monitor implementation is simple @mutex@ routines. … … 2362 2317 \end{figure} 2363 2318 2319 2364 2320 \subsection{Details: Interaction with polymorphism} 2321 2365 2322 Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support. 2366 2323 However, it is shown that entry-point locking solves most of the issues. … … 2442 2399 Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites. 2443 2400 2444 % ====================================================================== 2445 % ====================================================================== 2401 2446 2402 \section{Threading} \label{impl:thread} 2447 % ======================================================================2448 % ======================================================================2449 2403 2450 2404 Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency. … … 2459 2413 \end{figure} 2460 2414 2415 2461 2416 \subsection{Processors} 2417 2462 2418 Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically @pthread@s in the current implementation of \CFA. 2463 2419 Indeed, any parallelism must go through operating-system libraries. … … 2467 2423 Processors internally use coroutines to take advantage of the existing context-switching semantics. 2468 2424 2425 2469 2426 \subsection{Stack Management} 2427 2470 2428 One of the challenges of this system is to reduce the footprint as much as possible. 2471 2429 Specifically, all @pthread@s created also have a stack created with them, which should be used as much as possible. … … 2474 2432 In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large. 2475 2433 2434 2476 2435 \subsection{Context Switching} 2436 2477 2437 As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks. 2478 2438 To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific routine call. … … 2488 2448 This option is not currently present in \CFA, but the changes required to add it are strictly additive. 2489 2449 2450 2490 2451 \subsection{Preemption} \label{preemption} 2452 2491 2453 Finally, an important aspect for any complete threading system is preemption. 2492 2454 As mentioned in section \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution. … … 2522 2484 Indeed, @sigwait@ can differentiate signals sent from @pthread_sigqueue@ from signals sent from alarms or the kernel. 2523 2485 2486 2524 2487 \subsection{Scheduler} 2525 2488 Finally, an aspect that was not mentioned yet is the scheduling algorithm. … … 2527 2490 Further discussion on scheduling is present in section \ref{futur:sched}. 2528 2491 2529 % ====================================================================== 2530 % ====================================================================== 2492 2531 2493 \section{Internal Scheduling} \label{impl:intsched} 2532 % ====================================================================== 2533 % ====================================================================== 2494 2534 2495 The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience): 2535 2496 2536 2497 \begin{figure} 2537 2498 \begin{center} 2538 {\resizebox{0.4\textwidth}{!}{\input{monitor }}}2499 {\resizebox{0.4\textwidth}{!}{\input{monitor.pstex_t}}} 2539 2500 \end{center} 2540 2501 \caption{Traditional illustration of a monitor}
Note: See TracChangeset
for help on using the changeset viewer.