Changeset 2dcd80a
- Timestamp:
- Dec 14, 2022, 12:23:42 PM (3 years ago)
- Branches:
- ADT, ast-experimental, master
- Children:
- 441a6a7
- Parents:
- 7d9598d8 (diff), d8bdf13 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)
links above to see all the changes relative to each parent. - Files:
-
- 9 added
- 2 deleted
- 59 edited
- 8 moved
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/fig/base_ts2.fig
r7d9598d8 r2dcd80a 119 119 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2 120 120 2400 4950 3000 4950 121 4 2 -1 50 -1 0 12 0.0000 2 135 6 452100 3075 Threads\001122 4 2 -1 50 -1 0 12 0.0000 2 1 80 5252100 2850 Ready\001123 4 2 -1 50 -1 0 12 0.0000 2 1 80 660 2100 4200 Array of\001124 4 2 -1 50 -1 0 12 0.0000 2 1 65 600 2100 4425 Queues\001125 4 1 -1 50 -1 0 11 0.0000 2 1 20 210 2700 3550 TS\001126 4 2 -1 50 -1 0 12 0.0000 2 1 80 660 2100 5025 Array of\001127 4 1 -1 50 -1 0 11 0.0000 2 1 20 300 2700 4450 MA\001128 4 1 -1 50 -1 0 11 0.0000 2 1 20 210 2700 4225 TS\001129 4 1 -1 50 -1 0 11 0.0000 2 1 20 300 2700 5350 MA\001130 4 1 -1 50 -1 0 11 0.0000 2 1 20 210 2700 5125 TS\001131 4 2 -1 50 -1 0 12 0.0000 2 1 80 1590 2100 5250 Timestamps Copies\001132 4 2 -1 50 -1 0 12 0.0000 2 1 35 840 2100 6075 Processors\001121 4 2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001 122 4 2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001 123 4 2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001 124 4 2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001 125 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001 126 4 2 -1 50 -1 0 12 0.0000 2 165 720 2100 5025 Array of\001 127 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 MA\001 128 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4225 TS\001 129 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5350 MA\001 130 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5125 TS\001 131 4 2 -1 50 -1 0 12 0.0000 2 135 900 2100 6075 Processors\001 132 4 2 -1 50 -1 0 12 0.0000 2 165 1440 2100 5250 Timestamp Copies\001 -
doc/theses/thierry_delisle_PhD/thesis/local.bib
r7d9598d8 r2dcd80a 615 615 } 616 616 617 @book{MAN:inteldev, 618 key = {Intel 64 and IA-32 Architectures Software Developer’s Manual}, 619 title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual}, 620 publisher = {Intel{\textregistered}}, 621 year = {2016}, 622 volume = {3B: System Programming Guide, Part 2}, 623 url = {\href{https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html}{https://\-www.intel.com/\-content/\-www/\-us/\-en/\-architecture\--and\--technology/\-64\--ia\--32\--architectures\--software\--developer\--vol\--3b\--part\--2\--manual\-.html}}, 624 } 625 617 626 @misc{MemcachedThreading, 618 627 author = {Oracle}, … … 673 682 } 674 683 684 @misc{MAN:kotlin, 685 howpublished = {\href{https://kotlinlang.org/docs/multiplatform-mobile-concurrency-and-coroutines.html}{https://\-kotlinlang.org\-/docs\-/multiplatform\--mobile\--concurrency\--and\--coroutines.html}} 686 } 687 675 688 @misc{MAN:java/fork-join, 676 689 howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}} … … 965 978 howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", 966 979 note = "[Online; accessed 7-September-2022]" 980 } 981 982 @misc{wiki:rdtsc, 983 author = "{Wikipedia contributors}", 984 title = "Time Stamp Counter --- {W}ikipedia{,} The Free Encyclopedia", 985 year = "2022", 986 howpublished = "\href{https://en.wikipedia.org/wiki/Time\_Stamp\_Counter}{https://\-en.wikipedia.org/\-wiki/\-Time\-\_Stamp\-\_Counter}", 987 note = "[Online; accessed 14-November-2022]" 988 } 989 990 @misc{wiki:lockfree, 991 author = "{Wikipedia contributors}", 992 title = "Non-blocking algorithm --- {W}ikipedia{,} The Free Encyclopedia", 993 year = "2022", 994 howpublished = "\href{https://en.wikipedia.org/wiki/Non-blocking_algorithm}{https://en.wikipedia.org/\-wiki/Non\--blocking\-\_algorithm}", 995 note = "[Online; accessed 22-November-2022]" 996 } 997 998 @misc{wiki:expected, 999 author = "{Wikipedia contributors}", 1000 title = "Expected value --- {W}ikipedia{,} The Free Encyclopedia", 1001 year = "2022", 1002 howpublished = "\href{https://en.wikipedia.org/wiki/Expected_value}{https://en.wikipedia.org/\-wiki/\-Expected\-\_value}", 1003 note = "[Online; accessed 22-November-2022]" 1004 } 1005 1006 @misc{wiki:softirq, 1007 author = "{Wikipedia contributors}", 1008 title = "Interrupt --- {W}ikipedia{,} The Free Encyclopedia", 1009 year = "2022", 1010 howpublished = "\href{https://en.wikipedia.org/wiki/Interrupt}{https://en.wikipedia.org/\-wiki/\-Interrupt}", 1011 note = "[Online; accessed 24-November-2022]" 967 1012 } 968 1013 -
doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex
r7d9598d8 r2dcd80a 37 37 Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system. 38 38 Nonblocking I/O should not be handled in this way. 39 Presumably, this is better handled by Windows's ``overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work. 39 40 40 41 \section{Goals} … … 58 59 59 60 As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. 60 Productivity and safety manifest in removing scheduling pitfalls inthe efficient usage of the threading runtime.61 Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime. 61 62 Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. 62 63 … … 99 100 I am aware there is a host of low-power research that could be tapped here. 100 101 102 \subsection{CPU Workloads} 103 A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it 104 CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily. 105 Examining such workloads in the context of scheduling would be interesting. 106 However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work. 107 101 108 \subsection{Hardware} 102 109 One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness. -
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r7d9598d8 r2dcd80a 24 24 In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware. 25 25 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. 26 This expectation of \at independence means the scheduler is expected to offer two guarantees:26 This expectation of \at independence means the scheduler is expected to offer two features: 27 27 \begin{enumerate} 28 \item A fairness guarantee: a \at that is ready to run is not prevented by another thread .29 \item A performance g uarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same.28 \item A fairness guarantee: a \at that is ready to run is not prevented by another thread indefinitely, \ie, starvation freedom. This is discussed further in the next section. 29 \item A performance goal: given a \at that wants to start running, other threads wanting to do the same do not interfere with it. 30 30 \end{enumerate} 31 31 32 It is important to note that these guarantees are expected only up to a point. 33 \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources. 34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. 35 36 Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point. 37 Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. 32 The performance goal, the lack of interference among threads, is only desired up to a point. 33 Ideally, the cost of running and blocking should be constant regardless of contention, but the goal is considered satisfied if the cost is not \emph{too high} with or without contention. 38 34 How much is an acceptable cost is obviously highly variable. 39 35 For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages. 40 36 This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models. 41 37 Recall programmer expectation is that the impact of the scheduler can be ignored. 42 Therefore, if the cost of scheduling is competitive with other popular languages, the g uarantee is considered achieved.38 Therefore, if the cost of scheduling is competitive with other popular languages, the goal is considered satisfied. 43 39 More precisely the scheduler should be: 44 40 \begin{itemize} 45 \item As fast as other schedulers that are less fair.46 \item Faster than other schedulers that have equal or better fairness.41 \item As fast as other schedulers without any fairness guarantee. 42 \item Faster than other schedulers that have equal or stronger fairness guarantees. 47 43 \end{itemize} 48 44 49 45 \subsection{Fairness Goals} 50 For this work, fairness is considered to have two strongly related requirements: true starvation freedom and ``fast'' load balancing. 51 52 \paragraph{True starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress. 46 For this work, fairness is considered to have two strongly related requirements: 47 48 \paragraph{Starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress. 49 Starvation freedom can be bounded or unbounded. 50 In the bounded case, all \ats should be able to run within a fix bound, relative to its own enqueue. 51 Whereas unbounded starvation freedom only requires the \at to eventually run. 52 The \CFA scheduler aims to guarantee unbounded starvation freedom. 53 53 In any running system, a \proc can stop dequeuing \ats if it starts running a \at that never blocks. 54 Without preemption, traditional work-stealing schedulers do not have starvation freedom in this case.55 Now, this requirement begs the question, what about preemption?54 Without preemption, traditional work-stealing schedulers do not have starvation freedom, bounded or unbounded. 55 Now, this requirement raises the question, what about preemption? 56 56 Generally speaking, preemption happens on the timescale of several milliseconds, which brings us to the next requirement: ``fast'' load balancing. 57 57 58 \paragraph{Fast load balancing} means that load balancing should happen faster than preemption would normally allow. 59 For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved. 60 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale. 61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system. 58 \paragraph{Fast load balancing} means that while eventual progress is guaranteed, it is important to mention on which timescale this progress is expected to happen. 59 Indeed, while a scheduler with bounded starvation freedom is beyond the scope of this work, offering a good expected bound in the mathematical sense~\cite{wiki:expected} is desirable. 60 The expected bound on starvation freedom should be tighter than what preemption normally allows. 61 For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait milliseconds to run are effectively starved. 62 Therefore load-balancing should be done at a faster pace: one that is expected to detect starvation at the microsecond scale. 62 63 63 64 \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal} … … 68 69 69 70 For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness. 70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available.71 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data.71 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as a consequence cache lines, to a \gls{hthrd} that is currently available. 72 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler, versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data. 72 73 External locality is a much more complicated subject and is discussed in the next section. 73 74 74 However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally. 75 Figure~\ref{fig:fair} shows a visual representation of this behaviour. 76 As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental model. 75 However, I claim that in practice it is possible to strike a balance between fairness and performance because these requirements do not necessarily overlap temporally. 76 Figure~\ref{fig:fair} shows a visual representation of this effect. 77 As mentioned, some unfairness is acceptable; for example, once the bounded starvation guarantee is met, additional fairness will not satisfy it \emph{more}. 78 Inversely, once a \at's data is evicted from cache, its locality cannot worsen. 79 Therefore it is desirable to have an algorithm that prioritizes cache locality as long as the fairness guarantee is also satisfied. 77 80 78 81 \begin{figure} … … 88 91 \subsection{Performance Challenges}\label{pref:challenge} 89 92 While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges. 90 Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design. 93 Since these challenges are recurring themes in the design of a scheduler it is relevant to describe them here before looking at the scheduler's design. 94 95 \subsubsection{Latency} 96 The most basic performance metric of a scheduler is scheduling latency. 97 This measures the how long it takes for a \at to run once scheduled, including the cost of scheduling itself. 98 This measure include both the sequential cost of the operation itself, both also the scalability. 91 99 92 100 \subsubsection{Scalability} 93 The most basic performance challenge of a scheduler is scalability. 94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats. 101 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats relative to the available parallelism. 95 102 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements. 96 103 While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention. 104 In the Chapter~\ref{microbench}, scalability is measured as $\# procs \times \frac{ns}{ops}$, \ie, number of \procs times total time over total operations. 105 Since the total number of operation should scale with the number of \procs, this gives a measure how much each additional \proc affects the other \procs. 97 106 98 107 \subsubsection{Migration Cost} … … 107 116 In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance. 108 117 The problem is a single point of contention when adding/removing \ats. 109 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. 110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering. 111 112 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis. 118 As is shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. 119 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue. 120 The sub-queues are accessed by multiple \glspl{hthrd} without the need for communication. 121 122 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration for the scheduler in this thesis. 113 123 114 124 \subsection{Work-Stealing} … … 116 126 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing. 117 127 In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue. 118 The interesting aspect of work stealing happensin the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.119 In this case, work stealing is close to optimal scheduling : it can achieve perfect locality and have no contention.120 On the other hand, work-stealing schedulers only attemptto do load-balancing when a \gls{proc} runs out of work.128 The interesting aspect of work stealing manifests itself in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed. 129 In this case, work stealing is close to optimal scheduling latency: it can achieve perfect locality and have no contention. 130 On the other hand, work-stealing only attempts to do load-balancing when a \gls{proc} runs out of work. 121 131 This means that the scheduler never balances unfair loads unless they result in a \gls{proc} running out of work. 122 Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to indefinitestarvation.123 124 Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, but load balance proactively when the need arises.132 Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to unbounded starvation. 133 134 Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, \ie, avoid migrations in well balanced workloads, but load balance proactively when the need arises. 125 135 126 136 \subsection{Relaxed-FIFO} 127 A different scheduling approach is t o create a``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.137 A different scheduling approach is the ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}. 128 138 This approach forgoes any ownership between \gls{proc} and sub-queue, and simply creates a pool of sub-queues from which \glspl{proc} pick. 129 139 Scheduling is performed as follows: … … 134 144 Timestamps are added to each element of a sub-queue. 135 145 \item 136 A \gls{proc} randomly tests sub-queues until it has acquired one or two queues .146 A \gls{proc} randomly tests sub-queues until it has acquired one or two queues, referred to as \newterm{randomly picking} or \newterm{randomly helping}. 137 147 \item 138 148 If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues. … … 148 158 However, \glspl{proc} eagerly search for these older elements instead of focusing on specific queues, which negatively affects locality. 149 159 150 While this scheme has good fairness, its performance suffers. 151 It requires wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues is difficult when there are few ready \ats. 160 While this scheme has good fairness, its performance can be improved. 161 Wide sharding is generally desired, \eg at least 4 queues per \proc, and randomly picking non-empty queues is difficult when there are few ready \ats. 162 The next sections describe improvements I made to this existing algorithm. 163 However, ultimately the ``relaxed-FIFO'' queue is not used as the basis of the \CFA scheduler. 152 164 153 165 \section{Relaxed-FIFO++} 154 The inherent fairness and goodperformance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.166 The inherent fairness and decent performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler. 155 167 The problem case is workloads where the number of \ats is barely greater than the number of \procs. 156 168 In these situations, the wide sharding of the ready queue means most of its sub-queues are empty. … … 162 174 163 175 As this is the most obvious challenge, it is worth addressing first. 164 The obvious solution is to supplement each sharded sub-queue with data that indicates if the queue is empty/nonempty to simplify finding nonempty queues, \ie ready \glspl{at}. 165 This sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues. 166 Specifically, many modern architectures have powerful bitmask manipulation instructions or searching a binary tree has good Big-O complexity. 176 The seemingly obvious solution is to supplement each sharded sub-queue with data that indicates whether the queue is empty/nonempty. 177 This simplifies finding nonempty queues, \ie ready \glspl{at}. 178 The sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues, using a bit or a node per sub-queue, respectively. 179 Specifically, many modern architectures have powerful bitmask manipulation instructions, and, searching a binary tree has good Big-O complexity. 167 180 However, precisely tracking nonempty sub-queues is problematic. 168 181 The reason is that the sub-queues are initially sharded with a width presumably chosen to avoid contention. 169 However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie denser than the sharded sub-queues.182 However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie tracks whether multiple sub-queues are empty. 170 183 Otherwise, it does not provide useful information because reading this new data structure risks being as costly as simply picking a sub-queue at random. 171 184 But if the tracking mechanism \emph{is} denser than the shared sub-queues, then constant updates invariably create a new source of contention. … … 184 197 185 198 \subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy} 186 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} .199 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} that drives the random picking of sub-queues. 187 200 The \glsxtrshort{prng} state can be seen as containing a list of all the future sub-queues that will be accessed. 188 201 While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the sub-queues that were accessed. 189 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators \cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.202 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators~\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance. 190 203 This particular \glsxtrshort{prng} can be used as follows: 191 204 \begin{itemize} … … 193 206 Each \proc maintains two \glsxtrshort{prng} states, referred to as $F$ and $B$. 194 207 \item 195 When a \proc attempts to dequeue a \at, it picks a sub-queue by running $B$ backwards.196 \item 197 When a \proc attempts to enqueue a \at, it runs $F$ forward picking a sub-queue to enqueue to.198 If the enqueue is successful, state $B$ is overwritten with the content of$F$.208 When a \proc attempts to dequeue a \at, it picks a sub-queue by running its $B$ backwards. 209 \item 210 When a \proc attempts to enqueue a \at, it runs its $F$ forward picking a sub-queue to enqueue to. 211 If the enqueue is successful, state of its $B$ is overwritten with the content of its $F$. 199 212 \end{itemize} 200 213 The result is that each \proc tends to dequeue \ats that it has itself enqueued. 201 214 When most sub-queues are empty, this technique increases the odds of finding \ats at a very low cost, while also offering an improvement on locality in many cases. 202 215 203 Tests showed this approach performs better than relaxed-FIFO in many cases.216 My own tests showed this approach performs better than relaxed-FIFO in many cases. 204 217 However, it is still not competitive with work-stealing algorithms. 205 The fundamental problem is that the constantrandomness limits how much locality the scheduler offers.218 The fundamental problem is that the randomness limits how much locality the scheduler offers. 206 219 This becomes problematic both because the scheduler is likely to get cache misses on internal data structures and because migrations become frequent. 207 220 Therefore, the attempt to modify the relaxed-FIFO algorithm to behave more like work stealing did not pan out. … … 214 227 Before attempting to dequeue from a \proc's sub-queue, the \proc must make some effort to ensure other sub-queues are not being neglected. 215 228 To make this possible, \procs must be able to determine which \at has been on the ready queue the longest. 216 Second, the relaxed-FIFO approach needs timestamps for each \at to make this possible. 229 Second, the relaxed-FIFO approach uses timestamps, denoted TS, for each \at to make this possible. 230 Theses timestamps can be added to work stealing. 217 231 218 232 \begin{figure} 219 233 \centering 220 234 \input{base.pstex_t} 221 \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of sub-queues offers the sharding, twoper \proc.235 \caption[Base \CFA design]{Base \CFA design \smallskip\newline It uses a pool of sub-queues, with a sharding of two sub-queue per \proc. 222 236 Each \gls{proc} can access all of the sub-queues. 223 237 Each \at is timestamped when enqueued.} … … 227 241 Figure~\ref{fig:base} shows the algorithm structure. 228 242 This structure is similar to classic work-stealing except the sub-queues are placed in an array so \procs can access them in constant time. 229 Sharding widthcan be adjusted based on contention.230 Note, as an optimization, the TSof a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.243 Sharding can be adjusted based on contention. 244 As an optimization, the timestamp of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS. 231 245 This organization keeps the highly accessed front TSs directly in the array. 232 246 When a \proc attempts to dequeue a \at, it first picks a random remote sub-queue and compares its timestamp to the timestamps of its local sub-queue(s). 233 The oldest waiting \at is dequeued to provide global fairness. 247 The oldest waiting of the compared \ats is dequeued. 248 In this document, picking from a remote sub-queue in this fashion is referred to as ``helping''. 249 250 The timestamps are measured using the CPU's hardware timestamp counters~\cite{wiki:rdtsc}. 251 These provide a 64-bit counter that tracks the number of cycles since the CPU was powered on. 252 Assuming the CPU runs at less than 5 GHz, this means that the 64-bit counter takes over a century before overflowing. 253 This is true even on 32-bit CPUs, where the counter is generally still 64-bit. 254 However, on many architectures, the instructions to read the counter do not have any particular ordering guarantees. 255 Since the counter does not depend on any data in the cpu pipeline, this means there is significant flexibility for the instruction to be read out of order, which limites the accuracy to a window of code. 256 Finally, another issue that can come up with timestamp counters is synchronization between \glspl{hthrd}. 257 This appears to be mostly a historical concern, as recent CPU offer more synchronization guarantees. 258 For example, Intel supports "Invariant TSC" \cite[\S~17.15.1]{MAN:inteldev} which is guaranteed to be synchronized across \glspl{hthrd}. 234 259 235 260 However, this na\"ive implementation has performance problems. 236 First, it is necessary to have some damping effect on helping.237 Random effects like cache misses and preemption can add spurious but short bursts of latency negating the attempt to help.238 These bursts can cause increased migrations and make this work-stealing approach slow down to the level of relaxed-FIFO.261 First, it is necessary to avoid helping when it does not improve fairness. 262 Random effects like cache misses and preemption can add unpredictable but short bursts of latency but do not warrant the cost of helping. 263 These bursts can cause increased migrations, at which point this same locality problems as in the relaxed-FIFO approach start to appear. 239 264 240 265 \begin{figure} … … 246 271 247 272 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}. 248 Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.273 Note that this is more complex than it can appear because the \at at the head of a sub-queue is still waiting, so its wait time has not ended. 249 274 Therefore, the exponential moving average is an average of how long each dequeued \at has waited. 250 275 To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue. 251 276 This new waiting is averaged with the stored average. 252 277 To further limit \glslink{atmig}{migrations}, a bias can be added to a local sub-queue, where a remote sub-queue is helped only if its moving average is more than $X$ times the local sub-queue's average. 253 Tests for this approach indicate the choice of the weight for the moving average or the bias isnot important, \ie weights and biases of similar \emph{magnitudes} have similar effects.254 255 With these additions to work stealing, scheduling can be made as fair as the relaxed-FIFO approach, avoiding the majority of unnecessary migrations.278 Tests for this approach indicate the precise values for the weight of the moving average and the bias are not important, \ie weights and biases of similar \emph{magnitudes} have similar effects. 279 280 With these additions to work stealing, scheduling can satisfy the starvation freedom guarantee while suffering much less from unnecessary migrations than the relaxed-FIFO approach. 256 281 Unfortunately, the work to achieve fairness has a performance cost, especially when the workload is inherently fair, and hence, there is only short-term unfairness or no starvation. 257 The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated , \ie, writes.258 To make things wors t, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.282 The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated. 283 To make things worse, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss. 259 284 Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue. 260 285 This puts this algorithm in the awkward situation of paying for a largely unnecessary cost. … … 264 289 The problem with polling remote sub-queues is that correctness is critical. 265 290 There must be a consensus among \procs on which sub-queues hold which \ats, as the \ats are in constant motion. 266 Furthermore, since timestamps are used for fairness, it is critical t o have a consensus on which \at is the oldest.291 Furthermore, since timestamps are used for fairness, it is critical that the oldest \ats eventually be recognized as such. 267 292 However, when deciding if a remote sub-queue is worth polling, correctness is less of a problem. 268 293 Since the only requirement is that a sub-queue is eventually polled, some data staleness is acceptable. … … 278 303 \centering 279 304 \input{base_ts2.pstex_t} 280 \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is addedcontaining a copy of the timestamps.305 \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline This design uses an array containing a copy of the timestamps. 281 306 These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.} 282 307 \label{fig:base-ts2} … … 285 310 The correctness argument is somewhat subtle. 286 311 The data used for deciding whether or not to poll a queue can be stale as long as it does not cause starvation. 287 Therefore, it is acceptable if stale data makes queues appear older than they are but appearing fresher can be a problem.312 Therefore, it is acceptable if stale data makes queues appear older than they are, but appearing fresher can be a problem. 288 313 For the timestamps, this means it is acceptable to miss writes to the timestamp since they make the head \at look older. 289 314 For the moving average, as long as the operations are just atomic reads/writes, the average is guaranteed to yield a value that is between the oldest and newest values written. … … 292 317 With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines. 293 318 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}. 294 For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.319 For example on machines with multiple CPUs, cache misses can be satisfied from the caches on the same (local) CPU, or by the caches on a different (remote) CPU. 295 320 Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU. 296 321 However, these delays are not specific to systems with multiple CPUs. … … 312 337 Figures~\ref{fig:cache-share} and~\ref{fig:cache-noshare} show two different cache topologies that highlight this difference. 313 338 In Figure~\ref{fig:cache-share}, all cache misses are either private to a CPU or shared with another CPU. 314 This means latency due to cache misses is fairly consistent.315 In contrast, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by either instance of the L3 cache.339 This means that latency due to cache misses is fairly consistent. 340 In contrast, in Figure~\ref{fig:cache-noshare}, misses in the L2 cache can be satisfied by either instance of the L3 cache. 316 341 However, the memory-access latency to the remote L3 is higher than the memory-access latency to the local L3. 317 The impact of these different designs on this algorithm is that scheduling only scales well on architectures with a wide L3 cache, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many narrower L3 cache instances, similar to Figure~\ref{fig:cache-noshare}.342 The impact of these different designs on this algorithm is that scheduling only scales well on architectures with the L3 cache shared across many \glspl{hthrd}, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many L3 cache instances and less sharing, similar to Figure~\ref{fig:cache-noshare}. 318 343 Hence, as the number of L3 instances grows, so too does the chance that the random helping causes significant cache latency. 319 344 The solution is for the scheduler to be aware of the cache topology. … … 323 348 Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem. 324 349 This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory. 325 This leaves the challenge of matching \procs to cache structure, or more precisely identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure.350 This leaves the challenge of matching \procs to cache structure, or more precisely, identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure. 326 351 Once a match is generated, the helping algorithm is changed to add bias so that \procs more often help sub-queues local to the same cache substructure.\footnote{ 327 Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning .}352 Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning beyond the order of magnitude.} 328 353 329 354 The simplest approach for mapping sub-queues to cache structure is to statically tie sub-queues to CPUs. … … 335 360 However, it can still cause some subtle fairness problems in systems with few \procs and many \glspl{hthrd}. 336 361 In this case, the large number of sub-queues and the bias against sub-queues tied to different cache substructures make it unlikely that every sub-queue is picked. 337 To make things wors t, the small number of \procs means that few helping attempts are made.362 To make things worse, the small number of \procs means that few helping attempts are made. 338 363 This combination of low selection and few helping attempts allow a \at to become stranded on a sub-queue for a long time until it gets randomly helped. 339 On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue. 364 On a system with 2 \procs, 256 \glspl{hthrd}, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue. 365 In this scenario, where each \proc attempts to help on 50\% of dequeues, the probability that a remote sub-queue gets help is $\frac{1}{51200}$ and follows a geometric distribution. 366 Therefore the probability of the remote sub-queue gets help within the next 100'000 dequeues is only 85\%. 367 Assuming dequeues happen every 100ns, there is still 15\% chance a \at could starve for more than 10ms and a 1\% chance the \at starves for 33.33ms, the maximum latency tolerated for interactive applications. 368 If few \glspl{hthrd} share each cache instance, the probability that a \at is on a remote sub-queue becomes high. 340 369 Therefore, a more dynamic match of sub-queues to cache instances is needed. 341 370 … … 343 372 \label{s:TopologicalWorkStealing} 344 373 The approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to. 345 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.374 This tracking requires some finesse, because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. 346 375 A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}. 347 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each\proc queries the CPU id and the corresponding cache instance.348 Since sub-queues are tied to \procs, each\proc can then update the cache instance mapped to the local sub-queue(s).376 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, a \proc queries the CPU id and the corresponding cache instance. 377 Since sub-queues are tied to \procs, a \proc can then update the cache instance mapped to the local sub-queue(s). 349 378 To avoid unnecessary cache line invalidation, the map is only written-to if the mapping changes. 350 379 -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r7d9598d8 r2dcd80a 1 1 \chapter{Macro-Benchmarks}\label{macrobench} 2 The previous chapter demonstrated th e \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.2 The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios. 3 3 The next step is to demonstrate performance stays true in more realistic and complete scenarios. 4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments.4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments. 5 5 6 6 Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products. … … 10 10 As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios. 11 11 12 The most obvious performance metric for web servers is throughput. 13 This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up. 14 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service? 15 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception. 16 12 17 \section{Memcached} 13 18 Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}. … … 26 31 Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}. 27 32 \item 33 The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24 \glspl{hthrd}) 34 \item 28 35 A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively. 29 36 \item … … 47 54 \item 48 55 For UDP connections, all the threads listen to a single UDP socket for incoming requests. 49 Threads that are notcurrently dealing with another request ignore the incoming packet.56 Threads that are currently dealing with another request ignore the incoming packet. 50 57 One of the remaining, non-busy, threads reads the request and sends the response. 51 58 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request. … … 79 86 \subsection{Throughput} \label{memcd:tput} 80 87 This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment. 81 The clients then send read and write queries with only3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.88 The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible. 82 89 Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers. 83 90 … … 104 111 105 112 \subsection{Tail Latency} 106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?107 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.108 113 Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment. 109 114 110 115 Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines. 111 As expected, the latency starts low and increases as the server gets close to saturation, at which point , the latency increases dramatically because the web servers cannot keep up with the connection rateso client requests are disproportionally delayed.116 As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed. 112 117 Because of this dramatic increase, the Y-axis is presented using a log scale. 113 118 Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment. … … 186 191 web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.} 187 192 The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment. 188 189 \subsection{NGINX threading}190 193 NGINX is a high-performance, \emph{full-service}, event-driven web server. 191 194 It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}. 192 195 This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance. 193 196 The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes. 194 When running as a static web server, it uses an event-driven architecture to service incoming requests. 197 In comparison, the custom \CFA web server was developed specifically with this experiment in mind. 198 However, nothing seems to indicate that NGINX suffers from the increased flexibility. 199 When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve. 200 201 \subsection{NGINX threading} 202 When running as a static web server, NGINX uses an event-driven architecture to service incoming requests. 195 203 Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines. 196 204 For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections. 197 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.205 Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io. 198 206 The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool. 199 207 However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads. … … 262 270 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate. 263 271 \item 272 Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates. 273 \item 264 274 Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 265 275 \item -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r7d9598d8 r2dcd80a 4 4 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers. 5 5 All of these systems have a \gls{uthrding} model. 6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.6 The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees. 7 7 Note that only the code of the \CFA tests is shown; 8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.8 all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself. 9 9 10 10 \section{Benchmark Environment}\label{microenv} … … 129 129 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. 130 130 For throughput, higher is better, for scalability, lower is better. 131 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 131 Each series represents 15 independent runs. 132 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 132 133 \label{fig:cycle:jax} 133 134 \end{figure} … … 161 162 \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. 162 163 For throughput, higher is better, for scalability, lower is better. 163 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 164 Each series represents 15 independent runs. 165 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 164 166 \label{fig:cycle:nasus} 165 167 \end{figure} … … 177 179 Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc. 178 180 \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results. 179 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly wors tperformance compared to the left column.181 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column. 180 182 This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism. 181 183 This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available. … … 185 187 Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details. 186 188 Note the maximum of the Y-axis on Intel and AMD differ significantly. 187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.189 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability. 188 190 However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel. 189 191 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. … … 191 193 This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go. 192 194 Again, the same performance increase for libfibre is visible when running fewer \ats. 193 Note,I did not investigate the libfibre performance boost for 1 cycle in this experiment.195 I did not investigate the libfibre performance boost for 1 cycle in this experiment. 194 196 195 197 The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark. 196 198 Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts. 197 199 In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. 198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other , less fair, schedulers.200 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees. 199 201 200 202 \section{Yield} … … 250 252 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 251 253 For throughput, higher is better, for scalability, lower is better. 252 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 254 Each series represents 15 independent runs. 255 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 253 256 \label{fig:yield:jax} 254 257 \end{figure} … … 309 312 \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 310 313 For throughput, higher is better, for scalability, lower is better. 311 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 314 Each series represents 15 independent runs. 315 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 312 316 \label{fig:yield:nasus} 313 317 \end{figure} … … 317 321 Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling. 318 322 Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs. 319 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.323 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed. 320 324 Go and Tokio still display the same performance collapse as on Intel. 321 325 Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine. … … 324 328 325 329 It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently. 326 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.330 The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. 327 331 328 332 … … 333 337 334 338 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it. 335 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.339 With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue. 336 340 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. 337 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other 'sdata.341 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data. 338 342 In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly. 339 343 … … 392 396 \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 393 397 For throughput, higher is better, for scalability, lower is better. 394 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 398 Each series represents 15 independent runs. 399 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 395 400 \label{fig:churn:jax} 396 401 \end{figure} … … 404 409 Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs. 405 410 Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values. 406 Finally, Go gets the most peculiar results, scaling wors tthan other runtimes until 48 \procs.411 Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs. 407 412 At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing. 408 413 However, beyond this point Go keeps this level of variation but does not scale further in any of the runs. 409 414 410 Throughput and scalability are notably wors tfor all runtimes than the previous benchmarks since there is inherently more communication between processors.415 Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors. 411 416 Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second. 412 417 Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling. 413 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to wors tand performance completely ceases to improve.418 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve. 414 419 At this point, the benchmark is dominated by inter-socket communication costs for all runtimes. 415 420 … … 457 462 \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 458 463 For throughput, higher is better, for scalability, lower is better. 459 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 464 Each series represents 15 independent runs. 465 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 460 466 \label{fig:churn:nasus} 461 467 \end{figure} … … 600 606 \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 601 607 For throughput, higher is better, for scalability, lower is better. 602 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 608 Each series represents 15 independent runs. 609 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 603 610 \label{fig:locality:jax} 604 611 \end{figure} … … 632 639 \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 633 640 For throughput, higher is better, for scalability, lower is better. 634 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 641 Each series represents 15 independent runs. 642 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 635 643 \label{fig:locality:nasus} 636 644 \end{figure} … … 648 656 Go still has the same poor performance as on Intel. 649 657 650 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.658 Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio. 651 659 Go still has the same poor performance. 652 660 … … 751 759 \end{centering} 752 760 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. 761 For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time). 753 762 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.} 754 763 \label{fig:transfer:res} … … 764 773 765 774 The first two columns show the results for the semaphore variation on Intel. 766 While there are some differences in latencies, \CFA isconsistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.767 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.775 While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results. 776 Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal. 768 777 769 778 Looking at the next two columns, the results for the yield variation on Intel, the story is very different. … … 780 789 Neither Libfibre nor Tokio complete the experiment. 781 790 782 This experiment clearly demonstrates that \CFA achieves significantly better fairness.791 This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee. 783 792 The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly. 784 793 Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader. … … 790 799 Without \procs stealing from the \proc running the leader, the experiment cannot terminate. 791 800 Go manages to complete the experiment because it adds preemption on top of classic work-stealing. 792 However, since preemption is fairly infrequent, it achieves significantly wors tperformance.801 However, since preemption is fairly infrequent, it achieves significantly worse performance. 793 802 In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness. 794 803 Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep. -
doc/theses/thierry_delisle_PhD/thesis/text/existing.tex
r7d9598d8 r2dcd80a 2 2 As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa. 3 3 Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc. 4 Scheduling is also common in most other fields ,\eg in assembly lines, assigning parts to line workers is a form of scheduling.4 Scheduling is also common in most other fields; \eg in assembly lines, assigning parts to line workers is a form of scheduling. 5 5 6 6 In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler. … … 8 8 A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input. 9 9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information. 10 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.11 12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods.13 When a faster pace is needed and changes are much more frequent gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.10 This description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts. 11 12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned, and workloads are likely to stay stable for long periods. 13 When a faster pace is needed and changes are much more frequent, then gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed. 14 14 15 15 \section{Naming Convention} 16 Scheduling has been studied by various communities concentrating on different incarnations of the same problems.16 Scheduling has been studied by various communities, concentrating on different incarnations of the same problems. 17 17 As a result, there are no standard naming conventions for scheduling that are respected across these communities. 18 18 This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats. 19 19 20 20 \section{Static Scheduling} 21 \newterm{Static schedulers} require \at sdependencies and costs to be explicitly and exhaustively specified prior to scheduling.21 \newterm{Static schedulers} require \at dependencies and costs to be explicitly and exhaustively specified prior to scheduling. 22 22 The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution. 23 23 This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information. … … 27 27 28 28 \section{Dynamic Scheduling} 29 \newterm{Dynamic schedulers} determine \at dependencies and costs during scheduling, if at all. 30 Hence, unlike static scheduling, \at dependencies are conditional and detected at runtime. 31 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies. 29 \newterm{Dynamic schedulers} detect \at dependencies and costs during scheduling, if at all. 30 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, where a \at suspends or halts dynamically when it detects unfulfilled dependencies. 32 31 Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled. 33 32 As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies. … … 35 34 \subsection{Explicitly Informed Dynamic Schedulers} 36 35 While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc. 37 When available, a scheduler can then use this information to direct thescheduling decisions.36 When available, a scheduler can then use this information to direct scheduling decisions. 38 37 For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage. 39 38 However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information; … … 45 44 46 45 \subsubsection{Priority Scheduling} 47 Common information used by schedulers to direct their algorithm is priorities.46 A common approach to direct the scheduling algorithm is to add information about \at priority. 48 47 Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones. 49 48 The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority. … … 81 80 \paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc. 82 81 In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting. 83 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mappingmore interesting than others.82 However, in real-life architectures, there are contexts where different \procs can have different characteristics, which makes some mappings more interesting than others. 84 83 A common example where this is statically true is architectures with \glsxtrshort{numa}. 85 84 In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}. … … 88 87 \paragraph{Complex Machine Architecture} Another aspect that has been examined is how applicable work stealing is to different machine architectures. 89 88 This is arguably strongly related to Task Placement but extends into more heterogeneous architectures. 90 As \CFA offers no particular support for heterogeneous architecture , this is also an area that is less relevant tothis thesis.91 Although it could be an interesting avenue for future work.89 As \CFA offers no particular support for heterogeneous architectures, this is also an area that is not examined in this thesis. 90 However, support for concurrency across heterogeneous architectures is interesting avenue for future work, at which point the literature on this topic and how it relates to scheduling will become relevant. 92 91 93 92 \subsection{Theoretical Results} 94 93 There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. 95 \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.94 Blelloch et al.~\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}. 96 95 Others show that work stealing applies to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. 97 96 \cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats. … … 104 103 Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application. 105 104 There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \at are never further apart than some fixed duration. 106 While this helps schedulers guarantee that no \ats unfairly monopolize a worker, preemption can effectively be added to any scheduler. 105 This helps schedulers guarantee that no \ats unfairly monopolize a worker. 106 Preemption can effectively be added to any scheduler. 107 107 Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it. 108 108 … … 110 110 This section presents a quick overview of several current schedulers. 111 111 While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers. 112 As such, I believe these schedulers are at least as relevant as those presented in published work. 113 Both Schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project. 114 However, real-time schedulers are not considered, as these have constraints that are much stricter than what is needed for this project. 112 As such, I believe these schedulers are as relevant as those presented in published work. 113 Both schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project. 114 However, real-time schedulers aim to guarantee bounded compute time in order to meet deadlines. 115 These deadlines lead to constraints much stricter than the starvation freedom that is needed for this project. 116 As such real-time schedulers are not considered for this work. 115 117 116 118 \subsection{Operating System Schedulers} 117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.119 Operating System Schedulers tend to be fairly complex, as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate. 118 120 Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X. 119 121 The information is less complete for closed source operating systems. … … 137 139 It also periodically balances the load of the system (according to a different heuristic) but uses a simpler work stealing approach. 138 140 139 \paragraph{Windows (OS)}141 \paragraph{Windows (OS)} 140 142 Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. 141 143 It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications. … … 143 145 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. 144 146 145 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth.147 The scheduling policy is discussed more in-depth in~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs''. 146 148 Multicore scheduling is based on a combination of priorities and \proc preference. 147 149 Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc. … … 168 170 \paragraph{Go}\label{GoSafePoint} 169 171 Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}. 170 Preemption is present, but only at safe points ,~\cite{go:safepoints}which are detection code inserted at various frequent access boundaries.172 Preemption is present, but only at safe points~\cite{go:safepoints}, which are detection code inserted at various frequent access boundaries. 171 173 172 174 The algorithm is as follows : 173 175 \begin{enumerate} 174 176 \item Once out of 61 times, pick 1 element from the \emph{GRQ}. 175 \item If there is an item in the ``chair'' pick it.177 \item Otherwise, if there is an item in the ``chair'' pick it. 176 178 \item Else pick an item from the \emph{LRQ}. 177 179 \begin{itemize} … … 180 182 \end{itemize} 181 183 \end{enumerate} 184 185 Chapter~\ref{microbench} uses Go as one of its comparison point in this thesis's performance evaluation. 182 186 183 187 \paragraph{Erlang} … … 224 228 \paragraph{LibFibre} 225 229 LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a lightweight user-level threading framework developed at the University of Waterloo. 226 Similarly to Go, it uses a variation of work stealing with a global queue that has a higher priority than stealing. 227 Unlike Go, it does not have the high-priority next ``chair'' and does not use randomized work-stealing. 230 It shares a very strong resemblance to Go: using a variation of work stealing with a global queue that has a higher priority than stealing. 231 Unlike Go, it does not have the high-priority next ``chair'' and its work-stealing is not randomized. 232 233 Chapter~\ref{microbench} uses LibFibre as one of its comparison point in this thesis's performance evaluation. -
doc/theses/thierry_delisle_PhD/thesis/text/front.tex
r7d9598d8 r2dcd80a 88 88 \\ 89 89 Internal Member: \> Martin Karsten \\ 90 \> AssociateProfessor, School of Computer Science \\90 \> Professor, School of Computer Science \\ 91 91 \> University of Waterloo \\ 92 92 \end{tabbing} … … 108 108 % The following is a sample Declaration Page as provided by the GSO 109 109 % December 13th, 2006. It is designed for an electronic thesis. 110 \begin{center}\textbf{Author's Declaration}\end{center} 110 111 \noindent 111 112 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. … … 127 128 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 128 129 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 129 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.130 which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated. 130 131 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 131 When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources .132 When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources? 132 133 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; 133 134 otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. … … 195 196 % GLOSSARIES (Lists of definitions, abbreviations, symbols, etc. provided by the glossaries-extra package) 196 197 % ----------------------------- 197 \printglossary[type=\acronymtype ]198 \printglossary[type=\acronymtype,title={List of Abbreviations}] 198 199 \cleardoublepage 199 200 \phantomsection % allows hyperref to link to the correct page -
doc/theses/thierry_delisle_PhD/thesis/text/intro.tex
r7d9598d8 r2dcd80a 1 1 \chapter{Introduction}\label{intro} 2 2 3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages .3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages, like Go~\cite{GITHUB:go}, Java's Project Loom~\cite{MAN:project-loom} and Kotlin~\cite{MAN:kotlin}. 4 4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. 5 5 Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 6 6 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 7 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.7 which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated. 8 8 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources .9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources? 10 10 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; 11 11 otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. … … 16 16 Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases. 17 17 18 After examining, testing and selecting specific approaches to these scheduling issues, a completelynew scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.18 After examining, testing and selecting specific approaches to these scheduling issues, a new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system. 19 19 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance. 20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with ,the aim of showing equivalent or better performance while offering better fairness.20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness. 21 21 22 22 Chapter~\ref{intro} defines scheduling and its general goals. … … 32 32 Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones. 33 33 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}. 34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit. 34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit\footnote{ 35 Open systems constrasts to \newterm{closed} systems, where work never arrives from external sources. 36 This definition is a extension of open/closed systems in the field of thermodynamics. 37 }. 38 35 39 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled. 36 40 For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization. … … 94 98 \end{enumerate} 95 99 96 Scheduling is azero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{100 At a high-level, scheduling is considered zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{ 97 101 Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.} 98 Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns. 102 This almost invariably leads to schedulers needing to pick some \ats over others, opening the door to fairness problems. 103 However, at a lower-level, schedulers can make inefficient or incorrect decisions leading to strictly worse outcomes than what the theoretical zero-sum game suggests. 104 Since it can be difficult to avoid these poor decisions, schedulers are generally a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns. 99 105 For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. 100 106 While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. … … 113 119 Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling. 114 120 Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{ 115 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.}121 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd}, reducing parallelism at the user level.} 116 122 117 123 Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals. 118 More specifically, safety and productivity for scheduling mean supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).124 More specifically, safety and productivity for scheduling mean supporting a wide range of workloads, so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). 119 125 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. 120 126 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application. 121 127 122 As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers.128 As a research project, this work runs exclusively on newer versions of the Linux operating system and gcc/clang compilers. 123 129 The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance; 124 130 some of these optimizations rely on interesting hardware optimizations only present on modern CPUs. … … 138 144 An algorithm for load-balancing and idle sleep of processors, including NUMA awareness. 139 145 \item 140 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.146 A mechanism for adding fairness on top of the MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}. 141 147 \item 142 148 An optimization of the helping mechanism for load balancing to reduce scheduling costs. -
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r7d9598d8 r2dcd80a 1 1 \chapter{User Level \io}\label{userio} 2 2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations. 3 I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources. 4 When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}). 3 5 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system. 4 6 … … 16 18 This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block. 17 19 18 There are three options to monitor file descriptors in Linux:\footnote{20 There are three options to monitor file descriptors (FD) in Linux:\footnote{ 19 21 For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}. 20 22 The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.} … … 25 27 \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest. 26 28 Hence, the array length must be as long as the largest FD currently of interest. 27 On return, it outputs the set mo tified inplace to identify which of the file descriptors changed state.29 On return, it outputs the set modified in-place to identify which of the file descriptors changed state. 28 30 This destructive change means selecting in a loop requires re-initializing the array for each iteration. 29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.31 Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent. 30 32 Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set. 31 33 However, these changes are only reflected when the manager makes its next call to @select@. … … 46 48 However, all three of these I/O systems have limitations. 47 49 The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations. 48 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.50 Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed. 49 51 For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@. 50 52 Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors. … … 52 54 \subsection{POSIX asynchronous I/O (AIO)} 53 55 An alternative to @O_NONBLOCK@ is the AIO interface. 54 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel. 55 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations. 56 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}. 57 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost. 58 Leaving polling for completion, which is similar to the previous system calls. 59 AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices. 60 It also supports batching multiple operations in a single system call. 61 62 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. 63 For \io multiplexing, @aio_suspend@ is the best interface. 64 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. 56 Using AIO, programmers can enqueue operations which are to be performed 57 asynchronously by the kernel. 58 The kernel can communicate 59 completions of these operations in three ways: 60 it can spawn a new \gls{kthrd}; send a Linux signal; or 61 userspace can poll for completion of one or more operations. 62 Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution. 63 Signals and their associated interrupt handlers can also lead to fairly complicated 64 interactions between subsystems, and they have a non-trivial cost. 65 This leaves a single option: polling for completion---this is similar to the previously discussed 66 system calls. 67 While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file 68 descriptors can be regular files or block devices. 69 AIO also supports batching multiple operations in a single system call. 70 71 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. 72 Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing. 73 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. 65 74 AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests. 66 75 This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests. … … 92 101 A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces. 93 102 Like AIO, it represents \io operations as entries added to a queue. 94 But like @epoll@, new requests can be submitted , while a blocking call waiting for requests to complete,is already in progress.95 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ringfrom which programmers poll for completion.103 But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress. 104 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion. 96 105 97 106 One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations. 98 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operationslike @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.107 In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc. 99 108 100 109 On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion. … … 106 115 This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. 107 116 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. 108 As opposedto C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.117 Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking. 109 118 110 119 \subsection{Discussion} 111 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.112 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.113 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.114 115 For this project, I selected @io_uring@, in large part sbecause of its generality.120 These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete. 121 All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically. 122 For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@. 123 124 For this project, I selected @io_uring@, in large part because of its generality. 116 125 While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem. 117 126 118 127 \section{Event-Engine} 119 128 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. 120 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returningcontrol to the \gls{proc}.129 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}. 121 130 The parked \ats are then rescheduled by the event engine once the desired operation has been completed. 122 131 … … 125 134 Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance. 126 135 Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right). 127 The submission ring contains entries,\newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.128 The completion ring contains entries,\newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.136 The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel. 137 The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application. 129 138 The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start; 130 139 the completion ring contains entries for the completed I/O operation. … … 134 143 \centering 135 144 \input{io_uring.pstex_t} 136 \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.} 145 \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). 146 While the completion ring contains plain data, the submission ring contains only references. 147 These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.} 137 148 \label{fig:iouring} 138 149 \end{figure} … … 153 164 Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations. 154 165 Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated. 155 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 andS1 has not been submitted.166 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted. 156 167 \item 157 168 The kernel is notified of the change to the ring using the system call @io_uring_enter@. 158 169 The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned. 159 The @io_uring@ instance can be constructed so this step is not required, but this requireselevated privilege.% and an early version of @io_uring@ had additional restrictions.170 The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions. 160 171 \end{enumerate} 161 172 … … 165 176 When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring. 166 177 Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation. 167 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.168 The system call is only needed if the application wants to block waiting for operations to complete.178 The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring. 179 @io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions. 169 180 \end{sloppypar} 170 181 … … 179 190 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time. 180 191 181 An important detail to keep in mind is that just like ``The cloud is just someone else's computer'' \cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.192 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads. 182 193 Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads. 183 194 In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}. 184 195 To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user. 185 Three kinds of operations that can need the \glspl{kthrd} :196 Three kinds of operations that can need the \glspl{kthrd} are: 186 197 187 198 \paragraph{Operations using} @IOSQE_ASYNC@. … … 190 201 \paragraph{Bounded operations.} 191 202 This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices. 192 @io_uring@ must also take this reality into accountby delegating operations on regular files and block devices.203 Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices. 193 204 In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}. 194 205 195 206 \paragraph{Unbounded operations that must be retried.} 196 207 While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket. 197 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.208 Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel. 198 209 @io_uring@ maintains a separate pool for these operations. 199 210 The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}. 211 Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel. 200 212 Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@. 201 213 … … 212 224 however, the duration of the system call scales with the number of entries submitted. 213 225 The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited. 214 Beyond this limit, the length of the system call is the throughput 226 Beyond this limit, the length of the system call is the throughput-limiting factor. 215 227 I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. 216 228 Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. 217 229 Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{ 218 As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}. 219 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it. 220 Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance. 230 As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}. 231 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance. 221 232 222 233 A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link. 223 234 SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}). 224 235 The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission. 225 Supporting chains is not a requirement of the\io subsystem, but it is still valuable.236 For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable. 226 237 Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission. 227 238 … … 237 248 To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} 238 249 From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both. 239 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.250 In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}. 240 251 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call. 241 252 Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc. … … 244 255 \centering 245 256 \input{pivot_ring.pstex_t} 246 \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.257 \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition. 247 258 When submitting, the partition is advanced. 248 259 The kernel considers the partition as the head of the ring.} … … 253 264 However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly. 254 265 Instances running out of SQEs cannot run \ats wanting to do \io operations. 255 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed .266 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur. 256 267 257 268 A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}. 258 \ atsthat cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.269 \Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. 259 270 While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. 260 271 … … 263 274 In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made. 264 275 No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}. 265 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock. 276 However, the \io \gls{proc} is unable to help because it is executing the spinning \at. 277 This results in a deadlock. 266 278 While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''. 267 279 Furthermore, this pattern is difficult to reliably detect and avoid. … … 274 286 \subsubsection{Public Instances} 275 287 The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling. 276 \ atsattempting an \io operation pick one of the available instances and submit the operation to that instance.288 \Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance. 277 289 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently. 278 290 Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: … … 282 294 \item 283 295 The scheme to route \io requests to specific @io_uring@ instances does not introduce contention. 284 This aspect has oversized importancebecause it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.296 This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm. 285 297 \end{itemize} 286 298 287 299 Allocation in this scheme is fairly easy. 288 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.300 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs. 289 301 Allocation also does not require ordering guarantees as all free SQEs are interchangeable. 290 302 The only added complexity is that the number of SQEs is fixed, which means allocation can fail. … … 312 324 Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. 313 325 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. 314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.315 316 With the pool of SQE instances approach, the big advantageis that it is fairly flexible.326 A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. 327 328 The big advantage of the pool of SQE instances approach is that it is fairly flexible. 317 329 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. 318 330 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. … … 320 332 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. 321 333 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. 322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.334 All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases. 323 335 324 336 \subsubsection{Instance borrowing} 325 337 Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}. 326 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.327 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.338 The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling. 339 The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids. 328 340 When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs. 329 341 However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous. … … 332 344 While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.} 333 345 346 As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm. 334 347 In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}. 335 When a \at attempts to issue an \io operation, it ask for an instance from the arbiterand issues requests to that instance.348 When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance. 336 349 This instance is now bound to the \gls{proc} the \at is running on. 337 350 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state. … … 343 356 \item The current \gls{proc} does not hold an instance. 344 357 \item The current instance does not have sufficient SQEs to satisfy the request. 345 \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}. 358 \item The current \gls{proc} has a wrong instance. 359 This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}. 346 360 \end{enumerate} 347 361 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{ 348 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}362 Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.} 349 363 A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag. 350 364 If not, it proceeds, otherwise it delegates the operation to the arbiter. … … 365 379 However, there is no need to immediately revoke the instance. 366 380 External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed. 367 This means whoever is responsible for the system call , first checks ifthe instance has any external submissions.381 This means whoever is responsible for the system call first checks whether the instance has any external submissions. 368 382 If so, it asks the arbiter to revoke the instance and add the external submissions to the ring. 369 383 … … 382 396 383 397 \section{Interface} 384 The last importantpart of the \io subsystem is its interface.398 The final part of the \io subsystem is its interface. 385 399 Multiple approaches can be offered to programmers, each with advantages and disadvantages. 386 The new \ io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.387 The following sections discuss some useful options using @read@ as an example.400 The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different. 401 The following sections discuss some useful options, using @read@ as an example. 388 402 The standard Linux interface for C is: 389 403 \begin{cfa} … … 392 406 393 407 \subsection{Replacement} 394 Replacing the C \ glsxtrshort{api}is the more intrusive and draconian approach.395 The goal is to convince the compiler and linker to replace any calls to @read@ to direct themto the \CFA implementation instead of glibc's.396 This rerouting has the advantage of working transparently and supporting existing binaries without ne eding recompilation.397 It also offers a , presumably,well known and familiar API that C programmers can simply continue to work with.398 However, this approach also entails a plethora of subtle technical challenges, which generally boilsdown to making a perfect replacement.399 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.400 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.408 Replacing the C \io subsystem is the more intrusive and draconian approach. 409 The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's. 410 This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation. 411 It also offers a presumably well known and familiar API that C programmers can simply continue to work with. 412 %However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement. 413 However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs. 414 This approach was rejected as being laudable but infeasible. 401 415 402 416 \subsection{Synchronous Extension} 403 417 Another interface option is to offer an interface different in name only. 418 In this approach, an alternative call is created for each supported system calls. 404 419 For example: 405 420 \begin{cfa} 406 421 ssize_t cfa_read(int fd, void *buf, size_t count); 407 422 \end{cfa} 423 The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking. 424 408 425 This approach is feasible and still familiar to C programmers. 409 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.426 It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries. 410 427 However, it has the advantage of implementation simplicity. 411 428 Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io. … … 416 433 future(ssize_t) read(int fd, void *buf, size_t count); 417 434 \end{cfa} 418 where the generic @future@ is fulfilled when the read completes and it contains the number of bytesread, which may be less than the number of bytes requested.435 where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested. 419 436 The data read is placed in @buf@. 420 The problem is that both the bytes readand data form the synchronization object, not just the bytes read.421 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.437 The problem is that both the bytes count and data form the synchronization object, not just the bytes read. 438 Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this. 422 439 A classical asynchronous API is: 423 440 \begin{cfa} … … 438 455 However, it is not the most user-friendly option. 439 456 It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@. 457 458 As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension: 459 \begin{cfa} 460 ssize_t cfa_read(int fd, void *buf, size_t count); 461 future(ssize_t) async_read(int fd, void *buf, size_t count); 462 \end{cfa} -
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
r7d9598d8 r2dcd80a 2 2 The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state. 3 3 This chapter addresses problems that occur when the system state changes. 4 Indeed the \CFA runtime ,supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.4 Indeed the \CFA runtime supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically. 5 5 These changes affect the scheduling algorithm, which must dynamically alter its behaviour. 6 6 7 In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.7 Specifically, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios. 8 8 \begin{cfa} 9 9 { … … 26 26 This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason. 27 27 28 There are no performance requirements, within reason, for resizingsince it is expected to be rare.29 However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks.30 It should also avoidas much as possible any effect on performance when the number of \procs remains constant.31 This la terrequirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.28 There are no performance requirements, within reason, for act of resizing itself, since it is expected to be rare. 29 However, this operation has strict correctness requirements, since updating and idle sleep can easily lead to deadlocks. 30 The resizing mechanism should also avoid, as much as possible any effect on performance when the number of \procs remains constant. 31 This last requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays. 32 32 33 33 \subsection{Read-Copy-Update} 34 34 One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}. 35 This is a very common pattern that avoids large critical sections, which is why it is worth mentioning. 35 36 In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempting an Indiana Jones Switch to replace the original with the copy. 36 This approach has the advantage that it may not need any synchronization to do the switch .37 This approach has the advantage that it may not need any synchronization to do the switch, depending on how reclamation of the original copy is handled. 37 38 However, there is a race where \procs still use the original data structure after the copy is switched. 38 39 This race not only requires adding a memory-reclamation scheme, but it also requires that operations made on the stale original version are eventually moved to the copy. … … 63 64 Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer. 64 65 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock. 65 The lock i nnonblocking, so both readers and writers spin while the lock is held.66 The lock is nonblocking, so both readers and writers spin while the lock is held. 66 67 This very wide sharding strategy means that readers have very good locality since they only ever need to access two memory locations. 67 68 … … 98 99 \section{Idle-Sleep}\label{idlesleep} 99 100 While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs. 100 For this work, it is the programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.101 For this work, it is the application programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue. 101 102 This leaves too many \procs when there are not enough \ats for all the \procs to be useful. 102 103 These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease. … … 108 109 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements. 109 110 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore. 110 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking},user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.111 The complexity here is to support user-level locking, timers, \io operations, and all other \CFA features with minimal complexity. 111 112 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs. 112 113 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work. … … 115 116 An interesting subpart of this heuristic is what to do with bursts of \ats that become ready. 116 117 Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up. 117 This fact begs the question,if many \procs are available, how many should be woken?118 This fact raises the question: if many \procs are available, how many should be woken? 118 119 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization. 119 120 If the ready \ats will run for a very short time, waking many \procs may be wasteful. 120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis,the behaviour of the scheduler in this particular case is left unspecified.121 As mentioned, since a heuristic to handle these complex cases is outside the scope of this thesis, so the behaviour of the scheduler in this particular case is left unspecified. 121 122 122 123 \section{Sleeping} … … 156 157 The notifier first makes sure the newly ready \at is visible to \procs searching for \ats, and then attempts to notify an idle \proc. 157 158 On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed. 158 Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed.159 Unlike regular work-stealing, this search must be exhaustive to make sure that no pre-existing \at is missed. 159 160 These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc. 160 161 Conversely, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search. … … 188 189 \centering 189 190 \input{idle1.pstex_t} 190 \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.191 \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put onto a doubly-linked stack protected by a lock. 191 192 Each \proc has a private event \lstinline{fd}.} 192 193 \label{fig:idle1} -
doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex
r7d9598d8 r2dcd80a 6 6 \Celeven introduced threading features, such as the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@. 7 7 Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC). 8 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.8 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as \CC does. 9 9 This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel. 10 10 The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking. … … 38 38 39 39 \section{\glsxtrshort{io}}\label{prev:io} 40 Prior to this work, the \CFA runtime did not addany particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:40 Prior to this work, the \CFA runtime did not have any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows: 41 41 42 42 \begin{quote} … … 49 49 \end{quote} 50 50 51 Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.51 Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which, like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations. 52 52 This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}. 53 53 The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel. … … 56 56 57 57 \section{Interoperating with C} 58 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}: 58 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. 59 The POSIX standard states~\cite[\S~2.9.1]{POSIX17}: 59 60 \begin{quote} 60 61 All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions) 61 62 \end{quote} 62 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.63 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence, interoperability with UNIX library functions is a challenge for an M:N threading model. 63 64 64 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. 65 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. 66 Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. 65 67 66 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences: 68 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. 69 Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. 70 This fine-grained interoperability between C and \CFA has two consequences: 67 71 \begin{enumerate} 68 72 \item Precisely identifying blocking C calls is difficult. … … 71 75 Because of these consequences, this work does not attempt to ``sandbox'' calls to C. 72 76 Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model. 77 Since the blocking call is not known to the runtime, it is not necessarily possible to distinguish whether or not a deadlock occurs. 73 78 Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications. 74 Therefore, a complete solution to this problem is outside the scope of this thesis.\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.} 79 Therefore, a complete solution to this problem is outside the scope of this thesis. 80 \footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. 81 This capability further reduces the chances of blocking a \gls{kthrd}.} 82 Chapter~\ref{userio} discusses the interoperability with C chosen and used for the evaluation in Chapter~\ref{macrobench}. -
libcfa/src/bits/random.hfa
r7d9598d8 r2dcd80a 10 10 // Created On : Fri Jan 14 07:18:11 2022 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Fri Jan 14 07:18:58 202213 // Update Count : 1 12 // Last Modified On : Sun Dec 11 18:43:58 2022 13 // Update Count : 171 14 14 // 15 15 … … 18 18 #include <stdint.h> 19 19 20 // Pipelined to allow out-of-order overlap with reduced dependencies. Critically, the current random state is returned 21 // (copied), and then compute and store the next random value. 22 23 #if defined(__SIZEOF_INT128__) 24 //-------------------------------------------------- 20 #define GLUE2( x, y ) x##y 21 #define GLUE( x, y ) GLUE2( x, y ) 22 23 // Set default PRNG for architecture size. 24 #ifdef __x86_64__ // 64-bit architecture 25 // 64-bit generators 26 #define LEHMER64 27 //#define XORSHIFT_12_25_27 28 //#define XOSHIRO256PP 29 //#define KISS_64 30 31 // 32-bit generators 32 #define XORSHIFT_6_21_7 33 //#define XOSHIRO128PP 34 #else // 32-bit architecture 35 // 64-bit generators 36 #define XORSHIFT_13_7_17 37 38 // 32-bit generators 39 #define XORSHIFT_6_21_7 40 #endif // __x86_64__ 41 42 // Define C/CFA PRNG name and random-state. 43 44 // SKULLDUGGERY: typedefs name struct and typedef with the same name to deal with CFA typedef numbering problem. 45 46 #ifdef XOSHIRO256PP 47 #define PRNG_NAME_64 xoshiro256pp 48 #define PRNG_STATE_64_T GLUE(PRNG_NAME_64,_t) 49 typedef struct PRNG_STATE_64_T { uint64_t s[4]; } PRNG_STATE_64_T; 50 #endif // XOSHIRO256PP 51 52 #ifdef XOSHIRO128PP 53 #define PRNG_NAME_32 xoshiro128pp 54 #define PRNG_STATE_32_T GLUE(PRNG_NAME_32,_t) 55 typedef struct PRNG_STATE_32_T { uint32_t s[4]; } PRNG_STATE_32_T; 56 #endif // XOSHIRO128PP 57 58 #ifdef LEHMER64 59 #define PRNG_NAME_64 lehmer64 60 #define PRNG_STATE_64_T __uint128_t 61 #endif // LEHMER64 62 63 #ifdef WYHASH64 64 #define PRNG_NAME_64 wyhash64 65 #define PRNG_STATE_64_T uint64_t 66 #endif // LEHMER64 67 68 #ifdef XORSHIFT_13_7_17 69 #define PRNG_NAME_64 xorshift_13_7_17 70 #define PRNG_STATE_64_T uint64_t 71 #endif // XORSHIFT_13_7_17 72 73 #ifdef XORSHIFT_6_21_7 74 #define PRNG_NAME_32 xorshift_6_21_7 75 #define PRNG_STATE_32_T uint32_t 76 #endif // XORSHIFT_6_21_7 77 78 #ifdef XORSHIFT_12_25_27 79 #define PRNG_NAME_64 xorshift_12_25_27 80 #define PRNG_STATE_64_T uint64_t 81 #endif // XORSHIFT_12_25_27 82 83 #ifdef KISS_64 84 #define PRNG_NAME_64 kiss_64 85 #define PRNG_STATE_64_T GLUE(PRNG_NAME_64,_t) 86 typedef struct PRNG_STATE_64_T { uint64_t z, w, jsr, jcong; } PRNG_STATE_64_T; 87 #endif // KISS_^64 88 89 #ifdef XORWOW 90 #define PRNG_NAME_32 xorwow 91 #define PRNG_STATE_32_T GLUE(PRNG_NAME_32,_t) 92 typedef struct PRNG_STATE_32_T { uint32_t a, b, c, d, counter; } PRNG_STATE_32_T; 93 #endif // XOSHIRO128PP 94 95 #define PRNG_SET_SEED_64 GLUE(PRNG_NAME_64,_set_seed) 96 #define PRNG_SET_SEED_32 GLUE(PRNG_NAME_32,_set_seed) 97 98 99 // Default PRNG used by runtime. 100 #ifdef __x86_64__ // 64-bit architecture 101 #define PRNG_NAME PRNG_NAME_64 102 #define PRNG_STATE_T PRNG_STATE_64_T 103 #else // 32-bit architecture 104 #define PRNG_NAME PRNG_NAME_32 105 #define PRNG_STATE_T PRNG_STATE_32_T 106 #endif // __x86_64__ 107 108 #define PRNG_SET_SEED GLUE(PRNG_NAME,_set_seed) 109 110 111 // ALL PRNG ALGORITHMS ARE OPTIMIZED SO THAT THE PRNG LOGIC CAN HAPPEN IN PARALLEL WITH THE USE OF THE RESULT. 112 // Therefore, the set_seed routine primes the PRNG by calling it with the state so the seed is not return as the 113 // first random value. 114 115 #ifdef __cforall // don't include in C code (invoke.h) 116 117 // https://prng.di.unimi.it/xoshiro256starstar.c 118 // 119 // This is xoshiro256++ 1.0, one of our all-purpose, rock-solid generators. It has excellent (sub-ns) speed, a state 120 // (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of. 121 // 122 // For generating just floating-point numbers, xoshiro256+ is even faster. 123 // 124 // The state must be seeded so that it is not everywhere zero. If you have a 64-bit seed, we suggest to seed a 125 // splitmix64 generator and use its output to fill s. 126 127 #ifndef XOSHIRO256PP 128 typedef struct xoshiro256pp_t { uint64_t s[4]; } xoshiro256pp_t; 129 #endif // ! XOSHIRO256PP 130 131 static inline uint64_t xoshiro256pp( xoshiro256pp_t & rs ) with(rs) { 132 inline uint64_t rotl(const uint64_t x, int k) { 133 return (x << k) | (x >> (64 - k)); 134 } // rotl 135 136 const uint64_t result = rotl( s[0] + s[3], 23 ) + s[0]; 137 const uint64_t t = s[1] << 17; 138 139 s[2] ^= s[0]; 140 s[3] ^= s[1]; 141 s[1] ^= s[2]; 142 s[0] ^= s[3]; 143 s[2] ^= t; 144 s[3] = rotl( s[3], 45 ); 145 return result; 146 } // xoshiro256pp 147 148 static inline void xoshiro256pp_set_seed( xoshiro256pp_t & state, uint64_t seed ) { 149 state = (xoshiro256pp_t){ {seed, seed, seed, seed} }; 150 xoshiro256pp( state ); 151 } // xoshiro256pp_set_seed 152 153 // https://prng.di.unimi.it/xoshiro128plusplus.c 154 // 155 // This is xoshiro128++ 1.0, one of our 32-bit all-purpose, rock-solid generators. It has excellent speed, a state size 156 // (128 bits) that is large enough for mild parallelism, and it passes all tests we are aware of. 157 // 158 // For generating just single-precision (i.e., 32-bit) floating-point numbers, xoshiro128+ is even faster. 159 // 160 // The state must be seeded so that it is not everywhere zero. 161 162 #ifndef XOSHIRO128PP 163 typedef struct xoshiro128pp_t { uint32_t s[4]; } xoshiro128pp_t; 164 #endif // ! XOSHIRO128PP 165 166 static inline uint32_t xoshiro128pp( xoshiro128pp_t & rs ) with(rs) { 167 inline uint32_t rotl( const uint32_t x, int k ) { 168 return (x << k) | (x >> (32 - k)); 169 } // rotl 170 171 const uint32_t result = rotl( s[0] + s[3], 7 ) + s[0]; 172 const uint32_t t = s[1] << 9; 173 174 s[2] ^= s[0]; 175 s[3] ^= s[1]; 176 s[1] ^= s[2]; 177 s[0] ^= s[3]; 178 s[2] ^= t; 179 s[3] = rotl( s[3], 11 ); 180 return result; 181 } // xoshiro128pp 182 183 static inline void xoshiro128pp_set_seed( xoshiro128pp_t & state, uint32_t seed ) { 184 state = (xoshiro128pp_t){ {seed, seed, seed, seed} }; 185 xoshiro128pp( state ); // prime 186 } // xoshiro128pp_set_seed 187 188 #ifdef __SIZEOF_INT128__ 189 // Pipelined to allow out-of-order overlap with reduced dependencies. Critically, the current random state is 190 // returned (copied), and then compute and store the next random value. 191 //-------------------------------------------------- 25 192 static inline uint64_t lehmer64( __uint128_t & state ) { 26 193 __uint128_t ret = state; 27 194 state *= 0xda942042e4dd58b5; 28 195 return ret >> 64; 29 } 30 31 //-------------------------------------------------- 196 } // lehmer64 197 198 static inline void lehmer64_set_seed( __uint128_t & state, uint64_t seed ) { 199 state = seed; 200 lehmer64( state ); 201 } // lehmer64_set_seed 202 203 //-------------------------------------------------- 32 204 static inline uint64_t wyhash64( uint64_t & state ) { 33 state += 0x60bee2bee120fc15; 205 uint64_t ret = state; 206 state += 0x_60be_e2be_e120_fc15; 34 207 __uint128_t tmp; 35 tmp = (__uint128_t) state * 0xa3b195354a39b70d;208 tmp = (__uint128_t) ret * 0x_a3b1_9535_4a39_b70d; 36 209 uint64_t m1 = (tmp >> 64) ^ tmp; 37 tmp = (__uint128_t)m1 * 0x 1b03738712fad5c9;210 tmp = (__uint128_t)m1 * 0x_1b03_7387_12fa_d5c9; 38 211 uint64_t m2 = (tmp >> 64) ^ tmp; 39 212 return m2; 40 } 41 #endif 213 } // wyhash64 214 215 static inline void wyhash64_set_seed( uint64_t & state, uint64_t seed ) { 216 state = seed; 217 wyhash64( state ); // prime 218 } // wyhash64_set_seed 219 #endif // __SIZEOF_INT128__ 42 220 43 221 //-------------------------------------------------- … … 48 226 state ^= state << 17; 49 227 return ret; 50 } 51 52 //-------------------------------------------------- 228 } // xorshift_13_7_17 229 230 static inline void xorshift_13_7_17_set_seed( uint64_t & state, uint64_t seed ) { 231 state = seed; 232 xorshift_13_7_17( state ); // prime 233 } // xorshift_13_7_17_set_seed 234 235 //-------------------------------------------------- 236 // Marsaglia shift-XOR PRNG with thread-local state 237 // Period is 4G-1 238 // 0 is absorbing and must be avoided 239 // Low-order bits are not particularly random 53 240 static inline uint32_t xorshift_6_21_7( uint32_t & state ) { 54 241 uint32_t ret = state; … … 59 246 } // xorshift_6_21_7 60 247 61 //-------------------------------------------------- 62 typedef struct { 63 uint32_t a, b, c, d; 64 uint32_t counter; 65 } xorwow__state_t; 66 67 // The state array must be initialized to not be all zero in the first four words. 68 static inline uint32_t xorwow( xorwow__state_t & state ) { 248 static inline void xorshift_6_21_7_set_seed( uint32_t & state, uint32_t seed ) { 249 state = seed; 250 xorshift_6_21_7( state ); // prime 251 } // xorshift_6_21_7_set_seed 252 253 //-------------------------------------------------- 254 // The state must be seeded with a nonzero value. 255 static inline uint64_t xorshift_12_25_27( uint64_t & state ) { 256 uint64_t ret = state; 257 state ^= state >> 12; 258 state ^= state << 25; 259 state ^= state >> 27; 260 return ret * 0x_2545_F491_4F6C_DD1D; 261 } // xorshift_12_25_27 262 263 static inline void xorshift_12_25_27_set_seed( uint64_t & state, uint64_t seed ) { 264 state = seed; 265 xorshift_12_25_27( state ); // prime 266 } // xorshift_12_25_27_set_seed 267 268 //-------------------------------------------------- 269 // The state must be seeded with a nonzero value. 270 #ifndef KISS_64 271 typedef struct kiss_64_t { uint64_t z, w, jsr, jcong; } kiss_64_t; 272 #endif // ! KISS_64 273 274 static inline uint64_t kiss_64( kiss_64_t & state ) with(state) { 275 kiss_64_t ret = state; 276 z = 36969 * (z & 65535) + (z >> 16); 277 w = 18000 * (w & 65535) + (w >> 16); 278 jsr ^= (jsr << 17); 279 jsr ^= (jsr << 13); 280 jsr ^= (jsr << 5); 281 jcong = 69069 * jcong + 1234567; 282 return (((ret.z << 16) + ret.w) ^ ret.jcong) + ret.jsr; 283 } // kiss_64 284 285 static inline void kiss_64_set_seed( kiss_64_t & state, uint64_t seed ) with(state) { 286 z = 1; w = 1; jsr = 4; jcong = seed; 287 kiss_64( state ); // prime 288 } // kiss_64_set_seed 289 290 //-------------------------------------------------- 291 // The state array must be initialized to non-zero in the first four words. 292 #ifndef XORWOW 293 typedef struct xorwow_t { uint32_t a, b, c, d, counter; } xorwow_t; 294 #endif // ! XORWOW 295 296 static inline uint32_t xorwow( xorwow_t & state ) with(state) { 69 297 // Algorithm "xorwow" from p. 5 of Marsaglia, "Xorshift RNGs". 70 uint32_t ret = state.a + state.counter;71 uint32_t t = state.d;72 73 uint32_t const s = state.a;74 state.d = state.c;75 state.c = state.b;76 state.b = s;298 uint32_t ret = a + counter; 299 uint32_t t = d; 300 301 uint32_t const s = a; 302 d = c; 303 c = b; 304 b = s; 77 305 78 306 t ^= t >> 2; 79 307 t ^= t << 1; 80 308 t ^= s ^ (s << 4); 81 state.a = t; 82 83 state.counter += 362437; 309 a = t; 310 counter += 362437; 84 311 return ret; 85 } 86 87 //-------------------------------------------------- 88 static inline uint32_t LCG( uint32_t & state ) { // linear congruential generator 89 uint32_t ret = state; 90 state = 36969 * (state & 65535) + (state >> 16); // 36969 is NOT prime! No not change it! 91 return ret; 92 } // LCG 93 94 //-------------------------------------------------- 312 } // xorwow 313 314 static inline void xorwow_set_seed( xorwow_t & state, uint32_t seed ) { 315 state = (xorwow_t){ seed, seed, seed, seed, 0 }; 316 xorwow( state ); // prime 317 } // xorwow_set_seed 318 319 //-------------------------------------------------- 320 // Used in __tls_rand_fwd 95 321 #define M (1_l64u << 48_l64u) 96 322 #define A (25214903917_l64u) … … 103 329 state = (A * state + C) & (M - 1); 104 330 return state >> D; 105 } 331 } // LCGBI_fwd 106 332 107 333 static inline uint32_t LCGBI_bck( uint64_t & state ) { … … 109 335 state = AI * (state - C) & (M - 1); 110 336 return r; 111 } 337 } // LCGBI_bck 112 338 113 339 #undef M … … 116 342 #undef C 117 343 #undef D 344 345 #endif // __cforall -
libcfa/src/concurrency/invoke.h
r7d9598d8 r2dcd80a 10 10 // Created On : Tue Jan 17 12:27:26 2016 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Sun Jan 9 19:06:45202213 // Update Count : 4812 // Last Modified On : Tue Nov 29 20:42:21 2022 13 // Update Count : 56 14 14 // 15 15 … … 17 17 #include "bits/defs.hfa" 18 18 #include "bits/locks.hfa" 19 #include "bits/random.hfa" 19 20 #include "kernel/fwd.hfa" 20 21 … … 222 223 struct processor * last_proc; 223 224 224 uint32_t random_state;// fast random numbers225 PRNG_STATE_T random_state; // fast random numbers 225 226 226 227 #if defined( __CFA_WITH_VERIFY__ ) -
libcfa/src/concurrency/kernel.cfa
r7d9598d8 r2dcd80a 10 10 // Created On : Tue Jan 17 12:27:26 2017 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Mon Aug 31 07:08:20 202013 // Update Count : 7 112 // Last Modified On : Wed Nov 30 18:14:08 2022 13 // Update Count : 76 14 14 // 15 15 … … 151 151 // Because of a bug, we couldn't initialized the seed on construction 152 152 // Do it here 153 __cfaabi_tls.rand_seed ^= rdtscl();153 PRNG_SET_SEED( __cfaabi_tls.random_state, rdtscl() ); 154 154 __cfaabi_tls.ready_rng.fwd_seed = 25214903917_l64u * (rdtscl() ^ (uintptr_t)&runner); 155 155 __tls_rand_advance_bck(); -
libcfa/src/concurrency/kernel/fwd.hfa
r7d9598d8 r2dcd80a 46 46 } preemption_state; 47 47 48 #if defined(__SIZEOF_INT128__) 49 __uint128_t rand_seed; 50 #else 51 uint64_t rand_seed; 52 #endif 48 PRNG_STATE_T random_state; 49 53 50 struct { 54 51 uint64_t fwd_seed; … … 57 54 58 55 struct __stats_t * volatile this_stats; 59 60 56 61 57 #ifdef __CFA_WITH_VERIFY__ … … 76 72 #define publicTLS_get( member ) ((typeof(__cfaabi_tls.member))__cfatls_get( __builtin_offsetof(KernelThreadData, member) )) 77 73 78 static inline uint64_t __tls_rand() { 79 return 80 #if defined(__SIZEOF_INT128__) 81 lehmer64( kernelTLS().rand_seed ); 82 #else 83 xorshift_13_7_17( kernelTLS().rand_seed ); 84 #endif 74 static inline 75 #ifdef __x86_64__ // 64-bit architecture 76 uint64_t 77 #else // 32-bit architecture 78 uint32_t 79 #endif // __x86_64__ 80 __tls_rand() { 81 return PRNG_NAME( kernelTLS().random_state ); 85 82 } 86 83 … … 120 117 121 118 // Yield: yield N times 122 static inline void yield( unsignedtimes ) {119 static inline void yield( size_t times ) { 123 120 for( times ) { 124 121 yield(); -
libcfa/src/concurrency/kernel/startup.cfa
r7d9598d8 r2dcd80a 39 39 #include "limits.hfa" 40 40 #include "math.hfa" 41 #include "bits/random.hfa" // prng 41 42 42 43 #define CFA_PROCESSOR_USE_MMAP 0 … … 107 108 extern void __wake_proc(processor *); 108 109 extern int cfa_main_returned; // from interpose.cfa 109 uint32_t __global_random_prime = 4_294_967_291u, __global_random_mask = false; 110 size_t __global_random_prime = 4_294_967_291u; 111 bool __global_random_mask = false; 110 112 111 113 //----------------------------------------------------------------------------- … … 133 135 // Global state 134 136 __thread struct KernelThreadData __cfaabi_tls __attribute__ ((tls_model ( "initial-exec" ))) @= { 135 NULL,// cannot use 0p136 NULL,137 false,138 { 1, false,false },139 0,140 { 0,0 },141 NULL,137 .this_thread : NULL, // cannot use 0p 138 .this_processor : NULL, 139 .sched_lock : false, 140 .preemption_state : { .disable_count : 1, .enabled : false, .in_progress : false }, 141 // random_state uninitialized 142 .ready_rng : { .fwd_seed : 0, .bck_seed : 0 }, 143 .this_stats : NULL, 142 144 #ifdef __CFA_WITH_VERIFY__ 143 false,144 0,145 .in_sched_lock : false, 146 .sched_id : 0, 145 147 #endif 146 148 }; … … 513 515 rdy_link.next = 0p; 514 516 rdy_link.ts = MAX; 517 user_link.next = 0p; 518 user_link.prev = 0p; 519 cltr_link.next = 0p; 520 cltr_link.prev = 0p; 515 521 preferred = ready_queue_new_preferred(); 516 522 last_proc = 0p; 517 random_state = __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl();523 PRNG_SET_SEED( random_state, __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl() ); 518 524 #if defined( __CFA_WITH_VERIFY__ ) 519 525 executing = 0p; -
libcfa/src/concurrency/stats.cfa
r7d9598d8 r2dcd80a 48 48 stats->io.submit.eagr = 0; 49 49 stats->io.submit.nblk = 0; 50 stats->io.submit.extr = 0; 50 51 stats->io.flush.external = 0; 52 stats->io.flush.signal = 0; 51 53 stats->io.flush.dirty = 0; 52 54 stats->io.flush.full = 0; … … 120 122 tally_one( &cltr->io.submit.eagr , &proc->io.submit.eagr ); 121 123 tally_one( &cltr->io.submit.nblk , &proc->io.submit.nblk ); 124 tally_one( &cltr->io.submit.extr , &proc->io.submit.extr ); 122 125 tally_one( &cltr->io.flush.external , &proc->io.flush.external ); 126 tally_one( &cltr->io.flush.signal , &proc->io.flush.signal ); 123 127 tally_one( &cltr->io.flush.dirty , &proc->io.flush.dirty ); 124 128 tally_one( &cltr->io.flush.full , &proc->io.flush.full ); … … 199 203 if(io.alloc.slow) { 200 204 double avgfasts = (100.0 * (double)io.submit.fast) / total_submits; 201 sstr | "fast," | eng3(io.submit.slow) | "slow (" | ws(3, 3, avgfasts) | "%) " | nonl;205 sstr | "fast," | eng3(io.submit.slow) | "slow (" | ws(3, 3, avgfasts) | "%)," | eng3(io.submit.extr) | "external" | nonl; 202 206 } 203 207 sstr | " - eager" | eng3(io.submit.eagr) | nonl; … … 217 221 | " - cmp " | eng3(io.calls.locked) | "locked, " | eng3(io.calls.helped) | "helped" 218 222 | " - " | eng3(io.calls.errors.busy) | " EBUSY"; 219 sstr | " - sub: " | eng3(io.flush.full) | "full, " | eng3(io.flush.dirty) | "drty, " | eng3(io.flush.idle) | "idle, " | eng3(io.flush.eager) | "eagr, " | eng3(io.flush.external) | "ext";223 sstr | " - sub: " | eng3(io.flush.full) | "full, " | eng3(io.flush.dirty) | "drty, " | eng3(io.flush.idle) | "idle, " | eng3(io.flush.eager) | "eagr, " | eng3(io.flush.external) | '/' | eng3(io.flush.signal) | "ext"; 220 224 sstr | "- ops blk: " 221 225 | " sk rd: " | eng3(io.ops.sockread) | "epll: " | eng3(io.ops.epllread) -
libcfa/src/concurrency/stats.hfa
r7d9598d8 r2dcd80a 94 94 volatile uint64_t eagr; 95 95 volatile uint64_t nblk; 96 volatile uint64_t extr; 96 97 } submit; 97 98 struct { 98 99 volatile uint64_t external; 100 volatile uint64_t signal; 99 101 volatile uint64_t dirty; 100 102 volatile uint64_t full; -
libcfa/src/concurrency/thread.cfa
r7d9598d8 r2dcd80a 10 10 // Created On : Tue Jan 17 12:27:26 2017 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : S at Feb 12 15:24:18202213 // Update Count : 6612 // Last Modified On : Sun Dec 11 20:56:54 2022 13 // Update Count : 102 14 14 // 15 15 … … 26 26 #include "invoke.h" 27 27 28 extern uint32_t __global_random_seed, __global_random_prime, __global_random_mask; 28 extern size_t __global_random_seed; 29 extern size_t __global_random_prime; 30 extern bool __global_random_mask; 29 31 30 32 #pragma GCC visibility push(default) … … 46 48 rdy_link.next = 0p; 47 49 rdy_link.ts = MAX; 50 user_link.next = 0p; 51 user_link.prev = 0p; 52 cltr_link.next = 0p; 53 cltr_link.prev = 0p; 48 54 preferred = ready_queue_new_preferred(); 49 55 last_proc = 0p; 50 random_state = __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl();56 PRNG_SET_SEED( random_state, __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl() ); 51 57 #if defined( __CFA_WITH_VERIFY__ ) 52 58 executing = 0p; … … 176 182 //----------------------------------------------------------------------------- 177 183 bool migrate( thread$ * thrd, struct cluster & cl ) { 178 179 184 monitor$ * tmon = get_monitor(thrd); 180 185 monitor$ * __monitors[] = { tmon }; 181 186 monitor_guard_t __guard = { __monitors, 1 }; 182 183 184 187 { 185 188 // if nothing needs to be done, return false … … 221 224 222 225 //----------------------------------------------------------------------------- 223 #define GENERATOR LCG 224 225 void set_seed( uint32_t seed ) { 226 uint32_t & state = active_thread()->random_state; 227 state = __global_random_seed = seed; 228 GENERATOR( state ); 229 __global_random_prime = state; 226 227 void set_seed( size_t seed ) { 228 PRNG_STATE_T & state = active_thread()->random_state; 229 PRNG_SET_SEED( state, seed ); 230 __global_random_seed = seed; 231 __global_random_prime = seed; 230 232 __global_random_mask = true; 231 233 } // set_seed 232 234 233 uint32_t prng( void ) { // [0,UINT_MAX] 234 uint32_t & state = active_thread()->random_state; 235 return GENERATOR( state ); 235 size_t prng( void ) { // [0,UINT_MAX] 236 return PRNG_NAME( active_thread()->random_state ); 236 237 } // prng 237 238 -
libcfa/src/concurrency/thread.hfa
r7d9598d8 r2dcd80a 10 10 // Created On : Tue Jan 17 12:27:26 2017 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Fri Feb 11 16:34:07202213 // Update Count : 2012 // Last Modified On : Tue Nov 22 22:18:34 2022 13 // Update Count : 35 14 14 // 15 15 … … 23 23 #include "monitor.hfa" 24 24 #include "exception.hfa" 25 #include "bits/random.hfa" 25 26 26 27 //----------------------------------------------------------------------------- … … 141 142 //---------- 142 143 // prng 144 void set_seed( size_t seed ); 143 145 static inline { 144 uint32_t prng( thread$ & th ) __attribute__(( warn_unused_result )) { return LCG( th.random_state ); } // [0,UINT_MAX]145 uint32_t prng( thread$ & th, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u)146 uint32_t prng( thread$ & th, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u]146 size_t prng( thread$ & th ) __attribute__(( warn_unused_result )) { return PRNG_NAME( th.random_state ); } // [0,UINT_MAX] 147 size_t prng( thread$ & th, size_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u) 148 size_t prng( thread$ & th, size_t l, size_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u] 147 149 forall( T & | is_thread(T) ) { 148 uint32_t prng( T & th ) __attribute__(( warn_unused_result )) { return prng( (thread &)th ); } // [0,UINT_MAX]149 uint32_t prng( T & th, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u)150 uint32_t prng( T & th, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u]150 size_t prng( T & th ) __attribute__(( warn_unused_result )) { return prng( (thread &)th ); } // [0,UINT_MAX] 151 size_t prng( T & th, size_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u) 152 size_t prng( T & th, size_t l, size_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u] 151 153 } // distribution 152 154 } // distribution -
libcfa/src/exception.h
r7d9598d8 r2dcd80a 143 143 } 144 144 145 forall(exceptT &, virtualT & | is_ exception(exceptT, virtualT))145 forall(exceptT &, virtualT & | is_termination_exception(exceptT, virtualT)) 146 146 static inline void defaultResumptionHandler(exceptT & except) { 147 147 throw except; -
libcfa/src/startup.cfa
r7d9598d8 r2dcd80a 10 10 // Created On : Tue Jul 24 16:21:57 2018 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Thu Oct 6 13:51:57202213 // Update Count : 5712 // Last Modified On : Mon Dec 5 11:41:58 2022 13 // Update Count : 73 14 14 // 15 15 … … 18 18 #include <stdlib.h> // getenv 19 19 #include "bits/defs.hfa" // rdtscl 20 #include "bits/random.hfa" // rdtscl 20 21 #include "startup.hfa" 21 22 22 extern uint32_t __global_random_seed;// sequential/concurrent23 extern uint32_t __global_random_state;// sequential23 extern size_t __global_random_seed; // sequential/concurrent 24 extern PRNG_STATE_T __global_random_state; // sequential 24 25 25 26 extern "C" { … … 68 69 void __cfaabi_core_startup( void ) __attribute__(( constructor( STARTUP_PRIORITY_CORE ) )); 69 70 void __cfaabi_core_startup( void ) { 70 __global_random_state = __global_random_seed = rdtscl(); 71 __global_random_seed = rdtscl(); 72 PRNG_SET_SEED( __global_random_state, __global_random_seed ); 73 71 74 __cfaabi_interpose_startup(); 72 75 __cfaabi_device_startup(); -
libcfa/src/stdlib.cfa
r7d9598d8 r2dcd80a 10 10 // Created On : Thu Jan 28 17:10:29 2016 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Thu Aug 25 22:41:14202213 // Update Count : 6 0412 // Last Modified On : Fri Dec 9 15:11:30 2022 13 // Update Count : 631 14 14 // 15 15 … … 225 225 //--------------------------------------- 226 226 227 #define GENERATOR LCG228 229 227 // would be cool to make hidden but it's needed for libcfathread 230 __attribute__((visibility("default"))) uint32_t __global_random_seed; // sequential/concurrent 231 __attribute__((visibility("hidden"))) uint32_t __global_random_state; // sequential only 232 233 void set_seed( PRNG & prng, uint32_t seed_ ) with( prng ) { state = seed = seed_; GENERATOR( state ); } // set seed 234 235 void set_seed( uint32_t seed ) { __global_random_state = __global_random_seed = seed; GENERATOR( __global_random_state ); } 236 uint32_t get_seed() { return __global_random_seed; } 237 uint32_t prng( void ) { return GENERATOR( __global_random_state ); } // [0,UINT_MAX] 228 __attribute__((visibility("default"))) size_t __global_random_seed; // sequential/concurrent 229 __attribute__((visibility("hidden"))) PRNG_STATE_T __global_random_state; // sequential only 230 231 void set_seed( size_t seed ) { 232 __global_random_seed = seed; 233 PRNG_SET_SEED( __global_random_state, seed ); 234 } // set_seed 235 236 size_t get_seed() { return __global_random_seed; } 237 size_t prng( void ) { return PRNG_NAME( __global_random_state ); } // [0,UINT_MAX] 238 238 239 239 //--------------------------------------- -
libcfa/src/stdlib.hfa
r7d9598d8 r2dcd80a 10 10 // Created On : Thu Jan 28 17:12:35 2016 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Thu Aug 25 18:07:06202213 // Update Count : 64512 // Last Modified On : Sun Dec 11 18:25:53 2022 13 // Update Count : 765 14 14 // 15 15 … … 404 404 // calls( sprng ); 405 405 406 struct PRNG { 406 trait basic_prng( PRNG &, R ) { 407 void set_seed( PRNG & prng, R seed ); // set seed 408 R get_seed( PRNG & prng ); // get seed 409 R prng( PRNG & prng ); 410 void ?{}( PRNG & prng ); // random seed 411 void ?{}( PRNG & prng, R seed ); // fixed seed 412 }; // basic_prng 413 414 static inline forall( PRNG &, R | basic_prng( PRNG, R ) | { R ?%?( R, R ); } ) { 415 R prng( PRNG & prng, R u ) { return prng( prng ) % u; } // [0,u) 416 } 417 static inline forall( PRNG &, R | basic_prng( PRNG, R ) | { R ?+?( R, R ); R ?-?( R, R ); R ?%?( R, R ); void ?{}( R &, one_t ); } ) { 418 R prng( PRNG & prng, R l, R u ) { return prng( prng, u - l + (R){1} ) + l; } // [l,u] 419 } 420 421 struct PRNG32 { 407 422 uint32_t callcnt; // call count 408 423 uint32_t seed; // current seed 409 uint32_t state;// random state424 PRNG_STATE_32_T state; // random state 410 425 }; // PRNG 411 426 412 void set_seed( PRNG & prng, uint32_t seed_ ); 413 static inline { 414 void ?{}( PRNG & prng ) with( prng ) { callcnt = 0; set_seed( prng, rdtscl() ); } // random seed 415 void ?{}( PRNG & prng, uint32_t seed ) with( prng ) { callcnt = 0; set_seed( prng, seed ); } // fixed seed 416 uint32_t get_seed( PRNG & prng ) __attribute__(( warn_unused_result )) with( prng ) { return seed; } // get seed 417 uint32_t prng( PRNG & prng ) __attribute__(( warn_unused_result )) with( prng ) { callcnt += 1; return LCG( state ); } // [0,UINT_MAX] 418 uint32_t prng( PRNG & prng, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng ) % u; } // [0,u) 419 uint32_t prng( PRNG & prng, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng, u - l + 1 ) + l; } // [l,u] 420 uint32_t calls( PRNG & prng ) __attribute__(( warn_unused_result )) with( prng ) { return callcnt; } 427 static inline { 428 void set_seed( PRNG32 & prng, uint32_t seed_ ) with( prng ) { seed = seed_; PRNG_SET_SEED_32( state, seed ); } 429 uint32_t get_seed( PRNG32 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return seed; } 430 uint32_t prng( PRNG32 & prng ) __attribute__(( warn_unused_result )) with( prng ) { callcnt += 1; return PRNG_NAME_32( state ); } // [0,UINT_MAX] 431 uint32_t prng( PRNG32 & prng, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng ) % u; } // [0,u) 432 uint32_t prng( PRNG32 & prng, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng, u - l + 1 ) + l; } // [l,u] 433 uint32_t calls( PRNG32 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return callcnt; } 434 void ?{}( PRNG32 & prng ) with( prng ) { callcnt = 0; set_seed( prng, rdtscl() ); } // random seed 435 void ?{}( PRNG32 & prng, uint32_t seed ) with( prng ) { callcnt = 0; set_seed( prng, seed ); } // fixed seed 436 } // distribution 437 438 struct PRNG64 { 439 uint64_t callcnt; // call count 440 uint64_t seed; // current seed 441 PRNG_STATE_64_T state; // random state 442 }; // PRNG 443 444 static inline { 445 void set_seed( PRNG64 & prng, uint64_t seed_ ) with( prng ) { seed = seed_; PRNG_SET_SEED_64( state, seed ); } 446 uint64_t get_seed( PRNG64 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return seed; } 447 uint64_t prng( PRNG64 & prng ) __attribute__(( warn_unused_result )) with( prng ) { callcnt += 1; return PRNG_NAME_64( state ); } // [0,UINT_MAX] 448 uint64_t prng( PRNG64 & prng, uint64_t u ) __attribute__(( warn_unused_result )) { return prng( prng ) % u; } // [0,u) 449 uint64_t prng( PRNG64 & prng, uint64_t l, uint64_t u ) __attribute__(( warn_unused_result )) { return prng( prng, u - l + 1 ) + l; } // [l,u] 450 uint64_t calls( PRNG64 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return callcnt; } 451 void ?{}( PRNG64 & prng ) with( prng ) { callcnt = 0; set_seed( prng, rdtscl() ); } // random seed 452 void ?{}( PRNG64 & prng, uint64_t seed ) with( prng ) { callcnt = 0; set_seed( prng, seed ); } // fixed seed 421 453 } // distribution 422 454 … … 435 467 // prng( 5, 21 ); 436 468 437 void set_seed( uint32_t seed_ ) OPTIONAL_THREAD; 438 uint32_t get_seed() __attribute__(( warn_unused_result )); 439 uint32_t prng( void ) __attribute__(( warn_unused_result )) OPTIONAL_THREAD; // [0,UINT_MAX] 440 static inline { 441 uint32_t prng( uint32_t u ) __attribute__(( warn_unused_result )) { return prng() % u; } // [0,u) 442 uint32_t prng( uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( u - l + 1 ) + l; } // [l,u] 469 // Harmonize with concurrency/thread.hfa. 470 void set_seed( size_t seed_ ) OPTIONAL_THREAD; // set global seed 471 size_t get_seed() __attribute__(( warn_unused_result )); // get global seed 472 size_t prng( void ) __attribute__(( warn_unused_result )) OPTIONAL_THREAD; // [0,UINT_MAX] 473 static inline { 474 size_t prng( size_t u ) __attribute__(( warn_unused_result )) { return prng() % u; } // [0,u) 475 size_t prng( size_t l, size_t u ) __attribute__(( warn_unused_result )) { return prng( u - l + 1 ) + l; } // [l,u] 443 476 } // distribution 444 477 -
src/AST/Convert.cpp
r7d9598d8 r2dcd80a 1764 1764 { old->linkage.val }, 1765 1765 GET_ACCEPT_1(base, Type), 1766 old->hide == EnumDecl::EnumHiding::Hide ? ast::EnumDecl::EnumHiding::Hide : ast::EnumDecl::EnumHiding::Visible, 1766 1767 old->enumValues 1767 1768 ); -
src/AST/Decl.cpp
r7d9598d8 r2dcd80a 125 125 } 126 126 127 std::ostream & operator<< ( std::ostream & out, const TypeD ecl::Data & data ) {127 std::ostream & operator<< ( std::ostream & out, const TypeData & data ) { 128 128 return out << data.kind << ", " << data.isComplete; 129 129 } -
src/AST/Decl.hpp
r7d9598d8 r2dcd80a 10 10 // Created On : Thu May 9 10:00:00 2019 11 11 // Last Modified By : Andrew Beach 12 // Last Modified On : Thu May 5 12:09:00 202213 // Update Count : 3 312 // Last Modified On : Thu Nov 24 9:44:00 2022 13 // Update Count : 34 14 14 // 15 15 … … 191 191 ptr<Type> init; 192 192 193 /// Data extracted from a type decl194 struct Data {195 Kind kind;196 bool isComplete;197 198 Data() : kind( NUMBER_OF_KINDS ), isComplete( false ) {}199 Data( const TypeDecl * d ) : kind( d->kind ), isComplete( d->sized ) {}200 Data( Kind k, bool c ) : kind( k ), isComplete( c ) {}201 Data( const Data & d1, const Data & d2 )202 : kind( d1.kind ), isComplete( d1.isComplete || d2.isComplete ) {}203 204 bool operator==( const Data & o ) const { return kind == o.kind && isComplete == o.isComplete; }205 bool operator!=( const Data & o ) const { return !(*this == o); }206 };207 208 193 TypeDecl( 209 194 const CodeLocation & loc, const std::string & name, Storage::Classes storage, … … 225 210 }; 226 211 227 std::ostream & operator<< ( std::ostream &, const TypeDecl::Data & ); 212 /// Data extracted from a TypeDecl. 213 struct TypeData { 214 TypeDecl::Kind kind; 215 bool isComplete; 216 217 TypeData() : kind( TypeDecl::NUMBER_OF_KINDS ), isComplete( false ) {} 218 TypeData( const TypeDecl * d ) : kind( d->kind ), isComplete( d->sized ) {} 219 TypeData( TypeDecl::Kind k, bool c ) : kind( k ), isComplete( c ) {} 220 TypeData( const TypeData & d1, const TypeData & d2 ) 221 : kind( d1.kind ), isComplete( d1.isComplete || d2.isComplete ) {} 222 223 bool operator==( const TypeData & o ) const { return kind == o.kind && isComplete == o.isComplete; } 224 bool operator!=( const TypeData & o ) const { return !(*this == o); } 225 }; 226 227 std::ostream & operator<< ( std::ostream &, const TypeData & ); 228 228 229 229 /// C-style typedef `typedef Foo Bar` … … 315 315 // enum (type_optional) Name {...} 316 316 ptr<Type> base; // if isTyped == true && base.get() == nullptr, it is a "void" type enum 317 318 EnumDecl( const CodeLocation& loc, const std::string& name, bool isTyped = false, 317 enum class EnumHiding { Visible, Hide } hide; 318 319 EnumDecl( const CodeLocation& loc, const std::string& name, bool isTyped = false, 319 320 std::vector<ptr<Attribute>>&& attrs = {}, Linkage::Spec linkage = Linkage::Cforall, 320 Type const * base = nullptr, 321 Type const * base = nullptr, EnumHiding hide = EnumHiding::Hide, 321 322 std::unordered_map< std::string, long long > enumValues = std::unordered_map< std::string, long long >() ) 322 : AggregateDecl( loc, name, std::move(attrs), linkage ), isTyped(isTyped), base(base), enumValues(enumValues) {}323 : AggregateDecl( loc, name, std::move(attrs), linkage ), isTyped(isTyped), base(base), hide(hide), enumValues(enumValues) {} 323 324 324 325 /// gets the integer value for this enumerator, returning true iff value found -
src/AST/Pass.impl.hpp
r7d9598d8 r2dcd80a 686 686 687 687 if ( __visit_children() ) { 688 // unlike structs, traits, and unions, enums inject their members into the global scope 689 maybe_accept( node, &EnumDecl::base ); 690 maybe_accept( node, &EnumDecl::params ); 691 maybe_accept( node, &EnumDecl::members ); 692 maybe_accept( node, &EnumDecl::attributes ); 688 if ( node->hide == ast::EnumDecl::EnumHiding::Hide ) { 689 guard_symtab guard { *this }; 690 maybe_accept( node, &EnumDecl::base ); 691 maybe_accept( node, &EnumDecl::params ); 692 maybe_accept( node, &EnumDecl::members ); 693 maybe_accept( node, &EnumDecl::attributes ); 694 } else { 695 maybe_accept( node, &EnumDecl::base ); 696 maybe_accept( node, &EnumDecl::params ); 697 maybe_accept( node, &EnumDecl::members ); 698 maybe_accept( node, &EnumDecl::attributes ); 699 } 693 700 } 694 701 -
src/AST/Type.cpp
r7d9598d8 r2dcd80a 10 10 // Created On : Mon May 13 15:00:00 2019 11 11 // Last Modified By : Andrew Beach 12 // Last Modified On : Thu Jul 23 14:16:00 202013 // Update Count : 512 // Last Modified On : Thu Nov 24 9:49:00 2022 13 // Update Count : 6 14 14 // 15 15 … … 147 147 // --- TypeInstType 148 148 149 TypeInstType::TypeInstType( const TypeEnvKey & key ) 150 : BaseInstType(key.base->name), base(key.base), kind(key.base->kind), formal_usage(key.formal_usage), expr_id(key.expr_id) {} 151 149 152 bool TypeInstType::operator==( const TypeInstType & other ) const { 150 153 return base == other.base … … 164 167 bool TypeInstType::isComplete() const { return base->sized; } 165 168 166 std::string Type InstType::TypeEnvKey::typeString() const {169 std::string TypeEnvKey::typeString() const { 167 170 return std::string("_") + std::to_string(formal_usage) 168 171 + "_" + std::to_string(expr_id) + "_" + base->name; 169 172 } 170 173 171 bool Type InstType::TypeEnvKey::operator==(172 const Type InstType::TypeEnvKey & other ) const {174 bool TypeEnvKey::operator==( 175 const TypeEnvKey & other ) const { 173 176 return base == other.base 174 177 && formal_usage == other.formal_usage … … 176 179 } 177 180 178 bool Type InstType::TypeEnvKey::operator<(179 const Type InstType::TypeEnvKey & other ) const {181 bool TypeEnvKey::operator<( 182 const TypeEnvKey & other ) const { 180 183 // TypeEnvKey ordering is an arbitrary total ordering. 181 184 // It doesn't mean anything but allows for a sorting. -
src/AST/Type.hpp
r7d9598d8 r2dcd80a 10 10 // Created On : Thu May 9 10:00:00 2019 11 11 // Last Modified By : Andrew Beach 12 // Last Modified On : Wed Jul 14 15:54:00 202113 // Update Count : 712 // Last Modified On : Thu Nov 24 9:47:00 2022 13 // Update Count : 8 14 14 // 15 15 … … 390 390 }; 391 391 392 struct TypeEnvKey; 393 392 394 /// instance of named type alias (typedef or variable) 393 395 class TypeInstType final : public BaseInstType { … … 401 403 int expr_id = 0; 402 404 403 // compact representation used for map lookups.404 struct TypeEnvKey {405 const TypeDecl * base = nullptr;406 int formal_usage = 0;407 int expr_id = 0;408 409 TypeEnvKey() = default;410 TypeEnvKey(const TypeDecl * base, int formal_usage = 0, int expr_id = 0)411 : base(base), formal_usage(formal_usage), expr_id(expr_id) {}412 TypeEnvKey(const TypeInstType & inst)413 : base(inst.base), formal_usage(inst.formal_usage), expr_id(inst.expr_id) {}414 std::string typeString() const;415 bool operator==(const TypeEnvKey & other) const;416 bool operator<(const TypeEnvKey & other) const;417 };418 419 405 bool operator==(const TypeInstType & other) const; 420 406 … … 433 419 TypeInstType( const TypeInstType & o ) = default; 434 420 435 TypeInstType( const TypeEnvKey & key ) 436 : BaseInstType(key.base->name), base(key.base), kind(key.base->kind), formal_usage(key.formal_usage), expr_id(key.expr_id) {} 421 TypeInstType( const TypeEnvKey & key ); 437 422 438 423 /// sets `base`, updating `kind` correctly … … 453 438 TypeInstType * clone() const override { return new TypeInstType{ *this }; } 454 439 MUTATE_FRIEND 440 }; 441 442 /// Compact representation of TypeInstType used for map lookups. 443 struct TypeEnvKey { 444 const TypeDecl * base = nullptr; 445 int formal_usage = 0; 446 int expr_id = 0; 447 448 TypeEnvKey() = default; 449 TypeEnvKey(const TypeDecl * base, int formal_usage = 0, int expr_id = 0) 450 : base(base), formal_usage(formal_usage), expr_id(expr_id) {} 451 TypeEnvKey(const TypeInstType & inst) 452 : base(inst.base), formal_usage(inst.formal_usage), expr_id(inst.expr_id) {} 453 std::string typeString() const; 454 bool operator==(const TypeEnvKey & other) const; 455 bool operator<(const TypeEnvKey & other) const; 455 456 }; 456 457 … … 560 561 namespace std { 561 562 template<> 562 struct hash<typename ast::Type InstType::TypeEnvKey> {563 size_t operator() (const ast::Type InstType::TypeEnvKey & x) const {563 struct hash<typename ast::TypeEnvKey> { 564 size_t operator() (const ast::TypeEnvKey & x) const { 564 565 const size_t p = 1000007; 565 566 size_t res = reinterpret_cast<size_t>(x.base); -
src/AST/TypeEnvironment.cpp
r7d9598d8 r2dcd80a 82 82 } 83 83 84 const EqvClass * TypeEnvironment::lookup( const Type InstType::TypeEnvKey & var ) const {84 const EqvClass * TypeEnvironment::lookup( const TypeEnvKey & var ) const { 85 85 for ( ClassList::const_iterator i = env.begin(); i != env.end(); ++i ) { 86 86 if ( i->vars.find( var ) != i->vars.end() ) return &*i; … … 122 122 void TypeEnvironment::writeToSubstitution( TypeSubstitution & sub ) const { 123 123 for ( const auto & clz : env ) { 124 Type InstType::TypeEnvKey clzRep;124 TypeEnvKey clzRep; 125 125 bool first = true; 126 126 for ( const auto & var : clz.vars ) { … … 146 146 struct Occurs : public ast::WithVisitorRef<Occurs> { 147 147 bool result; 148 std::unordered_set< Type InstType::TypeEnvKey > vars;148 std::unordered_set< TypeEnvKey > vars; 149 149 const TypeEnvironment & tenv; 150 150 151 Occurs( const Type InstType::TypeEnvKey & var, const TypeEnvironment & env )151 Occurs( const TypeEnvKey & var, const TypeEnvironment & env ) 152 152 : result( false ), vars(), tenv( env ) { 153 153 if ( const EqvClass * clz = tenv.lookup( var ) ) { … … 170 170 171 171 /// true if `var` occurs in `ty` under `env` 172 bool occurs( const Type * ty, const Type InstType::TypeEnvKey & var, const TypeEnvironment & env ) {172 bool occurs( const Type * ty, const TypeEnvKey & var, const TypeEnvironment & env ) { 173 173 Pass<Occurs> occur{ var, env }; 174 174 maybe_accept( ty, occur ); … … 258 258 namespace { 259 259 /// true if the given type can be bound to the given type variable 260 bool tyVarCompatible( const TypeD ecl::Data & data, const Type * type ) {260 bool tyVarCompatible( const TypeData & data, const Type * type ) { 261 261 switch ( data.kind ) { 262 262 case TypeDecl::Dtype: … … 279 279 280 280 bool TypeEnvironment::bindVar( 281 const TypeInstType * typeInst, const Type * bindTo, const TypeD ecl::Data & data,281 const TypeInstType * typeInst, const Type * bindTo, const TypeData & data, 282 282 AssertionSet & need, AssertionSet & have, const OpenVarSet & open, WidenMode widen, 283 283 const SymbolTable & symtab … … 319 319 320 320 bool TypeEnvironment::bindVarToVar( 321 const TypeInstType * var1, const TypeInstType * var2, TypeD ecl::Data && data,321 const TypeInstType * var1, const TypeInstType * var2, TypeData && data, 322 322 AssertionSet & need, AssertionSet & have, const OpenVarSet & open, 323 323 WidenMode widen, const SymbolTable & symtab … … 457 457 } 458 458 459 TypeEnvironment::ClassList::iterator TypeEnvironment::internal_lookup( const Type InstType::TypeEnvKey & var ) {459 TypeEnvironment::ClassList::iterator TypeEnvironment::internal_lookup( const TypeEnvKey & var ) { 460 460 for ( ClassList::iterator i = env.begin(); i != env.end(); ++i ) { 461 461 if ( i->vars.count( var ) ) return i; -
src/AST/TypeEnvironment.hpp
r7d9598d8 r2dcd80a 79 79 80 80 /// Set of open variables 81 using OpenVarSet = std::unordered_map< Type InstType::TypeEnvKey, TypeDecl::Data >;81 using OpenVarSet = std::unordered_map< TypeEnvKey, TypeData >; 82 82 83 83 /// Merges one set of open vars into another … … 95 95 /// they bind to. 96 96 struct EqvClass { 97 std::unordered_set< Type InstType::TypeEnvKey > vars;97 std::unordered_set< TypeEnvKey > vars; 98 98 ptr<Type> bound; 99 99 bool allowWidening; 100 TypeD ecl::Data data;100 TypeData data; 101 101 102 102 EqvClass() : vars(), bound(), allowWidening( true ), data() {} … … 111 111 112 112 /// Singleton class constructor from substitution 113 EqvClass( const Type InstType::TypeEnvKey & v, const Type * b )113 EqvClass( const TypeEnvKey & v, const Type * b ) 114 114 : vars{ v }, bound( b ), allowWidening( false ), data( TypeDecl::Dtype, false ) {} 115 115 116 116 /// Single-var constructor (strips qualifiers from bound type) 117 EqvClass( const Type InstType::TypeEnvKey & v, const Type * b, bool w, const TypeDecl::Data & d )117 EqvClass( const TypeEnvKey & v, const Type * b, bool w, const TypeData & d ) 118 118 : vars{ v }, bound( b ), allowWidening( w ), data( d ) { 119 119 reset_qualifiers( bound ); … … 121 121 122 122 /// Double-var constructor 123 EqvClass( const Type InstType::TypeEnvKey & v, const TypeInstType::TypeEnvKey & u, bool w, const TypeDecl::Data & d )123 EqvClass( const TypeEnvKey & v, const TypeEnvKey & u, bool w, const TypeData & d ) 124 124 : vars{ v, u }, bound(), allowWidening( w ), data( d ) {} 125 125 … … 137 137 public: 138 138 /// Finds the equivalence class containing a variable; nullptr for none such 139 const EqvClass * lookup( const Type InstType::TypeEnvKey & var ) const;139 const EqvClass * lookup( const TypeEnvKey & var ) const; 140 140 141 141 /// Add a new equivalence class for each type variable … … 181 181 /// needed. Returns false on failure. 182 182 bool bindVar( 183 const TypeInstType * typeInst, const Type * bindTo, const TypeD ecl::Data & data,183 const TypeInstType * typeInst, const Type * bindTo, const TypeData & data, 184 184 AssertionSet & need, AssertionSet & have, const OpenVarSet & openVars, 185 185 ResolvExpr::WidenMode widen, const SymbolTable & symtab ); … … 188 188 /// classes if needed. Returns false on failure. 189 189 bool bindVarToVar( 190 const TypeInstType * var1, const TypeInstType * var2, TypeD ecl::Data && data,190 const TypeInstType * var1, const TypeInstType * var2, TypeData && data, 191 191 AssertionSet & need, AssertionSet & have, const OpenVarSet & openVars, 192 192 ResolvExpr::WidenMode widen, const SymbolTable & symtab ); … … 213 213 214 214 /// Private lookup API; returns array index of string, or env.size() for not found 215 ClassList::iterator internal_lookup( const Type InstType::TypeEnvKey & );215 ClassList::iterator internal_lookup( const TypeEnvKey & ); 216 216 }; 217 217 -
src/AST/TypeSubstitution.cpp
r7d9598d8 r2dcd80a 52 52 } 53 53 54 void TypeSubstitution::add( const Type InstType::TypeEnvKey & key, const Type * actualType) {54 void TypeSubstitution::add( const TypeEnvKey & key, const Type * actualType) { 55 55 typeMap[ key ] = actualType; 56 56 } … … 64 64 65 65 const Type *TypeSubstitution::lookup( 66 const Type InstType::TypeEnvKey & formalType ) const {66 const TypeEnvKey & formalType ) const { 67 67 TypeMap::const_iterator i = typeMap.find( formalType ); 68 68 … … 85 85 86 86 const Type *TypeSubstitution::lookup( const TypeInstType * formalType ) const { 87 return lookup( ast::Type InstType::TypeEnvKey( *formalType ) );87 return lookup( ast::TypeEnvKey( *formalType ) ); 88 88 } 89 89 -
src/AST/TypeSubstitution.hpp
r7d9598d8 r2dcd80a 72 72 73 73 void add( const TypeInstType * formalType, const Type *actualType ); 74 void add( const Type InstType::TypeEnvKey & key, const Type *actualType );74 void add( const TypeEnvKey & key, const Type *actualType ); 75 75 void add( const TypeSubstitution &other ); 76 76 void remove( const TypeInstType * formalType ); 77 const Type *lookup( const Type InstType::TypeEnvKey & formalType ) const;77 const Type *lookup( const TypeEnvKey & formalType ) const; 78 78 const Type *lookup( const TypeInstType * formalType ) const; 79 79 bool empty() const; … … 105 105 friend class Pass; 106 106 107 typedef std::unordered_map< Type InstType::TypeEnvKey, ptr<Type> > TypeMap;107 typedef std::unordered_map< TypeEnvKey, ptr<Type> > TypeMap; 108 108 TypeMap typeMap; 109 109 … … 184 184 int subCount = 0; 185 185 bool freeOnly; 186 typedef std::unordered_set< Type InstType::TypeEnvKey > BoundVarsType;186 typedef std::unordered_set< TypeEnvKey > BoundVarsType; 187 187 BoundVarsType boundVars; 188 188 -
src/CodeGen/CodeGenerator.cc
r7d9598d8 r2dcd80a 290 290 if ( obj->get_init() ) { 291 291 obj->get_init()->accept( *visitor ); 292 last_val = ((ConstantExpr *)(((SingleInit *)(obj->init))->value))->constant.get_ival(); 292 Expression* expr = ((SingleInit *)(obj->init))->value; 293 while ( auto temp = dynamic_cast<CastExpr *>(expr) ) { 294 expr = temp->arg; 295 } 296 last_val = ((ConstantExpr *)expr)->constant.get_ival(); 293 297 } else { 294 298 output << ++last_val; -
src/GenPoly/Box.cc
r7d9598d8 r2dcd80a 37 37 #include "InitTweak/InitTweak.h" // for getFunctionName, isAssignment 38 38 #include "Lvalue.h" // for generalizedLvalue 39 #include "ResolvExpr/TypeEnvironment.h" // for EqvClass40 39 #include "ResolvExpr/typeops.h" // for typesCompatible 41 40 #include "ScopedSet.h" // for ScopedSet, ScopedSet<>::iter... … … 95 94 private: 96 95 /// Pass the extra type parameters from polymorphic generic arguments or return types into a function application 97 void passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator &arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes ); 96 /// Will insert 0, 2 or 3 more arguments. 97 std::list< Expression *>::iterator passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes ); 98 98 /// passes extra type parameters into a polymorphic function application 99 99 /// Returns an iterator to the first argument after the added … … 488 488 makeTyVarMap( functionType, scopeTyVars ); 489 489 490 std::list< DeclarationWithType *> ¶mList = functionType->parameters;491 490 std::list< FunctionType const *> functions; 492 491 for ( TypeDecl * const tyVar : functionType->forall ) { … … 495 494 } // for 496 495 } // for 497 for ( DeclarationWithType * const arg : paramList) {496 for ( DeclarationWithType * const arg : functionType->parameters ) { 498 497 findFunction( arg->get_type(), functions, scopeTyVars, needsAdapter ); 499 498 } // for … … 532 531 } 533 532 534 void Pass1::passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator &arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes ) {533 std::list< Expression *>::iterator Pass1::passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes ) { 535 534 Type *polyType = isPolyType( parmType, exprTyVars ); 536 535 if ( polyType && ! dynamic_cast< TypeInstType* >( polyType ) ) { 537 536 std::string typeName = mangleType( polyType ); 538 if ( seenTypes.count( typeName ) ) return ;537 if ( seenTypes.count( typeName ) ) return arg; 539 538 540 539 arg = appExpr->get_args().insert( arg, new SizeofExpr( argBaseType->clone() ) ); … … 556 555 seenTypes.insert( typeName ); 557 556 } 557 return arg; 558 558 } 559 559 … … 562 562 std::list< Expression *>::iterator arg = appExpr->args.begin(); 563 563 // pass size/align for type variables 564 // NOTE: This is iterating over a map. This means the sorting 565 // order of the keys changes behaviour, as the iteration order 566 // is visible outside the loop. - The order matches the orignal 567 // order because the vars have been renamed with numbers that, 568 // even when converted to strings, sort in the original order. 569 // (At least, that is the best explination I have.) 564 570 for ( std::pair<std::string, TypeDecl::Data> const & tyParam : exprTyVars ) { 565 ResolvExpr::EqvClass eqvClass; 566 if ( tyParam.second.isComplete ) { 567 Type *concrete = env->lookup( tyParam.first ); 568 // If there is an unbound type variable, it should have detected already. 569 assertf( concrete, "Unbound type variable: %s in: %s", 570 toCString( tyParam.first ), toCString( *env ) ); 571 572 arg = appExpr->get_args().insert( arg, new SizeofExpr( concrete->clone() ) ); 573 arg++; 574 arg = appExpr->get_args().insert( arg, new AlignofExpr( concrete->clone() ) ); 575 arg++; 576 } // if 571 if ( !tyParam.second.isComplete ) continue; 572 Type *concrete = env->lookup( tyParam.first ); 573 // If there is an unbound type variable, it should have detected already. 574 assertf( concrete, "Unbound type variable: %s in: %s", 575 toCString( tyParam.first ), toCString( *env ) ); 576 577 arg = appExpr->get_args().insert( arg, new SizeofExpr( concrete->clone() ) ); 578 arg++; 579 arg = appExpr->get_args().insert( arg, new AlignofExpr( concrete->clone() ) ); 580 arg++; 577 581 } // for 578 582 … … 582 586 assert( funcType ); 583 587 584 // These iterators don't advance in unison.585 std::list< DeclarationWithType* >::const_iterator fnParm = funcType->get_parameters().begin();586 std::list< Expression* >::const_iterator fnArg = arg;587 std::set< std::string > seenTypes; ///< names for generic types we've seen588 // Iterator over the original function arguments. 589 std::list< Expression* >::const_iterator fnArg; 590 // Names for generic types we've seen. 591 std::set< std::string > seenTypes; 588 592 589 593 // a polymorphic return type may need to be added to the argument list 590 594 if ( polyRetType ) { 591 595 Type *concRetType = replaceWithConcrete( polyRetType, env ); 592 passArgTypeVars( appExpr, polyRetType, concRetType, arg, exprTyVars, seenTypes ); 593 ++fnArg; // skip the return parameter in the argument list 596 arg = passArgTypeVars( appExpr, polyRetType, concRetType, arg, exprTyVars, seenTypes ); 597 // Skip the return parameter in the argument list. 598 fnArg = arg + 1; 599 } else { 600 fnArg = arg; 594 601 } 595 602 596 603 // add type information args for presently unseen types in parameter list 604 std::list< DeclarationWithType* >::const_iterator fnParm = funcType->get_parameters().begin(); 597 605 for ( ; fnParm != funcType->get_parameters().end() && fnArg != appExpr->get_args().end(); ++fnParm, ++fnArg ) { 598 if ( ! (*fnArg)->get_result() ) continue;599 606 Type * argType = (*fnArg)->get_result(); 600 passArgTypeVars( appExpr, (*fnParm)->get_type(), argType, arg, exprTyVars, seenTypes ); 607 if ( ! argType ) continue; 608 arg = passArgTypeVars( appExpr, (*fnParm)->get_type(), argType, arg, exprTyVars, seenTypes ); 601 609 } 602 610 return arg; … … 680 688 Expression *Pass1::applyAdapter( ApplicationExpr *appExpr, FunctionType *function ) { 681 689 Expression *ret = appExpr; 682 // if ( ! function->get_returnVals().empty() && isPolyType( function->get_returnVals().front()->get_type(), tyVars ) ) {683 690 if ( isDynRet( function, scopeTyVars ) ) { 684 691 ret = addRetParam( appExpr, function->returnVals.front()->get_type() ); … … 772 779 773 780 void Pass1::addInferredParams( ApplicationExpr *appExpr, std::list< Expression *>::iterator arg, FunctionType *functionType, const TyVarMap &tyVars ) { 774 std::list< Expression *>::iterator cur = arg;775 781 for ( TypeDecl * const tyVar : functionType->forall ) { 776 782 for ( DeclarationWithType * const assert : tyVar->assertions ) { … … 779 785 Expression *newExpr = inferParam->second.expr->clone(); 780 786 boxParam( newExpr, assert->get_type(), tyVars ); 781 appExpr->get_args().insert( cur, newExpr ); 787 arg = appExpr->get_args().insert( arg, newExpr ); 788 ++arg; 782 789 } // for 783 790 } // for … … 922 929 // only attempt to create an adapter or pass one as a parameter if we haven't already done so for this 923 930 // pre-substitution parameter function type. 924 if ( adaptersDone.find( mangleName ) == adaptersDone.end() ) {925 adaptersDone.insert( adaptersDone.begin(), mangleName );931 // The second part of the insert result is "is the value new". 932 if ( adaptersDone.insert( mangleName ).second ) { 926 933 927 934 // apply substitution to type variables to figure out what the adapter's type should look like … … 1106 1113 1107 1114 Expression *ret = appExpr; 1115 // Save iterator to the first original parameter (works with lists). 1108 1116 std::list< Expression *>::iterator paramBegin = appExpr->get_args().begin(); 1109 1117 … … 1172 1180 1173 1181 void Pass1::premutate( AddressExpr * ) { visit_children = false; } 1182 1174 1183 Expression * Pass1::postmutate( AddressExpr * addrExpr ) { 1175 1184 assert( addrExpr->arg->result && ! addrExpr->arg->result->isVoid() ); … … 1232 1241 1233 1242 void Pass2::addAdapters( FunctionType *functionType ) { 1234 std::list< DeclarationWithType *> ¶mList = functionType->parameters;1235 1243 std::list< FunctionType const *> functions; 1236 1244 for ( DeclarationWithType * const arg : functionType->parameters ) { … … 1245 1253 std::string adapterName = makeAdapterName( mangleName ); 1246 1254 // adapter may not be used in body, pass along with unused attribute. 1247 paramList.push_front( new ObjectDecl( adapterName, Type::StorageClasses(), LinkageSpec::C, 0, new PointerType( Type::Qualifiers(), makeAdapterType( funType, scopeTyVars ) ), 0, { new Attribute( "unused" ) } ) ); 1255 functionType->parameters.push_front( 1256 new ObjectDecl( adapterName, Type::StorageClasses(), LinkageSpec::C, 0, new PointerType( Type::Qualifiers(), makeAdapterType( funType, scopeTyVars ) ), 0, { new Attribute( "unused" ) } ) ); 1248 1257 adaptersDone.insert( adaptersDone.begin(), mangleName ); 1249 1258 } 1250 1259 } 1251 // deleteAll( functions );1252 1260 } 1253 1261 -
src/GenPoly/ErasableScopedMap.h
r7d9598d8 r2dcd80a 23 23 24 24 namespace GenPoly { 25 /// A map where the items are placed into nested scopes; 26 /// inserted items are placed into the innermost scope, lookup looks from the innermost scope outward; 27 /// erasing a key means that find() will no longer report any instance of the key in a scope further 28 /// out, but the erasure itself is scoped. Key erasure works by inserting a sentinal value into the 29 /// value field, and thus only works for Value types where a meaningful sentinal can be chosen. 30 template<typename Key, typename Value> 31 class ErasableScopedMap { 32 typedef std::map< Key, Value > Scope; 33 typedef std::vector< Scope > ScopeList; 34 35 ScopeList scopes; ///< scoped list of maps 36 Value erased; ///< sentinal value for erased keys 37 public: 38 typedef typename Scope::key_type key_type; 39 typedef typename Scope::mapped_type mapped_type; 40 typedef typename Scope::value_type value_type; 41 typedef typename ScopeList::size_type size_type; 42 typedef typename ScopeList::difference_type difference_type; 43 typedef typename Scope::reference reference; 44 typedef typename Scope::const_reference const_reference; 45 typedef typename Scope::pointer pointer; 46 typedef typename Scope::const_pointer const_pointer; 47 48 class iterator : public std::iterator< std::bidirectional_iterator_tag, 49 value_type > { 50 friend class ErasableScopedMap; 51 friend class const_iterator; 52 typedef typename std::map< Key, Value >::iterator wrapped_iterator; 53 typedef typename std::vector< std::map< Key, Value > > scope_list; 54 typedef typename scope_list::size_type size_type; 55 56 /// Checks if this iterator points to a valid item 57 bool is_valid() const { 58 return it != map->scopes[i].end() && it->second != map->erased; 25 26 /// A map where the items are placed into nested scopes. 27 /// Inserted items are placed into the innermost scope, lookup looks from the 28 /// innermost scope outward. Erasing a key means that find() will no longer 29 /// report any instance of the key in a scope further out, but the erasure 30 /// itself is scoped. Key erasure works by inserting a sentinal value into 31 /// the value field, and thus only works for Value types where a meaningful 32 /// sentinal can be chosen. 33 template<typename Key, typename Value> 34 class ErasableScopedMap { 35 typedef std::map< Key, Value > Scope; 36 typedef std::vector< Scope > ScopeList; 37 38 /// Scoped list of maps. 39 ScopeList scopes; 40 /// Sentinal value for erased keys. 41 Value erased; 42 public: 43 typedef typename Scope::key_type key_type; 44 typedef typename Scope::mapped_type mapped_type; 45 typedef typename Scope::value_type value_type; 46 typedef typename ScopeList::size_type size_type; 47 typedef typename ScopeList::difference_type difference_type; 48 typedef typename Scope::reference reference; 49 typedef typename Scope::const_reference const_reference; 50 typedef typename Scope::pointer pointer; 51 typedef typename Scope::const_pointer const_pointer; 52 53 // Both iterator types are complete bidirection iterators, defined below. 54 class iterator; 55 class const_iterator; 56 57 /// Starts a new scope 58 void beginScope() { 59 Scope scope; 60 scopes.push_back(scope); 61 } 62 63 /// Ends a scope; invalidates any iterators pointing to elements of that scope 64 void endScope() { 65 scopes.pop_back(); 66 assert( ! scopes.empty() ); 67 } 68 69 /// Default constructor initializes with one scope 70 ErasableScopedMap( const Value &erased_ ) : erased( erased_ ) { beginScope(); } 71 72 iterator begin() { return iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); } 73 const_iterator begin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); } 74 const_iterator cbegin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); } 75 iterator end() { return iterator(*this, scopes[0].end(), 0); } 76 const_iterator end() const { return const_iterator(*this, scopes[0].end(), 0); } 77 const_iterator cend() const { return const_iterator(*this, scopes[0].end(), 0); } 78 79 /// Gets the index of the current scope (counted from 1) 80 size_type currentScope() const { return scopes.size(); } 81 82 /// Finds the given key in the outermost scope it occurs; returns end() for none such 83 iterator find( const Key &key ) { 84 for ( size_type i = scopes.size() - 1; ; --i ) { 85 typename Scope::iterator val = scopes[i].find( key ); 86 if ( val != scopes[i].end() ) { 87 return val->second == erased ? end() : iterator( *this, val, i ); 59 88 } 60 61 /// Increments on invalid 62 iterator& next_valid() { 63 if ( ! is_valid() ) { ++(*this); } 64 return *this; 89 if ( i == 0 ) break; 90 } 91 return end(); 92 } 93 const_iterator find( const Key &key ) const { 94 return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->find( key ) ); 95 } 96 97 /// Finds the given key in the outermost scope inside the given scope where it occurs 98 iterator findNext( const_iterator &it, const Key &key ) { 99 if ( it.i == 0 ) return end(); 100 for ( size_type i = it.i - 1; ; --i ) { 101 typename Scope::iterator val = scopes[i].find( key ); 102 if ( val != scopes[i].end() ) { 103 return val->second == erased ? end() : iterator( *this, val, i ); 65 104 } 66 67 /// Decrements on invalid 68 iterator& prev_valid() { 69 if ( ! is_valid() ) { --(*this); } 70 return *this; 71 } 72 73 iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_iterator &_it, size_type _i) 74 : map(&_map), it(_it), i(_i) {} 75 76 public: 77 iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {} 78 iterator& operator= (const iterator &that) { 79 map = that.map; i = that.i; it = that.it; 80 return *this; 81 } 82 83 reference operator* () { return *it; } 84 pointer operator-> () { return it.operator->(); } 85 86 iterator& operator++ () { 87 if ( it == map->scopes[i].end() ) { 88 if ( i == 0 ) return *this; 89 --i; 90 it = map->scopes[i].begin(); 91 } else { 92 ++it; 93 } 94 return next_valid(); 95 } 96 iterator& operator++ (int) { iterator tmp = *this; ++(*this); return tmp; } 97 98 iterator& operator-- () { 99 // may fail if this is the begin iterator; allowed by STL spec 100 if ( it == map->scopes[i].begin() ) { 101 ++i; 102 it = map->scopes[i].end(); 103 } 104 --it; 105 return prev_valid(); 106 } 107 iterator& operator-- (int) { iterator tmp = *this; --(*this); return tmp; } 108 109 bool operator== (const iterator &that) { 110 return map == that.map && i == that.i && it == that.it; 111 } 112 bool operator!= (const iterator &that) { return !( *this == that ); } 113 114 private: 115 ErasableScopedMap< Key, Value > const *map; 116 wrapped_iterator it; 117 size_type i; 118 }; 119 120 class const_iterator : public std::iterator< std::bidirectional_iterator_tag, 121 value_type > { 122 friend class ErasableScopedMap; 123 typedef typename std::map< Key, Value >::iterator wrapped_iterator; 124 typedef typename std::map< Key, Value >::const_iterator wrapped_const_iterator; 125 typedef typename std::vector< std::map< Key, Value > > scope_list; 126 typedef typename scope_list::size_type size_type; 127 128 /// Checks if this iterator points to a valid item 129 bool is_valid() const { 130 return it != map->scopes[i].end() && it->second != map->erased; 131 } 132 133 /// Increments on invalid 134 const_iterator& next_valid() { 135 if ( ! is_valid() ) { ++(*this); } 136 return *this; 137 } 138 139 /// Decrements on invalid 140 const_iterator& prev_valid() { 141 if ( ! is_valid() ) { --(*this); } 142 return *this; 143 } 144 145 const_iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_const_iterator &_it, size_type _i) 146 : map(&_map), it(_it), i(_i) {} 147 public: 148 const_iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {} 149 const_iterator(const const_iterator &that) : map(that.map), it(that.it), i(that.i) {} 150 const_iterator& operator= (const iterator &that) { 151 map = that.map; i = that.i; it = that.it; 152 return *this; 153 } 154 const_iterator& operator= (const const_iterator &that) { 155 map = that.map; i = that.i; it = that.it; 156 return *this; 157 } 158 159 const_reference operator* () { return *it; } 160 const_pointer operator-> () { return it.operator->(); } 161 162 const_iterator& operator++ () { 163 if ( it == map->scopes[i].end() ) { 164 if ( i == 0 ) return *this; 165 --i; 166 it = map->scopes[i].begin(); 167 } else { 168 ++it; 169 } 170 return next_valid(); 171 } 172 const_iterator& operator++ (int) { const_iterator tmp = *this; ++(*this); return tmp; } 173 174 const_iterator& operator-- () { 175 // may fail if this is the begin iterator; allowed by STL spec 176 if ( it == map->scopes[i].begin() ) { 177 ++i; 178 it = map->scopes[i].end(); 179 } 180 --it; 181 return prev_valid(); 182 } 183 const_iterator& operator-- (int) { const_iterator tmp = *this; --(*this); return tmp; } 184 185 bool operator== (const const_iterator &that) { 186 return map == that.map && i == that.i && it == that.it; 187 } 188 bool operator!= (const const_iterator &that) { return !( *this == that ); } 189 190 private: 191 ErasableScopedMap< Key, Value > const *map; 192 wrapped_const_iterator it; 193 size_type i; 194 }; 195 196 /// Starts a new scope 197 void beginScope() { 198 Scope scope; 199 scopes.push_back(scope); 200 } 201 202 /// Ends a scope; invalidates any iterators pointing to elements of that scope 203 void endScope() { 204 scopes.pop_back(); 205 assert( ! scopes.empty() ); 206 } 207 208 /// Default constructor initializes with one scope 209 ErasableScopedMap( const Value &erased_ ) : erased( erased_ ) { beginScope(); } 210 211 iterator begin() { return iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); } 212 const_iterator begin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); } 213 const_iterator cbegin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); } 214 iterator end() { return iterator(*this, scopes[0].end(), 0); } 215 const_iterator end() const { return const_iterator(*this, scopes[0].end(), 0); } 216 const_iterator cend() const { return const_iterator(*this, scopes[0].end(), 0); } 217 218 /// Gets the index of the current scope (counted from 1) 219 size_type currentScope() const { return scopes.size(); } 220 221 /// Finds the given key in the outermost scope it occurs; returns end() for none such 222 iterator find( const Key &key ) { 223 for ( size_type i = scopes.size() - 1; ; --i ) { 224 typename Scope::iterator val = scopes[i].find( key ); 225 if ( val != scopes[i].end() ) { 226 return val->second == erased ? end() : iterator( *this, val, i ); 227 } 228 if ( i == 0 ) break; 229 } 230 return end(); 231 } 232 const_iterator find( const Key &key ) const { 233 return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->find( key ) ); 234 } 235 236 /// Finds the given key in the outermost scope inside the given scope where it occurs 237 iterator findNext( const_iterator &it, const Key &key ) { 238 if ( it.i == 0 ) return end(); 239 for ( size_type i = it.i - 1; ; --i ) { 240 typename Scope::iterator val = scopes[i].find( key ); 241 if ( val != scopes[i].end() ) { 242 return val->second == erased ? end() : iterator( *this, val, i ); 243 } 244 if ( i == 0 ) break; 245 } 246 return end(); 247 } 248 const_iterator findNext( const_iterator &it, const Key &key ) const { 249 return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->findNext( it, key ) ); 250 } 251 252 /// Inserts the given key-value pair into the outermost scope 253 std::pair< iterator, bool > insert( const value_type &value ) { 254 std::pair< typename Scope::iterator, bool > res = scopes.back().insert( value ); 255 return std::make_pair( iterator(*this, res.first, scopes.size()-1), res.second ); 256 } 257 std::pair< iterator, bool > insert( const Key &key, const Value &value ) { return insert( std::make_pair( key, value ) ); } 258 259 /// Marks the given element as erased from this scope inward; returns 1 for erased an element, 0 otherwise 260 size_type erase( const Key &key ) { 261 typename Scope::iterator val = scopes.back().find( key ); 262 if ( val != scopes.back().end() ) { 263 val->second = erased; 264 return 1; 265 } else { 266 scopes.back().insert( val, std::make_pair( key, erased ) ); 267 return 0; 268 } 269 } 270 271 Value& operator[] ( const Key &key ) { 272 iterator slot = find( key ); 273 if ( slot != end() ) return slot->second; 274 return insert( key, Value() ).first->second; 275 } 276 }; 105 if ( i == 0 ) break; 106 } 107 return end(); 108 } 109 const_iterator findNext( const_iterator &it, const Key &key ) const { 110 return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->findNext( it, key ) ); 111 } 112 113 /// Inserts the given key-value pair into the outermost scope 114 std::pair< iterator, bool > insert( const value_type &value ) { 115 std::pair< typename Scope::iterator, bool > res = scopes.back().insert( value ); 116 return std::make_pair( iterator(*this, res.first, scopes.size()-1), res.second ); 117 } 118 std::pair< iterator, bool > insert( const Key &key, const Value &value ) { return insert( std::make_pair( key, value ) ); } 119 120 /// Marks the given element as erased from this scope inward; returns 1 for erased an element, 0 otherwise 121 size_type erase( const Key &key ) { 122 typename Scope::iterator val = scopes.back().find( key ); 123 if ( val != scopes.back().end() ) { 124 val->second = erased; 125 return 1; 126 } else { 127 scopes.back().insert( val, std::make_pair( key, erased ) ); 128 return 0; 129 } 130 } 131 132 Value& operator[] ( const Key &key ) { 133 iterator slot = find( key ); 134 if ( slot != end() ) return slot->second; 135 return insert( key, Value() ).first->second; 136 } 137 }; 138 139 template<typename Key, typename Value> 140 class ErasableScopedMap<Key, Value>::iterator : 141 public std::iterator< std::bidirectional_iterator_tag, value_type > { 142 friend class ErasableScopedMap; 143 typedef typename std::map< Key, Value >::iterator wrapped_iterator; 144 typedef typename std::vector< std::map< Key, Value > > scope_list; 145 typedef typename scope_list::size_type size_type; 146 147 /// Checks if this iterator points to a valid item 148 bool is_valid() const { 149 return it != map->scopes[i].end() && it->second != map->erased; 150 } 151 152 /// Increments on invalid 153 iterator& next_valid() { 154 if ( ! is_valid() ) { ++(*this); } 155 return *this; 156 } 157 158 /// Decrements on invalid 159 iterator& prev_valid() { 160 if ( ! is_valid() ) { --(*this); } 161 return *this; 162 } 163 164 iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_iterator &_it, size_type _i) 165 : map(&_map), it(_it), i(_i) {} 166 167 public: 168 iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {} 169 iterator& operator= (const iterator &that) { 170 map = that.map; i = that.i; it = that.it; 171 return *this; 172 } 173 174 reference operator* () { return *it; } 175 pointer operator-> () { return it.operator->(); } 176 177 iterator& operator++ () { 178 if ( it == map->scopes[i].end() ) { 179 if ( i == 0 ) return *this; 180 --i; 181 it = map->scopes[i].begin(); 182 } else { 183 ++it; 184 } 185 return next_valid(); 186 } 187 188 iterator& operator++ (int) { iterator tmp = *this; ++(*this); return tmp; } 189 190 iterator& operator-- () { 191 // may fail if this is the begin iterator; allowed by STL spec 192 if ( it == map->scopes[i].begin() ) { 193 ++i; 194 it = map->scopes[i].end(); 195 } 196 --it; 197 return prev_valid(); 198 } 199 iterator& operator-- (int) { iterator tmp = *this; --(*this); return tmp; } 200 201 bool operator== (const iterator &that) { 202 return map == that.map && i == that.i && it == that.it; 203 } 204 bool operator!= (const iterator &that) { return !( *this == that ); } 205 206 private: 207 ErasableScopedMap< Key, Value > const *map; 208 wrapped_iterator it; 209 size_type i; 210 }; 211 212 template<typename Key, typename Value> 213 class ErasableScopedMap<Key, Value>::const_iterator : 214 public std::iterator< std::bidirectional_iterator_tag, value_type > { 215 friend class ErasableScopedMap; 216 typedef typename std::map< Key, Value >::iterator wrapped_iterator; 217 typedef typename std::map< Key, Value >::const_iterator wrapped_const_iterator; 218 typedef typename std::vector< std::map< Key, Value > > scope_list; 219 typedef typename scope_list::size_type size_type; 220 221 /// Checks if this iterator points to a valid item 222 bool is_valid() const { 223 return it != map->scopes[i].end() && it->second != map->erased; 224 } 225 226 /// Increments on invalid 227 const_iterator& next_valid() { 228 if ( ! is_valid() ) { ++(*this); } 229 return *this; 230 } 231 232 /// Decrements on invalid 233 const_iterator& prev_valid() { 234 if ( ! is_valid() ) { --(*this); } 235 return *this; 236 } 237 238 const_iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_const_iterator &_it, size_type _i) 239 : map(&_map), it(_it), i(_i) {} 240 public: 241 const_iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {} 242 const_iterator(const const_iterator &that) : map(that.map), it(that.it), i(that.i) {} 243 const_iterator& operator= (const iterator &that) { 244 map = that.map; i = that.i; it = that.it; 245 return *this; 246 } 247 const_iterator& operator= (const const_iterator &that) { 248 map = that.map; i = that.i; it = that.it; 249 return *this; 250 } 251 252 const_reference operator* () { return *it; } 253 const_pointer operator-> () { return it.operator->(); } 254 255 const_iterator& operator++ () { 256 if ( it == map->scopes[i].end() ) { 257 if ( i == 0 ) return *this; 258 --i; 259 it = map->scopes[i].begin(); 260 } else { 261 ++it; 262 } 263 return next_valid(); 264 } 265 const_iterator& operator++ (int) { const_iterator tmp = *this; ++(*this); return tmp; } 266 267 const_iterator& operator-- () { 268 // may fail if this is the begin iterator; allowed by STL spec 269 if ( it == map->scopes[i].begin() ) { 270 ++i; 271 it = map->scopes[i].end(); 272 } 273 --it; 274 return prev_valid(); 275 } 276 const_iterator& operator-- (int) { const_iterator tmp = *this; --(*this); return tmp; } 277 278 bool operator== (const const_iterator &that) { 279 return map == that.map && i == that.i && it == that.it; 280 } 281 bool operator!= (const const_iterator &that) { return !( *this == that ); } 282 283 private: 284 ErasableScopedMap< Key, Value > const *map; 285 wrapped_const_iterator it; 286 size_type i; 287 }; 288 277 289 } // namespace GenPoly 278 290 -
src/GenPoly/GenPoly.cc
r7d9598d8 r2dcd80a 783 783 const ast::FunctionType * function = getFunctionType( expr->func->result ); 784 784 assertf( function, "ApplicationExpr has non-function type: %s", toString( expr->func->result ).c_str() ); 785 TypeVarMap exprTyVars = { ast::TypeD ecl::Data() };785 TypeVarMap exprTyVars = { ast::TypeData() }; 786 786 makeTypeVarMap( function, exprTyVars ); 787 787 return needsBoxing( param, arg, exprTyVars, subst ); … … 793 793 794 794 void addToTypeVarMap( const ast::TypeInstType * type, TypeVarMap & typeVars ) { 795 typeVars.insert( *type, ast::TypeD ecl::Data( type->base ) );795 typeVars.insert( *type, ast::TypeData( type->base ) ); 796 796 } 797 797 -
src/GenPoly/GenPoly.h
r7d9598d8 r2dcd80a 20 20 21 21 #include "ErasableScopedMap.h" // for ErasableScopedMap 22 #include "AST/Decl.hpp" // for TypeDecl::Data22 #include "AST/Decl.hpp" // for AggregateDecl 23 23 #include "AST/Fwd.hpp" // for ApplicationExpr, BaseInstType, Func... 24 #include "AST/Type.hpp" // for TypeInstType::TypeEnvKey25 24 #include "SymTab/Mangler.h" // for Mangler 26 25 #include "SynTree/Declaration.h" // for TypeDecl::Data, AggregateDecl, Type... 27 26 #include "SynTree/SynTree.h" // for Visitor Nodes 28 27 28 namespace ast { 29 struct TypeEnvKey; 30 } 31 29 32 namespace GenPoly { 30 33 31 34 typedef ErasableScopedMap< std::string, TypeDecl::Data > TyVarMap; 32 using TypeVarMap = ErasableScopedMap< ast::Type InstType::TypeEnvKey, ast::TypeDecl::Data >;35 using TypeVarMap = ErasableScopedMap< ast::TypeEnvKey, ast::TypeData >; 33 36 34 37 /// Replaces a TypeInstType by its referrent in the environment, if applicable -
src/Parser/DeclarationNode.cc
r7d9598d8 r2dcd80a 254 254 } // DeclarationNode::newAggregate 255 255 256 DeclarationNode * DeclarationNode::newEnum( const string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base ) {256 DeclarationNode * DeclarationNode::newEnum( const string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base, EnumHiding hiding ) { 257 257 DeclarationNode * newnode = new DeclarationNode; 258 258 newnode->type = new TypeData( TypeData::Enum ); … … 262 262 newnode->type->enumeration.anon = name == nullptr; 263 263 newnode->type->enumeration.typed = typed; 264 newnode->type->enumeration.hiding = hiding; 264 265 if ( base && base->type) { 265 266 newnode->type->base = base->type; -
src/Parser/ParseNode.h
r7d9598d8 r2dcd80a 239 239 static DeclarationNode * newFunction( const std::string * name, DeclarationNode * ret, DeclarationNode * param, StatementNode * body ); 240 240 static DeclarationNode * newAggregate( AggregateDecl::Aggregate kind, const std::string * name, ExpressionNode * actuals, DeclarationNode * fields, bool body ); 241 static DeclarationNode * newEnum( const std::string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base = nullptr );241 static DeclarationNode * newEnum( const std::string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base = nullptr, EnumHiding hiding = EnumHiding::Visible ); 242 242 static DeclarationNode * newEnumConstant( const std::string * name, ExpressionNode * constant ); 243 243 static DeclarationNode * newEnumValueGeneric( const std::string * name, InitializerNode * init ); -
src/Parser/TypeData.cc
r7d9598d8 r2dcd80a 923 923 buildList( td->enumeration.constants, ret->get_members() ); 924 924 list< Declaration * >::iterator members = ret->get_members().begin(); 925 ret->hide = td->enumeration.hiding == EnumHiding::Hide ? EnumDecl::EnumHiding::Hide : EnumDecl::EnumHiding::Visible; 925 926 for ( const DeclarationNode * cur = td->enumeration.constants; cur != nullptr; cur = dynamic_cast< DeclarationNode * >( cur->get_next() ), ++members ) { 926 927 if ( cur->enumInLine ) { -
src/Parser/TypeData.h
r7d9598d8 r2dcd80a 60 60 bool anon; 61 61 bool typed; 62 EnumHiding hiding; 62 63 }; 63 64 -
src/Parser/parser.yy
r7d9598d8 r2dcd80a 10 10 // Created On : Sat Sep 1 20:22:55 2001 11 11 // Last Modified By : Peter A. Buhr 12 // Last Modified On : Wed Nov 2 21:31:21202213 // Update Count : 58 1012 // Last Modified On : Mon Nov 21 22:34:30 2022 13 // Update Count : 5848 14 14 // 15 15 … … 383 383 %type<ifctl> conditional_declaration 384 384 %type<fctl> for_control_expression for_control_expression_list 385 %type<compop> up down updowneq downupdowneq385 %type<compop> upupeq updown updowneq downupdowneq 386 386 %type<en> subrange 387 387 %type<decl> asm_name_opt … … 489 489 %type<decl> type_parameter type_parameter_list type_initializer_opt 490 490 491 %type<en> type_parameters_opt type_list 491 %type<en> type_parameters_opt type_list array_type_list 492 492 493 493 %type<decl> type_qualifier type_qualifier_name forall type_qualifier_list_opt type_qualifier_list … … 551 551 552 552 %% 553 // ************************* Namespace Management ********************************553 // ************************ Namespace Management ******************************** 554 554 555 555 // The C grammar is not context free because it relies on the distinct terminal symbols "identifier" and "TYPEDEFname", … … 588 588 ; 589 589 590 // ************************* CONSTANTS ********************************590 // ************************ CONSTANTS ******************************** 591 591 592 592 constant: … … 634 634 ; 635 635 636 // ************************* EXPRESSIONS ********************************636 // ************************ EXPRESSIONS ******************************** 637 637 638 638 primary_expression: … … 1101 1101 ; 1102 1102 1103 // *************************** STATEMENTS *******************************1103 // ************************** STATEMENTS ******************************* 1104 1104 1105 1105 statement: … … 1758 1758 ; 1759 1759 1760 // ******************************* DECLARATIONS *********************************1760 // ****************************** DECLARATIONS ********************************* 1761 1761 1762 1762 declaration_list_opt: // used at beginning of switch statement … … 2558 2558 { typedefTable.makeTypedef( *$3 ); } 2559 2559 hide_opt '{' enumerator_list comma_opt '}' 2560 { $$ = DeclarationNode::newEnum( $3, $7, true, false)->addQualifiers( $2 ); }2560 { $$ = DeclarationNode::newEnum( $3, $7, true, false, nullptr, $5 )->addQualifiers( $2 ); } 2561 2561 | ENUM attribute_list_opt typedef_name // unqualified type name 2562 2562 hide_opt '{' enumerator_list comma_opt '}' 2563 { $$ = DeclarationNode::newEnum( $3->name, $6, true, false )->addQualifiers( $2 ); }2563 { $$ = DeclarationNode::newEnum( $3->name, $6, true, false, nullptr, $4 )->addQualifiers( $2 ); } 2564 2564 | ENUM '(' cfa_abstract_parameter_declaration ')' attribute_list_opt '{' enumerator_list comma_opt '}' 2565 2565 { … … 2580 2580 hide_opt '{' enumerator_list comma_opt '}' 2581 2581 { 2582 $$ = DeclarationNode::newEnum( $6, $11, true, true, $3 )->addQualifiers( $5 )->addQualifiers( $7 );2582 $$ = DeclarationNode::newEnum( $6, $11, true, true, $3, $9 )->addQualifiers( $5 )->addQualifiers( $7 ); 2583 2583 } 2584 2584 | ENUM '(' ')' attribute_list_opt identifier attribute_list_opt 2585 2585 hide_opt '{' enumerator_list comma_opt '}' 2586 2586 { 2587 $$ = DeclarationNode::newEnum( $5, $9, true, true, nullptr )->addQualifiers( $4 )->addQualifiers( $6 );2587 $$ = DeclarationNode::newEnum( $5, $9, true, true, nullptr, $7 )->addQualifiers( $4 )->addQualifiers( $6 ); 2588 2588 } 2589 2589 | ENUM '(' cfa_abstract_parameter_declaration ')' attribute_list_opt typedef_name attribute_list_opt 2590 2590 hide_opt '{' enumerator_list comma_opt '}' 2591 2591 { 2592 $$ = DeclarationNode::newEnum( $6->name, $10, true, true, $3 )->addQualifiers( $5 )->addQualifiers( $7 );2592 $$ = DeclarationNode::newEnum( $6->name, $10, true, true, $3, $8 )->addQualifiers( $5 )->addQualifiers( $7 ); 2593 2593 } 2594 2594 | ENUM '(' ')' attribute_list_opt typedef_name attribute_list_opt 2595 2595 hide_opt '{' enumerator_list comma_opt '}' 2596 2596 { 2597 $$ = DeclarationNode::newEnum( $5->name, $9, true, true, nullptr )->addQualifiers( $4 )->addQualifiers( $6 );2597 $$ = DeclarationNode::newEnum( $5->name, $9, true, true, nullptr, $7 )->addQualifiers( $4 )->addQualifiers( $6 ); 2598 2598 } 2599 2599 | enum_type_nobody … … 2991 2991 ; 2992 2992 2993 // ***************************** EXTERNAL DEFINITIONS *****************************2993 // **************************** EXTERNAL DEFINITIONS ***************************** 2994 2994 2995 2995 translation_unit: … … 3653 3653 | '[' ']' multi_array_dimension 3654 3654 { $$ = DeclarationNode::newArray( 0, 0, false )->addArray( $3 ); } 3655 | '[' push assignment_expression pop ',' comma_expression ']' 3655 // Cannot use constant_expression because of tuples => semantic check 3656 | '[' push assignment_expression pop ',' comma_expression ']' // CFA 3656 3657 { $$ = DeclarationNode::newArray( $3, 0, false )->addArray( DeclarationNode::newArray( $6, 0, false ) ); } 3657 3658 // { SemanticError( yylloc, "New array dimension is currently unimplemented." ); $$ = nullptr; } 3659 | '[' push array_type_list pop ']' // CFA 3660 { SemanticError( yylloc, "Type array dimension is currently unimplemented." ); $$ = nullptr; } 3658 3661 | multi_array_dimension 3659 3662 ; 3663 3664 array_type_list: 3665 basic_type_name 3666 { $$ = new ExpressionNode( new TypeExpr( maybeMoveBuildType( $1 ) ) ); } 3667 | type_name 3668 { $$ = new ExpressionNode( new TypeExpr( maybeMoveBuildType( $1 ) ) ); } 3669 | assignment_expression upupeq assignment_expression 3670 | array_type_list ',' basic_type_name 3671 { $$ = (ExpressionNode *)($1->set_last( new ExpressionNode( new TypeExpr( maybeMoveBuildType( $3 ) ) ) )); } 3672 | array_type_list ',' type_name 3673 { $$ = (ExpressionNode *)($1->set_last( new ExpressionNode( new TypeExpr( maybeMoveBuildType( $3 ) ) ) )); } 3674 | array_type_list ',' assignment_expression upupeq assignment_expression 3675 ; 3676 3677 upupeq: 3678 '~' 3679 { $$ = OperKinds::LThan; } 3680 | ErangeUpEq 3681 { $$ = OperKinds::LEThan; } 3682 ; 3660 3683 3661 3684 multi_array_dimension: … … 3990 4013 // declaration lists (not prototype-format parameter type and identifier declarators) is an obsolescent feature. 3991 4014 3992 // ************************* MISCELLANEOUS ********************************4015 // ************************ MISCELLANEOUS ******************************** 3993 4016 3994 4017 comma_opt: // redundant comma -
src/ResolvExpr/CandidateFinder.cpp
r7d9598d8 r2dcd80a 221 221 ) { 222 222 for ( auto & tyvar : type->forall ) { 223 unifiableVars[ *tyvar ] = ast::TypeD ecl::Data{ tyvar->base };223 unifiableVars[ *tyvar ] = ast::TypeData{ tyvar->base }; 224 224 } 225 225 for ( auto & assn : type->assertions ) { -
src/ResolvExpr/FindOpenVars.cc
r7d9598d8 r2dcd80a 113 113 if ( nextIsOpen ) { 114 114 for ( auto & decl : type->forall ) { 115 open[ *decl ] = ast::TypeD ecl::Data{ decl->base };115 open[ *decl ] = ast::TypeData{ decl->base }; 116 116 } 117 117 for ( auto & assert : type->assertions ) { … … 120 120 } else { 121 121 for ( auto & decl : type->forall ) { 122 closed[ *decl ] = ast::TypeD ecl::Data{ decl->base };122 closed[ *decl ] = ast::TypeData{ decl->base }; 123 123 } 124 124 for ( auto & assert : type->assertions ) { -
src/ResolvExpr/RenameVars.cc
r7d9598d8 r2dcd80a 42 42 int next_usage_id = 1; 43 43 ScopedMap< std::string, std::string > nameMap; 44 ScopedMap< std::string, ast::Type InstType::TypeEnvKey > idMap;44 ScopedMap< std::string, ast::TypeEnvKey > idMap; 45 45 public: 46 46 void reset() { … … 121 121 assert(false); 122 122 } 123 idMap[ td->name ] = ast::Type InstType::TypeEnvKey(*mut);124 123 idMap[ td->name ] = ast::TypeEnvKey( *mut ); 124 125 125 td = mut; 126 126 } -
src/ResolvExpr/Unify.cc
r7d9598d8 r2dcd80a 1166 1166 if ( entry1->second.kind != entry2->second.kind ) return false; 1167 1167 return env.bindVarToVar( 1168 var1, var2, ast::TypeD ecl::Data{ entry1->second, entry2->second }, need, have,1168 var1, var2, ast::TypeData{ entry1->second, entry2->second }, need, have, 1169 1169 open, widen, symtab ); 1170 1170 } else if ( isopen1 ) { -
src/SynTree/Declaration.h
r7d9598d8 r2dcd80a 340 340 bool isTyped; 341 341 Type * base; 342 enum EnumHiding { Visible, Hide } hide; 342 343 343 344 EnumDecl( const std::string & name, … … 345 346 bool isTyped = false, LinkageSpec::Spec linkage = LinkageSpec::Cforall, 346 347 Type * baseType = nullptr ) 347 : Parent( name, attributes, linkage ), isTyped(isTyped), base( baseType ) {}348 : Parent( name, attributes, linkage ), isTyped(isTyped), base( baseType ) {} 348 349 EnumDecl( const EnumDecl & other ) 349 350 : Parent( other ), isTyped( other.isTyped), base( other.base ) {} -
tests/Makefile.am
r7d9598d8 r2dcd80a 68 68 .PHONY: list .validate .test_makeflags 69 69 .INTERMEDIATE: .validate .validate.cfa .test_makeflags 70 EXTRA_PROGRAMS = avl_test linkonce .dummy_hack # build but do not install70 EXTRA_PROGRAMS = avl_test linkonce linking/mangling/anon .dummy_hack # build but do not install 71 71 EXTRA_DIST = test.py \ 72 72 pybin/__init__.py \ … … 101 101 avl_test_SOURCES = avltree/avl_test.cfa avltree/avl0.cfa avltree/avl1.cfa avltree/avl2.cfa avltree/avl3.cfa avltree/avl4.cfa avltree/avl-private.cfa 102 102 linkonce_SOURCES = link-once/main.cfa link-once/partner.cfa 103 linking_mangling_anon_SOURCES = linking/mangling/header.hfa linking/mangling/lib.cfa linking/mangling/main.cfa 103 104 # automake doesn't know we still need C/CPP rules so pretend like we have a C program 104 105 nodist__dummy_hack_SOURCES = .dummy_hack.c .dummy_hackxx.cpp -
tests/PRNG.cfa
r7d9598d8 r2dcd80a 8 8 // Created On : Wed Dec 29 09:38:12 2021 9 9 // Last Modified By : Peter A. Buhr 10 // Last Modified On : Sat Apr 9 15:21:14202211 // Update Count : 3 4410 // Last Modified On : Tue Nov 22 22:51:12 2022 11 // Update Count : 381 12 12 // 13 13 … … 22 22 #include <mutex_stmt.hfa> 23 23 24 #ifdef __x86_64__ // 64-bit architecture 25 #define PRNG PRNG64 26 #else // 32-bit architecture 27 #define PRNG PRNG32 28 #endif // __x86_64__ 29 24 30 #ifdef TIME // use -O2 -nodebug 25 31 #define STARTTIME start = timeHiRes() … … 54 60 55 61 56 u int32_t seed = 1009;62 unsigned int seed = 1009; 57 63 58 64 thread T1 {}; … … 158 164 #if 1 159 165 PRNG prng; 166 160 167 if ( seed != 0 ) set_seed( prng, seed ); 161 168 -
tests/concurrent/barrier/generation.cfa
r7d9598d8 r2dcd80a 37 37 for(c; 'A' ~= 'Z') { 38 38 // Yield for chaos 39 yield( prng(this, 10));39 yield( prng(this, 10) ); 40 40 41 41 // Print the generation, no newline because … … 43 43 44 44 // Yield again for more chaos 45 yield( prng(this, 10));45 yield( prng(this, 10) ); 46 46 47 47 // Block on the barrier -
tests/concurrent/barrier/order.cfa
r7d9598d8 r2dcd80a 37 37 for(l; NUM_LAPS) { 38 38 // Yield for chaos 39 yield( prng(this, 10));39 yield( prng(this, 10) ); 40 40 41 41 // Block and what order we arrived -
tests/concurrent/once.cfa
r7d9598d8 r2dcd80a 30 30 31 31 // sometime yields 32 yield( prng(this, 3));32 yield( prng(this, 3) ); 33 33 } 34 34 } -
tests/concurrent/readyQ/leader_spin.cfa
r7d9598d8 r2dcd80a 26 26 } 27 27 28 PRNG lead_rng;28 PRNG64 lead_rng; 29 29 volatile unsigned leader; 30 30 volatile size_t lead_idx; 31 31 32 const u nsignednthreads = 17;33 const u nsignedstop_count = 327;32 const uint64_t nthreads = 17; 33 const uint64_t stop_count = 327; 34 34 35 35 thread$ * the_main; … … 50 50 for(i; nthreads) { 51 51 while( threads[i]->idx != lead_idx ) { 52 Pause();52 sched_yield(); 53 53 } 54 54 } -
tests/io/away_fair.cfa
r7d9598d8 r2dcd80a 41 41 42 42 if(last == curr) { 43 Pause();43 sched_yield(); 44 44 continue; 45 45 } -
tests/io/comp_fair.cfa
r7d9598d8 r2dcd80a 52 52 53 53 if(last == curr) { 54 Pause();54 sched_yield(); 55 55 continue; 56 56 } -
tools/gdb/utils-gdb.py
r7d9598d8 r2dcd80a 23 23 gdb.execute('handle SIGUSR1 nostop noprint pass') 24 24 25 CfaTypes = collections.namedtuple('CfaTypes', 'cluster_ptr processor_ptr thread_ptr int_ptr thread_state yield_state')25 CfaTypes = collections.namedtuple('CfaTypes', 'cluster_ptr processor_ptr thread_ptr int_ptr uintptr thread_state yield_state') 26 26 27 27 class ThreadInfo: … … 55 55 thread_ptr = gdb.lookup_type('struct thread$').pointer(), 56 56 int_ptr = gdb.lookup_type('int').pointer(), 57 uintptr = gdb.lookup_type('uintptr_t'), 57 58 thread_state = gdb.lookup_type('enum __Coroutine_State'), 58 59 yield_state = gdb.lookup_type('enum __Preemption_Reason')) … … 89 90 return argv 90 91 92 def single_field(obj): 93 """ 94 If the struct only has one field return it, otherwise error 95 """ 96 97 _type = obj.type 98 if len(_type.fields()) != 1: 99 return None 100 101 return obj[_type.fields()[0].name] 102 103 104 def start_from_dlist(dlist): 105 fs = dlist.type.fields() 106 if len(fs) != 1: 107 print("Error, can't understand dlist type for", dlist, dlist.name, dlist.type) 108 return None 109 110 return dlist[fs[0]] 111 112 def fix_dlink(ptr): 113 """ 114 Remove the higher order bit from the pointer 115 """ 116 ptype = ptr.type 117 size = ptype.sizeof 118 if size == 8: 119 bit = 1 << ((size*8)-1) 120 mask = bit - 1 121 elif size == 4: 122 bit = 0 123 mask = 1 124 else: 125 print("Unexpected pointer size while fixing dlink", size) 126 127 cfa_t = get_cfa_types() 128 uptr = ptr.cast(cfa_t.uintptr) 129 return ptr if 0 == uptr & mask else gdb.Value(b'\x00'*size, ptype) 130 91 131 class ClusterIter: 92 132 def __init__(self, root): … … 139 179 def check(self): 140 180 # check if this is the last value 141 addr = int(self.curr) 142 mask = 1 << ((8 * int(gdb.parse_and_eval('sizeof(void*)'))) - 1) 143 if 0 != (mask & addr): 181 if not fix_dlink(self.curr): 144 182 raise StopIteration 145 183 … … 168 206 return self.curr 169 207 170 def start_from_dlist(dlist):171 fs = dlist.type.fields()172 if len(fs) != 1:173 print("Error, can't understand dlist type for", dlist, dlist.name, dlist.type)174 return None175 176 return dlist[fs[0]]177 178 208 def proc_list(cluster): 179 209 """ … … 183 213 proclist = cluster['_X5procsS19__cluster_proc_list_1'] 184 214 185 idle = start_from_dlist(proclist['_X5idlesS5dlist_S9processorS5dlink_S9processor___1']) 186 active = start_from_dlist(proclist['_X7activesS5dlist_S9processorS5dlink_S9processor___1']) 215 idle = start_from_dlist(proclist['_X5idlesS5dlist_S9processorS5dlink_S9processor___1'])['_X4nextPY13__tE_generic__1'] 216 active = start_from_dlist(proclist['_X7activesS5dlist_S9processorS5dlink_S9processor___1'])['_X4nextPY13__tE_generic__1'] 187 217 return ProcIter(active.cast(cfa_t.processor_ptr)), ProcIter(idle.cast(cfa_t.processor_ptr)) 188 218 … … 261 291 262 292 cfa_t = get_cfa_types() 263 root = cluster['_X7threadsS8__dllist_S7thread$__1']['_X4headPY15__TYPE_generic__1'].cast(cfa_t.thread_ptr) 293 head = single_field(cluster['_X7threadsS5dlist_S7thread$S18__thread_user_link__1']) 294 root = head['_X4nextPY13__tE_generic__1'].cast(cfa_t.thread_ptr) 264 295 265 296 if root == 0x0 or root.address == 0x0: … … 282 313 threads.append(t) 283 314 284 curr = curr['node']['next']315 curr = fix_dlink(single_field(curr['cltr_link'])['_X4nextPY13__tE_generic__1']).cast(cfa_t.thread_ptr) 285 316 if curr == root or curr == 0x0: 286 317 break … … 409 440 410 441 def print_formatted(self, marked, tid, name, state, address): 442 # print(marked, tid, name, state, address) 411 443 print('{:>1} {:>4} {:>20} {:>10} {:>20}'.format('*' if marked else ' ', tid, name, state, address)) 412 444 413 445 def print_thread(self, thread, tid, marked): 446 # print("print", thread, tid, marked) 414 447 cfa_t = get_cfa_types() 415 448 ys = str(thread['preempted'].cast(cfa_t.yield_state)) … … 439 472 440 473 self.print_formatted(False, '', 'Name', 'State', 'Address') 441 442 474 for t in threads: 443 475 if not t.is_system() or print_system:
Note:
See TracChangeset
for help on using the changeset viewer.