Changeset 2dcd80a


Ignore:
Timestamp:
Dec 14, 2022, 12:23:42 PM (3 years ago)
Author:
caparson <caparson@…>
Branches:
ADT, ast-experimental, master
Children:
441a6a7
Parents:
7d9598d8 (diff), d8bdf13 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Files:
9 added
2 deleted
59 edited
8 moved

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/fig/base_ts2.fig

    r7d9598d8 r2dcd80a  
    1191192 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
    120120         2400 4950 3000 4950
    121 4 2 -1 50 -1 0 12 0.0000 2 135 645 2100 3075 Threads\001
    122 4 2 -1 50 -1 0 12 0.0000 2 180 525 2100 2850 Ready\001
    123 4 2 -1 50 -1 0 12 0.0000 2 180 660 2100 4200 Array of\001
    124 4 2 -1 50 -1 0 12 0.0000 2 165 600 2100 4425 Queues\001
    125 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 3550 TS\001
    126 4 2 -1 50 -1 0 12 0.0000 2 180 660 2100 5025 Array of\001
    127 4 1 -1 50 -1 0 11 0.0000 2 120 300 2700 4450 MA\001
    128 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 4225 TS\001
    129 4 1 -1 50 -1 0 11 0.0000 2 120 300 2700 5350 MA\001
    130 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 5125 TS\001
    131 4 2 -1 50 -1 0 12 0.0000 2 180 1590 2100 5250 Timestamps Copies\001
    132 4 2 -1 50 -1 0 12 0.0000 2 135 840 2100 6075 Processors\001
     1214 2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001
     1224 2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001
     1234 2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001
     1244 2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001
     1254 1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001
     1264 2 -1 50 -1 0 12 0.0000 2 165 720 2100 5025 Array of\001
     1274 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 MA\001
     1284 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4225 TS\001
     1294 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5350 MA\001
     1304 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5125 TS\001
     1314 2 -1 50 -1 0 12 0.0000 2 135 900 2100 6075 Processors\001
     1324 2 -1 50 -1 0 12 0.0000 2 165 1440 2100 5250 Timestamp Copies\001
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    r7d9598d8 r2dcd80a  
    615615}
    616616
     617@book{MAN:inteldev,
     618  key = {Intel 64 and IA-32 Architectures Software Developer’s Manual},
     619  title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual},
     620  publisher = {Intel{\textregistered}},
     621  year = {2016},
     622  volume = {3B: System Programming Guide, Part 2},
     623  url = {\href{https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html}{https://\-www.intel.com/\-content/\-www/\-us/\-en/\-architecture\--and\--technology/\-64\--ia\--32\--architectures\--software\--developer\--vol\--3b\--part\--2\--manual\-.html}},
     624}
     625
    617626@misc{MemcachedThreading,
    618627  author = {Oracle},
     
    673682}
    674683
     684@misc{MAN:kotlin,
     685  howpublished = {\href{https://kotlinlang.org/docs/multiplatform-mobile-concurrency-and-coroutines.html}{https://\-kotlinlang.org\-/docs\-/multiplatform\--mobile\--concurrency\--and\--coroutines.html}}
     686}
     687
    675688@misc{MAN:java/fork-join,
    676689  howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}}
     
    965978  howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}",
    966979  note = "[Online; accessed 7-September-2022]"
     980}
     981
     982@misc{wiki:rdtsc,
     983  author = "{Wikipedia contributors}",
     984  title = "Time Stamp Counter --- {W}ikipedia{,} The Free Encyclopedia",
     985  year = "2022",
     986  howpublished = "\href{https://en.wikipedia.org/wiki/Time\_Stamp\_Counter}{https://\-en.wikipedia.org/\-wiki/\-Time\-\_Stamp\-\_Counter}",
     987  note = "[Online; accessed 14-November-2022]"
     988}
     989
     990@misc{wiki:lockfree,
     991  author = "{Wikipedia contributors}",
     992  title = "Non-blocking algorithm --- {W}ikipedia{,} The Free Encyclopedia",
     993  year = "2022",
     994  howpublished = "\href{https://en.wikipedia.org/wiki/Non-blocking_algorithm}{https://en.wikipedia.org/\-wiki/Non\--blocking\-\_algorithm}",
     995  note = "[Online; accessed 22-November-2022]"
     996}
     997
     998@misc{wiki:expected,
     999  author = "{Wikipedia contributors}",
     1000  title = "Expected value --- {W}ikipedia{,} The Free Encyclopedia",
     1001  year = "2022",
     1002  howpublished = "\href{https://en.wikipedia.org/wiki/Expected_value}{https://en.wikipedia.org/\-wiki/\-Expected\-\_value}",
     1003  note = "[Online; accessed 22-November-2022]"
     1004}
     1005
     1006@misc{wiki:softirq,
     1007  author = "{Wikipedia contributors}",
     1008  title = "Interrupt --- {W}ikipedia{,} The Free Encyclopedia",
     1009  year = "2022",
     1010  howpublished = "\href{https://en.wikipedia.org/wiki/Interrupt}{https://en.wikipedia.org/\-wiki/\-Interrupt}",
     1011  note = "[Online; accessed 24-November-2022]"
    9671012}
    9681013
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r7d9598d8 r2dcd80a  
    3737Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system.
    3838Nonblocking I/O should not be handled in this way.
     39Presumably, this is better handled by Windows's ``overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work.
    3940
    4041\section{Goals}
     
    5859
    5960As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
    60 Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime.
     61Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime.
    6162Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs.
    6263
     
    99100I am aware there is a host of low-power research that could be tapped here.
    100101
     102\subsection{CPU Workloads}
     103A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it
     104CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily.
     105Examining such workloads in the context of scheduling would be interesting.
     106However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work.
     107
    101108\subsection{Hardware}
    102109One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness.
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r7d9598d8 r2dcd80a  
    2424In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware.
    2525This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
    26 This expectation of \at independence means the scheduler is expected to offer two guarantees:
     26This expectation of \at independence means the scheduler is expected to offer two features:
    2727\begin{enumerate}
    28         \item A fairness guarantee: a \at that is ready to run is not prevented by another thread.
    29         \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same.
     28        \item A fairness guarantee: a \at that is ready to run is not prevented by another thread indefinitely, \ie, starvation freedom. This is discussed further in the next section.
     29        \item A performance goal: given a \at that wants to start running, other threads wanting to do the same do not interfere with it.
    3030\end{enumerate}
    3131
    32 It is important to note that these guarantees are expected only up to a point.
    33 \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources.
    34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
    35 
    36 Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point.
    37 Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention.
     32The performance goal, the lack of interference among threads, is only desired up to a point.
     33Ideally, the cost of running and blocking should be constant regardless of contention, but the goal is considered satisfied if the cost is not \emph{too high} with or without contention.
    3834How much is an acceptable cost is obviously highly variable.
    3935For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages.
    4036This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models.
    4137Recall programmer expectation is that the impact of the scheduler can be ignored.
    42 Therefore, if the cost of scheduling is competitive with other popular languages, the guarantee is considered achieved.
     38Therefore, if the cost of scheduling is competitive with other popular languages, the goal is considered satisfied.
    4339More precisely the scheduler should be:
    4440\begin{itemize}
    45         \item As fast as other schedulers that are less fair.
    46         \item Faster than other schedulers that have equal or better fairness.
     41        \item As fast as other schedulers without any fairness guarantee.
     42        \item Faster than other schedulers that have equal or stronger fairness guarantees.
    4743\end{itemize}
    4844
    4945\subsection{Fairness Goals}
    50 For this work, fairness is considered to have two strongly related requirements: true starvation freedom and ``fast'' load balancing.
    51 
    52 \paragraph{True starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress.
     46For this work, fairness is considered to have two strongly related requirements:
     47
     48\paragraph{Starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress.
     49Starvation freedom can be bounded or unbounded.
     50In the bounded case, all \ats should be able to run within a fix bound, relative to its own enqueue.
     51Whereas unbounded starvation freedom only requires the \at to eventually run.
     52The \CFA scheduler aims to guarantee unbounded starvation freedom.
    5353In any running system, a \proc can stop dequeuing \ats if it starts running a \at that never blocks.
    54 Without preemption, traditional work-stealing schedulers do not have starvation freedom in this case.
    55 Now, this requirement begs the question, what about preemption?
     54Without preemption, traditional work-stealing schedulers do not have starvation freedom, bounded or unbounded.
     55Now, this requirement raises the question, what about preemption?
    5656Generally speaking, preemption happens on the timescale of several milliseconds, which brings us to the next requirement: ``fast'' load balancing.
    5757
    58 \paragraph{Fast load balancing} means that load balancing should happen faster than preemption would normally allow.
    59 For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved.
    60 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale.
    61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system.
     58\paragraph{Fast load balancing} means that while eventual progress is guaranteed, it is important to mention on which timescale this progress is expected to happen.
     59Indeed, while a scheduler with bounded starvation freedom is beyond the scope of this work, offering a good expected bound in the mathematical sense~\cite{wiki:expected} is desirable.
     60The expected bound on starvation freedom should be tighter than what preemption normally allows.
     61For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait milliseconds to run are effectively starved.
     62Therefore load-balancing should be done at a faster pace: one that is expected to detect starvation at the microsecond scale.
    6263
    6364\subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal}
     
    6869
    6970For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness.
    70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available.
    71 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data.
     71Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as a consequence cache lines, to a \gls{hthrd} that is currently available.
     72Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler, versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data.
    7273External locality is a much more complicated subject and is discussed in the next section.
    7374
    74 However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally.
    75 Figure~\ref{fig:fair} shows a visual representation of this behaviour.
    76 As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental model.
     75However, I claim that in practice it is possible to strike a balance between fairness and performance because these requirements do not necessarily overlap temporally.
     76Figure~\ref{fig:fair} shows a visual representation of this effect.
     77As mentioned, some unfairness is acceptable; for example, once the bounded starvation guarantee is met, additional fairness will not satisfy it \emph{more}.
     78Inversely, once a \at's data is evicted from cache, its locality cannot worsen.
     79Therefore it is desirable to have an algorithm that prioritizes cache locality as long as the fairness guarantee is also satisfied.
    7780
    7881\begin{figure}
     
    8891\subsection{Performance Challenges}\label{pref:challenge}
    8992While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges.
    90 Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design.
     93Since these challenges are recurring themes in the design of a scheduler it is relevant to describe them here before looking at the scheduler's design.
     94
     95\subsubsection{Latency}
     96The most basic performance metric of a scheduler is scheduling latency.
     97This measures the how long it takes for a \at to run once scheduled, including the cost of scheduling itself.
     98This measure include both the sequential cost of the operation itself, both also the scalability.
    9199
    92100\subsubsection{Scalability}
    93 The most basic performance challenge of a scheduler is scalability.
    94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats.
     101Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats relative to the available parallelism.
    95102One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements.
    96103While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
     104In the Chapter~\ref{microbench}, scalability is measured as $\# procs \times \frac{ns}{ops}$, \ie, number of \procs times total time over total operations.
     105Since the total number of operation should scale with the number of \procs, this gives a measure how much each additional \proc affects the other \procs.
    97106
    98107\subsubsection{Migration Cost}
     
    107116In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance.
    108117The problem is a single point of contention when adding/removing \ats.
    109 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
    110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
    111 
    112 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
     118As is shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
     119The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue.
     120The sub-queues are accessed by multiple \glspl{hthrd} without the need for communication.
     121
     122Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration for the scheduler in this thesis.
    113123
    114124\subsection{Work-Stealing}
     
    116126As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing.
    117127In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue.
    118 The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
    119 In this case, work stealing is close to optimal scheduling: it can achieve perfect locality and have no contention.
    120 On the other hand, work-stealing schedulers only attempt to do load-balancing when a \gls{proc} runs out of work.
     128The interesting aspect of work stealing manifests itself in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
     129In this case, work stealing is close to optimal scheduling latency: it can achieve perfect locality and have no contention.
     130On the other hand, work-stealing only attempts to do load-balancing when a \gls{proc} runs out of work.
    121131This means that the scheduler never balances unfair loads unless they result in a \gls{proc} running out of work.
    122 Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to indefinite starvation.
    123 
    124 Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, but load balance proactively when the need arises.
     132Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to unbounded starvation.
     133
     134Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, \ie, avoid migrations in well balanced workloads, but load balance proactively when the need arises.
    125135
    126136\subsection{Relaxed-FIFO}
    127 A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.
     137A different scheduling approach is the ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.
    128138This approach forgoes any ownership between \gls{proc} and sub-queue, and simply creates a pool of sub-queues from which \glspl{proc} pick.
    129139Scheduling is performed as follows:
     
    134144Timestamps are added to each element of a sub-queue.
    135145\item
    136 A \gls{proc} randomly tests sub-queues until it has acquired one or two queues.
     146A \gls{proc} randomly tests sub-queues until it has acquired one or two queues, referred to as \newterm{randomly picking} or \newterm{randomly helping}.
    137147\item
    138148If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues.
     
    148158However, \glspl{proc} eagerly search for these older elements instead of focusing on specific queues, which negatively affects locality.
    149159
    150 While this scheme has good fairness, its performance suffers.
    151 It requires wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues is difficult when there are few ready \ats.
     160While this scheme has good fairness, its performance can be improved.
     161Wide sharding is generally desired, \eg at least 4 queues per \proc, and randomly picking non-empty queues is difficult when there are few ready \ats.
     162The next sections describe improvements I made to this existing algorithm.
     163However, ultimately the ``relaxed-FIFO'' queue is not used as the basis of the \CFA scheduler.
    152164
    153165\section{Relaxed-FIFO++}
    154 The inherent fairness and good performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.
     166The inherent fairness and decent performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.
    155167The problem case is workloads where the number of \ats is barely greater than the number of \procs.
    156168In these situations, the wide sharding of the ready queue means most of its sub-queues are empty.
     
    162174
    163175As this is the most obvious challenge, it is worth addressing first.
    164 The obvious solution is to supplement each sharded sub-queue with data that indicates if the queue is empty/nonempty to simplify finding nonempty queues, \ie ready \glspl{at}.
    165 This sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues.
    166 Specifically, many modern architectures have powerful bitmask manipulation instructions or searching a binary tree has good Big-O complexity.
     176The seemingly obvious solution is to supplement each sharded sub-queue with data that indicates whether the queue is empty/nonempty.
     177This simplifies finding nonempty queues, \ie ready \glspl{at}.
     178The sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues, using a bit or a node per sub-queue, respectively.
     179Specifically, many modern architectures have powerful bitmask manipulation instructions, and, searching a binary tree has good Big-O complexity.
    167180However, precisely tracking nonempty sub-queues is problematic.
    168181The reason is that the sub-queues are initially sharded with a width presumably chosen to avoid contention.
    169 However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie denser than the sharded sub-queues.
     182However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie tracks whether multiple sub-queues are empty.
    170183Otherwise, it does not provide useful information because reading this new data structure risks being as costly as simply picking a sub-queue at random.
    171184But if the tracking mechanism \emph{is} denser than the shared sub-queues, then constant updates invariably create a new source of contention.
     
    184197
    185198\subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy}
    186 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng}.
     199The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} that drives the random picking of sub-queues.
    187200The \glsxtrshort{prng} state can be seen as containing a list of all the future sub-queues that will be accessed.
    188201While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the sub-queues that were accessed.
    189 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.
     202Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators~\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.
    190203This particular \glsxtrshort{prng} can be used as follows:
    191204\begin{itemize}
     
    193206Each \proc maintains two \glsxtrshort{prng} states, referred to as $F$ and $B$.
    194207\item
    195 When a \proc attempts to dequeue a \at, it picks a sub-queue by running $B$ backwards.
    196 \item
    197 When a \proc attempts to enqueue a \at, it runs $F$ forward picking a sub-queue to enqueue to.
    198 If the enqueue is successful, state $B$ is overwritten with the content of $F$.
     208When a \proc attempts to dequeue a \at, it picks a sub-queue by running its $B$ backwards.
     209\item
     210When a \proc attempts to enqueue a \at, it runs its $F$ forward picking a sub-queue to enqueue to.
     211If the enqueue is successful, state of its $B$ is overwritten with the content of its $F$.
    199212\end{itemize}
    200213The result is that each \proc tends to dequeue \ats that it has itself enqueued.
    201214When most sub-queues are empty, this technique increases the odds of finding \ats at a very low cost, while also offering an improvement on locality in many cases.
    202215
    203 Tests showed this approach performs better than relaxed-FIFO in many cases.
     216My own tests showed this approach performs better than relaxed-FIFO in many cases.
    204217However, it is still not competitive with work-stealing algorithms.
    205 The fundamental problem is that the constant randomness limits how much locality the scheduler offers.
     218The fundamental problem is that the randomness limits how much locality the scheduler offers.
    206219This becomes problematic both because the scheduler is likely to get cache misses on internal data structures and because migrations become frequent.
    207220Therefore, the attempt to modify the relaxed-FIFO algorithm to behave more like work stealing did not pan out.
     
    214227Before attempting to dequeue from a \proc's sub-queue, the \proc must make some effort to ensure other sub-queues are not being neglected.
    215228To make this possible, \procs must be able to determine which \at has been on the ready queue the longest.
    216 Second, the relaxed-FIFO approach needs timestamps for each \at to make this possible.
     229Second, the relaxed-FIFO approach uses timestamps, denoted TS, for each \at to make this possible.
     230Theses timestamps can be added to work stealing.
    217231
    218232\begin{figure}
    219233        \centering
    220234        \input{base.pstex_t}
    221         \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of sub-queues offers the sharding, two per \proc.
     235        \caption[Base \CFA design]{Base \CFA design \smallskip\newline It uses a pool of sub-queues, with a sharding of two sub-queue per \proc.
    222236        Each \gls{proc} can access all of the sub-queues.
    223237        Each \at is timestamped when enqueued.}
     
    227241Figure~\ref{fig:base} shows the algorithm structure.
    228242This structure is similar to classic work-stealing except the sub-queues are placed in an array so \procs can access them in constant time.
    229 Sharding width can be adjusted based on contention.
    230 Note, as an optimization, the TS of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.
     243Sharding can be adjusted based on contention.
     244As an optimization, the timestamp of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.
    231245This organization keeps the highly accessed front TSs directly in the array.
    232246When a \proc attempts to dequeue a \at, it first picks a random remote sub-queue and compares its timestamp to the timestamps of its local sub-queue(s).
    233 The oldest waiting \at is dequeued to provide global fairness.
     247The oldest waiting of the compared \ats is dequeued.
     248In this document, picking from a remote sub-queue in this fashion is referred to as ``helping''.
     249
     250The timestamps are measured using the CPU's hardware timestamp counters~\cite{wiki:rdtsc}.
     251These provide a 64-bit counter that tracks the number of cycles since the CPU was powered on.
     252Assuming the CPU runs at less than 5 GHz, this means that the 64-bit counter takes over a century before overflowing.
     253This is true even on 32-bit CPUs, where the counter is generally still 64-bit.
     254However, on many architectures, the instructions to read the counter do not have any particular ordering guarantees.
     255Since the counter does not depend on any data in the cpu pipeline, this means there is significant flexibility for the instruction to be read out of order, which limites the accuracy to a window of code.
     256Finally, another issue that can come up with timestamp counters is synchronization between \glspl{hthrd}.
     257This appears to be mostly a historical concern, as recent CPU offer more synchronization guarantees.
     258For example, Intel supports "Invariant TSC" \cite[\S~17.15.1]{MAN:inteldev} which is guaranteed to be synchronized across \glspl{hthrd}.
    234259
    235260However, this na\"ive implementation has performance problems.
    236 First, it is necessary to have some damping effect on helping.
    237 Random effects like cache misses and preemption can add spurious but short bursts of latency negating the attempt to help.
    238 These bursts can cause increased migrations and make this work-stealing approach slow down to the level of relaxed-FIFO.
     261First, it is necessary to avoid helping when it does not improve fairness.
     262Random effects like cache misses and preemption can add unpredictable but short bursts of latency but do not warrant the cost of helping.
     263These bursts can cause increased migrations, at which point this same locality problems as in the relaxed-FIFO approach start to appear.
    239264
    240265\begin{figure}
     
    246271
    247272A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}.
    248 Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
     273Note that this is more complex than it can appear because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
    249274Therefore, the exponential moving average is an average of how long each dequeued \at has waited.
    250275To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue.
    251276This new waiting is averaged with the stored average.
    252277To further limit \glslink{atmig}{migrations}, a bias can be added to a local sub-queue, where a remote sub-queue is helped only if its moving average is more than $X$ times the local sub-queue's average.
    253 Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
    254 
    255 With these additions to work stealing, scheduling can be made as fair as the relaxed-FIFO approach, avoiding the majority of unnecessary migrations.
     278Tests for this approach indicate the precise values for the weight of the moving average and the bias are not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
     279
     280With these additions to work stealing, scheduling can satisfy the starvation freedom guarantee while suffering much less from unnecessary migrations than the relaxed-FIFO approach.
    256281Unfortunately, the work to achieve fairness has a performance cost, especially when the workload is inherently fair, and hence, there is only short-term unfairness or no starvation.
    257 The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated, \ie, writes.
    258 To make things worst, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.
     282The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated.
     283To make things worse, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.
    259284Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue.
    260285This puts this algorithm in the awkward situation of paying for a largely unnecessary cost.
     
    264289The problem with polling remote sub-queues is that correctness is critical.
    265290There must be a consensus among \procs on which sub-queues hold which \ats, as the \ats are in constant motion.
    266 Furthermore, since timestamps are used for fairness, it is critical to have a consensus on which \at is the oldest.
     291Furthermore, since timestamps are used for fairness, it is critical that the oldest \ats eventually be recognized as such.
    267292However, when deciding if a remote sub-queue is worth polling, correctness is less of a problem.
    268293Since the only requirement is that a sub-queue is eventually polled, some data staleness is acceptable.
     
    278303        \centering
    279304        \input{base_ts2.pstex_t}
    280         \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
     305        \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline This design uses an array containing a copy of the timestamps.
    281306        These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
    282307        \label{fig:base-ts2}
     
    285310The correctness argument is somewhat subtle.
    286311The data used for deciding whether or not to poll a queue can be stale as long as it does not cause starvation.
    287 Therefore, it is acceptable if stale data makes queues appear older than they are but appearing fresher can be a problem.
     312Therefore, it is acceptable if stale data makes queues appear older than they are, but appearing fresher can be a problem.
    288313For the timestamps, this means it is acceptable to miss writes to the timestamp since they make the head \at look older.
    289314For the moving average, as long as the operations are just atomic reads/writes, the average is guaranteed to yield a value that is between the oldest and newest values written.
     
    292317With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
    293318The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
    294 For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
     319For example on machines with multiple CPUs, cache misses can be satisfied from the caches on the same (local) CPU, or by the caches on a different (remote) CPU.
    295320Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
    296321However, these delays are not specific to systems with multiple CPUs.
     
    312337Figures~\ref{fig:cache-share} and~\ref{fig:cache-noshare} show two different cache topologies that highlight this difference.
    313338In Figure~\ref{fig:cache-share}, all cache misses are either private to a CPU or shared with another CPU.
    314 This means latency due to cache misses is fairly consistent.
    315 In contrast, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by either instance of the L3 cache.
     339This means that latency due to cache misses is fairly consistent.
     340In contrast, in Figure~\ref{fig:cache-noshare}, misses in the L2 cache can be satisfied by either instance of the L3 cache.
    316341However, the memory-access latency to the remote L3 is higher than the memory-access latency to the local L3.
    317 The impact of these different designs on this algorithm is that scheduling only scales well on architectures with a wide L3 cache, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many narrower L3 cache instances, similar to Figure~\ref{fig:cache-noshare}.
     342The impact of these different designs on this algorithm is that scheduling only scales well on architectures with the L3 cache shared across many \glspl{hthrd}, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many L3 cache instances and less sharing, similar to Figure~\ref{fig:cache-noshare}.
    318343Hence, as the number of L3 instances grows, so too does the chance that the random helping causes significant cache latency.
    319344The solution is for the scheduler to be aware of the cache topology.
     
    323348Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem.
    324349This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory.
    325 This leaves the challenge of matching \procs to cache structure, or more precisely identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure.
     350This leaves the challenge of matching \procs to cache structure, or more precisely, identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure.
    326351Once a match is generated, the helping algorithm is changed to add bias so that \procs more often help sub-queues local to the same cache substructure.\footnote{
    327 Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning.}
     352Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning beyond the order of magnitude.}
    328353
    329354The simplest approach for mapping sub-queues to cache structure is to statically tie sub-queues to CPUs.
     
    335360However, it can still cause some subtle fairness problems in systems with few \procs and many \glspl{hthrd}.
    336361In this case, the large number of sub-queues and the bias against sub-queues tied to different cache substructures make it unlikely that every sub-queue is picked.
    337 To make things worst, the small number of \procs means that few helping attempts are made.
     362To make things worse, the small number of \procs means that few helping attempts are made.
    338363This combination of low selection and few helping attempts allow a \at to become stranded on a sub-queue for a long time until it gets randomly helped.
    339 On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue.
     364On a system with 2 \procs, 256 \glspl{hthrd}, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue.
     365In this scenario, where each \proc attempts to help on 50\% of dequeues, the probability that a remote sub-queue gets help is $\frac{1}{51200}$ and follows a geometric distribution.
     366Therefore the probability of the remote sub-queue gets help within the next 100'000 dequeues is only 85\%.
     367Assuming dequeues happen every 100ns, there is still 15\% chance a \at could starve for more than 10ms and a 1\% chance the \at starves for 33.33ms, the maximum latency tolerated for interactive applications.
     368If few \glspl{hthrd} share each cache instance, the probability that a \at is on a remote sub-queue becomes high.
    340369Therefore, a more dynamic match of sub-queues to cache instances is needed.
    341370
     
    343372\label{s:TopologicalWorkStealing}
    344373The approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to.
    345 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
     374This tracking requires some finesse, because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
    346375A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
    347 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
    348 Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).
     376Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, a \proc queries the CPU id and the corresponding cache instance.
     377Since sub-queues are tied to \procs, a \proc can then update the cache instance mapped to the local sub-queue(s).
    349378To avoid unnecessary cache line invalidation, the map is only written-to if the mapping changes.
    350379
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r7d9598d8 r2dcd80a  
    11\chapter{Macro-Benchmarks}\label{macrobench}
    2 The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
     2The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
    33The next step is to demonstrate performance stays true in more realistic and complete scenarios.
    4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments.
     4Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments.
    55
    66Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.
     
    1010As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios.
    1111
     12The most obvious performance metric for web servers is throughput.
     13This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up.
     14Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
     15Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
     16
    1217\section{Memcached}
    1318Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
     
    2631Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
    2732\item
     33The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24  \glspl{hthrd})
     34\item
    2835A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively.
    2936\item
     
    4754\item
    4855For UDP connections, all the threads listen to a single UDP socket for incoming requests.
    49 Threads that are not currently dealing with another request ignore the incoming packet.
     56Threads that are currently dealing with another request ignore the incoming packet.
    5057One of the remaining, non-busy, threads reads the request and sends the response.
    5158This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
     
    7986\subsection{Throughput} \label{memcd:tput}
    8087This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment.
    81 The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
     88The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
    8289Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers.
    8390
     
    104111
    105112\subsection{Tail Latency}
    106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
    107 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
    108113Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment.
    109114
    110115Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
    111 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.
     116As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed.
    112117Because of this dramatic increase, the Y-axis is presented using a log scale.
    113118Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
     
    186191web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}
    187192The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment.
    188 
    189 \subsection{NGINX threading}
    190193NGINX is a high-performance, \emph{full-service}, event-driven web server.
    191194It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}.
    192195This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance.
    193196The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
    194 When running as a static web server, it uses an event-driven architecture to service incoming requests.
     197In comparison, the custom \CFA web server was developed specifically with this experiment in mind.
     198However, nothing seems to indicate that NGINX suffers from the increased flexibility.
     199When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve.
     200
     201\subsection{NGINX threading}
     202When running as a static web server, NGINX uses an event-driven architecture to service incoming requests.
    195203Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines.
    196204For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections.
    197 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.
     205Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io.
    198206The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool.
    199207However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads.
     
    262270The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.
    263271\item
     272Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates.
     273\item
    264274Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
    265275\item
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r7d9598d8 r2dcd80a  
    44This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
    55All of these systems have a \gls{uthrding} model.
    6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.
     6The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees.
    77Note that only the code of the \CFA tests is shown;
    8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
     8all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself.
    99
    1010\section{Benchmark Environment}\label{microenv}
     
    129129        \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
    130130        For throughput, higher is better, for scalability, lower is better.
    131         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     131        Each series represents 15 independent runs.
     132        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    132133        \label{fig:cycle:jax}
    133134\end{figure}
     
    161162        \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
    162163        For throughput, higher is better, for scalability, lower is better.
    163         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     164        Each series represents 15 independent runs.
     165        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    164166        \label{fig:cycle:nasus}
    165167\end{figure}
     
    177179Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc.
    178180\CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results.
    179 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worst performance compared to the left column.
     181Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column.
    180182This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism.
    181183This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available.
     
    185187Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details.
    186188Note the maximum of the Y-axis on Intel and AMD differ significantly.
    187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.
     189Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability.
    188190However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel.
    189191The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
     
    191193This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go.
    192194Again, the same performance increase for libfibre is visible when running fewer \ats.
    193 Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment.
     195I did not investigate the libfibre performance boost for 1 cycle in this experiment.
    194196
    195197The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark.
    196198Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
    197199In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
    198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers.
     200For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees.
    199201
    200202\section{Yield}
     
    250252        \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    251253        For throughput, higher is better, for scalability, lower is better.
    252         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     254        Each series represents 15 independent runs.
     255        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    253256        \label{fig:yield:jax}
    254257\end{figure}
     
    309312        \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    310313        For throughput, higher is better, for scalability, lower is better.
    311         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     314        Each series represents 15 independent runs.
     315        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    312316        \label{fig:yield:nasus}
    313317\end{figure}
     
    317321Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling.
    318322Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs.
    319 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.
     323This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed.
    320324Go and Tokio still display the same performance collapse as on Intel.
    321325Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine.
     
    324328
    325329It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently.
    326 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
     330The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
    327331
    328332
     
    333337
    334338The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
    335 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
     339With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
    336340This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure.
    337 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other's data.
     341Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data.
    338342In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly.
    339343
     
    392396        \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    393397        For throughput, higher is better, for scalability, lower is better.
    394         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     398        Each series represents 15 independent runs.
     399        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    395400        \label{fig:churn:jax}
    396401\end{figure}
     
    404409Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs.
    405410Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values.
    406 Finally, Go gets the most peculiar results, scaling worst than other runtimes until 48 \procs.
     411Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs.
    407412At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing.
    408413However, beyond this point Go keeps this level of variation but does not scale further in any of the runs.
    409414
    410 Throughput and scalability are notably worst for all runtimes than the previous benchmarks since there is inherently more communication between processors.
     415Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors.
    411416Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second.
    412417Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling.
    413 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worst and performance completely ceases to improve.
     418However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve.
    414419At this point, the benchmark is dominated by inter-socket communication costs for all runtimes.
    415420
     
    457462        \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    458463        For throughput, higher is better, for scalability, lower is better.
    459         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     464        Each series represents 15 independent runs.
     465        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    460466        \label{fig:churn:nasus}
    461467\end{figure}
     
    600606        \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    601607        For throughput, higher is better, for scalability, lower is better.
    602         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     608        Each series represents 15 independent runs.
     609        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    603610        \label{fig:locality:jax}
    604611\end{figure}
     
    632639        \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    633640        For throughput, higher is better, for scalability, lower is better.
    634         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     641        Each series represents 15 independent runs.
     642        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    635643        \label{fig:locality:nasus}
    636644\end{figure}
     
    648656Go still has the same poor performance as on Intel.
    649657
    650 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
     658Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
    651659Go still has the same poor performance.
    652660
     
    751759\end{centering}
    752760\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
     761For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time).
    753762DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
    754763\label{fig:transfer:res}
     
    764773
    765774The first two columns show the results for the semaphore variation on Intel.
    766 While there are some differences in latencies, \CFA is consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
    767 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.
     775While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
     776Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal.
    768777
    769778Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
     
    780789Neither Libfibre nor Tokio complete the experiment.
    781790
    782 This experiment clearly demonstrates that \CFA achieves significantly better fairness.
     791This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee.
    783792The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly.
    784793Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader.
     
    790799Without \procs stealing from the \proc running the leader, the experiment cannot terminate.
    791800Go manages to complete the experiment because it adds preemption on top of classic work-stealing.
    792 However, since preemption is fairly infrequent, it achieves significantly worst performance.
     801However, since preemption is fairly infrequent, it achieves significantly worse performance.
    793802In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness.
    794803Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep.
  • doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

    r7d9598d8 r2dcd80a  
    22As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa.
    33Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc.
    4 Scheduling is also common in most other fields, \eg in assembly lines, assigning parts to line workers is a form of scheduling.
     4Scheduling is also common in most other fields; \eg in assembly lines, assigning parts to line workers is a form of scheduling.
    55
    66In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler.
     
    88A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
    99This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
    10 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
    11 
    12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods.
    13 When a faster pace is needed and changes are much more frequent gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.
     10This description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
     11
     12Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned, and workloads are likely to stay stable for long periods.
     13When a faster pace is needed and changes are much more frequent, then gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.
    1414
    1515\section{Naming Convention}
    16 Scheduling has been studied by various communities concentrating on different incarnations of the same problems.
     16Scheduling has been studied by various communities, concentrating on different incarnations of the same problems.
    1717As a result, there are no standard naming conventions for scheduling that are respected across these communities.
    1818This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats.
    1919
    2020\section{Static Scheduling}
    21 \newterm{Static schedulers} require \ats dependencies and costs to be explicitly and exhaustively specified prior to scheduling.
     21\newterm{Static schedulers} require \at dependencies and costs to be explicitly and exhaustively specified prior to scheduling.
    2222The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution.
    2323This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information.
     
    2727
    2828\section{Dynamic Scheduling}
    29 \newterm{Dynamic schedulers} determine \at dependencies and costs during scheduling, if at all.
    30 Hence, unlike static scheduling, \at dependencies are conditional and detected at runtime.
    31 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies.
     29\newterm{Dynamic schedulers} detect \at dependencies and costs during scheduling, if at all.
     30This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, where a \at suspends or halts dynamically when it detects unfulfilled dependencies.
    3231Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
    3332As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies.
     
    3534\subsection{Explicitly Informed Dynamic Schedulers}
    3635While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc.
    37 When available, a scheduler can then use this information to direct the scheduling decisions.
     36When available, a scheduler can then use this information to direct scheduling decisions.
    3837For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
    3938However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
     
    4544
    4645\subsubsection{Priority Scheduling}
    47 Common information used by schedulers to direct their algorithm is priorities.
     46A common approach to direct the scheduling algorithm is to add information about \at priority.
    4847Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones.
    4948The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority.
     
    8180\paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
    8281In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
    83 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others.
     82However, in real-life architectures, there are contexts where different \procs can have different characteristics, which makes some mappings more interesting than others.
    8483A common example where this is statically true is architectures with \glsxtrshort{numa}.
    8584In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}.
     
    8887\paragraph{Complex Machine Architecture} Another aspect that has been examined is how applicable work stealing is to different machine architectures.
    8988This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
    90 As \CFA offers no particular support for heterogeneous architecture, this is also an area that is less relevant to this thesis.
    91 Although it could be an interesting avenue for future work.
     89As \CFA offers no particular support for heterogeneous architectures, this is also an area that is not examined in this thesis.
     90However, support for concurrency across heterogeneous architectures is interesting avenue for future work, at which point the literature on this topic and how it relates to scheduling will become relevant.
    9291
    9392\subsection{Theoretical Results}
    9493There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
    95 \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
     94Blelloch et al.~\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
    9695Others show that work stealing applies to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}.
    9796\cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats.
     
    104103Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application.
    105104There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \at are never further apart than some fixed duration.
    106 While this helps schedulers guarantee that no \ats unfairly monopolize a worker, preemption can effectively be added to any scheduler.
     105This helps schedulers guarantee that no \ats unfairly monopolize a worker.
     106Preemption can effectively be added to any scheduler.
    107107Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it.
    108108
     
    110110This section presents a quick overview of several current schedulers.
    111111While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers.
    112 As such, I believe these schedulers are at least as relevant as those presented in published work.
    113 Both Schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project.
    114 However, real-time schedulers are not considered, as these have constraints that are much stricter than what is needed for this project.
     112As such, I believe these schedulers are as relevant as those presented in published work.
     113Both schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project.
     114However, real-time schedulers aim to guarantee bounded compute time in order to meet deadlines.
     115These deadlines lead to constraints much stricter than the starvation freedom that is needed for this project.
     116As such real-time schedulers are not considered for this work.
    115117
    116118\subsection{Operating System Schedulers}
    117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
     119Operating System Schedulers tend to be fairly complex, as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
    118120Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X.
    119121The information is less complete for closed source operating systems.
     
    137139It also periodically balances the load of the system (according to a different heuristic) but uses a simpler work stealing approach.
    138140
    139 \paragraph{Windows(OS)}
     141\paragraph{Windows (OS)}
    140142Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities.
    141143It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications.
     
    143145The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
    144146
    145 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth.
     147The scheduling policy is discussed more in-depth in~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs''.
    146148Multicore scheduling is based on a combination of priorities and \proc preference.
    147149Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
     
    168170\paragraph{Go}\label{GoSafePoint}
    169171Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}.
    170 Preemption is present, but only at safe points,~\cite{go:safepoints} which are detection code inserted at various frequent access boundaries.
     172Preemption is present, but only at safe points~\cite{go:safepoints}, which are detection code inserted at various frequent access boundaries.
    171173
    172174The algorithm is as follows :
    173175\begin{enumerate}
    174176        \item Once out of 61 times, pick 1 element from the \emph{GRQ}.
    175         \item If there is an item in the ``chair'' pick it.
     177        \item Otherwise, if there is an item in the ``chair'' pick it.
    176178        \item Else pick an item from the \emph{LRQ}.
    177179        \begin{itemize}
     
    180182        \end{itemize}
    181183\end{enumerate}
     184
     185Chapter~\ref{microbench} uses Go as one of its comparison point in this thesis's performance evaluation.
    182186
    183187\paragraph{Erlang}
     
    224228\paragraph{LibFibre}
    225229LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a lightweight user-level threading framework developed at the University of Waterloo.
    226 Similarly to Go, it uses a variation of work stealing with a global queue that has a higher priority than stealing.
    227 Unlike Go, it does not have the high-priority next ``chair'' and does not use randomized work-stealing.
     230It shares a very strong resemblance to Go: using a variation of work stealing with a global queue that has a higher priority than stealing.
     231Unlike Go, it does not have the high-priority next ``chair'' and its work-stealing is not randomized.
     232
     233Chapter~\ref{microbench} uses LibFibre as one of its comparison point in this thesis's performance evaluation.
  • doc/theses/thierry_delisle_PhD/thesis/text/front.tex

    r7d9598d8 r2dcd80a  
    8888        \\
    8989        Internal Member: \> Martin Karsten \\
    90         \> Associate Professor, School of Computer Science \\
     90        \> Professor, School of Computer Science \\
    9191        \> University of Waterloo \\
    9292\end{tabbing}
     
    108108% The following is a sample Declaration Page as provided by the GSO
    109109% December 13th, 2006.  It is designed for an electronic thesis.
     110\begin{center}\textbf{Author's Declaration}\end{center}
    110111\noindent
    111112I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners.
     
    127128Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    128129To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
    129 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     130which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated.
    130131Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
    131 When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources.
     132When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources?
    132133Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
    133134otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
     
    195196% GLOSSARIES (Lists of definitions, abbreviations, symbols, etc. provided by the glossaries-extra package)
    196197% -----------------------------
    197 \printglossary[type=\acronymtype]
     198\printglossary[type=\acronymtype,title={List of Abbreviations}]
    198199\cleardoublepage
    199200\phantomsection         % allows hyperref to link to the correct page
  • doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

    r7d9598d8 r2dcd80a  
    11\chapter{Introduction}\label{intro}
    22
    3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
     3\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages, like Go~\cite{GITHUB:go}, Java's Project Loom~\cite{MAN:project-loom} and Kotlin~\cite{MAN:kotlin}.
    44The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
    55Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    66To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
    7 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     7which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated.
    88Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
    9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
     9When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources?
    1010Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
    1111otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
     
    1616Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases.
    1717
    18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
     18After examining, testing and selecting specific approaches to these scheduling issues, a new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
    1919The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
    20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness.
     20The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
    2121
    2222Chapter~\ref{intro} defines scheduling and its general goals.
     
    3232Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
    3333On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}.
    34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit.
     34Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit\footnote{
     35Open systems constrasts to \newterm{closed} systems, where work never arrives from external sources.
     36This definition is a extension of open/closed systems in the field of thermodynamics.
     37}.
     38
    3539In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
    3640For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
     
    9498\end{enumerate}
    9599
    96 Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
     100At a high-level, scheduling is considered zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
    97101Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.}
    98 Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
     102This almost invariably leads to schedulers needing to pick some \ats over others, opening the door to fairness problems.
     103However, at a lower-level, schedulers can make inefficient or incorrect decisions leading to strictly worse outcomes than what the theoretical zero-sum game suggests.
     104Since it can be difficult to avoid these poor decisions, schedulers are generally a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
    99105For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
    100106While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
     
    113119Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling.
    114120Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{
    115 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.}
     121C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd}, reducing parallelism at the user level.}
    116122
    117123Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals.
    118 More specifically, safety and productivity for scheduling mean supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
     124More specifically, safety and productivity for scheduling mean supporting a wide range of workloads, so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
    119125The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
    120126To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
    121127
    122 As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers.
     128As a research project, this work runs exclusively on newer versions of the Linux operating system and gcc/clang compilers.
    123129The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
    124130some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
     
    138144An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
    139145\item
    140 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
     146A mechanism for adding fairness on top of the MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
    141147\item
    142148An optimization of the helping mechanism for load balancing to reduce scheduling costs.
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r7d9598d8 r2dcd80a  
    11\chapter{User Level \io}\label{userio}
    22As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
     3I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources.
     4When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}).
    35Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system.
    46
     
    1618This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
    1719
    18 There are three options to monitor file descriptors in Linux:\footnote{
     20There are three options to monitor file descriptors (FD) in Linux:\footnote{
    1921For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
    2022The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.}
     
    2527\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
    2628Hence, the array length must be as long as the largest FD currently of interest.
    27 On return, it outputs the set motified in place to identify which of the file descriptors changed state.
     29On return, it outputs the set modified in-place to identify which of the file descriptors changed state.
    2830This destructive change means selecting in a loop requires re-initializing the array for each iteration.
    29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
     31Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
    3032Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
    3133However, these changes are only reflected when the manager makes its next call to @select@.
     
    4648However, all three of these I/O systems have limitations.
    4749The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
    48 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.
     50Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed.
    4951For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@.
    5052Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
     
    5254\subsection{POSIX asynchronous I/O (AIO)}
    5355An alternative to @O_NONBLOCK@ is the AIO interface.
    54 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
    55 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
    56 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}.
    57 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
    58 Leaving polling for completion, which is similar to the previous system calls.
    59 AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
    60 It also supports batching multiple operations in a single system call.
    61 
    62 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
    63 For \io multiplexing, @aio_suspend@ is the best interface.
    64 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
     56Using AIO, programmers can enqueue operations which are to be performed
     57asynchronously by the kernel.
     58The kernel can communicate
     59completions of these operations in three ways:
     60it can spawn a new \gls{kthrd}; send a Linux signal; or
     61userspace can poll for completion of one or more operations.
     62Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution.
     63Signals and their associated interrupt handlers can also lead to fairly complicated
     64interactions between subsystems, and they have a non-trivial cost.
     65This leaves a single option: polling for completion---this is similar to the previously discussed
     66system calls.
     67While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file
     68descriptors can be regular files or block devices.
     69AIO also supports batching multiple operations in a single system call.
     70
     71AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
     72Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing.
     73However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
    6574AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
    6675This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
     
    92101A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
    93102Like AIO, it represents \io operations as entries added to a queue.
    94 But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
    95 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
     103But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
     104The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion.
    96105
    97106One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
    98 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
     107In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
    99108
    100109On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion.
     
    106115This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
    107116This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
    108 As opposed to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
     117Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
    109118
    110119\subsection{Discussion}
    111 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
    112 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
    113 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
    114 
    115 For this project, I selected @io_uring@, in large parts because of its generality.
     120These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete.
     121All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically.
     122For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
     123
     124For this project, I selected @io_uring@, in large part because of its generality.
    116125While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
    117126
    118127\section{Event-Engine}
    119128An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
    120 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
     129In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}.
    121130The parked \ats are then rescheduled by the event engine once the desired operation has been completed.
    122131
     
    125134Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
    126135Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
    127 The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
    128 The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
     136The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
     137The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
    129138The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
    130139the completion ring contains entries for the completed I/O operation.
     
    134143        \centering
    135144        \input{io_uring.pstex_t}
    136         \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
     145        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left).
     146        While the completion ring contains plain data, the submission ring contains only references.
     147        These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.}
    137148        \label{fig:iouring}
    138149\end{figure}
     
    153164Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
    154165Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
    155 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
     166Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted.
    156167\item
    157168The kernel is notified of the change to the ring using the system call @io_uring_enter@.
    158169The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
    159 The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions.
     170The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions.
    160171\end{enumerate}
    161172
     
    165176When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
    166177Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
    167 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.
    168 The system call is only needed if the application wants to block waiting for operations to complete.
     178The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring.
     179@io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions.
    169180\end{sloppypar}
    170181
     
    179190This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
    180191
    181 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
     192An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
    182193Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
    183194In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
    184195To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
    185 Three kinds of operations that can need the \glspl{kthrd}:
     196Three kinds of operations that can need the \glspl{kthrd} are:
    186197
    187198\paragraph{Operations using} @IOSQE_ASYNC@.
     
    190201\paragraph{Bounded operations.}
    191202This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
    192 @io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
     203Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices.
    193204In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
    194205
    195206\paragraph{Unbounded operations that must be retried.}
    196207While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
    197 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.
     208Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel.
    198209@io_uring@ maintains a separate pool for these operations.
    199210The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
     211Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel.
    200212Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@.
    201213
     
    212224however, the duration of the system call scales with the number of entries submitted.
    213225The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
    214 Beyond this limit, the length of the system call is the throughput limiting factor.
     226Beyond this limit, the length of the system call is the throughput-limiting factor.
    215227I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
    216228Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
    217229Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
    218 As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
    219 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
    220 Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance.
     230As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}.
     231Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance.
    221232
    222233A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
    223234SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
    224235The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
    225 Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
     236For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable.
    226237Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
    227238
     
    237248To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
    238249From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
    239 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
     250In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}.
    240251Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call.
    241252Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
     
    244255        \centering
    245256        \input{pivot_ring.pstex_t}
    246         \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.
     257        \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition.
    247258        When submitting, the partition is advanced.
    248259        The kernel considers the partition as the head of the ring.}
     
    253264However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
    254265Instances running out of SQEs cannot run \ats wanting to do \io operations.
    255 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
     266In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur.
    256267
    257268A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
    258 \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
     269\Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
    259270While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
    260271
     
    263274In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
    264275No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
    265 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
     276However, the \io \gls{proc} is unable to help because it is executing the spinning \at.
     277This results in a deadlock.
    266278While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''.
    267279Furthermore, this pattern is difficult to reliably detect and avoid.
     
    274286\subsubsection{Public Instances}
    275287The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
    276 \ats attempting an \io operation pick one of the available instances and submit the operation to that instance.
     288\Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance.
    277289Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
    278290Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
     
    282294\item
    283295The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
    284 This aspect has oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
     296This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
    285297\end{itemize}
    286298
    287299Allocation in this scheme is fairly easy.
    288 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.
     300Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs.
    289301Allocation also does not require ordering guarantees as all free SQEs are interchangeable.
    290302The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
     
    312324Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
    313325If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
    314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
    315 
    316 With the pool of SQE instances approach, the big advantage is that it is fairly flexible.
     326A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
     327
     328The big advantage of the pool of SQE instances approach is that it is fairly flexible.
    317329It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
    318330It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
     
    320332The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
    321333The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
    322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.
     334All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases.
    323335
    324336\subsubsection{Instance borrowing}
    325337Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
    326 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
    327 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.
     338The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling.
     339The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids.
    328340When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
    329341However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
     
    332344While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
    333345
     346As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm.
    334347In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
    335 When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
     348When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance.
    336349This instance is now bound to the \gls{proc} the \at is running on.
    337350This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state.
     
    343356        \item The current \gls{proc} does not hold an instance.
    344357        \item The current instance does not have sufficient SQEs to satisfy the request.
    345         \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}.
     358        \item The current \gls{proc} has a wrong instance.
     359        This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}.
    346360\end{enumerate}
    347361However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
    348 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}
     362Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.}
    349363A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
    350364If not, it proceeds, otherwise it delegates the operation to the arbiter.
     
    365379However, there is no need to immediately revoke the instance.
    366380External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
    367 This means whoever is responsible for the system call, first checks if the instance has any external submissions.
     381This means whoever is responsible for the system call first checks whether the instance has any external submissions.
    368382If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
    369383
     
    382396
    383397\section{Interface}
    384 The last important part of the \io subsystem is its interface.
     398The final part of the \io subsystem is its interface.
    385399Multiple approaches can be offered to programmers, each with advantages and disadvantages.
    386 The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
    387 The following sections discuss some useful options using @read@ as an example.
     400The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
     401The following sections discuss some useful options, using @read@ as an example.
    388402The standard Linux interface for C is:
    389403\begin{cfa}
     
    392406
    393407\subsection{Replacement}
    394 Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
    395 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
    396 This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
    397 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
    398 However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
    399 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
    400 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
     408Replacing the C \io subsystem is the more intrusive and draconian approach.
     409The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's.
     410This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation.
     411It also offers a presumably well known and familiar API that C programmers can simply continue to work with.
     412%However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement.
     413However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs.
     414This approach was rejected as being laudable but infeasible.
    401415
    402416\subsection{Synchronous Extension}
    403417Another interface option is to offer an interface different in name only.
     418In this approach, an alternative call is created for each supported system calls.
    404419For example:
    405420\begin{cfa}
    406421ssize_t cfa_read(int fd, void *buf, size_t count);
    407422\end{cfa}
     423The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking.
     424
    408425This approach is feasible and still familiar to C programmers.
    409 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
     426It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries.
    410427However, it has the advantage of implementation simplicity.
    411428Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io.
     
    416433future(ssize_t) read(int fd, void *buf, size_t count);
    417434\end{cfa}
    418 where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
     435where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested.
    419436The data read is placed in @buf@.
    420 The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
    421 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
     437The problem is that both the bytes count and data form the synchronization object, not just the bytes read.
     438Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this.
    422439A classical asynchronous API is:
    423440\begin{cfa}
     
    438455However, it is not the most user-friendly option.
    439456It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@.
     457
     458As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension:
     459\begin{cfa}
     460ssize_t cfa_read(int fd, void *buf, size_t count);
     461future(ssize_t) async_read(int fd, void *buf, size_t count);
     462\end{cfa}
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    r7d9598d8 r2dcd80a  
    22The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state.
    33This chapter addresses problems that occur when the system state changes.
    4 Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
     4Indeed the \CFA runtime supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
    55These changes affect the scheduling algorithm, which must dynamically alter its behaviour.
    66
    7 In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
     7Specifically, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
    88\begin{cfa}
    99{
     
    2626This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason.
    2727
    28 There are no performance requirements, within reason, for resizing since it is expected to be rare.
    29 However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks.
    30 It should also avoid as much as possible any effect on performance when the number of \procs remains constant.
    31 This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
     28There are no performance requirements, within reason, for act of resizing itself, since it is expected to be rare.
     29However, this operation has strict correctness requirements, since updating and idle sleep can easily lead to deadlocks.
     30The resizing mechanism should also avoid, as much as possible any effect on performance when the number of \procs remains constant.
     31This last requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
    3232
    3333\subsection{Read-Copy-Update}
    3434One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}.
     35This is a very common pattern that avoids large critical sections, which is why it is worth mentioning.
    3536In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempting an Indiana Jones Switch to replace the original with the copy.
    36 This approach has the advantage that it may not need any synchronization to do the switch.
     37This approach has the advantage that it may not need any synchronization to do the switch, depending on how reclamation of the original copy is handled.
    3738However, there is a race where \procs still use the original data structure after the copy is switched.
    3839This race not only requires adding a memory-reclamation scheme, but it also requires that operations made on the stale original version are eventually moved to the copy.
     
    6364Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
    6465Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
    65 The lock in nonblocking, so both readers and writers spin while the lock is held.
     66The lock is nonblocking, so both readers and writers spin while the lock is held.
    6667This very wide sharding strategy means that readers have very good locality since they only ever need to access two memory locations.
    6768
     
    9899\section{Idle-Sleep}\label{idlesleep}
    99100While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs.
    100 For this work, it is the programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.
     101For this work, it is the application programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.
    101102This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
    102103These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
     
    108109Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
    109110Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore.
    110 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
     111The complexity here is to support user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
    111112Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
    112113However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
     
    115116An interesting subpart of this heuristic is what to do with bursts of \ats that become ready.
    116117Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up.
    117 This fact begs the question, if many \procs are available, how many should be woken?
     118This fact raises the question: if many \procs are available, how many should be woken?
    118119If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization.
    119120If the ready \ats will run for a very short time, waking many \procs may be wasteful.
    120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified.
     121As mentioned, since a heuristic to handle these complex cases is outside the scope of this thesis, so the behaviour of the scheduler in this particular case is left unspecified.
    121122
    122123\section{Sleeping}
     
    156157The notifier first makes sure the newly ready \at is visible to \procs searching for \ats, and then attempts to notify an idle \proc.
    157158On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed.
    158 Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed.
     159Unlike regular work-stealing, this search must be exhaustive to make sure that no pre-existing \at is missed.
    159160These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc.
    160161Conversely, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search.
     
    188189        \centering
    189190        \input{idle1.pstex_t}
    190         \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
     191        \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put onto a doubly-linked stack protected by a lock.
    191192        Each \proc has a private event \lstinline{fd}.}
    192193        \label{fig:idle1}
  • doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

    r7d9598d8 r2dcd80a  
    66\Celeven introduced threading features, such as the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@.
    77Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC).
    8 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.
     8While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as \CC does.
    99This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel.
    1010The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking.
     
    3838
    3939\section{\glsxtrshort{io}}\label{prev:io}
    40 Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
     40Prior to this work, the \CFA runtime did not have any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
    4141
    4242\begin{quote}
     
    4949\end{quote}
    5050
    51 Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
     51Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which, like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
    5252This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}.
    5353The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel.
     
    5656
    5757\section{Interoperating with C}
    58 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}:
     58While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls.
     59The POSIX standard states~\cite[\S~2.9.1]{POSIX17}:
    5960\begin{quote}
    6061All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions)
    6162\end{quote}
    62 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
     63Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence, interoperability with UNIX library functions is a challenge for an M:N threading model.
    6364
    64 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
     65Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}.
     66Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
    6567
    66 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences:
     68As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions.
     69Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together.
     70This fine-grained interoperability between C and \CFA has two consequences:
    6771\begin{enumerate}
    6872        \item Precisely identifying blocking C calls is difficult.
     
    7175Because of these consequences, this work does not attempt to ``sandbox'' calls to C.
    7276Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model.
     77Since the blocking call is not known to the runtime, it is not necessarily possible to distinguish whether or not a deadlock occurs.
    7378Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications.
    74 Therefore, a complete solution to this problem is outside the scope of this thesis.\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.}
     79Therefore, a complete solution to this problem is outside the scope of this thesis.
     80\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks.
     81This capability further reduces the chances of blocking a \gls{kthrd}.}
     82Chapter~\ref{userio} discusses the interoperability with C chosen and used for the evaluation in Chapter~\ref{macrobench}.
  • libcfa/src/bits/random.hfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Fri Jan 14 07:18:11 2022
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Fri Jan 14 07:18:58 2022
    13 // Update Count     : 1
     12// Last Modified On : Sun Dec 11 18:43:58 2022
     13// Update Count     : 171
    1414//
    1515
     
    1818#include <stdint.h>
    1919
    20 // Pipelined to allow out-of-order overlap with reduced dependencies. Critically, the current random state is returned
    21 // (copied), and then compute and store the next random value.
    22 
    23 #if defined(__SIZEOF_INT128__)
    24 //--------------------------------------------------
     20#define GLUE2( x, y ) x##y
     21#define GLUE( x, y ) GLUE2( x, y )
     22
     23// Set default PRNG for architecture size.
     24#ifdef __x86_64__                                                                               // 64-bit architecture
     25        // 64-bit generators
     26        #define LEHMER64
     27        //#define XORSHIFT_12_25_27
     28        //#define XOSHIRO256PP
     29        //#define KISS_64
     30
     31        // 32-bit generators
     32        #define XORSHIFT_6_21_7
     33        //#define XOSHIRO128PP
     34#else                                                                                                   // 32-bit architecture
     35        // 64-bit generators
     36        #define XORSHIFT_13_7_17
     37
     38        // 32-bit generators
     39        #define XORSHIFT_6_21_7
     40#endif // __x86_64__
     41
     42// Define C/CFA PRNG name and random-state.
     43
     44// SKULLDUGGERY: typedefs name struct and typedef with the same name to deal with CFA typedef numbering problem.
     45
     46#ifdef XOSHIRO256PP
     47#define PRNG_NAME_64 xoshiro256pp
     48#define PRNG_STATE_64_T GLUE(PRNG_NAME_64,_t)
     49typedef struct PRNG_STATE_64_T { uint64_t s[4]; } PRNG_STATE_64_T;
     50#endif // XOSHIRO256PP
     51
     52#ifdef XOSHIRO128PP
     53#define PRNG_NAME_32 xoshiro128pp
     54#define PRNG_STATE_32_T GLUE(PRNG_NAME_32,_t)
     55typedef struct PRNG_STATE_32_T { uint32_t s[4]; } PRNG_STATE_32_T;
     56#endif // XOSHIRO128PP
     57
     58#ifdef LEHMER64
     59#define PRNG_NAME_64 lehmer64
     60#define PRNG_STATE_64_T __uint128_t
     61#endif // LEHMER64
     62
     63#ifdef WYHASH64
     64#define PRNG_NAME_64 wyhash64
     65#define PRNG_STATE_64_T uint64_t
     66#endif // LEHMER64
     67
     68#ifdef XORSHIFT_13_7_17
     69#define PRNG_NAME_64 xorshift_13_7_17
     70#define PRNG_STATE_64_T uint64_t
     71#endif // XORSHIFT_13_7_17
     72
     73#ifdef XORSHIFT_6_21_7
     74#define PRNG_NAME_32 xorshift_6_21_7
     75#define PRNG_STATE_32_T uint32_t
     76#endif // XORSHIFT_6_21_7
     77
     78#ifdef XORSHIFT_12_25_27
     79#define PRNG_NAME_64 xorshift_12_25_27
     80#define PRNG_STATE_64_T uint64_t
     81#endif // XORSHIFT_12_25_27
     82
     83#ifdef KISS_64
     84#define PRNG_NAME_64 kiss_64
     85#define PRNG_STATE_64_T GLUE(PRNG_NAME_64,_t)
     86typedef struct PRNG_STATE_64_T { uint64_t z, w, jsr, jcong; } PRNG_STATE_64_T;
     87#endif // KISS_^64
     88
     89#ifdef XORWOW
     90#define PRNG_NAME_32 xorwow
     91#define PRNG_STATE_32_T GLUE(PRNG_NAME_32,_t)
     92typedef struct PRNG_STATE_32_T { uint32_t a, b, c, d, counter; } PRNG_STATE_32_T;
     93#endif // XOSHIRO128PP
     94
     95#define PRNG_SET_SEED_64 GLUE(PRNG_NAME_64,_set_seed)
     96#define PRNG_SET_SEED_32 GLUE(PRNG_NAME_32,_set_seed)
     97
     98
     99// Default PRNG used by runtime.
     100#ifdef __x86_64__                                                                               // 64-bit architecture
     101#define PRNG_NAME PRNG_NAME_64
     102#define PRNG_STATE_T PRNG_STATE_64_T
     103#else                                                                                                   // 32-bit architecture
     104#define PRNG_NAME PRNG_NAME_32
     105#define PRNG_STATE_T PRNG_STATE_32_T
     106#endif // __x86_64__
     107
     108#define PRNG_SET_SEED GLUE(PRNG_NAME,_set_seed)
     109
     110
     111// ALL PRNG ALGORITHMS ARE OPTIMIZED SO THAT THE PRNG LOGIC CAN HAPPEN IN PARALLEL WITH THE USE OF THE RESULT.
     112// Therefore, the set_seed routine primes the PRNG by calling it with the state so the seed is not return as the
     113// first random value.
     114
     115#ifdef __cforall                                                                                // don't include in C code (invoke.h)
     116
     117// https://prng.di.unimi.it/xoshiro256starstar.c
     118//
     119// This is xoshiro256++ 1.0, one of our all-purpose, rock-solid generators.  It has excellent (sub-ns) speed, a state
     120// (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of.
     121//
     122// For generating just floating-point numbers, xoshiro256+ is even faster.
     123//
     124// The state must be seeded so that it is not everywhere zero. If you have a 64-bit seed, we suggest to seed a
     125// splitmix64 generator and use its output to fill s.
     126
     127#ifndef XOSHIRO256PP
     128typedef struct xoshiro256pp_t { uint64_t s[4]; } xoshiro256pp_t;
     129#endif // ! XOSHIRO256PP
     130
     131static inline uint64_t xoshiro256pp( xoshiro256pp_t & rs ) with(rs) {
     132        inline uint64_t rotl(const uint64_t x, int k) {
     133                return (x << k) | (x >> (64 - k));
     134        } // rotl
     135
     136        const uint64_t result = rotl( s[0] + s[3], 23 ) + s[0];
     137        const uint64_t t = s[1] << 17;
     138
     139        s[2] ^= s[0];
     140        s[3] ^= s[1];
     141        s[1] ^= s[2];
     142        s[0] ^= s[3];
     143        s[2] ^= t;
     144        s[3] = rotl( s[3], 45 );
     145        return result;
     146} // xoshiro256pp
     147
     148static inline void xoshiro256pp_set_seed( xoshiro256pp_t & state,  uint64_t seed ) {
     149        state = (xoshiro256pp_t){ {seed, seed, seed, seed} };
     150        xoshiro256pp( state );
     151} // xoshiro256pp_set_seed
     152
     153// https://prng.di.unimi.it/xoshiro128plusplus.c
     154//
     155// This is xoshiro128++ 1.0, one of our 32-bit all-purpose, rock-solid generators. It has excellent speed, a state size
     156// (128 bits) that is large enough for mild parallelism, and it passes all tests we are aware of.
     157//
     158// For generating just single-precision (i.e., 32-bit) floating-point numbers, xoshiro128+ is even faster.
     159//
     160// The state must be seeded so that it is not everywhere zero.
     161
     162#ifndef XOSHIRO128PP
     163typedef struct xoshiro128pp_t { uint32_t s[4]; } xoshiro128pp_t;
     164#endif // ! XOSHIRO128PP
     165
     166static inline uint32_t xoshiro128pp( xoshiro128pp_t & rs ) with(rs) {
     167        inline uint32_t rotl( const uint32_t x, int k ) {
     168                return (x << k) | (x >> (32 - k));
     169        } // rotl
     170
     171        const uint32_t result = rotl( s[0] + s[3], 7 ) + s[0];
     172        const uint32_t t = s[1] << 9;
     173
     174        s[2] ^= s[0];
     175        s[3] ^= s[1];
     176        s[1] ^= s[2];
     177        s[0] ^= s[3];
     178        s[2] ^= t;
     179        s[3] = rotl( s[3], 11 );
     180        return result;
     181} // xoshiro128pp
     182
     183static inline void xoshiro128pp_set_seed( xoshiro128pp_t & state, uint32_t seed ) {
     184        state = (xoshiro128pp_t){ {seed, seed, seed, seed} };
     185        xoshiro128pp( state );                                                          // prime
     186} // xoshiro128pp_set_seed
     187
     188#ifdef __SIZEOF_INT128__
     189        // Pipelined to allow out-of-order overlap with reduced dependencies. Critically, the current random state is
     190        // returned (copied), and then compute and store the next random value.
     191        //--------------------------------------------------
    25192        static inline uint64_t lehmer64( __uint128_t & state ) {
    26193                __uint128_t ret = state;
    27194                state *= 0xda942042e4dd58b5;
    28195                return ret >> 64;
    29         }
    30 
    31 //--------------------------------------------------
     196        } // lehmer64
     197
     198        static inline void lehmer64_set_seed( __uint128_t & state, uint64_t seed ) {
     199                state = seed;
     200                lehmer64( state );
     201        } // lehmer64_set_seed
     202
     203        //--------------------------------------------------
    32204        static inline uint64_t wyhash64( uint64_t & state ) {
    33                 state += 0x60bee2bee120fc15;
     205                uint64_t ret = state;
     206                state += 0x_60be_e2be_e120_fc15;
    34207                __uint128_t tmp;
    35                 tmp = (__uint128_t) state * 0xa3b195354a39b70d;
     208                tmp = (__uint128_t) ret * 0x_a3b1_9535_4a39_b70d;
    36209                uint64_t m1 = (tmp >> 64) ^ tmp;
    37                 tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
     210                tmp = (__uint128_t)m1 * 0x_1b03_7387_12fa_d5c9;
    38211                uint64_t m2 = (tmp >> 64) ^ tmp;
    39212                return m2;
    40         }
    41 #endif
     213        } // wyhash64
     214
     215        static inline void wyhash64_set_seed( uint64_t & state, uint64_t seed ) {
     216                state = seed;
     217                wyhash64( state );                                                              // prime
     218        } // wyhash64_set_seed
     219#endif // __SIZEOF_INT128__
    42220
    43221//--------------------------------------------------
     
    48226        state ^= state << 17;
    49227        return ret;
    50 }
    51 
    52 //--------------------------------------------------
     228} // xorshift_13_7_17
     229
     230static inline void xorshift_13_7_17_set_seed( uint64_t & state, uint64_t seed ) {
     231        state = seed;
     232        xorshift_13_7_17( state );                                                      // prime
     233} // xorshift_13_7_17_set_seed
     234
     235//--------------------------------------------------
     236// Marsaglia shift-XOR PRNG with thread-local state
     237// Period is 4G-1
     238// 0 is absorbing and must be avoided
     239// Low-order bits are not particularly random
    53240static inline uint32_t xorshift_6_21_7( uint32_t & state ) {
    54241        uint32_t ret = state;
     
    59246} // xorshift_6_21_7
    60247
    61 //--------------------------------------------------
    62 typedef struct {
    63   uint32_t a, b, c, d;
    64   uint32_t counter;
    65 } xorwow__state_t;
    66 
    67 // The state array must be initialized to not be all zero in the first four words.
    68 static inline uint32_t xorwow( xorwow__state_t & state ) {
     248static inline void xorshift_6_21_7_set_seed( uint32_t & state, uint32_t seed ) {
     249        state = seed;
     250        xorshift_6_21_7( state );                                                       // prime
     251} // xorshift_6_21_7_set_seed
     252
     253//--------------------------------------------------
     254// The state must be seeded with a nonzero value.
     255static inline uint64_t xorshift_12_25_27( uint64_t & state ) {
     256        uint64_t ret = state;
     257        state ^= state >> 12;
     258        state ^= state << 25;
     259        state ^= state >> 27;
     260        return ret * 0x_2545_F491_4F6C_DD1D;
     261} // xorshift_12_25_27
     262
     263static inline void xorshift_12_25_27_set_seed( uint64_t & state, uint64_t seed ) {
     264        state = seed;
     265        xorshift_12_25_27( state );                                                     // prime
     266} // xorshift_12_25_27_set_seed
     267
     268//--------------------------------------------------
     269// The state must be seeded with a nonzero value.
     270#ifndef KISS_64
     271typedef struct kiss_64_t { uint64_t z, w, jsr, jcong; } kiss_64_t;
     272#endif // ! KISS_64
     273
     274static inline uint64_t kiss_64( kiss_64_t & state ) with(state) {
     275        kiss_64_t ret = state;
     276        z = 36969 * (z & 65535) + (z >> 16);
     277        w = 18000 * (w & 65535) + (w >> 16);
     278        jsr ^= (jsr << 17);
     279        jsr ^= (jsr << 13);
     280        jsr ^= (jsr << 5);
     281        jcong = 69069 * jcong + 1234567;
     282        return (((ret.z << 16) + ret.w) ^ ret.jcong) + ret.jsr;
     283} // kiss_64
     284
     285static inline void kiss_64_set_seed( kiss_64_t & state, uint64_t seed ) with(state) {
     286        z = 1; w = 1; jsr = 4; jcong = seed;
     287        kiss_64( state );                                                                       // prime
     288} // kiss_64_set_seed
     289
     290//--------------------------------------------------
     291// The state array must be initialized to non-zero in the first four words.
     292#ifndef XORWOW
     293typedef struct xorwow_t { uint32_t a, b, c, d, counter; } xorwow_t;
     294#endif // ! XORWOW
     295
     296static inline uint32_t xorwow( xorwow_t & state ) with(state) {
    69297        // Algorithm "xorwow" from p. 5 of Marsaglia, "Xorshift RNGs".
    70         uint32_t ret = state.a + state.counter;
    71         uint32_t t = state.d;
    72 
    73         uint32_t const s = state.a;
    74         state.d = state.c;
    75         state.c = state.b;
    76         state.b = s;
     298        uint32_t ret = a + counter;
     299        uint32_t t = d;
     300
     301        uint32_t const s = a;
     302        d = c;
     303        c = b;
     304        b = s;
    77305
    78306        t ^= t >> 2;
    79307        t ^= t << 1;
    80308        t ^= s ^ (s << 4);
    81         state.a = t;
    82 
    83         state.counter += 362437;
     309        a = t;
     310        counter += 362437;
    84311        return ret;
    85 }
    86 
    87 //--------------------------------------------------
    88 static inline uint32_t LCG( uint32_t & state ) {                // linear congruential generator
    89         uint32_t ret = state;
    90         state = 36969 * (state & 65535) + (state >> 16);        // 36969 is NOT prime! No not change it!
    91         return ret;
    92 } // LCG
    93 
    94 //--------------------------------------------------
     312} // xorwow
     313
     314static inline void xorwow_set_seed( xorwow_t & state, uint32_t seed ) {
     315        state = (xorwow_t){ seed, seed, seed, seed, 0 };
     316        xorwow( state );                                                                        // prime
     317} // xorwow_set_seed
     318
     319//--------------------------------------------------
     320// Used in __tls_rand_fwd
    95321#define M  (1_l64u << 48_l64u)
    96322#define A  (25214903917_l64u)
     
    103329        state = (A * state + C) & (M - 1);
    104330        return state >> D;
    105 }
     331} // LCGBI_fwd
    106332
    107333static inline uint32_t LCGBI_bck( uint64_t & state ) {
     
    109335        state = AI * (state - C) & (M - 1);
    110336        return r;
    111 }
     337} // LCGBI_bck
    112338
    113339#undef M
     
    116342#undef C
    117343#undef D
     344
     345#endif // __cforall
  • libcfa/src/concurrency/invoke.h

    r7d9598d8 r2dcd80a  
    1010// Created On       : Tue Jan 17 12:27:26 2016
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Sun Jan  9 19:06:45 2022
    13 // Update Count     : 48
     12// Last Modified On : Tue Nov 29 20:42:21 2022
     13// Update Count     : 56
    1414//
    1515
     
    1717#include "bits/defs.hfa"
    1818#include "bits/locks.hfa"
     19#include "bits/random.hfa"
    1920#include "kernel/fwd.hfa"
    2021
     
    222223                struct processor * last_proc;
    223224
    224                 uint32_t random_state;                                                  // fast random numbers
     225                PRNG_STATE_T random_state;                                              // fast random numbers
    225226
    226227                #if defined( __CFA_WITH_VERIFY__ )
  • libcfa/src/concurrency/kernel.cfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Tue Jan 17 12:27:26 2017
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Mon Aug 31 07:08:20 2020
    13 // Update Count     : 71
     12// Last Modified On : Wed Nov 30 18:14:08 2022
     13// Update Count     : 76
    1414//
    1515
     
    151151        // Because of a bug, we couldn't initialized the seed on construction
    152152        // Do it here
    153         __cfaabi_tls.rand_seed ^= rdtscl();
     153        PRNG_SET_SEED( __cfaabi_tls.random_state, rdtscl() );
    154154        __cfaabi_tls.ready_rng.fwd_seed = 25214903917_l64u * (rdtscl() ^ (uintptr_t)&runner);
    155155        __tls_rand_advance_bck();
  • libcfa/src/concurrency/kernel/fwd.hfa

    r7d9598d8 r2dcd80a  
    4646                        } preemption_state;
    4747
    48                         #if defined(__SIZEOF_INT128__)
    49                                 __uint128_t rand_seed;
    50                         #else
    51                                 uint64_t rand_seed;
    52                         #endif
     48                        PRNG_STATE_T random_state;
     49
    5350                        struct {
    5451                                uint64_t fwd_seed;
     
    5754
    5855                        struct __stats_t        * volatile this_stats;
    59 
    6056
    6157                        #ifdef __CFA_WITH_VERIFY__
     
    7672                #define publicTLS_get( member ) ((typeof(__cfaabi_tls.member))__cfatls_get( __builtin_offsetof(KernelThreadData, member) ))
    7773
    78                 static inline uint64_t __tls_rand() {
    79                         return
    80                         #if defined(__SIZEOF_INT128__)
    81                                 lehmer64( kernelTLS().rand_seed );
    82                         #else
    83                                 xorshift_13_7_17( kernelTLS().rand_seed );
    84                         #endif
     74                static inline
     75                        #ifdef __x86_64__                                                       // 64-bit architecture
     76                        uint64_t
     77                        #else                                                                           // 32-bit architecture
     78                        uint32_t
     79                        #endif // __x86_64__
     80                __tls_rand() {
     81                        return PRNG_NAME( kernelTLS().random_state );
    8582                }
    8683
     
    120117
    121118                // Yield: yield N times
    122                 static inline void yield( unsigned times ) {
     119                static inline void yield( size_t times ) {
    123120                        for( times ) {
    124121                                yield();
  • libcfa/src/concurrency/kernel/startup.cfa

    r7d9598d8 r2dcd80a  
    3939#include "limits.hfa"
    4040#include "math.hfa"
     41#include "bits/random.hfa"                                                              // prng
    4142
    4243#define CFA_PROCESSOR_USE_MMAP 0
     
    107108extern void __wake_proc(processor *);
    108109extern int cfa_main_returned;                                                   // from interpose.cfa
    109 uint32_t __global_random_prime = 4_294_967_291u, __global_random_mask = false;
     110size_t __global_random_prime = 4_294_967_291u;
     111bool __global_random_mask = false;
    110112
    111113//-----------------------------------------------------------------------------
     
    133135// Global state
    134136__thread struct KernelThreadData __cfaabi_tls __attribute__ ((tls_model ( "initial-exec" ))) @= {
    135         NULL,                                                                                           // cannot use 0p
    136         NULL,
    137         false,
    138         { 1, false, false },
    139         0,
    140         { 0, 0 },
    141         NULL,
     137        .this_thread : NULL,                                                            // cannot use 0p
     138        .this_processor : NULL,
     139        .sched_lock : false,
     140        .preemption_state : { .disable_count : 1, .enabled : false, .in_progress : false },
     141        // random_state uninitialized
     142        .ready_rng : { .fwd_seed : 0, .bck_seed : 0 },
     143        .this_stats : NULL,
    142144        #ifdef __CFA_WITH_VERIFY__
    143                 false,
    144                 0,
     145                .in_sched_lock : false,
     146                .sched_id : 0,
    145147        #endif
    146148};
     
    513515        rdy_link.next = 0p;
    514516        rdy_link.ts   = MAX;
     517        user_link.next = 0p;
     518        user_link.prev = 0p;
     519        cltr_link.next = 0p;
     520        cltr_link.prev = 0p;
    515521        preferred = ready_queue_new_preferred();
    516522        last_proc = 0p;
    517         random_state = __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl();
     523        PRNG_SET_SEED( random_state,  __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl() );
    518524        #if defined( __CFA_WITH_VERIFY__ )
    519525                executing = 0p;
  • libcfa/src/concurrency/stats.cfa

    r7d9598d8 r2dcd80a  
    4848                        stats->io.submit.eagr       = 0;
    4949                        stats->io.submit.nblk       = 0;
     50                        stats->io.submit.extr       = 0;
    5051                        stats->io.flush.external    = 0;
     52                        stats->io.flush.signal      = 0;
    5153                        stats->io.flush.dirty       = 0;
    5254                        stats->io.flush.full        = 0;
     
    120122                        tally_one( &cltr->io.submit.eagr      , &proc->io.submit.eagr       );
    121123                        tally_one( &cltr->io.submit.nblk      , &proc->io.submit.nblk       );
     124                        tally_one( &cltr->io.submit.extr      , &proc->io.submit.extr       );
    122125                        tally_one( &cltr->io.flush.external   , &proc->io.flush.external    );
     126                        tally_one( &cltr->io.flush.signal     , &proc->io.flush.signal      );
    123127                        tally_one( &cltr->io.flush.dirty      , &proc->io.flush.dirty       );
    124128                        tally_one( &cltr->io.flush.full       , &proc->io.flush.full        );
     
    199203                                if(io.alloc.slow) {
    200204                                        double avgfasts = (100.0 * (double)io.submit.fast) / total_submits;
    201                                         sstr | "fast," | eng3(io.submit.slow) | "slow (" | ws(3, 3, avgfasts) | "%)" | nonl;
     205                                        sstr | "fast," | eng3(io.submit.slow) | "slow (" | ws(3, 3, avgfasts) | "%)," | eng3(io.submit.extr) | "external" | nonl;
    202206                                }
    203207                                sstr | " - eager" | eng3(io.submit.eagr) | nonl;
     
    217221                                     | " - cmp " | eng3(io.calls.locked) | "locked, " | eng3(io.calls.helped) | "helped"
    218222                                     | " - " | eng3(io.calls.errors.busy) | " EBUSY";
    219                                 sstr | " - sub: " | eng3(io.flush.full) | "full, " | eng3(io.flush.dirty) | "drty, " | eng3(io.flush.idle) | "idle, " | eng3(io.flush.eager) | "eagr, " | eng3(io.flush.external) | "ext";
     223                                sstr | " - sub: " | eng3(io.flush.full) | "full, " | eng3(io.flush.dirty) | "drty, " | eng3(io.flush.idle) | "idle, " | eng3(io.flush.eager) | "eagr, " | eng3(io.flush.external) | '/' | eng3(io.flush.signal) | "ext";
    220224                                sstr | "- ops blk: "
    221225                                     |   " sk rd: " | eng3(io.ops.sockread)  | "epll: " | eng3(io.ops.epllread)
  • libcfa/src/concurrency/stats.hfa

    r7d9598d8 r2dcd80a  
    9494                                volatile uint64_t eagr;
    9595                                volatile uint64_t nblk;
     96                                volatile uint64_t extr;
    9697                        } submit;
    9798                        struct {
    9899                                volatile uint64_t external;
     100                                volatile uint64_t signal;
    99101                                volatile uint64_t dirty;
    100102                                volatile uint64_t full;
  • libcfa/src/concurrency/thread.cfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Tue Jan 17 12:27:26 2017
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Sat Feb 12 15:24:18 2022
    13 // Update Count     : 66
     12// Last Modified On : Sun Dec 11 20:56:54 2022
     13// Update Count     : 102
    1414//
    1515
     
    2626#include "invoke.h"
    2727
    28 extern uint32_t __global_random_seed, __global_random_prime, __global_random_mask;
     28extern size_t __global_random_seed;
     29extern size_t __global_random_prime;
     30extern bool __global_random_mask;
    2931
    3032#pragma GCC visibility push(default)
     
    4648        rdy_link.next = 0p;
    4749        rdy_link.ts   = MAX;
     50        user_link.next = 0p;
     51        user_link.prev = 0p;
     52        cltr_link.next = 0p;
     53        cltr_link.prev = 0p;
    4854        preferred = ready_queue_new_preferred();
    4955        last_proc = 0p;
    50         random_state = __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl();
     56        PRNG_SET_SEED( random_state, __global_random_mask ? __global_random_prime : __global_random_prime ^ rdtscl() );
    5157        #if defined( __CFA_WITH_VERIFY__ )
    5258                executing = 0p;
     
    176182//-----------------------------------------------------------------------------
    177183bool migrate( thread$ * thrd, struct cluster & cl ) {
    178 
    179184        monitor$ * tmon = get_monitor(thrd);
    180185        monitor$ * __monitors[] = { tmon };
    181186        monitor_guard_t __guard = { __monitors, 1 };
    182 
    183 
    184187        {
    185188                // if nothing needs to be done, return false
     
    221224
    222225//-----------------------------------------------------------------------------
    223 #define GENERATOR LCG
    224 
    225 void set_seed( uint32_t seed ) {
    226         uint32_t & state = active_thread()->random_state;
    227         state = __global_random_seed = seed;
    228         GENERATOR( state );
    229         __global_random_prime = state;
     226
     227void set_seed( size_t seed ) {
     228        PRNG_STATE_T & state = active_thread()->random_state;
     229        PRNG_SET_SEED( state, seed );
     230        __global_random_seed = seed;
     231        __global_random_prime = seed;
    230232        __global_random_mask = true;
    231233} // set_seed
    232234
    233 uint32_t prng( void ) {                                                                 // [0,UINT_MAX]
    234         uint32_t & state = active_thread()->random_state;
    235         return GENERATOR( state );
     235size_t prng( void ) {                                                                   // [0,UINT_MAX]
     236        return PRNG_NAME( active_thread()->random_state );
    236237} // prng
    237238
  • libcfa/src/concurrency/thread.hfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Tue Jan 17 12:27:26 2017
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Fri Feb 11 16:34:07 2022
    13 // Update Count     : 20
     12// Last Modified On : Tue Nov 22 22:18:34 2022
     13// Update Count     : 35
    1414//
    1515
     
    2323#include "monitor.hfa"
    2424#include "exception.hfa"
     25#include "bits/random.hfa"
    2526
    2627//-----------------------------------------------------------------------------
     
    141142//----------
    142143// prng
     144void set_seed( size_t seed );
    143145static inline {
    144         uint32_t prng( thread$ & th ) __attribute__(( warn_unused_result )) { return LCG( th.random_state ); } // [0,UINT_MAX]
    145         uint32_t prng( thread$ & th, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u)
    146         uint32_t prng( thread$ & th, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u]
     146        size_t prng( thread$ & th ) __attribute__(( warn_unused_result )) { return PRNG_NAME( th.random_state ); } // [0,UINT_MAX]
     147        size_t prng( thread$ & th, size_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u)
     148        size_t prng( thread$ & th, size_t l, size_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u]
    147149        forall( T & | is_thread(T) ) {
    148                 uint32_t prng( T & th ) __attribute__(( warn_unused_result )) { return prng( (thread &)th ); } // [0,UINT_MAX]
    149                 uint32_t prng( T & th, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u)
    150                 uint32_t prng( T & th, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u]
     150                size_t prng( T & th ) __attribute__(( warn_unused_result )) { return prng( (thread &)th ); } // [0,UINT_MAX]
     151                size_t prng( T & th, size_t u ) __attribute__(( warn_unused_result )) { return prng( th ) % u; } // [0,u)
     152                size_t prng( T & th, size_t l, size_t u ) __attribute__(( warn_unused_result )) { return prng( th, u - l + 1 ) + l; } // [l,u]
    151153        } // distribution
    152154} // distribution
  • libcfa/src/exception.h

    r7d9598d8 r2dcd80a  
    143143}
    144144
    145 forall(exceptT &, virtualT & | is_exception(exceptT, virtualT))
     145forall(exceptT &, virtualT & | is_termination_exception(exceptT, virtualT))
    146146static inline void defaultResumptionHandler(exceptT & except) {
    147147        throw except;
  • libcfa/src/startup.cfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Tue Jul 24 16:21:57 2018
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Thu Oct  6 13:51:57 2022
    13 // Update Count     : 57
     12// Last Modified On : Mon Dec  5 11:41:58 2022
     13// Update Count     : 73
    1414//
    1515
     
    1818#include <stdlib.h>                                                                             // getenv
    1919#include "bits/defs.hfa"                                                                // rdtscl
     20#include "bits/random.hfa"                                                              // rdtscl
    2021#include "startup.hfa"
    2122
    22 extern uint32_t __global_random_seed;                                   // sequential/concurrent
    23 extern uint32_t __global_random_state;                                  // sequential
     23extern size_t __global_random_seed;                                             // sequential/concurrent
     24extern PRNG_STATE_T __global_random_state;                              // sequential
    2425
    2526extern "C" {
     
    6869        void __cfaabi_core_startup( void ) __attribute__(( constructor( STARTUP_PRIORITY_CORE ) ));
    6970        void __cfaabi_core_startup( void ) {
    70                 __global_random_state = __global_random_seed = rdtscl();
     71                __global_random_seed = rdtscl();
     72                PRNG_SET_SEED( __global_random_state, __global_random_seed );
     73
    7174                __cfaabi_interpose_startup();
    7275                __cfaabi_device_startup();
  • libcfa/src/stdlib.cfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Thu Jan 28 17:10:29 2016
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Thu Aug 25 22:41:14 2022
    13 // Update Count     : 604
     12// Last Modified On : Fri Dec  9 15:11:30 2022
     13// Update Count     : 631
    1414//
    1515
     
    225225//---------------------------------------
    226226
    227 #define GENERATOR LCG
    228 
    229227// would be cool to make hidden but it's needed for libcfathread
    230 __attribute__((visibility("default"))) uint32_t __global_random_seed;                                                   // sequential/concurrent
    231 __attribute__((visibility("hidden"))) uint32_t __global_random_state;                                                   // sequential only
    232 
    233 void set_seed( PRNG & prng, uint32_t seed_ ) with( prng ) { state = seed = seed_; GENERATOR( state ); } // set seed
    234 
    235 void set_seed( uint32_t seed ) { __global_random_state = __global_random_seed = seed; GENERATOR( __global_random_state ); }
    236 uint32_t get_seed() { return __global_random_seed; }
    237 uint32_t prng( void ) { return GENERATOR( __global_random_state ); } // [0,UINT_MAX]
     228__attribute__((visibility("default"))) size_t __global_random_seed; // sequential/concurrent
     229__attribute__((visibility("hidden"))) PRNG_STATE_T __global_random_state; // sequential only
     230
     231void set_seed( size_t seed ) {
     232        __global_random_seed = seed;
     233        PRNG_SET_SEED( __global_random_state, seed );
     234} // set_seed
     235
     236size_t get_seed() { return __global_random_seed; }
     237size_t prng( void ) { return PRNG_NAME( __global_random_state ); } // [0,UINT_MAX]
    238238
    239239//---------------------------------------
  • libcfa/src/stdlib.hfa

    r7d9598d8 r2dcd80a  
    1010// Created On       : Thu Jan 28 17:12:35 2016
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Thu Aug 25 18:07:06 2022
    13 // Update Count     : 645
     12// Last Modified On : Sun Dec 11 18:25:53 2022
     13// Update Count     : 765
    1414//
    1515
     
    404404//   calls( sprng );
    405405
    406 struct PRNG {
     406trait basic_prng( PRNG &, R ) {
     407        void set_seed( PRNG & prng, R seed );                           // set seed
     408        R get_seed( PRNG & prng );                                                      // get seed
     409        R prng( PRNG & prng );
     410        void ?{}( PRNG & prng );                                                        // random seed
     411        void ?{}( PRNG & prng, R seed );                                        // fixed seed
     412}; // basic_prng
     413
     414static inline forall( PRNG &, R | basic_prng( PRNG, R ) | { R ?%?( R, R ); } ) {
     415        R prng( PRNG & prng, R u ) { return prng( prng ) % u; } // [0,u)
     416}
     417static inline forall( PRNG &, R | basic_prng( PRNG, R ) | { R ?+?( R, R ); R ?-?( R, R ); R ?%?( R, R ); void ?{}( R &, one_t ); } ) {
     418        R prng( PRNG & prng, R l, R u ) { return prng( prng, u - l + (R){1} ) + l; } // [l,u]
     419}
     420
     421struct PRNG32 {
    407422        uint32_t callcnt;                                                                       // call count
    408423        uint32_t seed;                                                                          // current seed
    409         uint32_t state;                                                                         // random state
     424        PRNG_STATE_32_T state;                                                          // random state
    410425}; // PRNG
    411426
    412 void set_seed( PRNG & prng, uint32_t seed_ );
    413 static inline {
    414         void ?{}( PRNG & prng ) with( prng ) { callcnt = 0; set_seed( prng, rdtscl() ); } // random seed
    415         void ?{}( PRNG & prng, uint32_t seed ) with( prng ) { callcnt = 0; set_seed( prng, seed ); } // fixed seed
    416         uint32_t get_seed( PRNG & prng ) __attribute__(( warn_unused_result )) with( prng ) { return seed; } // get seed
    417         uint32_t prng( PRNG & prng ) __attribute__(( warn_unused_result )) with( prng ) { callcnt += 1; return LCG( state ); } // [0,UINT_MAX]
    418         uint32_t prng( PRNG & prng, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng ) % u; } // [0,u)
    419         uint32_t prng( PRNG & prng, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng, u - l + 1 ) + l; } // [l,u]
    420         uint32_t calls( PRNG & prng ) __attribute__(( warn_unused_result )) with( prng ) { return callcnt; }
     427static inline {
     428        void set_seed( PRNG32 & prng, uint32_t seed_ ) with( prng ) { seed = seed_; PRNG_SET_SEED_32( state, seed ); }
     429        uint32_t get_seed( PRNG32 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return seed; }
     430        uint32_t prng( PRNG32 & prng ) __attribute__(( warn_unused_result )) with( prng ) { callcnt += 1; return PRNG_NAME_32( state ); } // [0,UINT_MAX]
     431        uint32_t prng( PRNG32 & prng, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng ) % u; } // [0,u)
     432        uint32_t prng( PRNG32 & prng, uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( prng, u - l + 1 ) + l; } // [l,u]
     433        uint32_t calls( PRNG32 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return callcnt; }
     434        void ?{}( PRNG32 & prng ) with( prng ) { callcnt = 0; set_seed( prng, rdtscl() ); } // random seed
     435        void ?{}( PRNG32 & prng, uint32_t seed ) with( prng ) { callcnt = 0; set_seed( prng, seed ); } // fixed seed
     436} // distribution
     437
     438struct PRNG64 {
     439        uint64_t callcnt;                                                                       // call count
     440        uint64_t seed;                                                                          // current seed
     441        PRNG_STATE_64_T state;                                                          // random state
     442}; // PRNG
     443
     444static inline {
     445        void set_seed( PRNG64 & prng, uint64_t seed_ ) with( prng ) { seed = seed_; PRNG_SET_SEED_64( state, seed ); }
     446        uint64_t get_seed( PRNG64 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return seed; }
     447        uint64_t prng( PRNG64 & prng ) __attribute__(( warn_unused_result )) with( prng ) { callcnt += 1; return PRNG_NAME_64( state ); } // [0,UINT_MAX]
     448        uint64_t prng( PRNG64 & prng, uint64_t u ) __attribute__(( warn_unused_result )) { return prng( prng ) % u; } // [0,u)
     449        uint64_t prng( PRNG64 & prng, uint64_t l, uint64_t u ) __attribute__(( warn_unused_result )) { return prng( prng, u - l + 1 ) + l; } // [l,u]
     450        uint64_t calls( PRNG64 & prng ) __attribute__(( warn_unused_result )) with( prng ) { return callcnt; }
     451        void ?{}( PRNG64 & prng ) with( prng ) { callcnt = 0; set_seed( prng, rdtscl() ); } // random seed
     452        void ?{}( PRNG64 & prng, uint64_t seed ) with( prng ) { callcnt = 0; set_seed( prng, seed ); } // fixed seed
    421453} // distribution
    422454
     
    435467//   prng( 5, 21 );
    436468
    437 void set_seed( uint32_t seed_ ) OPTIONAL_THREAD;
    438 uint32_t get_seed() __attribute__(( warn_unused_result ));
    439 uint32_t prng( void ) __attribute__(( warn_unused_result )) OPTIONAL_THREAD; // [0,UINT_MAX]
    440 static inline {
    441         uint32_t prng( uint32_t u ) __attribute__(( warn_unused_result )) { return prng() % u; } // [0,u)
    442         uint32_t prng( uint32_t l, uint32_t u ) __attribute__(( warn_unused_result )) { return prng( u - l + 1 ) + l; } // [l,u]
     469// Harmonize with concurrency/thread.hfa.
     470void set_seed( size_t seed_ ) OPTIONAL_THREAD;                  // set global seed
     471size_t get_seed() __attribute__(( warn_unused_result )); // get global seed
     472size_t prng( void ) __attribute__(( warn_unused_result )) OPTIONAL_THREAD; // [0,UINT_MAX]
     473static inline {
     474        size_t prng( size_t u ) __attribute__(( warn_unused_result )) { return prng() % u; } // [0,u)
     475        size_t prng( size_t l, size_t u ) __attribute__(( warn_unused_result )) { return prng( u - l + 1 ) + l; } // [l,u]
    443476} // distribution
    444477
  • src/AST/Convert.cpp

    r7d9598d8 r2dcd80a  
    17641764                        { old->linkage.val },
    17651765                        GET_ACCEPT_1(base, Type),
     1766                        old->hide == EnumDecl::EnumHiding::Hide ? ast::EnumDecl::EnumHiding::Hide : ast::EnumDecl::EnumHiding::Visible,
    17661767                        old->enumValues
    17671768                );
  • src/AST/Decl.cpp

    r7d9598d8 r2dcd80a  
    125125}
    126126
    127 std::ostream & operator<< ( std::ostream & out, const TypeDecl::Data & data ) {
     127std::ostream & operator<< ( std::ostream & out, const TypeData & data ) {
    128128        return out << data.kind << ", " << data.isComplete;
    129129}
  • src/AST/Decl.hpp

    r7d9598d8 r2dcd80a  
    1010// Created On       : Thu May 9 10:00:00 2019
    1111// Last Modified By : Andrew Beach
    12 // Last Modified On : Thu May  5 12:09:00 2022
    13 // Update Count     : 33
     12// Last Modified On : Thu Nov 24  9:44:00 2022
     13// Update Count     : 34
    1414//
    1515
     
    191191        ptr<Type> init;
    192192
    193         /// Data extracted from a type decl
    194         struct Data {
    195                 Kind kind;
    196                 bool isComplete;
    197 
    198                 Data() : kind( NUMBER_OF_KINDS ), isComplete( false ) {}
    199                 Data( const TypeDecl * d ) : kind( d->kind ), isComplete( d->sized ) {}
    200                 Data( Kind k, bool c ) : kind( k ), isComplete( c ) {}
    201                 Data( const Data & d1, const Data & d2 )
    202                         : kind( d1.kind ), isComplete( d1.isComplete || d2.isComplete ) {}
    203 
    204                 bool operator==( const Data & o ) const { return kind == o.kind && isComplete == o.isComplete; }
    205                 bool operator!=( const Data & o ) const { return !(*this == o); }
    206         };
    207 
    208193        TypeDecl(
    209194                const CodeLocation & loc, const std::string & name, Storage::Classes storage,
     
    225210};
    226211
    227 std::ostream & operator<< ( std::ostream &, const TypeDecl::Data & );
     212/// Data extracted from a TypeDecl.
     213struct TypeData {
     214        TypeDecl::Kind kind;
     215        bool isComplete;
     216
     217        TypeData() : kind( TypeDecl::NUMBER_OF_KINDS ), isComplete( false ) {}
     218        TypeData( const TypeDecl * d ) : kind( d->kind ), isComplete( d->sized ) {}
     219        TypeData( TypeDecl::Kind k, bool c ) : kind( k ), isComplete( c ) {}
     220        TypeData( const TypeData & d1, const TypeData & d2 )
     221                : kind( d1.kind ), isComplete( d1.isComplete || d2.isComplete ) {}
     222
     223        bool operator==( const TypeData & o ) const { return kind == o.kind && isComplete == o.isComplete; }
     224        bool operator!=( const TypeData & o ) const { return !(*this == o); }
     225};
     226
     227std::ostream & operator<< ( std::ostream &, const TypeData & );
    228228
    229229/// C-style typedef `typedef Foo Bar`
     
    315315        // enum (type_optional) Name {...}
    316316        ptr<Type> base; // if isTyped == true && base.get() == nullptr, it is a "void" type enum
    317 
    318         EnumDecl( const CodeLocation& loc, const std::string& name, bool isTyped = false,
     317        enum class EnumHiding { Visible, Hide } hide;
     318
     319        EnumDecl( const CodeLocation& loc, const std::string& name, bool isTyped = false,
    319320                std::vector<ptr<Attribute>>&& attrs = {}, Linkage::Spec linkage = Linkage::Cforall,
    320                 Type const * base = nullptr,
     321                Type const * base = nullptr, EnumHiding hide = EnumHiding::Hide,
    321322                std::unordered_map< std::string, long long > enumValues = std::unordered_map< std::string, long long >() )
    322         : AggregateDecl( loc, name, std::move(attrs), linkage ), isTyped(isTyped), base(base), enumValues(enumValues) {}
     323        : AggregateDecl( loc, name, std::move(attrs), linkage ), isTyped(isTyped), base(base), hide(hide), enumValues(enumValues) {}
    323324
    324325        /// gets the integer value for this enumerator, returning true iff value found
  • src/AST/Pass.impl.hpp

    r7d9598d8 r2dcd80a  
    686686
    687687        if ( __visit_children() ) {
    688                 // unlike structs, traits, and unions, enums inject their members into the global scope
    689                 maybe_accept( node, &EnumDecl::base );
    690                 maybe_accept( node, &EnumDecl::params     );
    691                 maybe_accept( node, &EnumDecl::members    );
    692                 maybe_accept( node, &EnumDecl::attributes );
     688                if ( node->hide == ast::EnumDecl::EnumHiding::Hide ) {
     689                        guard_symtab guard { *this };
     690                        maybe_accept( node, &EnumDecl::base );
     691                        maybe_accept( node, &EnumDecl::params     );
     692                        maybe_accept( node, &EnumDecl::members    );
     693                        maybe_accept( node, &EnumDecl::attributes );
     694                } else {
     695                        maybe_accept( node, &EnumDecl::base );
     696                        maybe_accept( node, &EnumDecl::params     );
     697                        maybe_accept( node, &EnumDecl::members    );
     698                        maybe_accept( node, &EnumDecl::attributes );
     699                }
    693700        }
    694701
  • src/AST/Type.cpp

    r7d9598d8 r2dcd80a  
    1010// Created On       : Mon May 13 15:00:00 2019
    1111// Last Modified By : Andrew Beach
    12 // Last Modified On : Thu Jul 23 14:16:00 2020
    13 // Update Count     : 5
     12// Last Modified On : Thu Nov 24  9:49:00 2022
     13// Update Count     : 6
    1414//
    1515
     
    147147// --- TypeInstType
    148148
     149TypeInstType::TypeInstType( const TypeEnvKey & key )
     150: BaseInstType(key.base->name), base(key.base), kind(key.base->kind), formal_usage(key.formal_usage), expr_id(key.expr_id) {}
     151
    149152bool TypeInstType::operator==( const TypeInstType & other ) const {
    150153        return base == other.base
     
    164167bool TypeInstType::isComplete() const { return base->sized; }
    165168
    166 std::string TypeInstType::TypeEnvKey::typeString() const {
     169std::string TypeEnvKey::typeString() const {
    167170        return std::string("_") + std::to_string(formal_usage)
    168171                + "_" + std::to_string(expr_id) + "_" + base->name;
    169172}
    170173
    171 bool TypeInstType::TypeEnvKey::operator==(
    172                 const TypeInstType::TypeEnvKey & other ) const {
     174bool TypeEnvKey::operator==(
     175                const TypeEnvKey & other ) const {
    173176        return base == other.base
    174177                && formal_usage == other.formal_usage
     
    176179}
    177180
    178 bool TypeInstType::TypeEnvKey::operator<(
    179                 const TypeInstType::TypeEnvKey & other ) const {
     181bool TypeEnvKey::operator<(
     182                const TypeEnvKey & other ) const {
    180183        // TypeEnvKey ordering is an arbitrary total ordering.
    181184        // It doesn't mean anything but allows for a sorting.
  • src/AST/Type.hpp

    r7d9598d8 r2dcd80a  
    1010// Created On       : Thu May 9 10:00:00 2019
    1111// Last Modified By : Andrew Beach
    12 // Last Modified On : Wed Jul 14 15:54:00 2021
    13 // Update Count     : 7
     12// Last Modified On : Thu Nov 24  9:47:00 2022
     13// Update Count     : 8
    1414//
    1515
     
    390390};
    391391
     392struct TypeEnvKey;
     393
    392394/// instance of named type alias (typedef or variable)
    393395class TypeInstType final : public BaseInstType {
     
    401403        int expr_id = 0;
    402404
    403         // compact representation used for map lookups.
    404         struct TypeEnvKey {
    405                 const TypeDecl * base = nullptr;
    406                 int formal_usage = 0;
    407                 int expr_id = 0;
    408 
    409                 TypeEnvKey() = default;
    410                 TypeEnvKey(const TypeDecl * base, int formal_usage = 0, int expr_id = 0)
    411                 : base(base), formal_usage(formal_usage), expr_id(expr_id) {}
    412                 TypeEnvKey(const TypeInstType & inst)
    413                 : base(inst.base), formal_usage(inst.formal_usage), expr_id(inst.expr_id) {}
    414                 std::string typeString() const;
    415                 bool operator==(const TypeEnvKey & other) const;
    416                 bool operator<(const TypeEnvKey & other) const;
    417         };
    418 
    419405        bool operator==(const TypeInstType & other) const;
    420406
     
    433419        TypeInstType( const TypeInstType & o ) = default;
    434420
    435         TypeInstType( const TypeEnvKey & key )
    436         : BaseInstType(key.base->name), base(key.base), kind(key.base->kind), formal_usage(key.formal_usage), expr_id(key.expr_id) {}
     421        TypeInstType( const TypeEnvKey & key );
    437422
    438423        /// sets `base`, updating `kind` correctly
     
    453438        TypeInstType * clone() const override { return new TypeInstType{ *this }; }
    454439        MUTATE_FRIEND
     440};
     441
     442/// Compact representation of TypeInstType used for map lookups.
     443struct TypeEnvKey {
     444        const TypeDecl * base = nullptr;
     445        int formal_usage = 0;
     446        int expr_id = 0;
     447
     448        TypeEnvKey() = default;
     449        TypeEnvKey(const TypeDecl * base, int formal_usage = 0, int expr_id = 0)
     450        : base(base), formal_usage(formal_usage), expr_id(expr_id) {}
     451        TypeEnvKey(const TypeInstType & inst)
     452        : base(inst.base), formal_usage(inst.formal_usage), expr_id(inst.expr_id) {}
     453        std::string typeString() const;
     454        bool operator==(const TypeEnvKey & other) const;
     455        bool operator<(const TypeEnvKey & other) const;
    455456};
    456457
     
    560561namespace std {
    561562        template<>
    562         struct hash<typename ast::TypeInstType::TypeEnvKey> {
    563                 size_t operator() (const ast::TypeInstType::TypeEnvKey & x) const {
     563        struct hash<typename ast::TypeEnvKey> {
     564                size_t operator() (const ast::TypeEnvKey & x) const {
    564565                        const size_t p = 1000007;
    565566                        size_t res = reinterpret_cast<size_t>(x.base);
  • src/AST/TypeEnvironment.cpp

    r7d9598d8 r2dcd80a  
    8282}
    8383
    84 const EqvClass * TypeEnvironment::lookup( const TypeInstType::TypeEnvKey & var ) const {
     84const EqvClass * TypeEnvironment::lookup( const TypeEnvKey & var ) const {
    8585        for ( ClassList::const_iterator i = env.begin(); i != env.end(); ++i ) {
    8686                if ( i->vars.find( var ) != i->vars.end() ) return &*i;
     
    122122void TypeEnvironment::writeToSubstitution( TypeSubstitution & sub ) const {
    123123        for ( const auto & clz : env ) {
    124                 TypeInstType::TypeEnvKey clzRep;
     124                TypeEnvKey clzRep;
    125125                bool first = true;
    126126                for ( const auto & var : clz.vars ) {
     
    146146        struct Occurs : public ast::WithVisitorRef<Occurs> {
    147147                bool result;
    148                 std::unordered_set< TypeInstType::TypeEnvKey > vars;
     148                std::unordered_set< TypeEnvKey > vars;
    149149                const TypeEnvironment & tenv;
    150150
    151                 Occurs( const TypeInstType::TypeEnvKey & var, const TypeEnvironment & env )
     151                Occurs( const TypeEnvKey & var, const TypeEnvironment & env )
    152152                : result( false ), vars(), tenv( env ) {
    153153                        if ( const EqvClass * clz = tenv.lookup( var ) ) {
     
    170170
    171171        /// true if `var` occurs in `ty` under `env`
    172         bool occurs( const Type * ty, const TypeInstType::TypeEnvKey & var, const TypeEnvironment & env ) {
     172        bool occurs( const Type * ty, const TypeEnvKey & var, const TypeEnvironment & env ) {
    173173                Pass<Occurs> occur{ var, env };
    174174                maybe_accept( ty, occur );
     
    258258namespace {
    259259        /// true if the given type can be bound to the given type variable
    260         bool tyVarCompatible( const TypeDecl::Data & data, const Type * type ) {
     260        bool tyVarCompatible( const TypeData & data, const Type * type ) {
    261261                switch ( data.kind ) {
    262262                  case TypeDecl::Dtype:
     
    279279
    280280bool TypeEnvironment::bindVar(
    281                 const TypeInstType * typeInst, const Type * bindTo, const TypeDecl::Data & data,
     281                const TypeInstType * typeInst, const Type * bindTo, const TypeData & data,
    282282                AssertionSet & need, AssertionSet & have, const OpenVarSet & open, WidenMode widen,
    283283                const SymbolTable & symtab
     
    319319
    320320bool TypeEnvironment::bindVarToVar(
    321                 const TypeInstType * var1, const TypeInstType * var2, TypeDecl::Data && data,
     321                const TypeInstType * var1, const TypeInstType * var2, TypeData && data,
    322322                AssertionSet & need, AssertionSet & have, const OpenVarSet & open,
    323323                WidenMode widen, const SymbolTable & symtab
     
    457457}
    458458
    459 TypeEnvironment::ClassList::iterator TypeEnvironment::internal_lookup( const TypeInstType::TypeEnvKey & var ) {
     459TypeEnvironment::ClassList::iterator TypeEnvironment::internal_lookup( const TypeEnvKey & var ) {
    460460        for ( ClassList::iterator i = env.begin(); i != env.end(); ++i ) {
    461461                if ( i->vars.count( var ) ) return i;
  • src/AST/TypeEnvironment.hpp

    r7d9598d8 r2dcd80a  
    7979
    8080/// Set of open variables
    81 using OpenVarSet = std::unordered_map< TypeInstType::TypeEnvKey, TypeDecl::Data >;
     81using OpenVarSet = std::unordered_map< TypeEnvKey, TypeData >;
    8282
    8383/// Merges one set of open vars into another
     
    9595/// they bind to.
    9696struct EqvClass {
    97         std::unordered_set< TypeInstType::TypeEnvKey > vars;
     97        std::unordered_set< TypeEnvKey > vars;
    9898        ptr<Type> bound;
    9999        bool allowWidening;
    100         TypeDecl::Data data;
     100        TypeData data;
    101101
    102102        EqvClass() : vars(), bound(), allowWidening( true ), data() {}
     
    111111
    112112        /// Singleton class constructor from substitution
    113         EqvClass( const TypeInstType::TypeEnvKey & v, const Type * b )
     113        EqvClass( const TypeEnvKey & v, const Type * b )
    114114        : vars{ v }, bound( b ), allowWidening( false ), data( TypeDecl::Dtype, false ) {}
    115115
    116116        /// Single-var constructor (strips qualifiers from bound type)
    117         EqvClass( const TypeInstType::TypeEnvKey & v, const Type * b, bool w, const TypeDecl::Data & d )
     117        EqvClass( const TypeEnvKey & v, const Type * b, bool w, const TypeData & d )
    118118        : vars{ v }, bound( b ), allowWidening( w ), data( d ) {
    119119                reset_qualifiers( bound );
     
    121121
    122122        /// Double-var constructor
    123         EqvClass( const TypeInstType::TypeEnvKey & v, const TypeInstType::TypeEnvKey & u, bool w, const TypeDecl::Data & d )
     123        EqvClass( const TypeEnvKey & v, const TypeEnvKey & u, bool w, const TypeData & d )
    124124        : vars{ v, u }, bound(), allowWidening( w ), data( d ) {}
    125125
     
    137137public:
    138138        /// Finds the equivalence class containing a variable; nullptr for none such
    139         const EqvClass * lookup( const TypeInstType::TypeEnvKey & var ) const;
     139        const EqvClass * lookup( const TypeEnvKey & var ) const;
    140140
    141141        /// Add a new equivalence class for each type variable
     
    181181        /// needed. Returns false on failure.
    182182        bool bindVar(
    183                 const TypeInstType * typeInst, const Type * bindTo, const TypeDecl::Data & data,
     183                const TypeInstType * typeInst, const Type * bindTo, const TypeData & data,
    184184                AssertionSet & need, AssertionSet & have, const OpenVarSet & openVars,
    185185                ResolvExpr::WidenMode widen, const SymbolTable & symtab );
     
    188188        /// classes if needed. Returns false on failure.
    189189        bool bindVarToVar(
    190                 const TypeInstType * var1, const TypeInstType * var2, TypeDecl::Data && data,
     190                const TypeInstType * var1, const TypeInstType * var2, TypeData && data,
    191191                AssertionSet & need, AssertionSet & have, const OpenVarSet & openVars,
    192192                ResolvExpr::WidenMode widen, const SymbolTable & symtab );
     
    213213
    214214        /// Private lookup API; returns array index of string, or env.size() for not found
    215         ClassList::iterator internal_lookup( const TypeInstType::TypeEnvKey & );
     215        ClassList::iterator internal_lookup( const TypeEnvKey & );
    216216};
    217217
  • src/AST/TypeSubstitution.cpp

    r7d9598d8 r2dcd80a  
    5252}
    5353
    54 void TypeSubstitution::add( const TypeInstType::TypeEnvKey & key, const Type * actualType) {
     54void TypeSubstitution::add( const TypeEnvKey & key, const Type * actualType) {
    5555        typeMap[ key ] = actualType;
    5656}
     
    6464
    6565const Type *TypeSubstitution::lookup(
    66                 const TypeInstType::TypeEnvKey & formalType ) const {
     66                const TypeEnvKey & formalType ) const {
    6767        TypeMap::const_iterator i = typeMap.find( formalType );
    6868
     
    8585
    8686const Type *TypeSubstitution::lookup( const TypeInstType * formalType ) const {
    87         return lookup( ast::TypeInstType::TypeEnvKey( *formalType ) );
     87        return lookup( ast::TypeEnvKey( *formalType ) );
    8888}
    8989
  • src/AST/TypeSubstitution.hpp

    r7d9598d8 r2dcd80a  
    7272
    7373        void add( const TypeInstType * formalType, const Type *actualType );
    74         void add( const TypeInstType::TypeEnvKey & key, const Type *actualType );
     74        void add( const TypeEnvKey & key, const Type *actualType );
    7575        void add( const TypeSubstitution &other );
    7676        void remove( const TypeInstType * formalType );
    77         const Type *lookup( const TypeInstType::TypeEnvKey & formalType ) const;
     77        const Type *lookup( const TypeEnvKey & formalType ) const;
    7878        const Type *lookup( const TypeInstType * formalType ) const;
    7979        bool empty() const;
     
    105105        friend class Pass;
    106106
    107         typedef std::unordered_map< TypeInstType::TypeEnvKey, ptr<Type> > TypeMap;
     107        typedef std::unordered_map< TypeEnvKey, ptr<Type> > TypeMap;
    108108        TypeMap typeMap;
    109109
     
    184184                int subCount = 0;
    185185                bool freeOnly;
    186                 typedef std::unordered_set< TypeInstType::TypeEnvKey > BoundVarsType;
     186                typedef std::unordered_set< TypeEnvKey > BoundVarsType;
    187187                BoundVarsType boundVars;
    188188
  • src/CodeGen/CodeGenerator.cc

    r7d9598d8 r2dcd80a  
    290290                                        if ( obj->get_init() ) {
    291291                                                obj->get_init()->accept( *visitor );
    292                                                 last_val = ((ConstantExpr *)(((SingleInit *)(obj->init))->value))->constant.get_ival();
     292                                                Expression* expr = ((SingleInit *)(obj->init))->value;
     293                                                while ( auto temp = dynamic_cast<CastExpr *>(expr) ) {
     294                                                        expr = temp->arg;
     295                                                }
     296                                                last_val = ((ConstantExpr *)expr)->constant.get_ival();
    293297                                        } else {
    294298                                                output << ++last_val;
  • src/GenPoly/Box.cc

    r7d9598d8 r2dcd80a  
    3737#include "InitTweak/InitTweak.h"         // for getFunctionName, isAssignment
    3838#include "Lvalue.h"                      // for generalizedLvalue
    39 #include "ResolvExpr/TypeEnvironment.h"  // for EqvClass
    4039#include "ResolvExpr/typeops.h"          // for typesCompatible
    4140#include "ScopedSet.h"                   // for ScopedSet, ScopedSet<>::iter...
     
    9594                  private:
    9695                        /// Pass the extra type parameters from polymorphic generic arguments or return types into a function application
    97                         void passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator &arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes );
     96                        /// Will insert 0, 2 or 3 more arguments.
     97                        std::list< Expression *>::iterator passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes );
    9898                        /// passes extra type parameters into a polymorphic function application
    9999                        /// Returns an iterator to the first argument after the added
     
    488488                                makeTyVarMap( functionType, scopeTyVars );
    489489
    490                                 std::list< DeclarationWithType *> &paramList = functionType->parameters;
    491490                                std::list< FunctionType const *> functions;
    492491                                for ( TypeDecl * const tyVar : functionType->forall ) {
     
    495494                                        } // for
    496495                                } // for
    497                                 for ( DeclarationWithType * const arg : paramList ) {
     496                                for ( DeclarationWithType * const arg : functionType->parameters ) {
    498497                                        findFunction( arg->get_type(), functions, scopeTyVars, needsAdapter );
    499498                                } // for
     
    532531                }
    533532
    534                 void Pass1::passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator &arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes ) {
     533                std::list< Expression *>::iterator Pass1::passArgTypeVars( ApplicationExpr *appExpr, Type *parmType, Type *argBaseType, std::list< Expression *>::iterator arg, const TyVarMap &exprTyVars, std::set< std::string > &seenTypes ) {
    535534                        Type *polyType = isPolyType( parmType, exprTyVars );
    536535                        if ( polyType && ! dynamic_cast< TypeInstType* >( polyType ) ) {
    537536                                std::string typeName = mangleType( polyType );
    538                                 if ( seenTypes.count( typeName ) ) return;
     537                                if ( seenTypes.count( typeName ) ) return arg;
    539538
    540539                                arg = appExpr->get_args().insert( arg, new SizeofExpr( argBaseType->clone() ) );
     
    556555                                seenTypes.insert( typeName );
    557556                        }
     557                        return arg;
    558558                }
    559559
     
    562562                        std::list< Expression *>::iterator arg = appExpr->args.begin();
    563563                        // pass size/align for type variables
     564                        // NOTE: This is iterating over a map. This means the sorting
     565                        // order of the keys changes behaviour, as the iteration order
     566                        // is visible outside the loop. - The order matches the orignal
     567                        // order because the vars have been renamed with numbers that,
     568                        // even when converted to strings, sort in the original order.
     569                        // (At least, that is the best explination I have.)
    564570                        for ( std::pair<std::string, TypeDecl::Data> const & tyParam : exprTyVars ) {
    565                                 ResolvExpr::EqvClass eqvClass;
    566                                 if ( tyParam.second.isComplete ) {
    567                                         Type *concrete = env->lookup( tyParam.first );
    568                                         // If there is an unbound type variable, it should have detected already.
    569                                         assertf( concrete, "Unbound type variable: %s in: %s",
    570                                                 toCString( tyParam.first ), toCString( *env ) );
    571 
    572                                         arg = appExpr->get_args().insert( arg, new SizeofExpr( concrete->clone() ) );
    573                                         arg++;
    574                                         arg = appExpr->get_args().insert( arg, new AlignofExpr( concrete->clone() ) );
    575                                         arg++;
    576                                 } // if
     571                                if ( !tyParam.second.isComplete ) continue;
     572                                Type *concrete = env->lookup( tyParam.first );
     573                                // If there is an unbound type variable, it should have detected already.
     574                                assertf( concrete, "Unbound type variable: %s in: %s",
     575                                        toCString( tyParam.first ), toCString( *env ) );
     576
     577                                arg = appExpr->get_args().insert( arg, new SizeofExpr( concrete->clone() ) );
     578                                arg++;
     579                                arg = appExpr->get_args().insert( arg, new AlignofExpr( concrete->clone() ) );
     580                                arg++;
    577581                        } // for
    578582
     
    582586                        assert( funcType );
    583587
    584                         // These iterators don't advance in unison.
    585                         std::list< DeclarationWithType* >::const_iterator fnParm = funcType->get_parameters().begin();
    586                         std::list< Expression* >::const_iterator fnArg = arg;
    587                         std::set< std::string > seenTypes; ///< names for generic types we've seen
     588                        // Iterator over the original function arguments.
     589                        std::list< Expression* >::const_iterator fnArg;
     590                        // Names for generic types we've seen.
     591                        std::set< std::string > seenTypes;
    588592
    589593                        // a polymorphic return type may need to be added to the argument list
    590594                        if ( polyRetType ) {
    591595                                Type *concRetType = replaceWithConcrete( polyRetType, env );
    592                                 passArgTypeVars( appExpr, polyRetType, concRetType, arg, exprTyVars, seenTypes );
    593                                 ++fnArg; // skip the return parameter in the argument list
     596                                arg = passArgTypeVars( appExpr, polyRetType, concRetType, arg, exprTyVars, seenTypes );
     597                                // Skip the return parameter in the argument list.
     598                                fnArg = arg + 1;
     599                        } else {
     600                                fnArg = arg;
    594601                        }
    595602
    596603                        // add type information args for presently unseen types in parameter list
     604                        std::list< DeclarationWithType* >::const_iterator fnParm = funcType->get_parameters().begin();
    597605                        for ( ; fnParm != funcType->get_parameters().end() && fnArg != appExpr->get_args().end(); ++fnParm, ++fnArg ) {
    598                                 if ( ! (*fnArg)->get_result() ) continue;
    599606                                Type * argType = (*fnArg)->get_result();
    600                                 passArgTypeVars( appExpr, (*fnParm)->get_type(), argType, arg, exprTyVars, seenTypes );
     607                                if ( ! argType ) continue;
     608                                arg = passArgTypeVars( appExpr, (*fnParm)->get_type(), argType, arg, exprTyVars, seenTypes );
    601609                        }
    602610                        return arg;
     
    680688                Expression *Pass1::applyAdapter( ApplicationExpr *appExpr, FunctionType *function ) {
    681689                        Expression *ret = appExpr;
    682 //                      if ( ! function->get_returnVals().empty() && isPolyType( function->get_returnVals().front()->get_type(), tyVars ) ) {
    683690                        if ( isDynRet( function, scopeTyVars ) ) {
    684691                                ret = addRetParam( appExpr, function->returnVals.front()->get_type() );
     
    772779
    773780                void Pass1::addInferredParams( ApplicationExpr *appExpr, std::list< Expression *>::iterator arg, FunctionType *functionType, const TyVarMap &tyVars ) {
    774                         std::list< Expression *>::iterator cur = arg;
    775781                        for ( TypeDecl * const tyVar : functionType->forall ) {
    776782                                for ( DeclarationWithType * const assert : tyVar->assertions ) {
     
    779785                                        Expression *newExpr = inferParam->second.expr->clone();
    780786                                        boxParam( newExpr, assert->get_type(), tyVars );
    781                                         appExpr->get_args().insert( cur, newExpr );
     787                                        arg = appExpr->get_args().insert( arg, newExpr );
     788                                        ++arg;
    782789                                } // for
    783790                        } // for
     
    922929                                // only attempt to create an adapter or pass one as a parameter if we haven't already done so for this
    923930                                // pre-substitution parameter function type.
    924                                 if ( adaptersDone.find( mangleName ) == adaptersDone.end() ) {
    925                                         adaptersDone.insert( adaptersDone.begin(), mangleName );
     931                                // The second part of the insert result is "is the value new".
     932                                if ( adaptersDone.insert( mangleName ).second ) {
    926933
    927934                                        // apply substitution to type variables to figure out what the adapter's type should look like
     
    11061113
    11071114                        Expression *ret = appExpr;
     1115                        // Save iterator to the first original parameter (works with lists).
    11081116                        std::list< Expression *>::iterator paramBegin = appExpr->get_args().begin();
    11091117
     
    11721180
    11731181                void Pass1::premutate( AddressExpr * ) { visit_children = false; }
     1182
    11741183                Expression * Pass1::postmutate( AddressExpr * addrExpr ) {
    11751184                        assert( addrExpr->arg->result && ! addrExpr->arg->result->isVoid() );
     
    12321241
    12331242                void Pass2::addAdapters( FunctionType *functionType ) {
    1234                         std::list< DeclarationWithType *> &paramList = functionType->parameters;
    12351243                        std::list< FunctionType const *> functions;
    12361244                        for ( DeclarationWithType * const arg : functionType->parameters ) {
     
    12451253                                        std::string adapterName = makeAdapterName( mangleName );
    12461254                                        // adapter may not be used in body, pass along with unused attribute.
    1247                                         paramList.push_front( new ObjectDecl( adapterName, Type::StorageClasses(), LinkageSpec::C, 0, new PointerType( Type::Qualifiers(), makeAdapterType( funType, scopeTyVars ) ), 0, { new Attribute( "unused" ) } ) );
     1255                                        functionType->parameters.push_front(
     1256                                                new ObjectDecl( adapterName, Type::StorageClasses(), LinkageSpec::C, 0, new PointerType( Type::Qualifiers(), makeAdapterType( funType, scopeTyVars ) ), 0, { new Attribute( "unused" ) } ) );
    12481257                                        adaptersDone.insert( adaptersDone.begin(), mangleName );
    12491258                                }
    12501259                        }
    1251 //  deleteAll( functions );
    12521260                }
    12531261
  • src/GenPoly/ErasableScopedMap.h

    r7d9598d8 r2dcd80a  
    2323
    2424namespace GenPoly {
    25         /// A map where the items are placed into nested scopes;
    26         /// inserted items are placed into the innermost scope, lookup looks from the innermost scope outward;
    27         /// erasing a key means that find() will no longer report any instance of the key in a scope further
    28         /// out, but the erasure itself is scoped. Key erasure works by inserting a sentinal value into the
    29         /// value field, and thus only works for Value types where a meaningful sentinal can be chosen.
    30         template<typename Key, typename Value>
    31         class ErasableScopedMap {
    32                 typedef std::map< Key, Value > Scope;
    33                 typedef std::vector< Scope > ScopeList;
    34 
    35                 ScopeList scopes; ///< scoped list of maps
    36                 Value erased;     ///< sentinal value for erased keys
    37         public:
    38                 typedef typename Scope::key_type key_type;
    39                 typedef typename Scope::mapped_type mapped_type;
    40                 typedef typename Scope::value_type value_type;
    41                 typedef typename ScopeList::size_type size_type;
    42                 typedef typename ScopeList::difference_type difference_type;
    43                 typedef typename Scope::reference reference;
    44                 typedef typename Scope::const_reference const_reference;
    45                 typedef typename Scope::pointer pointer;
    46                 typedef typename Scope::const_pointer const_pointer;
    47 
    48                 class iterator : public std::iterator< std::bidirectional_iterator_tag,
    49                                                        value_type > {
    50                 friend class ErasableScopedMap;
    51                 friend class const_iterator;
    52                         typedef typename std::map< Key, Value >::iterator wrapped_iterator;
    53                         typedef typename std::vector< std::map< Key, Value > > scope_list;
    54                         typedef typename scope_list::size_type size_type;
    55 
    56                         /// Checks if this iterator points to a valid item
    57                         bool is_valid() const {
    58                                 return it != map->scopes[i].end() && it->second != map->erased;
     25
     26/// A map where the items are placed into nested scopes.
     27/// Inserted items are placed into the innermost scope, lookup looks from the
     28/// innermost scope outward. Erasing a key means that find() will no longer
     29/// report any instance of the key in a scope further out, but the erasure
     30/// itself is scoped. Key erasure works by inserting a sentinal value into
     31/// the value field, and thus only works for Value types where a meaningful
     32/// sentinal can be chosen.
     33template<typename Key, typename Value>
     34class ErasableScopedMap {
     35        typedef std::map< Key, Value > Scope;
     36        typedef std::vector< Scope > ScopeList;
     37
     38        /// Scoped list of maps.
     39        ScopeList scopes;
     40        /// Sentinal value for erased keys.
     41        Value erased;
     42public:
     43        typedef typename Scope::key_type key_type;
     44        typedef typename Scope::mapped_type mapped_type;
     45        typedef typename Scope::value_type value_type;
     46        typedef typename ScopeList::size_type size_type;
     47        typedef typename ScopeList::difference_type difference_type;
     48        typedef typename Scope::reference reference;
     49        typedef typename Scope::const_reference const_reference;
     50        typedef typename Scope::pointer pointer;
     51        typedef typename Scope::const_pointer const_pointer;
     52
     53        // Both iterator types are complete bidirection iterators, defined below.
     54        class iterator;
     55        class const_iterator;
     56
     57        /// Starts a new scope
     58        void beginScope() {
     59                Scope scope;
     60                scopes.push_back(scope);
     61        }
     62
     63        /// Ends a scope; invalidates any iterators pointing to elements of that scope
     64        void endScope() {
     65                scopes.pop_back();
     66                assert( ! scopes.empty() );
     67        }
     68
     69        /// Default constructor initializes with one scope
     70        ErasableScopedMap( const Value &erased_ ) : erased( erased_ ) { beginScope(); }
     71
     72        iterator begin() { return iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); }
     73        const_iterator begin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); }
     74        const_iterator cbegin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); }
     75        iterator end() { return iterator(*this, scopes[0].end(), 0); }
     76        const_iterator end() const { return const_iterator(*this, scopes[0].end(), 0); }
     77        const_iterator cend() const { return const_iterator(*this, scopes[0].end(), 0); }
     78
     79        /// Gets the index of the current scope (counted from 1)
     80        size_type currentScope() const { return scopes.size(); }
     81
     82        /// Finds the given key in the outermost scope it occurs; returns end() for none such
     83        iterator find( const Key &key ) {
     84                for ( size_type i = scopes.size() - 1; ; --i ) {
     85                        typename Scope::iterator val = scopes[i].find( key );
     86                        if ( val != scopes[i].end() ) {
     87                                return val->second == erased ? end() : iterator( *this, val, i );
    5988                        }
    60 
    61                         /// Increments on invalid
    62                         iterator& next_valid() {
    63                                 if ( ! is_valid() ) { ++(*this); }
    64                                 return *this;
     89                        if ( i == 0 ) break;
     90                }
     91                return end();
     92        }
     93        const_iterator find( const Key &key ) const {
     94                return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->find( key ) );
     95        }
     96
     97        /// Finds the given key in the outermost scope inside the given scope where it occurs
     98        iterator findNext( const_iterator &it, const Key &key ) {
     99                if ( it.i == 0 ) return end();
     100                for ( size_type i = it.i - 1; ; --i ) {
     101                        typename Scope::iterator val = scopes[i].find( key );
     102                        if ( val != scopes[i].end() ) {
     103                                return val->second == erased ? end() : iterator( *this, val, i );
    65104                        }
    66 
    67                         /// Decrements on invalid
    68                         iterator& prev_valid() {
    69                                 if ( ! is_valid() ) { --(*this); }
    70                                 return *this;
    71                         }
    72                        
    73                         iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_iterator &_it, size_type _i)
    74                                         : map(&_map), it(_it), i(_i) {}
    75                        
    76                 public:
    77                         iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {}
    78                         iterator& operator= (const iterator &that) {
    79                                 map = that.map; i = that.i; it = that.it;
    80                                 return *this;
    81                         }
    82 
    83                         reference operator* () { return *it; }
    84                         pointer operator-> () { return it.operator->(); }
    85 
    86                         iterator& operator++ () {
    87                                 if ( it == map->scopes[i].end() ) {
    88                                         if ( i == 0 ) return *this;
    89                                         --i;
    90                                         it = map->scopes[i].begin();
    91                                 } else {
    92                                         ++it;
    93                                 }
    94                                 return next_valid();
    95                         }
    96                         iterator& operator++ (int) { iterator tmp = *this; ++(*this); return tmp; }
    97 
    98                         iterator& operator-- () {
    99                                 // may fail if this is the begin iterator; allowed by STL spec
    100                                 if ( it == map->scopes[i].begin() ) {
    101                                         ++i;
    102                                         it = map->scopes[i].end();
    103                                 }
    104                                 --it;
    105                                 return prev_valid();
    106                         }
    107                         iterator& operator-- (int) { iterator tmp = *this; --(*this); return tmp; }
    108 
    109                         bool operator== (const iterator &that) {
    110                                 return map == that.map && i == that.i && it == that.it;
    111                         }
    112                         bool operator!= (const iterator &that) { return !( *this == that ); }
    113 
    114                 private:
    115                         ErasableScopedMap< Key, Value > const *map;
    116                         wrapped_iterator it;
    117                         size_type i;
    118                 };
    119 
    120                 class const_iterator : public std::iterator< std::bidirectional_iterator_tag,
    121                                                              value_type > {
    122                 friend class ErasableScopedMap;
    123                         typedef typename std::map< Key, Value >::iterator wrapped_iterator;
    124                         typedef typename std::map< Key, Value >::const_iterator wrapped_const_iterator;
    125                         typedef typename std::vector< std::map< Key, Value > > scope_list;
    126                         typedef typename scope_list::size_type size_type;
    127 
    128                         /// Checks if this iterator points to a valid item
    129                         bool is_valid() const {
    130                                 return it != map->scopes[i].end() && it->second != map->erased;
    131                         }
    132 
    133                         /// Increments on invalid
    134                         const_iterator& next_valid() {
    135                                 if ( ! is_valid() ) { ++(*this); }
    136                                 return *this;
    137                         }
    138 
    139                         /// Decrements on invalid
    140                         const_iterator& prev_valid() {
    141                                 if ( ! is_valid() ) { --(*this); }
    142                                 return *this;
    143                         }
    144                        
    145                         const_iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_const_iterator &_it, size_type _i)
    146                                         : map(&_map), it(_it), i(_i) {}
    147                 public:
    148                         const_iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {}
    149                         const_iterator(const const_iterator &that) : map(that.map), it(that.it), i(that.i) {}
    150                         const_iterator& operator= (const iterator &that) {
    151                                 map = that.map; i = that.i; it = that.it;
    152                                 return *this;
    153                         }
    154                         const_iterator& operator= (const const_iterator &that) {
    155                                 map = that.map; i = that.i; it = that.it;
    156                                 return *this;
    157                         }
    158 
    159                         const_reference operator* () { return *it; }
    160                         const_pointer operator-> () { return it.operator->(); }
    161 
    162                         const_iterator& operator++ () {
    163                                 if ( it == map->scopes[i].end() ) {
    164                                         if ( i == 0 ) return *this;
    165                                         --i;
    166                                         it = map->scopes[i].begin();
    167                                 } else {
    168                                         ++it;
    169                                 }
    170                                 return next_valid();
    171                         }
    172                         const_iterator& operator++ (int) { const_iterator tmp = *this; ++(*this); return tmp; }
    173 
    174                         const_iterator& operator-- () {
    175                                 // may fail if this is the begin iterator; allowed by STL spec
    176                                 if ( it == map->scopes[i].begin() ) {
    177                                         ++i;
    178                                         it = map->scopes[i].end();
    179                                 }
    180                                 --it;
    181                                 return prev_valid();
    182                         }
    183                         const_iterator& operator-- (int) { const_iterator tmp = *this; --(*this); return tmp; }
    184 
    185                         bool operator== (const const_iterator &that) {
    186                                 return map == that.map && i == that.i && it == that.it;
    187                         }
    188                         bool operator!= (const const_iterator &that) { return !( *this == that ); }
    189 
    190                 private:
    191                         ErasableScopedMap< Key, Value > const *map;
    192                         wrapped_const_iterator it;
    193                         size_type i;
    194                 };
    195 
    196                 /// Starts a new scope
    197                 void beginScope() {
    198                         Scope scope;
    199                         scopes.push_back(scope);
    200                 }
    201 
    202                 /// Ends a scope; invalidates any iterators pointing to elements of that scope
    203                 void endScope() {
    204                         scopes.pop_back();
    205                         assert( ! scopes.empty() );
    206                 }
    207 
    208                 /// Default constructor initializes with one scope
    209                 ErasableScopedMap( const Value &erased_ ) : erased( erased_ ) { beginScope(); }
    210 
    211                 iterator begin() { return iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); }
    212                 const_iterator begin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); }
    213                 const_iterator cbegin() const { return const_iterator(*this, scopes.back().begin(), scopes.size()-1).next_valid(); }
    214                 iterator end() { return iterator(*this, scopes[0].end(), 0); }
    215                 const_iterator end() const { return const_iterator(*this, scopes[0].end(), 0); }
    216                 const_iterator cend() const { return const_iterator(*this, scopes[0].end(), 0); }
    217 
    218                 /// Gets the index of the current scope (counted from 1)
    219                 size_type currentScope() const { return scopes.size(); }
    220 
    221                 /// Finds the given key in the outermost scope it occurs; returns end() for none such
    222                 iterator find( const Key &key ) {
    223                         for ( size_type i = scopes.size() - 1; ; --i ) {
    224                                 typename Scope::iterator val = scopes[i].find( key );
    225                                 if ( val != scopes[i].end() ) {
    226                                         return val->second == erased ? end() : iterator( *this, val, i );
    227                                 }
    228                                 if ( i == 0 ) break;
    229                         }
    230                         return end();
    231                 }
    232                 const_iterator find( const Key &key ) const {
    233                                 return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->find( key ) );
    234                 }
    235 
    236                 /// Finds the given key in the outermost scope inside the given scope where it occurs
    237                 iterator findNext( const_iterator &it, const Key &key ) {
    238                         if ( it.i == 0 ) return end();
    239                         for ( size_type i = it.i - 1; ; --i ) {
    240                                 typename Scope::iterator val = scopes[i].find( key );
    241                                 if ( val != scopes[i].end() ) {
    242                                         return val->second == erased ? end() : iterator( *this, val, i );
    243                                 }
    244                                 if ( i == 0 ) break;
    245                         }
    246                         return end();
    247                 }
    248                 const_iterator findNext( const_iterator &it, const Key &key ) const {
    249                                 return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->findNext( it, key ) );
    250                 }
    251 
    252                 /// Inserts the given key-value pair into the outermost scope
    253                 std::pair< iterator, bool > insert( const value_type &value ) {
    254                         std::pair< typename Scope::iterator, bool > res = scopes.back().insert( value );
    255                         return std::make_pair( iterator(*this, res.first, scopes.size()-1), res.second );
    256                 }
    257                 std::pair< iterator, bool > insert( const Key &key, const Value &value ) { return insert( std::make_pair( key, value ) ); }
    258 
    259                 /// Marks the given element as erased from this scope inward; returns 1 for erased an element, 0 otherwise
    260                 size_type erase( const Key &key ) {
    261                         typename Scope::iterator val = scopes.back().find( key );
    262                         if ( val != scopes.back().end() ) {
    263                                 val->second = erased;
    264                                 return 1;
    265                         } else {
    266                                 scopes.back().insert( val, std::make_pair( key, erased ) );
    267                                 return 0;
    268                         }
    269                 }
    270 
    271                 Value& operator[] ( const Key &key ) {
    272                         iterator slot = find( key );
    273                         if ( slot != end() ) return slot->second;
    274                         return insert( key, Value() ).first->second;
    275                 }
    276         };
     105                        if ( i == 0 ) break;
     106                }
     107                return end();
     108        }
     109        const_iterator findNext( const_iterator &it, const Key &key ) const {
     110                return const_iterator( const_cast< ErasableScopedMap< Key, Value >* >(this)->findNext( it, key ) );
     111        }
     112
     113        /// Inserts the given key-value pair into the outermost scope
     114        std::pair< iterator, bool > insert( const value_type &value ) {
     115                std::pair< typename Scope::iterator, bool > res = scopes.back().insert( value );
     116                return std::make_pair( iterator(*this, res.first, scopes.size()-1), res.second );
     117        }
     118        std::pair< iterator, bool > insert( const Key &key, const Value &value ) { return insert( std::make_pair( key, value ) ); }
     119
     120        /// Marks the given element as erased from this scope inward; returns 1 for erased an element, 0 otherwise
     121        size_type erase( const Key &key ) {
     122                typename Scope::iterator val = scopes.back().find( key );
     123                if ( val != scopes.back().end() ) {
     124                        val->second = erased;
     125                        return 1;
     126                } else {
     127                        scopes.back().insert( val, std::make_pair( key, erased ) );
     128                        return 0;
     129                }
     130        }
     131
     132        Value& operator[] ( const Key &key ) {
     133                iterator slot = find( key );
     134                if ( slot != end() ) return slot->second;
     135                return insert( key, Value() ).first->second;
     136        }
     137};
     138
     139template<typename Key, typename Value>
     140class ErasableScopedMap<Key, Value>::iterator :
     141                public std::iterator< std::bidirectional_iterator_tag, value_type > {
     142        friend class ErasableScopedMap;
     143        typedef typename std::map< Key, Value >::iterator wrapped_iterator;
     144        typedef typename std::vector< std::map< Key, Value > > scope_list;
     145        typedef typename scope_list::size_type size_type;
     146
     147        /// Checks if this iterator points to a valid item
     148        bool is_valid() const {
     149                return it != map->scopes[i].end() && it->second != map->erased;
     150        }
     151
     152        /// Increments on invalid
     153        iterator& next_valid() {
     154                if ( ! is_valid() ) { ++(*this); }
     155                return *this;
     156        }
     157
     158        /// Decrements on invalid
     159        iterator& prev_valid() {
     160                if ( ! is_valid() ) { --(*this); }
     161                return *this;
     162        }
     163
     164        iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_iterator &_it, size_type _i)
     165                        : map(&_map), it(_it), i(_i) {}
     166
     167public:
     168        iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {}
     169        iterator& operator= (const iterator &that) {
     170                map = that.map; i = that.i; it = that.it;
     171                return *this;
     172        }
     173
     174        reference operator* () { return *it; }
     175        pointer operator-> () { return it.operator->(); }
     176
     177        iterator& operator++ () {
     178                if ( it == map->scopes[i].end() ) {
     179                        if ( i == 0 ) return *this;
     180                        --i;
     181                        it = map->scopes[i].begin();
     182                } else {
     183                        ++it;
     184                }
     185                return next_valid();
     186        }
     187
     188        iterator& operator++ (int) { iterator tmp = *this; ++(*this); return tmp; }
     189
     190        iterator& operator-- () {
     191                // may fail if this is the begin iterator; allowed by STL spec
     192                if ( it == map->scopes[i].begin() ) {
     193                        ++i;
     194                        it = map->scopes[i].end();
     195                }
     196                --it;
     197                return prev_valid();
     198        }
     199        iterator& operator-- (int) { iterator tmp = *this; --(*this); return tmp; }
     200
     201        bool operator== (const iterator &that) {
     202                return map == that.map && i == that.i && it == that.it;
     203        }
     204        bool operator!= (const iterator &that) { return !( *this == that ); }
     205
     206private:
     207        ErasableScopedMap< Key, Value > const *map;
     208        wrapped_iterator it;
     209        size_type i;
     210};
     211
     212template<typename Key, typename Value>
     213class ErasableScopedMap<Key, Value>::const_iterator :
     214                public std::iterator< std::bidirectional_iterator_tag, value_type > {
     215        friend class ErasableScopedMap;
     216        typedef typename std::map< Key, Value >::iterator wrapped_iterator;
     217        typedef typename std::map< Key, Value >::const_iterator wrapped_const_iterator;
     218        typedef typename std::vector< std::map< Key, Value > > scope_list;
     219        typedef typename scope_list::size_type size_type;
     220
     221        /// Checks if this iterator points to a valid item
     222        bool is_valid() const {
     223                return it != map->scopes[i].end() && it->second != map->erased;
     224        }
     225
     226        /// Increments on invalid
     227        const_iterator& next_valid() {
     228                if ( ! is_valid() ) { ++(*this); }
     229                return *this;
     230        }
     231
     232        /// Decrements on invalid
     233        const_iterator& prev_valid() {
     234                if ( ! is_valid() ) { --(*this); }
     235                return *this;
     236        }
     237
     238        const_iterator(ErasableScopedMap< Key, Value > const &_map, const wrapped_const_iterator &_it, size_type _i)
     239                        : map(&_map), it(_it), i(_i) {}
     240public:
     241        const_iterator(const iterator &that) : map(that.map), it(that.it), i(that.i) {}
     242        const_iterator(const const_iterator &that) : map(that.map), it(that.it), i(that.i) {}
     243        const_iterator& operator= (const iterator &that) {
     244                map = that.map; i = that.i; it = that.it;
     245                return *this;
     246        }
     247        const_iterator& operator= (const const_iterator &that) {
     248                map = that.map; i = that.i; it = that.it;
     249                return *this;
     250        }
     251
     252        const_reference operator* () { return *it; }
     253        const_pointer operator-> () { return it.operator->(); }
     254
     255        const_iterator& operator++ () {
     256                if ( it == map->scopes[i].end() ) {
     257                        if ( i == 0 ) return *this;
     258                        --i;
     259                        it = map->scopes[i].begin();
     260                } else {
     261                        ++it;
     262                }
     263                return next_valid();
     264        }
     265        const_iterator& operator++ (int) { const_iterator tmp = *this; ++(*this); return tmp; }
     266
     267        const_iterator& operator-- () {
     268                // may fail if this is the begin iterator; allowed by STL spec
     269                if ( it == map->scopes[i].begin() ) {
     270                        ++i;
     271                        it = map->scopes[i].end();
     272                }
     273                --it;
     274                return prev_valid();
     275        }
     276        const_iterator& operator-- (int) { const_iterator tmp = *this; --(*this); return tmp; }
     277
     278        bool operator== (const const_iterator &that) {
     279                return map == that.map && i == that.i && it == that.it;
     280        }
     281        bool operator!= (const const_iterator &that) { return !( *this == that ); }
     282
     283private:
     284        ErasableScopedMap< Key, Value > const *map;
     285        wrapped_const_iterator it;
     286        size_type i;
     287};
     288
    277289} // namespace GenPoly
    278290
  • src/GenPoly/GenPoly.cc

    r7d9598d8 r2dcd80a  
    783783        const ast::FunctionType * function = getFunctionType( expr->func->result );
    784784        assertf( function, "ApplicationExpr has non-function type: %s", toString( expr->func->result ).c_str() );
    785         TypeVarMap exprTyVars = { ast::TypeDecl::Data() };
     785        TypeVarMap exprTyVars = { ast::TypeData() };
    786786        makeTypeVarMap( function, exprTyVars );
    787787        return needsBoxing( param, arg, exprTyVars, subst );
     
    793793
    794794void addToTypeVarMap( const ast::TypeInstType * type, TypeVarMap & typeVars ) {
    795         typeVars.insert( *type, ast::TypeDecl::Data( type->base ) );
     795        typeVars.insert( *type, ast::TypeData( type->base ) );
    796796}
    797797
  • src/GenPoly/GenPoly.h

    r7d9598d8 r2dcd80a  
    2020
    2121#include "ErasableScopedMap.h"    // for ErasableScopedMap
    22 #include "AST/Decl.hpp"           // for TypeDecl::Data
     22#include "AST/Decl.hpp"           // for AggregateDecl
    2323#include "AST/Fwd.hpp"            // for ApplicationExpr, BaseInstType, Func...
    24 #include "AST/Type.hpp"           // for TypeInstType::TypeEnvKey
    2524#include "SymTab/Mangler.h"       // for Mangler
    2625#include "SynTree/Declaration.h"  // for TypeDecl::Data, AggregateDecl, Type...
    2726#include "SynTree/SynTree.h"      // for Visitor Nodes
    2827
     28namespace ast {
     29        struct TypeEnvKey;
     30}
     31
    2932namespace GenPoly {
    3033
    3134        typedef ErasableScopedMap< std::string, TypeDecl::Data > TyVarMap;
    32         using TypeVarMap = ErasableScopedMap< ast::TypeInstType::TypeEnvKey, ast::TypeDecl::Data >;
     35        using TypeVarMap = ErasableScopedMap< ast::TypeEnvKey, ast::TypeData >;
    3336
    3437        /// Replaces a TypeInstType by its referrent in the environment, if applicable
  • src/Parser/DeclarationNode.cc

    r7d9598d8 r2dcd80a  
    254254} // DeclarationNode::newAggregate
    255255
    256 DeclarationNode * DeclarationNode::newEnum( const string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base) {
     256DeclarationNode * DeclarationNode::newEnum( const string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base, EnumHiding hiding ) {
    257257        DeclarationNode * newnode = new DeclarationNode;
    258258        newnode->type = new TypeData( TypeData::Enum );
     
    262262        newnode->type->enumeration.anon = name == nullptr;
    263263        newnode->type->enumeration.typed = typed;
     264        newnode->type->enumeration.hiding = hiding;
    264265        if ( base && base->type)  {
    265266                newnode->type->base = base->type;
  • src/Parser/ParseNode.h

    r7d9598d8 r2dcd80a  
    239239        static DeclarationNode * newFunction( const std::string * name, DeclarationNode * ret, DeclarationNode * param, StatementNode * body );
    240240        static DeclarationNode * newAggregate( AggregateDecl::Aggregate kind, const std::string * name, ExpressionNode * actuals, DeclarationNode * fields, bool body );
    241         static DeclarationNode * newEnum( const std::string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base = nullptr );
     241        static DeclarationNode * newEnum( const std::string * name, DeclarationNode * constants, bool body, bool typed, DeclarationNode * base = nullptr, EnumHiding hiding = EnumHiding::Visible );
    242242        static DeclarationNode * newEnumConstant( const std::string * name, ExpressionNode * constant );
    243243        static DeclarationNode * newEnumValueGeneric( const std::string * name, InitializerNode * init );
  • src/Parser/TypeData.cc

    r7d9598d8 r2dcd80a  
    923923        buildList( td->enumeration.constants, ret->get_members() );
    924924        list< Declaration * >::iterator members = ret->get_members().begin();
     925        ret->hide = td->enumeration.hiding == EnumHiding::Hide ? EnumDecl::EnumHiding::Hide : EnumDecl::EnumHiding::Visible;
    925926        for ( const DeclarationNode * cur = td->enumeration.constants; cur != nullptr; cur = dynamic_cast< DeclarationNode * >( cur->get_next() ), ++members ) {
    926927                if ( cur->enumInLine ) {
  • src/Parser/TypeData.h

    r7d9598d8 r2dcd80a  
    6060                bool anon;
    6161                bool typed;
     62                EnumHiding hiding;
    6263        };
    6364
  • src/Parser/parser.yy

    r7d9598d8 r2dcd80a  
    1010// Created On       : Sat Sep  1 20:22:55 2001
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Wed Nov  2 21:31:21 2022
    13 // Update Count     : 5810
     12// Last Modified On : Mon Nov 21 22:34:30 2022
     13// Update Count     : 5848
    1414//
    1515
     
    383383%type<ifctl> conditional_declaration
    384384%type<fctl> for_control_expression              for_control_expression_list
    385 %type<compop> updown updowneq downupdowneq
     385%type<compop> upupeq updown updowneq downupdowneq
    386386%type<en> subrange
    387387%type<decl> asm_name_opt
     
    489489%type<decl> type_parameter type_parameter_list type_initializer_opt
    490490
    491 %type<en> type_parameters_opt type_list
     491%type<en> type_parameters_opt type_list array_type_list
    492492
    493493%type<decl> type_qualifier type_qualifier_name forall type_qualifier_list_opt type_qualifier_list
     
    551551
    552552%%
    553 //************************* Namespace Management ********************************
     553// ************************ Namespace Management ********************************
    554554
    555555// The C grammar is not context free because it relies on the distinct terminal symbols "identifier" and "TYPEDEFname",
     
    588588        ;
    589589
    590 //************************* CONSTANTS ********************************
     590// ************************ CONSTANTS ********************************
    591591
    592592constant:
     
    634634        ;
    635635
    636 //************************* EXPRESSIONS ********************************
     636// ************************ EXPRESSIONS ********************************
    637637
    638638primary_expression:
     
    11011101        ;
    11021102
    1103 //*************************** STATEMENTS *******************************
     1103// ************************** STATEMENTS *******************************
    11041104
    11051105statement:
     
    17581758        ;
    17591759
    1760 //******************************* DECLARATIONS *********************************
     1760// ****************************** DECLARATIONS *********************************
    17611761
    17621762declaration_list_opt:                                                                   // used at beginning of switch statement
     
    25582558                { typedefTable.makeTypedef( *$3 ); }
    25592559          hide_opt '{' enumerator_list comma_opt '}'
    2560           { $$ = DeclarationNode::newEnum( $3, $7, true, false )->addQualifiers( $2 ); }
     2560                { $$ = DeclarationNode::newEnum( $3, $7, true, false, nullptr, $5 )->addQualifiers( $2 ); }
    25612561        | ENUM attribute_list_opt typedef_name                          // unqualified type name
    25622562          hide_opt '{' enumerator_list comma_opt '}'
    2563                 { $$ = DeclarationNode::newEnum( $3->name, $6, true, false )->addQualifiers( $2 ); }
     2563                { $$ = DeclarationNode::newEnum( $3->name, $6, true, false, nullptr, $4 )->addQualifiers( $2 ); }
    25642564        | ENUM '(' cfa_abstract_parameter_declaration ')' attribute_list_opt '{' enumerator_list comma_opt '}'
    25652565                {
     
    25802580          hide_opt '{' enumerator_list comma_opt '}'
    25812581                {
    2582                         $$ = DeclarationNode::newEnum( $6, $11, true, true, $3 )->addQualifiers( $5 )->addQualifiers( $7 );
     2582                        $$ = DeclarationNode::newEnum( $6, $11, true, true, $3, $9 )->addQualifiers( $5 )->addQualifiers( $7 );
    25832583                }
    25842584        | ENUM '(' ')' attribute_list_opt identifier attribute_list_opt
    25852585          hide_opt '{' enumerator_list comma_opt '}'
    25862586                {
    2587                         $$ = DeclarationNode::newEnum( $5, $9, true, true, nullptr )->addQualifiers( $4 )->addQualifiers( $6 );
     2587                        $$ = DeclarationNode::newEnum( $5, $9, true, true, nullptr, $7 )->addQualifiers( $4 )->addQualifiers( $6 );
    25882588                }
    25892589        | ENUM '(' cfa_abstract_parameter_declaration ')' attribute_list_opt typedef_name attribute_list_opt
    25902590          hide_opt '{' enumerator_list comma_opt '}'
    25912591                {
    2592                         $$ = DeclarationNode::newEnum( $6->name, $10, true, true, $3 )->addQualifiers( $5 )->addQualifiers( $7 );
     2592                        $$ = DeclarationNode::newEnum( $6->name, $10, true, true, $3, $8 )->addQualifiers( $5 )->addQualifiers( $7 );
    25932593                }
    25942594        | ENUM '(' ')' attribute_list_opt typedef_name attribute_list_opt
    25952595          hide_opt '{' enumerator_list comma_opt '}'
    25962596                {
    2597                         $$ = DeclarationNode::newEnum( $5->name, $9, true, true, nullptr )->addQualifiers( $4 )->addQualifiers( $6 );
     2597                        $$ = DeclarationNode::newEnum( $5->name, $9, true, true, nullptr, $7 )->addQualifiers( $4 )->addQualifiers( $6 );
    25982598                }
    25992599        | enum_type_nobody
     
    29912991        ;
    29922992
    2993 //***************************** EXTERNAL DEFINITIONS *****************************
     2993// **************************** EXTERNAL DEFINITIONS *****************************
    29942994
    29952995translation_unit:
     
    36533653        | '[' ']' multi_array_dimension
    36543654                { $$ = DeclarationNode::newArray( 0, 0, false )->addArray( $3 ); }
    3655         | '[' push assignment_expression pop ',' comma_expression ']'
     3655                // Cannot use constant_expression because of tuples => semantic check
     3656        | '[' push assignment_expression pop ',' comma_expression ']' // CFA
    36563657                { $$ = DeclarationNode::newArray( $3, 0, false )->addArray( DeclarationNode::newArray( $6, 0, false ) ); }
    36573658                // { SemanticError( yylloc, "New array dimension is currently unimplemented." ); $$ = nullptr; }
     3659        | '[' push array_type_list pop ']'                                      // CFA
     3660                { SemanticError( yylloc, "Type array dimension is currently unimplemented." ); $$ = nullptr; }
    36583661        | multi_array_dimension
    36593662        ;
     3663
     3664array_type_list:
     3665        basic_type_name
     3666                { $$ = new ExpressionNode( new TypeExpr( maybeMoveBuildType( $1 ) ) ); }
     3667        | type_name
     3668                { $$ = new ExpressionNode( new TypeExpr( maybeMoveBuildType( $1 ) ) ); }
     3669        | assignment_expression upupeq assignment_expression
     3670        | array_type_list ',' basic_type_name
     3671                { $$ = (ExpressionNode *)($1->set_last( new ExpressionNode( new TypeExpr( maybeMoveBuildType( $3 ) ) ) )); }
     3672        | array_type_list ',' type_name
     3673                { $$ = (ExpressionNode *)($1->set_last( new ExpressionNode( new TypeExpr( maybeMoveBuildType( $3 ) ) ) )); }
     3674        | array_type_list ',' assignment_expression upupeq assignment_expression
     3675        ;
     3676
     3677upupeq:
     3678        '~'
     3679                { $$ = OperKinds::LThan; }
     3680        | ErangeUpEq
     3681                { $$ = OperKinds::LEThan; }
     3682        ;
    36603683
    36613684multi_array_dimension:
     
    39904013//    declaration lists (not prototype-format parameter type and identifier declarators) is an obsolescent feature.
    39914014
    3992 //************************* MISCELLANEOUS ********************************
     4015// ************************ MISCELLANEOUS ********************************
    39934016
    39944017comma_opt:                                                                                              // redundant comma
  • src/ResolvExpr/CandidateFinder.cpp

    r7d9598d8 r2dcd80a  
    221221        ) {
    222222                for ( auto & tyvar : type->forall ) {
    223                         unifiableVars[ *tyvar ] = ast::TypeDecl::Data{ tyvar->base };
     223                        unifiableVars[ *tyvar ] = ast::TypeData{ tyvar->base };
    224224                }
    225225                for ( auto & assn : type->assertions ) {
  • src/ResolvExpr/FindOpenVars.cc

    r7d9598d8 r2dcd80a  
    113113                                if ( nextIsOpen ) {
    114114                                        for ( auto & decl : type->forall ) {
    115                                                 open[ *decl ] = ast::TypeDecl::Data{ decl->base };
     115                                                open[ *decl ] = ast::TypeData{ decl->base };
    116116                                        }
    117117                                        for ( auto & assert : type->assertions ) {
     
    120120                                } else {
    121121                                        for ( auto & decl : type->forall ) {
    122                                                 closed[ *decl ] = ast::TypeDecl::Data{ decl->base };   
     122                                                closed[ *decl ] = ast::TypeData{ decl->base };
    123123                                        }
    124124                                        for ( auto & assert : type->assertions ) {
  • src/ResolvExpr/RenameVars.cc

    r7d9598d8 r2dcd80a  
    4242                int next_usage_id = 1;
    4343                ScopedMap< std::string, std::string > nameMap;
    44                 ScopedMap< std::string, ast::TypeInstType::TypeEnvKey > idMap;
     44                ScopedMap< std::string, ast::TypeEnvKey > idMap;
    4545        public:
    4646                void reset() {
     
    121121                                        assert(false);
    122122                                }
    123                                 idMap[ td->name ] = ast::TypeInstType::TypeEnvKey(*mut);
    124                                
     123                                idMap[ td->name ] = ast::TypeEnvKey( *mut );
     124
    125125                                td = mut;
    126126                        }
  • src/ResolvExpr/Unify.cc

    r7d9598d8 r2dcd80a  
    11661166                        if ( entry1->second.kind != entry2->second.kind ) return false;
    11671167                        return env.bindVarToVar(
    1168                                 var1, var2, ast::TypeDecl::Data{ entry1->second, entry2->second }, need, have,
     1168                                var1, var2, ast::TypeData{ entry1->second, entry2->second }, need, have,
    11691169                                open, widen, symtab );
    11701170                } else if ( isopen1 ) {
  • src/SynTree/Declaration.h

    r7d9598d8 r2dcd80a  
    340340        bool isTyped;
    341341        Type * base;
     342        enum EnumHiding { Visible, Hide } hide;
    342343
    343344        EnumDecl( const std::string & name,
     
    345346          bool isTyped = false, LinkageSpec::Spec linkage = LinkageSpec::Cforall,
    346347          Type * baseType = nullptr )
    347           : Parent( name, attributes, linkage ),isTyped(isTyped), base( baseType ) {}
     348          : Parent( name, attributes, linkage ), isTyped(isTyped), base( baseType ) {}
    348349        EnumDecl( const EnumDecl & other )
    349350          : Parent( other ), isTyped( other.isTyped), base( other.base ) {}
  • tests/Makefile.am

    r7d9598d8 r2dcd80a  
    6868.PHONY: list .validate .test_makeflags
    6969.INTERMEDIATE: .validate .validate.cfa .test_makeflags
    70 EXTRA_PROGRAMS = avl_test linkonce .dummy_hack # build but do not install
     70EXTRA_PROGRAMS = avl_test linkonce linking/mangling/anon .dummy_hack # build but do not install
    7171EXTRA_DIST = test.py \
    7272        pybin/__init__.py \
     
    101101avl_test_SOURCES = avltree/avl_test.cfa avltree/avl0.cfa avltree/avl1.cfa avltree/avl2.cfa avltree/avl3.cfa avltree/avl4.cfa avltree/avl-private.cfa
    102102linkonce_SOURCES = link-once/main.cfa link-once/partner.cfa
     103linking_mangling_anon_SOURCES = linking/mangling/header.hfa linking/mangling/lib.cfa linking/mangling/main.cfa
    103104# automake doesn't know we still need C/CPP rules so pretend like we have a C program
    104105nodist__dummy_hack_SOURCES = .dummy_hack.c .dummy_hackxx.cpp
  • tests/PRNG.cfa

    r7d9598d8 r2dcd80a  
    88// Created On       : Wed Dec 29 09:38:12 2021
    99// Last Modified By : Peter A. Buhr
    10 // Last Modified On : Sat Apr  9 15:21:14 2022
    11 // Update Count     : 344
     10// Last Modified On : Tue Nov 22 22:51:12 2022
     11// Update Count     : 381
    1212//
    1313
     
    2222#include <mutex_stmt.hfa>
    2323
     24#ifdef __x86_64__                                                                               // 64-bit architecture
     25#define PRNG PRNG64
     26#else                                                                                                   // 32-bit architecture
     27#define PRNG PRNG32
     28#endif // __x86_64__
     29
    2430#ifdef TIME                                                                                             // use -O2 -nodebug
    2531#define STARTTIME start = timeHiRes()
     
    5460
    5561
    56 uint32_t seed = 1009;
     62unsigned int seed = 1009;
    5763
    5864thread T1 {};
     
    158164#if 1
    159165        PRNG prng;
     166
    160167        if ( seed != 0 ) set_seed( prng, seed );
    161168
  • tests/concurrent/barrier/generation.cfa

    r7d9598d8 r2dcd80a  
    3737                for(c; 'A' ~= 'Z') {
    3838                        // Yield for chaos
    39                         yield(prng(this, 10));
     39                        yield( prng(this, 10) );
    4040
    4141                        // Print the generation, no newline because
     
    4343
    4444                        // Yield again for more chaos
    45                         yield(prng(this, 10));
     45                        yield( prng(this, 10) );
    4646
    4747                        // Block on the barrier
  • tests/concurrent/barrier/order.cfa

    r7d9598d8 r2dcd80a  
    3737        for(l; NUM_LAPS) {
    3838                // Yield for chaos
    39                 yield(prng(this, 10));
     39                yield( prng(this, 10) );
    4040
    4141                // Block and what order we arrived
  • tests/concurrent/once.cfa

    r7d9598d8 r2dcd80a  
    3030
    3131                // sometime yields
    32                 yield(prng(this, 3));
     32                yield( prng(this, 3) );
    3333        }
    3434}
  • tests/concurrent/readyQ/leader_spin.cfa

    r7d9598d8 r2dcd80a  
    2626}
    2727
    28 PRNG lead_rng;
     28PRNG64 lead_rng;
    2929volatile unsigned leader;
    3030volatile size_t lead_idx;
    3131
    32 const unsigned nthreads = 17;
    33 const unsigned stop_count = 327;
     32const uint64_t nthreads = 17;
     33const uint64_t stop_count = 327;
    3434
    3535thread$ * the_main;
     
    5050        for(i; nthreads) {
    5151                while( threads[i]->idx != lead_idx ) {
    52                         Pause();
     52                        sched_yield();
    5353                }
    5454        }
  • tests/io/away_fair.cfa

    r7d9598d8 r2dcd80a  
    4141
    4242                if(last == curr) {
    43                         Pause();
     43                        sched_yield();
    4444                        continue;
    4545                }
  • tests/io/comp_fair.cfa

    r7d9598d8 r2dcd80a  
    5252
    5353                if(last == curr) {
    54                         Pause();
     54                        sched_yield();
    5555                        continue;
    5656                }
  • tools/gdb/utils-gdb.py

    r7d9598d8 r2dcd80a  
    2323gdb.execute('handle SIGUSR1 nostop noprint pass')
    2424
    25 CfaTypes = collections.namedtuple('CfaTypes', 'cluster_ptr processor_ptr thread_ptr int_ptr thread_state yield_state')
     25CfaTypes = collections.namedtuple('CfaTypes', 'cluster_ptr processor_ptr thread_ptr int_ptr uintptr thread_state yield_state')
    2626
    2727class ThreadInfo:
     
    5555                thread_ptr = gdb.lookup_type('struct thread$').pointer(),
    5656                int_ptr = gdb.lookup_type('int').pointer(),
     57                uintptr = gdb.lookup_type('uintptr_t'),
    5758                thread_state = gdb.lookup_type('enum __Coroutine_State'),
    5859                yield_state = gdb.lookup_type('enum __Preemption_Reason'))
     
    8990        return argv
    9091
     92def single_field(obj):
     93        """
     94        If the struct only has one field return it, otherwise error
     95        """
     96
     97        _type = obj.type
     98        if len(_type.fields()) != 1:
     99                return None
     100
     101        return obj[_type.fields()[0].name]
     102
     103
     104def start_from_dlist(dlist):
     105        fs = dlist.type.fields()
     106        if len(fs) != 1:
     107                print("Error, can't understand dlist type for", dlist, dlist.name, dlist.type)
     108                return None
     109
     110        return dlist[fs[0]]
     111
     112def fix_dlink(ptr):
     113        """
     114        Remove the higher order bit from the pointer
     115        """
     116        ptype = ptr.type
     117        size = ptype.sizeof
     118        if size == 8:
     119                bit = 1 << ((size*8)-1)
     120                mask = bit - 1
     121        elif size == 4:
     122                bit = 0
     123                mask = 1
     124        else:
     125                print("Unexpected pointer size while fixing dlink", size)
     126
     127        cfa_t = get_cfa_types()
     128        uptr = ptr.cast(cfa_t.uintptr)
     129        return ptr if 0 == uptr & mask else gdb.Value(b'\x00'*size, ptype)
     130
    91131class ClusterIter:
    92132        def __init__(self, root):
     
    139179        def check(self):
    140180                # check if this is the last value
    141                 addr = int(self.curr)
    142                 mask = 1 << ((8 * int(gdb.parse_and_eval('sizeof(void*)'))) - 1)
    143                 if 0 != (mask & addr):
     181                if not fix_dlink(self.curr):
    144182                        raise StopIteration
    145183
     
    168206                return self.curr
    169207
    170 def start_from_dlist(dlist):
    171         fs = dlist.type.fields()
    172         if len(fs) != 1:
    173                 print("Error, can't understand dlist type for", dlist, dlist.name, dlist.type)
    174                 return None
    175 
    176         return dlist[fs[0]]
    177 
    178208def proc_list(cluster):
    179209        """
     
    183213        proclist = cluster['_X5procsS19__cluster_proc_list_1']
    184214
    185         idle = start_from_dlist(proclist['_X5idlesS5dlist_S9processorS5dlink_S9processor___1'])
    186         active = start_from_dlist(proclist['_X7activesS5dlist_S9processorS5dlink_S9processor___1'])
     215        idle = start_from_dlist(proclist['_X5idlesS5dlist_S9processorS5dlink_S9processor___1'])['_X4nextPY13__tE_generic__1']
     216        active = start_from_dlist(proclist['_X7activesS5dlist_S9processorS5dlink_S9processor___1'])['_X4nextPY13__tE_generic__1']
    187217        return ProcIter(active.cast(cfa_t.processor_ptr)), ProcIter(idle.cast(cfa_t.processor_ptr))
    188218
     
    261291
    262292                cfa_t = get_cfa_types()
    263                 root = cluster['_X7threadsS8__dllist_S7thread$__1']['_X4headPY15__TYPE_generic__1'].cast(cfa_t.thread_ptr)
     293                head = single_field(cluster['_X7threadsS5dlist_S7thread$S18__thread_user_link__1'])
     294                root = head['_X4nextPY13__tE_generic__1'].cast(cfa_t.thread_ptr)
    264295
    265296                if root == 0x0 or root.address == 0x0:
     
    282313                        threads.append(t)
    283314
    284                         curr = curr['node']['next']
     315                        curr = fix_dlink(single_field(curr['cltr_link'])['_X4nextPY13__tE_generic__1']).cast(cfa_t.thread_ptr)
    285316                        if curr == root or curr == 0x0:
    286317                                break
     
    409440
    410441        def print_formatted(self, marked, tid, name, state, address):
     442                # print(marked, tid, name, state, address)
    411443                print('{:>1}  {:>4}  {:>20}  {:>10}  {:>20}'.format('*' if marked else ' ', tid, name, state, address))
    412444
    413445        def print_thread(self, thread, tid, marked):
     446                # print("print", thread, tid, marked)
    414447                cfa_t = get_cfa_types()
    415448                ys = str(thread['preempted'].cast(cfa_t.yield_state))
     
    439472
    440473                self.print_formatted(False, '', 'Name', 'State', 'Address')
    441 
    442474                for t in threads:
    443475                        if not t.is_system() or print_system:
Note: See TracChangeset for help on using the changeset viewer.