Changeset 0e34a14


Ignore:
Timestamp:
Aug 14, 2022, 4:15:10 PM (4 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
master, pthread-emulation
Children:
41a6a78
Parents:
2ae6a99
Message:

Not fully finished but readable

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r2ae6a99 r0e34a14  
    406406Beyond that performance starts to suffer from increased caching costs.
    407407
    408         Indeed on Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show that with 100 \ats per \proc, \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count.
    409         Interestingly, Go starts with better scaling at very low \proc counts but then performance quickly plateaus, resulting in worse performance at higher \proc counts.
    410         This performance difference disappears in Figures~\ref{fig:churn:jax:low:ops} and \ref{fig:churn:jax:low:ns}, where the performance of all runtimes is equivalent.
    411 
    412         Figure~\ref{fig:churn:nasus} again shows a similar story.
    413         \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count.
    414         Go still shows different scaling than the other 3 runtimes.
    415         The distinction is that on AMD the difference between Go and the other runtime is more significant.
    416         Indeed, even with only 1 \at per \proc, Go achieves notably different scaling than the other runtimes.
    417 
    418         One possible explanation for this difference is that since Go has very few available concurrent primitives, a channel was used instead of a semaphore.
    419         On paper a semaphore can be replaced by a channel and with zero-sized objects passed along equivalent performance could be expected.
    420         However, in practice there can be implementation difference between the two.
    421         This is especially true if the semaphore count can get somewhat high.
    422         Note that this replacement is also made in the cycle benchmark, however in that context it did not seem to have a notable impact.
     408Indeed on Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show that with 1 and 100 \ats per \proc, \CFA, libfibre, Go and tokio achieve effectively equivalent performance for most \proc count.
     409
     410However, Figure~\ref{fig:churn:nasus} again shows a somewhat different story on AMD.
     411While \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count, Go starts with better scaling at very low \proc counts but then performance quickly plateaus, resulting in worse performance at higher \proc counts.
     412This performance difference is visible at both high and low \at counts.
     413
     414One possible explanation for this difference is that since Go has very few available concurrent primitives, a channel was used instead of a semaphore.
     415On paper a semaphore can be replaced by a channel and with zero-sized objects passed along equivalent performance could be expected.
     416However, in practice there can be implementation difference between the two.
     417This is especially true if the semaphore count can get somewhat high.
     418Note that this replacement is also made in the cycle benchmark, however in that context it did not seem to have a notable impact.
     419
     420As second possible explanation is that Go may sometimes use the heap when allocating variables based on the result of escape analysis of the code.
     421It is possible that variables that should be placed on the stack are placed on the heap.
     422This could cause extra pointer chasing in the benchmark, heightning locality effects.
     423Depending on how the heap is structure, this could also lead to false sharing.
    423424
    424425The objective of this benchmark is to demonstrate that unparking \ats from remote \procs do not cause too much contention on the local queues.
     
    541542In both cases, the graphs on the left column show the results for the @share@ variation and the graphs on the right column show the results for the @noshare@.
    542543
    543 that the results somewhat follow the expectation.
    544 On the left of the figure showing the results for the shared variation, where \CFA and tokio outperform libfibre as expected.
     544On Intel, Figure~\ref{fig:locality:jax} shows Go trailing behind the 3 other runtimes.
     545On the left of the figure showing the results for the shared variation, where \CFA and tokio slightly outperform libfibre as expected.
    545546And correspondingly on the right, we see the expected performance inversion where libfibre now outperforms \CFA and tokio.
    546547Otherwise the results are similar to the churn benchmark, with lower throughtput due to the array processing.
    547 It is unclear why Go's performance is notably worst than the other runtimes.
     548Presumably the reason why Go trails behind are the same as in Figure~\ref{fig:churn:nasus}.
    548549
    549550Figure~\ref{fig:locality:nasus} shows the same experiment on AMD.
Note: See TracChangeset for help on using the changeset viewer.