Index: doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex	(revision e378c7303388d370d17f556517643de79ea041ea)
+++ doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex	(revision 0e34a143acbc54c5601c223ddac0cac229d6e477)
@@ -406,19 +406,20 @@
 Beyond that performance starts to suffer from increased caching costs.
 
-	Indeed on Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show that with 100 \ats per \proc, \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count.
-	Interestingly, Go starts with better scaling at very low \proc counts but then performance quickly plateaus, resulting in worse performance at higher \proc counts.
-	This performance difference disappears in Figures~\ref{fig:churn:jax:low:ops} and \ref{fig:churn:jax:low:ns}, where the performance of all runtimes is equivalent.
-
-	Figure~\ref{fig:churn:nasus} again shows a similar story.
-	\CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count.
-	Go still shows different scaling than the other 3 runtimes.
-	The distinction is that on AMD the difference between Go and the other runtime is more significant.
-	Indeed, even with only 1 \at per \proc, Go achieves notably different scaling than the other runtimes.
-
-	One possible explanation for this difference is that since Go has very few available concurrent primitives, a channel was used instead of a semaphore.
-	On paper a semaphore can be replaced by a channel and with zero-sized objects passed along equivalent performance could be expected.
-	However, in practice there can be implementation difference between the two.
-	This is especially true if the semaphore count can get somewhat high.
-	Note that this replacement is also made in the cycle benchmark, however in that context it did not seem to have a notable impact.
+Indeed on Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show that with 1 and 100 \ats per \proc, \CFA, libfibre, Go and tokio achieve effectively equivalent performance for most \proc count.
+
+However, Figure~\ref{fig:churn:nasus} again shows a somewhat different story on AMD.
+While \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count, Go starts with better scaling at very low \proc counts but then performance quickly plateaus, resulting in worse performance at higher \proc counts.
+This performance difference is visible at both high and low \at counts.
+
+One possible explanation for this difference is that since Go has very few available concurrent primitives, a channel was used instead of a semaphore.
+On paper a semaphore can be replaced by a channel and with zero-sized objects passed along equivalent performance could be expected.
+However, in practice there can be implementation difference between the two.
+This is especially true if the semaphore count can get somewhat high.
+Note that this replacement is also made in the cycle benchmark, however in that context it did not seem to have a notable impact.
+
+As second possible explanation is that Go may sometimes use the heap when allocating variables based on the result of escape analysis of the code.
+It is possible that variables that should be placed on the stack are placed on the heap.
+This could cause extra pointer chasing in the benchmark, heightning locality effects.
+Depending on how the heap is structure, this could also lead to false sharing.
 
 The objective of this benchmark is to demonstrate that unparking \ats from remote \procs do not cause too much contention on the local queues.
@@ -541,9 +542,9 @@
 In both cases, the graphs on the left column show the results for the @share@ variation and the graphs on the right column show the results for the @noshare@.
 
-that the results somewhat follow the expectation.
-On the left of the figure showing the results for the shared variation, where \CFA and tokio outperform libfibre as expected.
+On Intel, Figure~\ref{fig:locality:jax} shows Go trailing behind the 3 other runtimes.
+On the left of the figure showing the results for the shared variation, where \CFA and tokio slightly outperform libfibre as expected.
 And correspondingly on the right, we see the expected performance inversion where libfibre now outperforms \CFA and tokio.
 Otherwise the results are similar to the churn benchmark, with lower throughtput due to the array processing.
-It is unclear why Go's performance is notably worst than the other runtimes.
+Presumably the reason why Go trails behind are the same as in Figure~\ref{fig:churn:nasus}.
 
 Figure~\ref{fig:locality:nasus} shows the same experiment on AMD.
