I would like you address the comments of Reviewer 2, particularly with regard to the description of the adaptation Java harness to deal with warmup. I would expect to see a convincing argument that the computation has reached a steady state.

We understand referee2 and your concern about the JIT experiments, which is why we verified our experiments with two experts in JIT development for both Java and Node.js before submitting the paper. We also read the supplied papers, but most of the information is not applicable to our work for the following reasons.

1. SPEC benchmarks are medium to large. In contrast, our benchmarks are 5-15 lines in length for each programming language (see code for the Cforall tests in the paper). Hence, there is no significant computations, complex control flow, or use of memory. They test one specific language features (context switch, mutex call, etc.) in isolation over and over again. These language features are fixed (e.g., acquiring and releasing a lock is a fixed cost). Therefore, unless the feature can be removed there is nothing to optimize at runtime. But these features cannot be removed without changing the meaning of the benchmark. If the feature is removed, the timing result would be 0. In fact, it was difficult to prevent the JIT from completely eliding some benchmarks because there are no side-effects.

2. All of our benchmark results correlate across programming languages with and without JIT, indicating the JIT has completed any runtime optimizations (added this sentence to Section 8.1). Any large differences are explained by how a language implements a feature not by how the compiler/JIT precesses that feature. Section 8.1 discusses these points in detail.

3. We also added a sentence about running all JIT-base programming language experiments for 30 minutes and there was no statistical difference, med/avg/std correlated with the short-run experiments, which seems a convincing argument that the benchmark has reached a steady state. If the JIT takes longer than 30 minutes to achieve its optimization goals, it is unlikely to be useful.

4. The purpose of the performance section is not to draw conclusions about improvements. It is to contrast program-language implementation approaches. Section 8.1 talks about ramifications of certain design and implementation decisions with respect to overall performance. The only conclusion we draw about performance is:

   Performance comparisons with other concurrent systems and languages show the Cforall approach is competitive across all basic operations, which translates directly into good performance in well-written applications with advanced control-flow.


       I would also like you to provide the values for N for each benchmark run.

Done.


Referee 2 suggested

   * don't start sentences with "However"

However, there are numerous grammar sites on the web indicating "however" (a conjunction) at the start of a sentence is acceptable, e.g.:

https://www.merriam-webster.com/words-at-play/can-you-start-a-sentence-with-however This is a stylistic choice, more than anything else, as we have a considerable body of evidence of writers using however to begin sentences, frequently with the meaning of "nevertheless."