source: doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex @ 749cf69

ADTast-experimentalpthread-emulation
Last change on this file since 749cf69 was 999faf1, checked in by Thierry Delisle <tdelisle@…>, 2 years ago

Some writing for the eval section.
Results for cycle and yield should be ready to read.

  • Property mode set to 100644
File size: 6.9 KB
Line 
1\chapter{Macro-Benchmarks}\label{macrobench}
2The previous chapter has demonstrated that the scheduler achieves its performance goal in small and controlled scenario.
3The next step is then to demonstrate that this stays true in more realistic and complete scenarios.
4This chapter presents two flavours of webservers that demonstrate that \CFA performs competitively with production environments.
5
6Webservers where chosen because they offer fairly simple applications that are still useful as standalone products.
7Furthermore, webservers are generally amenable to parallelisation since their workloads are mostly homogenous.
8They therefore offer a stringent performance benchmark for \CFA.
9Indeed existing solutions are likely to have close to optimal performance while the homogeneity of the workloads mean the additional fairness is not needed.
10
11\section{Memcached}
12Memcached~\cit{memcached} is an in memory key-value store that is used in many production environments, \eg \cit{Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store,
13SIGMETRICS 2012}.
14This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cit{mutilate}.
15Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
16This experiment does not exercise the \io subsytem with regards to disk operations.
17The experiments compare 3 different varitions of memcached:
18\begin{itemize}
19 \item \emph{vanilla}: the official release of memcached, version~1.6.9.
20 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
21 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
22\end{itemize}
23
24\subsection{Benchmark Environment}
25These experiments are run on a cluster of homogenous Supermicro SYS-6017R-TDF compute nodes with the following characteristics:
26The server runs Ubuntu 20.04.3 LTS on top of Linux Kernel 5.11.0-34.
27Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz.
28These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
29The cpus each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches respectively.
30Each node is connected to the network through a Mellanox 10 Gigabit Ethernet port.
31The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
32
33\subsection{Throughput}
34\begin{figure}
35        \centering
36        \input{result.memcd.rate.qps.pstex_t}
37        \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual request rate for 15360 connections. Target QPS is the request rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.}
38        \label{fig:memcd:rate:qps}
39\end{figure}
40Figure~\ref{fig:memcd:rate:qps} shows the result for the throughput of all three webservers.
41This experiment is done by having the clients establish 15360 total connections, which persist for the duration of the experiments.
42The clients then send requests, attempting to follow a desired request rate.
43The servers respond to the desired rate as best they can and the difference between desired rate, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS''.
44The results show that \CFA achieves equivalent throughput even when the server starts to reach saturation.
45Only then does it start to fall behind slightly.
46This is a demonstration of the \CFA runtime achieving its performance goal.
47
48\subsection{Tail Latency}
49\begin{figure}
50        \centering
51        \input{result.memcd.rate.99th.pstex_t}
52        \caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} request rate for 15360 connections. }
53        \label{fig:memcd:rate:tail}
54\end{figure}
55Another important performance metric to look at is \newterm{tail} latency.
56Since many web applications rely on a combination of different requests made in parallel, the latency of the slowest response, \ie tail latency, can dictate overall performance.
57Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same experiment memcached experiment.
58As is expected, the latency starts low and increases as the server gets close to saturation, point at which the latency increses dramatically.
59Note that the figure shows \emph{target} request rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
60
61\subsection{Update rate}
62\begin{figure}
63        \centering
64        \input{result.memcd.updt.qps.pstex_t}
65        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
66        \label{fig:memcd:updt:qps}
67\end{figure}
68
69\begin{figure}
70        \centering
71        \input{result.memcd.updt.lat.pstex_t}
72        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
73        \label{fig:memcd:updt:lat}
74\end{figure}
75
76
77
78\section{Static Web-Server}
79The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks.
80On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}.
81The static webserver experiments will compare NGINX with a custom webserver developped for this experiment.
82
83\subsection{Benchmark Environment}
84Unlike the memcached experiment, the webserver run on a more heterogenous environment.
85The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52.
86It has an AMD Opteron(tm) Processor 6380 running at 2.50GHz.
87These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
88This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
89
90The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
91Each client machine runs two copies of the workload generator.
92They run a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
93Since the clients outnumber the server 8-to-1, this is plenty sufficient to generate enough load for the clients not to become the bottleneck.
94
95\todo{switch}
96
97
98
99\subsection{Throughput}
100\begin{figure}
101        \centering
102        \input{result.swbsrv.25gb.pstex_t}
103        \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline }
104        \label{fig:swbsrv}
105\end{figure}
106
107Networked ZIPF
108
109Nginx : 5Gb still good, 4Gb starts to suffer
110
111Cforall : 10Gb too high, 4 Gb too low
Note: See TracBrowser for help on using the repository browser.