- Timestamp:
- Sep 19, 2022, 8:11:02 PM (3 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- aa9f215
- Parents:
- ebf8ca5 (diff), ae1d151 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)links above to see all the changes relative to each parent. - File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
rebf8ca5 r23a08aa0 2 2 The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios. 3 3 The next step is to demonstrate performance stays true in more realistic and complete scenarios. 4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively withproduction environments.5 6 Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.7 Furthermore, web servers are generally amenable to parallelization since their workloads are mostly homogeneous.8 Therefore, web servers offer a stringent performance benchmark for \CFA.9 Indeed, existing web servers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem.10 As such, these experiments should highlight the overhead tue to any \CFA fairness cost in realistic scenarios.4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments. 5 6 Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products. 7 Furthermore, web servers are generally amenable to parallelization since their workloads are mostly homogeneous. 8 Therefore, web servers offer a stringent performance benchmark for \CFA. 9 Indeed, existing web servers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem. 10 As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios. 11 11 12 12 \section{Memcached} 13 13 Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}. 14 In fact, the Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}.15 Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.16 Note , this experiment does not exercise the \io subsystem with regardsto disk operations because Memcached is an in-memory server.14 The Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}. 15 Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well as the \io subsystem for sockets. 16 Note that this experiment does not exercise the \io subsystem with regard to disk operations because Memcached is an in-memory server. 17 17 18 18 \subsection{Benchmark Environment} … … 24 24 Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz. 25 25 \item 26 These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.27 \item 28 The CPUs each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 cachesrespectively.29 \item 30 Each node isconnected to the network through a Mellanox 10 Gigabit Ethernet port.26 Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}. 27 \item 28 A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively. 29 \item 30 The compute nodes are connected to the network through a Mellanox 10 Gigabit Ethernet port. 31 31 \item 32 32 Network routing is performed by a Mellanox SX1012 10/40 Gigabit Ethernet switch. … … 35 35 \subsection{Memcached threading}\label{memcd:thrd} 36 36 Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests. 37 When enabled, the threading implementation operates as follows~\cite {https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}:37 When enabled, the threading implementation operates as follows~\cite[\S~16.2.2.8]{MemcachedThreading}: 38 38 \begin{itemize} 39 39 \item … … 48 48 For UDP connections, all the threads listen to a single UDP socket for incoming requests. 49 49 Threads that are not currently dealing with another request ignore the incoming packet. 50 One of the remaining, non busy, threads reads the request and sends the response.51 This implementation can lead to increased CPU loadas threads wake from sleep to potentially process the request.52 \end{itemize} 53 Here, Memcached is based on an event-based web server architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.54 Alternative web server architectureare:50 One of the remaining, non-busy, threads reads the request and sends the response. 51 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request. 52 \end{itemize} 53 Here, Memcached is based on an event-based web server architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O. 54 Alternative web server architectures are: 55 55 \begin{itemize} 56 56 \item … … 74 74 \item \emph{vanilla}: the official release of Memcached, version~1.6.9. 75 75 \item \emph{fibre}: a modification of vanilla using the thread-per-connection model on top of the libfibre runtime. 76 \item \emph{cfa}: a modification of the fibre web server that replaces the libfibre runtime with \CFA.76 \item \emph{cfa}: a modification of the fibre web server that replaces the libfibre runtime with \CFA. 77 77 \end{itemize} 78 78 … … 80 80 This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment. 81 81 The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible. 82 Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers.83 84 Like the experimental setup in Chapter~\ref{microbench}, each experiment is run 15 times, and for each client rate, the measured web server rate is plotted.82 Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers. 83 84 Like the experimental setup in Chapter~\ref{microbench}, each experiment is run 15 times, and for each client rate, the measured web server rate is plotted. 85 85 The solid line represents the median while the dashed and dotted lines represent the maximum and minimum respectively. 86 For rates below 500K queries per second s, all three webservers match the client rate.87 Beyond 500K, the web servers cannot match the client rate.88 During this interval, vanilla Memcached achieves the highest web server throughput, with libfibre and \CFA slightly lower but very similar throughput.89 Overall the performance of all three web servers is very similar, especially considering that at 500K the servers have reached saturation, which is discussed more in the next section.86 For rates below 500K queries per second, all three web servers match the client rate. 87 Beyond 500K, the web servers cannot match the client rate. 88 During this interval, vanilla Memcached achieves the highest web server throughput, with libfibre and \CFA slightly lower but very similar throughput. 89 Overall the performance of all three web servers is very similar, especially considering that at 500K the servers have reached saturation, which is discussed more in the next section. 90 90 91 91 \begin{figure} 92 92 \centering 93 93 \resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.qps.pstex_t}} 94 \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual query rate for 15,360 connections. Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able torespond.}94 \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual query rate for 15,360 connections. Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server can respond.} 95 95 \label{fig:memcd:rate:qps} 96 96 %\end{figure} … … 99 99 \centering 100 100 \resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.99th.pstex_t}} 101 \caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15,360 connections. }101 \caption[Memcached Benchmark: 99th Percentile Latency]{Memcached Benchmark: 99th Percentile Latency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15,360 connections. } 102 102 \label{fig:memcd:rate:tail} 103 103 \end{figure} 104 104 105 105 \subsection{Tail Latency} 106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service .106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service? 107 107 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception. 108 108 Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment. 109 109 110 110 Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines. 111 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.112 Because of this dramatic increase, the Y axis is presented usinglog scale.113 Note that the graph shows \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.114 115 For all three servers, the saturation point is reached before 500K queries per second, which is when throughput starts to decline among the web servers.116 In this experiment, all three web servers are much more distinguishable than the throughput experiment.117 Vanilla Memcached achieves the lowest latency until 600K, after which all the web servers are struggling to respond to client requests.111 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed. 112 Because of this dramatic increase, the Y-axis is presented using a log scale. 113 Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment. 114 115 For all three servers, the saturation point is reached before 500K queries per second, which is when throughput starts to decline among the web servers. 116 In this experiment, all three web servers are much more distinguishable than in the throughput experiment. 117 Vanilla Memcached achieves the lowest latency until 600K, after which all the web servers are struggling to respond to client requests. 118 118 \CFA begins to decline at 600K, indicating some bottleneck after saturation. 119 Overall, all three web servers achieve micro-second latencies and the increases in latency mostly follow each other.119 Overall, all three web servers achieve microsecond latencies and the increases in latency mostly follow each other. 120 120 121 121 \subsection{Update rate} 122 Since Memcached is effectively a simple database, the information that is cachedcan be written to concurrently by multiple queries.122 Since Memcached is effectively a simple database, the cache information can be written to concurrently by multiple queries. 123 123 And since writes can significantly affect performance, it is interesting to see how varying the update rate affects performance. 124 124 Figure~\ref{fig:memcd:updt} shows the results for the same experiment as the throughput and latency experiment but increasing the update percentage to 5\%, 10\% and 50\%, respectively, versus the original 3\% update percentage. 125 125 126 126 \begin{figure} 127 \hspace{-15pt} 127 128 \subfloat[][\CFA: Throughput]{ 128 129 \resizebox{0.5\linewidth}{!}{ … … 132 133 } 133 134 \subfloat[][\CFA: Latency]{ 134 \resizebox{0.5 \linewidth}{!}{135 \resizebox{0.52\linewidth}{!}{ 135 136 \input{result.memcd.forall.lat.pstex_t} 136 137 } … … 138 139 } 139 140 141 \hspace{-15pt} 140 142 \subfloat[][LibFibre: Throughput]{ 141 143 \resizebox{0.5\linewidth}{!}{ … … 145 147 } 146 148 \subfloat[][LibFibre: Latency]{ 147 \resizebox{0.5 \linewidth}{!}{149 \resizebox{0.52\linewidth}{!}{ 148 150 \input{result.memcd.fibre.lat.pstex_t} 149 151 } … … 151 153 } 152 154 155 \hspace{-15pt} 153 156 \subfloat[][Vanilla: Throughput]{ 154 157 \resizebox{0.5\linewidth}{!}{ … … 158 161 } 159 162 \subfloat[][Vanilla: Latency]{ 160 \resizebox{0.5 \linewidth}{!}{163 \resizebox{0.52\linewidth}{!}{ 161 164 \input{result.memcd.vanilla.lat.pstex_t} 162 165 } 163 166 \label{fig:memcd:updt:vanilla:lat} 164 167 } 165 \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline Description} 168 \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline On the left, throughput as Desired vs Actual query rate. 169 Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server can respond. 170 On the right, tail latency, \ie 99th Percentile of the response latency as a function of \emph{desired} query rate. 171 For throughput, higher is better, for tail-latency, lower is better. 172 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 173 All runs have 15,360 client connections. 166 174 \label{fig:memcd:updt} 167 175 \end{figure} … … 175 183 \section{Static Web-Server} 176 184 The Memcached experiment does not exercise two key aspects of the \io subsystem: accept\-ing new connections and interacting with disks. 177 On the other hand, a web server servicing static web-pages does stress both accepting connections and disk \io by accepting tens of thousands of client requests per second where these requests return static data serviced from the file-system cache or disk.\footnote{178 Webservers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}179 The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based webserver developed for this experiment.185 On the other hand, a web server servicing static web pages does stress both accepting connections and disk \io by accepting tens of thousands of client requests per second where these requests return static data serviced from the file-system cache or disk.\footnote{ 186 web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.} 187 The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment. 180 188 181 189 \subsection{NGINX threading} 182 Like memcached, NGINX can be makde to use multiple \glspl{kthrd}. 183 It has a very similar architecture to the memcached architecture decscribed in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic. 184 While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads. 185 Each worker threads handles multiple connections exclusively, effectively dividing the connections into distinct sets. 186 Again, this is effectively the \emph{event-based server} approach. 187 188 \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/} 189 190 191 \subsection{\CFA webserver} 192 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront. 190 NGINX is a high-performance, \emph{full-service}, event-driven web server. 191 It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}. 192 This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance. 193 The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes. 194 When running as a static web server, it uses an event-driven architecture to service incoming requests. 195 Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines. 196 For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections. 197 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io. 198 The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool. 199 However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads. 200 201 \subsection{\CFA web server} 202 The \CFA web server is a straightforward thread-per-connection web server, where a fixed number of \ats are created upfront. 193 203 Each \at calls @accept@, through @io_uring@, on the listening port and handles the incoming connection once accepted. 194 204 Most of the implementation is fairly straightforward; 195 205 however, the inclusion of file \io found an @io_uring@ problem that required an unfortunate workaround. 196 206 197 Normally, web servers use @sendfile@~\cite{MAN:sendfile} to send files over a socket because it performs a direct move in the kernel from the file-system cache to the NIC, eliminating reading/writing the file into the webserver.198 While @io_uring@ does not support @sendfile@, it does support s@splice@~\cite{MAN:splice}, which is strictly more powerful.199 However, because of how Linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@ must delegate splice calls to worker threads insidethe kernel.207 Normally, web servers use @sendfile@~\cite{MAN:sendfile} to send files over a socket because it performs a direct move in the kernel from the file-system cache to the NIC, eliminating reading/writing the file into the web server. 208 While @io_uring@ does not support @sendfile@, it does support @splice@~\cite{MAN:splice}, which is strictly more powerful. 209 However, because of how Linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@ must delegate splice calls to worker threads \emph{inside} the kernel. 200 210 As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}. 201 211 Such a high number of \glspl{kthrd} slows Linux significantly. 202 Rather than abandon the experiment, the \CFA web server was switched to @sendfile@.203 204 With a blocking @sendfile@ the\CFA achieves acceptable performance until saturation is reached.205 At saturation, latency increases so some client connectionstimeout.212 Rather than abandon the experiment, the \CFA web server was switched to @sendfile@. 213 214 Starting with \emph{blocking} @sendfile@, \CFA achieves acceptable performance until saturation is reached. 215 At saturation, latency increases and client connections begin to timeout. 206 216 As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections. 207 217 Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved. 208 However, this poses a problem using nonblocking @sendfile@ calls:218 However, this poses a problem using blocking @sendfile@ calls: 209 219 when @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets. 210 220 The call can block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{ 211 221 \lstinline{sendfile} can always block even in nonblocking mode if the file to be sent is not in the file-system cache, because Linux does not provide nonblocking disk I/O.} 212 This effect results in a negative feedback where more timeouts lead to more @sendfile@ calls running out of resources. 213 214 Normally, this is address by using @select@/@epoll@ to wait for sockets to have sufficient resources. 215 However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely: 216 all calls would simply immediately return @EAGAIN@ and all asynchronicity would be lost. 217 218 For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@. 222 This effect results in a negative feedback loop where more timeouts lead to more @sendfile@ calls running out of resources. 223 224 Normally, this problem is addressed by using @select@/@epoll@ to wait for sockets to have sufficient resources. 225 However, since @io_uring@ does not support @sendfile@ but does respect non\-blocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely: 226 all calls simply immediately return @EAGAIN@ and all asynchronicity is lost. 227 228 Switching the entire \CFA runtime to @epoll@ for this experiment is unrealistic and does not help in the evaluation of the \CFA runtime. 229 For this reason, the \CFA web server sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@. 219 230 However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@. 220 Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@. 221 Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases. 222 223 Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 224 Presumably, this limit could prevent the explosion of \glspl{kthrd} which justified using @sendfile@ over @io_uring@ and @splice@. 231 Therefore, the \at spins performing the @sendfile@, yields if the call returns @EAGAIN@ and retries in these cases. 232 233 Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 234 Presumably, this limit would prevent the explosion of \glspl{kthrd}, which justified using @sendfile@ over @io_uring@ and @splice@. 225 235 However, recall from Section~\ref{iouring} that @io_uring@ maintains two pools of workers: bounded workers and unbounded workers. 226 In the particular case of the webserver, we would want the unbounded workers to handle accepts and reads on socket and bounded workers to handle reading the files from disk. 227 This would allow fine grained countrol over the number of workers needed for each operation type and would presumably lead to good performance. 236 For a web server, the unbounded workers should handle accepts and reads on sockets, and the bounded workers should handle reading files from disk. 237 This setup allows fine-grained control over the number of workers needed for each operation type and presumably leads to good performance. 238 228 239 However, @io_uring@ must contend with another reality of Linux: the versatility of @splice@. 229 Indeed, @splice@ can be used both for reading and writing ,to or from any type of file descriptor.230 This makes it moreambiguous which pool @io_uring@ should delegate @splice@ calls to.231 In the case of splicing from a socket to pipe, @splice@ will behavelike an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation.232 To make things more complicated, @splice@ can read from a pipe and write outto a regular file.240 Indeed, @splice@ can be used both for reading and writing to or from any type of file descriptor. 241 This generality makes it ambiguous which pool @io_uring@ should delegate @splice@ calls to. 242 In the case of splicing from a socket to a pipe, @splice@ behaves like an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation. 243 To make things more complicated, @splice@ can read from a pipe and write to a regular file. 233 244 In this case, the read is an unbounded operation but the write is a bounded one. 234 245 This leaves @io_uring@ in a difficult situation where it can be very difficult to delegate splice operations to the appropriate type of worker. 235 Since there is little to no context available to @io_uring@, I believe it makes the decisionto always delegate @splice@ operations to the unbounded workers.236 This is unfortunate for this specific experiment, since it prevents the webserver from limiting the number of calls to @splice@ happening in parallelwithout affecting the performance of @read@ or @accept@.246 Since there is little or no context available to @io_uring@, it seems to always delegate @splice@ operations to the unbounded workers. 247 This decision is unfortunate for this specific experiment since it prevents the web server from limiting the number of parallel calls to @splice@ without affecting the performance of @read@ or @accept@. 237 248 For this reason, the @sendfile@ approach described above is still the most performant solution in Linux 5.15. 238 249 239 Note that it could be possible to workaround this problem, for example by creating more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations. 240 However, I do not believe this solution is appropriate in general, it simply replaces a hack in the webserver with a different, equivalent hack. 250 One possible workaround is to create more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations. 251 However, I do not believe this solution is appropriate in general; 252 it simply replaces my current web server hack with a different, equivalent hack. 241 253 242 254 \subsection{Benchmark Environment} 243 Unlike the Memcached experiment, the web server experiment is run on a heterogeneous environment.255 Unlike the Memcached experiment, the web server experiment is run on a heterogeneous environment. 244 256 \begin{itemize} 245 257 \item 246 258 The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52. 247 259 \item 248 It has an AMD Opteron(tm) Processor 6380 running at 2.5GHz. 260 The server computer has four AMD Opteron\texttrademark Processor 6380 with 16 cores running at 2.5GHz, for a total of 64 \glspl{hthrd}. 261 \item 262 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate. 249 263 \item 250 264 Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 251 265 \item 252 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.253 \item254 266 The computer is booted with only 25GB of memory to restrict the file-system cache. 255 267 \end{itemize} … … 257 269 \begin{itemize} 258 270 \item 259 A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load -generator to run on a separate CPU.271 A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load generator to run on a separate CPU. 260 272 \item 261 273 It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards. 262 274 \item 263 \todo{switch} 275 Network routing is performed by an HP 2530 10 Gigabit Ethernet switch. 264 276 \item 265 277 A client machine runs two copies of the workload generator. 266 278 \end{itemize} 267 279 The clients and network are sufficiently provisioned to drive the server to saturation and beyond. 268 Hence, any server effects are attributable solely to the runtime system and web server.269 Finally, without restricting the server hardware resources, it is impossible to determine if a runtime system or the web server using it has any specific design restrictions, \eg using space to reduce time.270 Trying to determine these restriction with large numbers of processors or memory simply means running equally large experiments, which takeslonger and are harder to set up.280 Hence, any server effects are attributable solely to the runtime system and web server. 281 Finally, without restricting the server hardware resources, it is impossible to determine if a runtime system or the web server using it has any specific design restrictions, \eg using space to reduce time. 282 Trying to determine these restrictions with large numbers of processors or memory simply means running equally large experiments, which take longer and are harder to set up. 271 283 272 284 \subsection{Throughput} 273 To measure web server throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O.285 To measure web server throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O. 274 286 The clients run httperf~\cite{httperf} to request a set of static files. 275 The httperf load -generator is used with session files to simulate a large number of users and to implement a partially open-loop system.287 The httperf load generator is used with session files to simulate a large number of users and to implement a partially open-loop system. 276 288 This permits httperf to produce overload conditions, generate multiple requests from persistent HTTP/1.1 connections, and include both active and inactive off periods to model browser processing times and user think times~\cite{Barford98}. 277 289 278 290 The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution. 279 This distribution is representative of users accessing static data through a web -browser.280 Each request reads a file name from its trace, establishes a connection, performs an HTTP get-request for the file name, receive the file data, close the connection, and repeatthe process.291 This distribution is representative of users accessing static data through a web browser. 292 Each request reads a file name from its trace, establishes a connection, performs an HTTP GET request for the file name, receives the file data, closes the connection, and repeats the process. 281 293 Some trace elements have multiple file names that are read across a persistent connection. 282 A client times -out if the server does not complete a request within 10 seconds.294 A client times out if the server does not complete a request within 10 seconds. 283 295 284 296 An experiment consists of running a server with request rates ranging from 10,000 to 70,000 requests per second; 285 297 each rate takes about 5 minutes to complete. 286 There is 20 secondsidle time between rates and between experiments to allow connections in the TIME-WAIT state to clear.298 There are 20 seconds of idle time between rates and between experiments to allow connections in the TIME-WAIT state to clear. 287 299 Server throughput is measured both at peak and after saturation (\ie after peak). 288 300 Peak indicates the level of client requests the server can handle and after peak indicates if a server degrades gracefully. 289 Throughput is measured by aggregating the results from httperf ofall the clients.301 Throughput is measured by aggregating the results from httperf for all the clients. 290 302 291 303 This experiment can be done for two workload scenarios by reconfiguring the server with different amounts of memory: 25 GB and 2.5 GB. … … 305 317 \end{table} 306 318 307 Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput.308 These results are fairly straightforward.309 Both servers achieve the same throughput until around 57,500 requests per seconds.310 Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desired rate.311 Once the saturation point is reached, both servers are still very close.312 NGINX achieves slightly better throughput.313 However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achieves notably fewer errors once the machine reaches saturation.314 This suggest that \CFA is slightly more fair and NGINX may slightly sacrifice some fairness for improved throughput.315 It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine.316 317 319 \begin{figure} 320 \centering 318 321 \subfloat[][Throughput]{ 319 322 \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.pstex_t}} … … 325 328 \label{fig:swbsrv:err} 326 329 } 327 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connectionsconnections.}330 \caption[Static web server Benchmark: Throughput]{Static web server Benchmark: Throughput\smallskip\newline Throughput vs request rate for short-lived connections.} 328 331 \label{fig:swbsrv} 329 332 \end{figure} 330 333 334 Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput. 335 These results are fairly straightforward. 336 Both servers achieve the same throughput until around 57,500 requests per second. 337 Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the request rate. 338 Once the saturation point is reached, both servers are still very close. 339 NGINX achieves slightly better throughput. 340 However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achieves notably fewer errors once the servers reach saturation. 341 This suggests \CFA is slightly fairer with less throughput, while NGINX sacrifices fairness for more throughput. 342 This experiment demonstrates that the \CFA web server is able to match the performance of NGINX up to and beyond the saturation point of the machine. 343 331 344 \subsection{Disk Operations} 332 The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the rest of the machine. 333 Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine. 334 If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk. 335 However, in this configuration, the problem with @splice@ and @io_uring@ rears its ugly head again. 345 With 25GB of memory, the entire experimental file-set plus the web server and OS fit in memory. 346 If memory is constrained, the OS must evict files from the file cache, which causes @sendfile@ to read from disk.\footnote{ 347 For the in-memory experiments, the file-system cache was warmed by running an experiment three times before measuring started to ensure all files are in the file-system cache.} 348 web servers can behave very differently once file I/O begins and increases. 349 Hence, prior work~\cite{Harji10} suggests running both kinds of experiments to test overall web server performance. 350 351 However, after reducing memory to 2.5GB, the problem with @splice@ and @io_uring@ rears its ugly head again. 336 352 Indeed, in the in-memory configuration, replacing @splice@ with calls to @sendfile@ works because the bounded side basically never blocks. 337 353 Like @splice@, @sendfile@ is in a situation where the read side requires bounded blocking, \eg reading from a regular file, while the write side requires unbounded blocking, \eg blocking until the socket is available for writing. 338 The unbounded side can be handled by yielding when it returns @EAGAIN@ likementioned above, but this trick does not work for the bounded side.354 The unbounded side can be handled by yielding when it returns @EAGAIN@, as mentioned above, but this trick does not work for the bounded side. 339 355 The only solution for the bounded side is to spawn more threads and let these handle the blocking. 340 356 341 Supporting this case in the web server would require creating more \procs or creating a dedicated thread-pool.342 However, since what I am to evaluate in this thesis is the runtime of \CFA, I decided to forgo experiments on low memory server.343 The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.357 Supporting this case in the web server would require creating more \procs or creating a dedicated thread pool. 358 However, I felt this kind of modification moves too far away from my goal of evaluating the \CFA runtime, \ie it begins writing another runtime system; 359 hence, I decided to forgo experiments on low-memory performance.
Note:
See TracChangeset
for help on using the changeset viewer.