Changeset a8dd247 for doc/theses/thierry_delisle_PhD
- Timestamp:
- Aug 30, 2022, 6:29:11 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- a0dbf20
- Parents:
- 31b9d3c
- Location:
- doc/theses/thierry_delisle_PhD/thesis
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/local.bib
r31b9d3c ra8dd247 555 555 title = {Mach Scheduling and Thread Interfaces - Kernel Programming Guide}, 556 556 organization = {Apple Inc.}, 557 howPublish = {\href{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}} 557 note = {\href{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}} 558 } 559 560 @misc{MemcachedThreading, 561 author = {Oracle}, 562 title = {MySQL 5.6 Reference Manual Including MySQL NDB Cluster 7.3-7.4 Reference Guide}, 563 howpublished = {\href{https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}{https://docs.oracle.com/\-cd/E17952\_01/\-mysql-5.6-en/\-ha-memcached-using-threads.html}}, 564 note = "[Online; accessed 5-August-2022]" 558 565 } 559 566 … … 991 998 note = "[Online; accessed 5-August-2022]" 992 999 } 1000 1001 @article{reese2008nginx, 1002 title = {NGINX: the high-performance web server and reverse proxy}, 1003 author = {Reese, Will}, 1004 journal = {Linux Journal}, 1005 volume = {2008}, 1006 number = {173}, 1007 pages = {2}, 1008 year = {2008}, 1009 publisher = {Belltown Media} 1010 } 1011 1012 @phdthesis{Harji10, 1013 author = {Ashif Harji}, 1014 title = {Performance Comparison of Uniprocessor and Multiprocessor Web Server Architectures}, 1015 school = {University of Waterloo}, 1016 year = 2010, 1017 month = feb, 1018 address = {Waterloo, Ontario, Canada, N2L 3G1}, 1019 note = {\textsf{http://uwspace.uwaterloo.ca/\-bitstream/\-10012/\-5040/\-1/\-Harji\_thesis.pdf}}, 1020 } -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r31b9d3c ra8dd247 8 8 Therefore, webservers offer a stringent performance benchmark for \CFA. 9 9 Indeed, existing webservers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem. 10 As such, these experiments should highlight the overhead tue to any \CFA fairness cost in realistic scenarios.10 As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios. 11 11 12 12 \section{Memcached} … … 24 24 Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz. 25 25 \item 26 These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.27 \item 28 The CPUs each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 cachesrespectively.29 \item 30 Each node isconnected to the network through a Mellanox 10 Gigabit Ethernet port.26 Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}. 27 \item 28 A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively. 29 \item 30 The compute nodes are connected to the network through a Mellanox 10 Gigabit Ethernet port. 31 31 \item 32 32 Network routing is performed by a Mellanox SX1012 10/40 Gigabit Ethernet switch. … … 35 35 \subsection{Memcached threading}\label{memcd:thrd} 36 36 Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests. 37 When enabled, the threading implementation operates as follows~\cite {https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}:37 When enabled, the threading implementation operates as follows~\cite[\S~16.2.2.8]{MemcachedThreading}: 38 38 \begin{itemize} 39 39 \item … … 99 99 \centering 100 100 \resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.99th.pstex_t}} 101 \caption[Memcached Benchmark : 99th Percentile La ntency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15,360 connections. }101 \caption[Memcached Benchmark : 99th Percentile Latency]{Memcached Benchmark : 99th Percentile Latency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15,360 connections. } 102 102 \label{fig:memcd:rate:tail} 103 103 \end{figure} … … 180 180 181 181 \subsection{NGINX threading} 182 Like memcached, NGINX can be makde to use multiple \glspl{kthrd}. 183 It has a very similar architecture to the memcached architecture decscribed in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic. 184 While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads. 185 Each worker threads handles multiple connections exclusively, effectively dividing the connections into distinct sets. 186 Again, this is effectively the \emph{event-based server} approach. 187 188 \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/} 189 182 % Like memcached, NGINX can be made to use multiple \glspl{kthrd}. 183 % It has a very similar architecture to the memcached architecture described in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic. 184 % While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads. 185 % Each worker thread handles multiple connections exclusively, effectively dividing the connections into distinct sets. 186 % Again, this is effectively the \emph{event-based server} approach. 187 % 188 % \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/} 189 190 NGINX 1.13.7 is an high-performance, \emph{full-service}, event-driven, with multiple operating-system processes or multiple kernel-threads within a process to handle blocking I/O. 191 It can also serve as a reverse proxy and a load balancer~\cite{reese2008nginx}. 192 Concurrent connections are handled using a complex event-driven architecture. 193 The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes. 194 NGINX uses a disk-based cache for performance, and assigns a dedicated process to manage the cache. 195 This process, known as the \textit{cache manager}, is spun-off by the master process. 196 Additionally, there can be many \textit{worker processes}, each handling network connections, reading and writing disk files, and communicating with upstream servers, such as reverse proxies or databases. 197 198 A worker is a single-threaded process, running independently of other workers. 199 The worker process handles new incoming connections and processes them. 200 Workers communicate using shared memory for shared cache and session data, and other shared resources. 201 Each worker assigns incoming connections to an HTTP state-machine. 202 As in a typical event-driven architecture, the worker listens for events from the clients, and responds immediately without blocking. 203 Memory use in NGINX is very conservative, because it does not spin up a new process or thread per connection, like Apache~\cite{apache} or \CFA. 204 All operations are asynchronous -- implemented using event notifications, callback functions and fine-tuned timers. 190 205 191 206 \subsection{\CFA webserver} … … 197 212 Normally, webservers use @sendfile@~\cite{MAN:sendfile} to send files over a socket because it performs a direct move in the kernel from the file-system cache to the NIC, eliminating reading/writing the file into the webserver. 198 213 While @io_uring@ does not support @sendfile@, it does supports @splice@~\cite{MAN:splice}, which is strictly more powerful. 199 However, because of how Linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@ must delegate splice calls to worker threads insidethe kernel.214 However, because of how Linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@ must delegate splice calls to worker threads \emph{inside} the kernel. 200 215 As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}. 201 216 Such a high number of \glspl{kthrd} slows Linux significantly. 202 217 Rather than abandon the experiment, the \CFA webserver was switched to @sendfile@. 203 218 204 With a blocking @sendfile@ the\CFA achieves acceptable performance until saturation is reached.205 At saturation, latency increases so some client connectionstimeout.219 Starting with \emph{blocking} @sendfile@, \CFA achieves acceptable performance until saturation is reached. 220 At saturation, latency increases and client connections begin to timeout. 206 221 As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections. 207 222 Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved. 208 However, this poses a problem using nonblocking @sendfile@ calls:223 However, this poses a problem using blocking @sendfile@ calls: 209 224 when @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets. 210 225 The call can block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{ … … 212 227 This effect results in a negative feedback where more timeouts lead to more @sendfile@ calls running out of resources. 213 228 214 Normally, this is address by using @select@/@epoll@ to wait for sockets to have sufficient resources. 215 However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely: 216 all calls would simply immediately return @EAGAIN@ and all asynchronicity would be lost. 217 229 Normally, this problem is address by using @select@/@epoll@ to wait for sockets to have sufficient resources. 230 However, since @io_uring@ does not support @sendfile@ but does respects non\-blocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely: 231 all calls simply immediately return @EAGAIN@ and all asynchronicity is lost. 232 233 Switching all of the \CFA runtime to @epoll@ for this experiment is unrealistic and does not help in the evaluation of the \CFA runtime. 218 234 For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@. 219 235 However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@. 220 Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@. 221 Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases. 222 223 Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 224 Presumably, this limit could prevent the explosion of \glspl{kthrd} which justified using @sendfile@ over @io_uring@ and @splice@. 236 Therefore, the \at spins performing the @sendfile@, yields if the call returns @EAGAIN@, and retries in these cases. 237 238 Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 239 Presumably, this limit would prevent the explosion of \glspl{kthrd}, which justified using @sendfile@ over @io_uring@ and @splice@. 225 240 However, recall from Section~\ref{iouring} that @io_uring@ maintains two pools of workers: bounded workers and unbounded workers. 226 In the particular case of the webserver, we would want the unbounded workers to handle accepts and reads on socket and bounded workers to handle reading the files from disk. 227 This would allow fine grained countrol over the number of workers needed for each operation type and would presumably lead to good performance. 241 For a webserver, the unbounded workers should handle accepts and reads on socket, and the bounded workers should handle reading files from disk. 242 This setup allows fine-grained control over the number of workers needed for each operation type and presumably lead to good performance. 243 228 244 However, @io_uring@ must contend with another reality of Linux: the versatility of @splice@. 229 Indeed, @splice@ can be used both for reading and writing ,to or from any type of file descriptor.230 This makes it moreambiguous which pool @io_uring@ should delegate @splice@ calls to.231 In the case of splicing from a socket to pipe, @splice@ will behavelike an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation.232 To make things more complicated, @splice@ can read from a pipe and write outto a regular file.245 Indeed, @splice@ can be used both for reading and writing to or from any type of file descriptor. 246 This generality makes it ambiguous which pool @io_uring@ should delegate @splice@ calls to. 247 In the case of splicing from a socket to pipe, @splice@ behaves like an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation. 248 To make things more complicated, @splice@ can read from a pipe and write to a regular file. 233 249 In this case, the read is an unbounded operation but the write is a bounded one. 234 250 This leaves @io_uring@ in a difficult situation where it can be very difficult to delegate splice operations to the appropriate type of worker. 235 Since there is little to no context available to @io_uring@, I believe it makes the decisionto always delegate @splice@ operations to the unbounded workers.236 This is unfortunate for this specific experiment, since it prevents the webserver from limiting the number of calls to @splice@ happening in parallelwithout affecting the performance of @read@ or @accept@.251 Since there is little or no context available to @io_uring@, it seems to always delegate @splice@ operations to the unbounded workers. 252 This decision is unfortunate for this specific experiment, since it prevents the webserver from limiting the number of parallel calls to @splice@ without affecting the performance of @read@ or @accept@. 237 253 For this reason, the @sendfile@ approach described above is still the most performant solution in Linux 5.15. 238 254 239 Note that it could be possible to workaround this problem, for example by creating more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations. 240 However, I do not believe this solution is appropriate in general, it simply replaces a hack in the webserver with a different, equivalent hack. 255 One possible workaround is to create more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations. 256 However, I do not believe this solution is appropriate in general; 257 it simply replaces my current webserver hack with a different, equivalent hack. 241 258 242 259 \subsection{Benchmark Environment} … … 246 263 The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52. 247 264 \item 248 It has an AMD Opteron(tm) Processor 6380 running at 2.5GHz. 265 The server computer has four AMD Opteron(tm) Processor 6380 with 16 cores running at 2.5GHz, for a total of 64 \glspl{hthrd}. 266 \item 267 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate. 249 268 \item 250 269 Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 251 \item252 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.253 270 \item 254 271 The computer is booted with only 25GB of memory to restrict the file-system cache. … … 277 294 278 295 The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution. 279 This distribution is representative of users accessing static data through a web -browser.296 This distribution is representative of users accessing static data through a web browser. 280 297 Each request reads a file name from its trace, establishes a connection, performs an HTTP get-request for the file name, receive the file data, close the connection, and repeat the process. 281 298 Some trace elements have multiple file names that are read across a persistent connection. … … 287 304 Server throughput is measured both at peak and after saturation (\ie after peak). 288 305 Peak indicates the level of client requests the server can handle and after peak indicates if a server degrades gracefully. 289 Throughput is measured by aggregating the results from httperf ofall the clients.306 Throughput is measured by aggregating the results from httperf for all the clients. 290 307 291 308 This experiment can be done for two workload scenarios by reconfiguring the server with different amounts of memory: 25 GB and 2.5 GB. … … 305 322 \end{table} 306 323 324 \begin{figure} 325 \centering 326 \subfloat[][Throughput]{ 327 \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.pstex_t}} 328 \label{fig:swbsrv:ops} 329 } 330 331 \subfloat[][Rate of Errors]{ 332 \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.err.pstex_t}} 333 \label{fig:swbsrv:err} 334 } 335 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.} 336 \label{fig:swbsrv} 337 \end{figure} 338 307 339 Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput. 308 340 These results are fairly straightforward. 309 341 Both servers achieve the same throughput until around 57,500 requests per seconds. 310 Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desiredrate.342 Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the request rate. 311 343 Once the saturation point is reached, both servers are still very close. 312 344 NGINX achieves slightly better throughput. 313 However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achieves notably fewer errors once the machine reaches saturation. 314 This suggest that \CFA is slightly more fair and NGINX may slightly sacrifice some fairness for improved throughput. 315 It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine. 316 317 \begin{figure} 318 \subfloat[][Throughput]{ 319 \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.pstex_t}} 320 \label{fig:swbsrv:ops} 321 } 322 323 \subfloat[][Rate of Errors]{ 324 \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.err.pstex_t}} 325 \label{fig:swbsrv:err} 326 } 327 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.} 328 \label{fig:swbsrv} 329 \end{figure} 345 However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achieves notably fewer errors once the servers reach saturation. 346 This suggests \CFA is slightly fairer with less throughput, while NGINX sacrifice fairness for more throughput. 347 This experiment demonstrate that the \CFA webserver is able to match the performance of NGINX up-to and beyond the saturation point of the machine. 330 348 331 349 \subsection{Disk Operations} 332 The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the rest of the machine. 333 Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine. 334 If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk. 335 However, in this configuration, the problem with @splice@ and @io_uring@ rears its ugly head again. 350 With 25GB of memory, the entire experimental file-set plus the webserver and OS fit in memory. 351 If memory is constrained, the OS must evict files from the file cache, which causes @sendfile@ to read from disk.\footnote{ 352 For the in-memory experiments, the file-system cache was warmed by running an experiment three times before measuring started to ensure all files are in the file-system cache.} 353 Webservers can behave very differently once file I/O begins and increases. 354 Hence, prior work~\cite{Harji10} suggests running both kinds of experiments to test overall webserver performance. 355 356 However, after reducing memory to 2.5GB, the problem with @splice@ and @io_uring@ rears its ugly head again. 336 357 Indeed, in the in-memory configuration, replacing @splice@ with calls to @sendfile@ works because the bounded side basically never blocks. 337 358 Like @splice@, @sendfile@ is in a situation where the read side requires bounded blocking, \eg reading from a regular file, while the write side requires unbounded blocking, \eg blocking until the socket is available for writing. 338 The unbounded side can be handled by yielding when it returns @EAGAIN@ likementioned above, but this trick does not work for the bounded side.359 The unbounded side can be handled by yielding when it returns @EAGAIN@, as mentioned above, but this trick does not work for the bounded side. 339 360 The only solution for the bounded side is to spawn more threads and let these handle the blocking. 340 361 341 362 Supporting this case in the webserver would require creating more \procs or creating a dedicated thread-pool. 342 However, since what I am to evaluate in this thesis is the runtime of \CFA, I decided to forgo experiments on low memory server. 343 The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime. 363 However, I felt this kind of modification moves to far away from my goal of evaluating the \CFA runtime, \ie it begins writing another runtime system; 364 hence, I decided to forgo experiments on low-memory performance. 365 % The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.
Note: See TracChangeset
for help on using the changeset viewer.