Changeset fc6c410
- Timestamp:
- Aug 25, 2022, 11:52:00 AM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- d2f09e4
- Parents:
- 82b9e956
- Location:
- doc/theses/thierry_delisle_PhD/thesis
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/local.bib
r82b9e956 rfc6c410 969 969 } 970 970 971 @misc{xkcd:cloud, 972 author = "Randall Munroe", 973 title = "908: The Cloud", 974 year = "2011", 975 month = "June", 976 howpublished = "\href{https://xkcd.com/908/}", 977 note = "[Online; accessed 25-August-2022]" 978 } 979 971 980 @misc{go:safepoints, 972 981 author = "The Go Programming Language", -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r82b9e956 rfc6c410 8 8 Therefore, webservers offer a stringent performance benchmark for \CFA. 9 9 Indeed, existing webservers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem. 10 As such, these experiments should highlight any \CFA fairness cost (overhead)in realistic scenarios.10 As such, these experiments should highlight the overhead tue to any \CFA fairness cost in realistic scenarios. 11 11 12 12 \section{Memcached} … … 33 33 \end{itemize} 34 34 35 \subsection{Memcached threading} 35 \subsection{Memcached threading}\label{memcd:thrd} 36 36 Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests. 37 37 When enabled, the threading implementation operates as follows~\cite{https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}: … … 49 49 Threads that are not currently dealing with another request ignore the incoming packet. 50 50 One of the remaining, nonbusy, threads reads the request and sends the response. 51 This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request. 52 \end{itemize} 53 Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple (largely)independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.51 This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request. 52 \end{itemize} 53 Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O. 54 54 Alternative webserver architecture are: 55 55 \begin{itemize} … … 179 179 The static webserver experiment compares NGINX~\cite{nginx} with a custom \CFA-based webserver developed for this experiment. 180 180 181 \subsection{NGINX threading} 182 Like memcached, NGINX can be makde to use multiple \glspl{kthrd}. 183 It has a very similar architecture to the memcached architecture decscribed in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic. 184 While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads. 185 Each worker threads handles multiple connections exclusively, effectively dividing the connections into distinct sets. 186 Again, this is effectively the \emph{event-based server} approach. 187 188 \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/} 189 190 181 191 \subsection{\CFA webserver} 182 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront (tuning parameter).192 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront. 183 193 Each \at calls @accept@, through @io_uring@, on the listening port and handles the incoming connection once accepted. 184 194 Most of the implementation is fairly straightforward; … … 190 200 As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}. 191 201 Such a high number of \glspl{kthrd} slows Linux significantly. 192 Rather than abandon the experiment, the \CFA webserver was switched to nonblocking @sendfile@. 193 However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@. 194 Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@. 195 This workaround works up to the saturation point, when other problems occur. 196 202 Rather than abandon the experiment, the \CFA webserver was switched to @sendfile@. 203 204 With a blocking @sendfile@ the \CFA achieves acceptable performance until saturation is reached. 197 205 At saturation, latency increases so some client connections timeout. 198 206 As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections. 199 207 Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved. 200 208 However, this poses a problem using nonblocking @sendfile@ calls: 201 the call can still block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{ 209 when @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets. 210 The call can block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{ 202 211 \lstinline{sendfile} can always block even in nonblocking mode if the file to be sent is not in the file-system cache, because Linux does not provide nonblocking disk I/O.} 203 When @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.204 212 This effect results in a negative feedback where more timeouts lead to more @sendfile@ calls running out of resources. 205 213 206 214 Normally, this is address by using @select@/@epoll@ to wait for sockets to have sufficient resources. 207 However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely. 215 However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely: 216 all calls would simply immediately return @EAGAIN@ and all asynchronicity would be lost. 217 208 218 For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@. 219 However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@. 220 Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@. 209 221 Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases. 210 222 211 223 Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 212 However, as of writing this document Ubuntu does not have a stable release of Linux 5.15. 213 There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment. 214 Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway. 215 However, since this could not be tested, this is purely a conjecture at this point. 224 Presumably, this limit could prevent the explosion of \glspl{kthrd} which justified using @sendfile@ over @io_uring@ and @splice@. 225 However, recall from Section~\ref{iouring} that @io_uring@ maintains two pools of workers: bounded workers and unbounded workers. 226 In the particular case of the webserver, we would want the unbounded workers to handle accepts and reads on socket and bounded workers to handle reading the files from disk. 227 This would allow fine grained countrol over the number of workers needed for each operation type and would presumably lead to good performance. 228 However, @io_uring@ must contend with another reality of Linux: the versatility of @splice@. 229 Indeed, @splice@ can be used both for reading and writing, to or from any type of file descriptor. 230 This makes it more ambiguous which pool @io_uring@ should delegate @splice@ calls to. 231 In the case of splicing from a socket to pipe, @splice@ will behave like an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation. 232 To make things more complicated, @splice@ can read from a pipe and write out to a regular file. 233 In this case, the read is an unbounded operation but the write is a bounded one. 234 This leaves @io_uring@ in a difficult situation where it can be very difficult to delegate splice operations to the appropriate type of worker. 235 Since there is little to no context available to @io_uring@, I believe it makes the decision to always delegate @splice@ operations to the unbounded workers. 236 This is unfortunate for this specific experiment, since it prevents the webserver from limiting the number of calls to @splice@ happening in parallel without affecting the performance of @read@ or @accept@. 237 For this reason, the @sendfile@ approach described above is still the most performant solution in Linux 5.15. 238 239 Note that it could be possible to workaround this problem, for example by creating more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations. 240 However, I do not believe this solution is appropriate in general, it simply replaces a hack in the webserver with a different, equivalent hack. 216 241 217 242 \subsection{Benchmark Environment} … … 264 289 Throughput is measured by aggregating the results from httperf of all the clients. 265 290 266 T wo workload scenarios are created by reconfiguring the server with different amounts of memory: 4 GB and 2GB.267 The two workloads correspond to in-memory (4 GB) and disk-I/O (2 GB).291 This experiment can be done for two workload scenarios by reconfiguring the server with different amounts of memory: 25 GB and 2.5 GB. 292 The two workloads correspond to in-memory and disk-I/O respectively. 268 293 Due to the Zipf distribution, only a small amount of memory is needed to service a significant percentage of requests. 269 294 Table~\ref{t:CumulativeMemory} shows the cumulative memory required to satisfy the specified percentage of requests; e.g., 95\% of the requests come from 126.5 MB of the file set and 95\% of the requests are for files less than or equal to 51,200 bytes. 270 Interestingly, with 2 GB of memory, significant disk-I/O occurs.295 Interestingly, with 2.5 GB of memory, significant disk-I/O occurs. 271 296 272 297 \begin{table} … … 308 333 Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine. 309 334 If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk. 310 However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance. 335 However, in this configuration, the problem with @splice@ and @io_uring@ rears its ugly head again. 336 Indeed, in the in-memory configuration, replacing @splice@ with calls to @sendfile@ works because the bounded side basically never blocks. 337 Like @splice@, @sendfile@ is in a situation where the read side requires bounded blocking, \eg reading from a regular file, while the write side requires unbounded blocking, \eg blocking until the socket is available for writing. 338 The unbounded side can be handled by yielding when it returns @EAGAIN@ like mentioned above, but this trick does not work for the bounded side. 339 The only solution for the bounded side is to spawn more threads and let these handle the blocking. 340 341 Supporting this case in the webserver would require creating more \procs or creating a dedicated thread-pool. 311 342 However, since what I am to evaluate in this thesis is the runtime of \CFA, I decided to forgo experiments on low memory server. 312 343 The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime. -
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r82b9e956 rfc6c410 158 158 The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed. 159 159 160 \subsection{\lstinline{io_uring} in depth} 160 \subsection{\lstinline{io_uring} in depth}\label{iouring} 161 161 Before going into details on the design of my event engine, more details on @io_uring@ usage are presented, each important in the design of the engine. 162 162 Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance. … … 216 216 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time. 217 217 218 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operation using someone else's threads. 219 Indeed, asynchronous operation can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads. 220 In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}. 221 To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user. 222 There are three kind of operations that can need the \glspl{kthrd}: 223 224 \paragraph{Operations using} @IOSQE_ASYNC@. 225 This is a straightforward case, users can explicitly set the @IOSQE_ASYNC@ flag on an SQE to specify that it \emph{must} be delegated to a different \gls{kthrd}. 226 227 \paragraph{Bounded operations.} 228 This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices. 229 @io_uring@ must also take this reality into account by delegating operations on regular files and block devices. 230 In fact @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}. 231 232 \paragraph{Unbounded operations that must be retried.} 233 While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket. 234 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}. 235 @io_uring@ maintains a separate pool for these operations. 236 The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}. 237 Unbounded workers are also responsible of handling operations using @IOSQE_ASYNC@. 238 239 @io_uring@ implicitly spawns and joins both the bounded and unbounded workers based on its evaluation of the needs of the workload. 240 This effectively encapsulates the work that is needed when using @epoll@. 241 Indeed, @io_uring@ does not change Linux's underlying handling of \io opeartions, it simply offers an asynchronous \glsxtrshort{api} on top of the existing system. 242 243 218 244 \subsection{Multiplexing \io: Submission} 219 245
Note: See TracChangeset
for help on using the changeset viewer.