The impact of network latency, errors, and concurrency on benchmarks

By Jake Morrison in Development on Fri 06 July 2018

The goal of benchmarking is to understand the performance of our system and how to improve it. When we are making benchmarks, we need to make sure that they match real world usage.

In my post on Benchmarking Phoenix on Digital Ocean, changing the concurrent connections and network latency had a big effect on the results. This post goes into more details on why.

Most websites have connections from multiple clients. That's difficult to simulate, so we often make a lot of connections from a small number of clients, maybe just one. It's an unrealistic way to benchmark if we want to test how the server can handle load. It ends up being limited by latency of the connection between the client and the server.

Latency

There is a direct relationship between the latency and the number of requests the server can handle. It is basically just math.

If the server takes 1 ms to handle a request, then the fastest we can do is 1000 requests per second on a single connection, done serially.

1000 ms/sec / 1 ms/request = 1000 requests/sec

That's limited by the server, it is working as hard as it can to process requests. This assumes that network latency is negligible.

If we are testing from a client in the same data center, and round trip latency is 4ms, then each request from the client takes 1 ms for the request and 4 ms waiting for the network.

1000 ms/sec / 5 ms/request = 200 requests/sec

With one client, the server is sitting idle, it could be handling five times as many requests if it had more clients talking with it.

If we add more network latency, say 50ms, then it has a dramatic affect:

1000 ms/sec / 51 ms/request = 19.6 requests/sec

Client concurrency

In order to accurately benchmark what the server can do, we have to add more clients.

We do that in wrk by adding more concurrency:

wrk -t200 -c200 -d60s --latency -s wrk.lua "http://159.89.197.173"

At a certain point, the client performance starts to affect the benchmark. This is particularly true if we benchmark on the same server from localhost. The client takes up resources that the server needs to handle the requests, skewing the results.

Server concurrency

If we only have one CPU, then the server can actually only do one thing at a time. In practice, servers are often waiting on network connections, so we can use a non-blocking I/O framework and do work on other requests when we are waiting. Single-threaded frameworks with non-blocking I/O can handle a lot of clients, as long as they are not doing CPU heavy work. It's efficient and easy to reason about. As processing work increases, they start to have problems because requests have to wait. This is the way that Node.js or Twisted Python work.

More advanced languages like Java, C++ or Go use threads. Each request gets its own independent thread, and the operating system schedules them. Programming can be complex, though, with race conditions accessing shared data. Shared memory means that one thread can step on another thread's memory. There are also limits to how many OS threads can be active at a time, as each takes up system resources.

Server side concurrency is where Elixir really shines. The Erlang VM handles the low level networking using non-blocking I/O. It also schedules Elixir processes across a moderate number of OS threads, allowing it to handle millions of lightweight processes efficiently. This is particularly useful when the server has multiple cores, Elixir will transparently take advantage of them all.

TCP/IP connections and reliability

In that benchmark post, I set max_keepalive: 5_000_000, which told Phoenix to keep the network connection open between requests. This allows the client to send multiple requests on the same TCP/IP connection, avoiding the work and latency from the TCP three-packet handshake (SYN/SYN-ACK/ACK).

This is realistic for interactive web use, where a single user may request multiple pages or other assets. For an API, keep-alive may not be relevant.

Packet loss can have a big effect on the worst case time. If the connection loses a packet and it needs to be retransmitted, then that request will be extra slow. For benchmarking, once the connection loses a packet, it can block the rest of the pipeline, giving poor results, and you may get overall better throughput without it.

The real world can be nasty, though, and tuning according to benchmarks in a clean network can lead to poor application behavior.

Mobile connections may have very high packet loss (> 50%) and/or very high latency (> 500 ms). Retransmits happen both at the cellular-network level and the TCP level, causing pathological network behavior. Running a reliable protocol on top of another reliable protocol results in conflicts with timing of acknowledgements, retries, and windowing.

With 50% packet loss, every time we send a packet, it may get lost. That goes for DNS packets, TCP handshake packets, request packets, response packets, etc. For an API, we are probably best minimizing the number of API requests, putting more data in each request. We rely on the automatic network retransmits to get the data there. If the connections completely die, it can take minutes to give up. We may be better off with a connection per request again.

We can also run out of TCP/IP sockets as we handle more simultaneous requests. As you get more traffic, it's important to configure each part of the system to make sure you have enough sockets, or it will fundamentally limit your application. The defaults are surprisingly small. This is particularly important when we are using web sockets, as described in The Road to 2 Million Websocket Connections in Phoenix

Tuning the bottlenecks

The most important thing to look at when performance tuning is not the average, it's the worst case. That indicates where the problems lie. Sometimes, as described above, it's lost network packets. Other times, it's waiting on a shared bottleneck, typically the database. Sometimes it's a garbage collection pause. The Erlang VM manages memory on a fine-grained per-process basis, so it's particularly good at this.

See this performance tuning presentation for more details.