Avoiding GenServer bottlenecks

GenServers are the standard way to create services in Elixir. At it's heart, a GenServer is a separate process (thread) that receives messages, does some work, manages state, and sends responses back. If that's what you want, great. It's important to recognize, though, that a GenServer only handles one request at a time, and they can become a bottleneck for your system. The Erlang system has other tools available which may work better.

There is a tendency for developers to model their system with GenServers, creating unnecessary problems. This is particularly an issue for developers coming to Elixir from object oriented languages. This leads me to sometimes joke that "GenServer is a code smell."

Following are some examples of how GenServers became bottlenecks in high volume systems, and how we resolved them.

Example: Geoip lookups

We needed to determine which country the user is coming from based on their IP address. MaxMind has various databases related to IP addresses. They are a binary file in a form which supports efficient querying by network prefix.

The database we are using is about 65MB in size. That is too much data to read from disk on every request, so our initial design was to put it in a GenServer. On application startup, we start the GenServer, telling it where the file is, and it loads the file into its state. Then it responds to GenServer "call" requests to return the data associated with an IP address.

That worked fine for a while, but at a certain point, we started getting timeouts. A GenServer can only handle a single request at a time, so it became the bottleneck. We were effectively forcing all the requests in the system to line up and go through system through the process one by one.

To avoid that, we switched to a process pool, using the Episcina library. We ran multiple instances of the GenServer. The process handling a request would check out a server from the pool, call the server to get the data, then put it back in the pool.

That worked for a while, but eventually the geoip lookups became the bottleneck for the system again. We added more and more servers to the queue, but it didn't help. At first we thought it was the queue manager, as it was a GenServer, too. We needed to send multiple messages: one to check out the GenServer, one to run the request, then another check it back in. The message passing is actually quite fast, though.

The bigger issue was how many processes we should have in the pool and how to scale. Our peak load was driven by traffic spikes, particularly DDOS attacks. We would get sustained traffic of 5-10K requests per second, with spikes above that.

If the pool starts with a small number of processes, then there is a delay launching new processes as we read the data from the disk. Sometimes we would launch hundreds of processes at once in response to demand, all of them fighting for the same disk. If we pre-loaded lots of processes, then we would use a lot of RAM, and our startup time was poor.

The solution to this, like a lot of Elixir performance issues, was to use ETS. ETS stands for "Erlang Term Storage." It is an in-memory key/value database built into the Erlang virtual machine and highly optimized for concurrent access between multiple processes. It works on Erlang "terms," i.e. data structures, so there is no serialization overhead. Lookup times in ETS are less than one microsecond, which makes them 1000 times faster than something like Redis.

On startup, we load the geoip data into an ETS table. Then, in the process that handles the HTTP request, we load the data from the ETS table and do the lookup on the data blob. You might think that would be inefficient due to copying data around, but the Erlang virtual machine has optimized the process of sharing binary data. If a binary is larger than 64 bytes, it gets stored in a shared binary heap. In fact, we are just passing around a reference to the binary data between ETS and the process. This is a case where immutable data is a big win.

After this optimization, our worst case geoip lookups were taking five microseconds, and our memory usage dropped a lot. That was pretty good, but when we are under DDOS attack, we get a lot of requests from the same IPs. We added a second ETS table to cache the results of the lookup, getting the time to less than one microsecond.

Principles

This is also a good example of the principle of "model the natural concurrency of your application." We had created a lot of processes to manage the geoip data and lookups, and we had overhead talking to them. The number of processes was different from the number of requests.

For each HTTP request, we have a Cowboy process that does the work, then goes back into a pool. The right answer was to do all the work associated with the request in this process. That way we don't have the overhead and latency of dealing with the queue manager or sending messages to the GenServer.

Another principle is that we should restrict load at the edge of the system. If we can't handle load, we should reject it, rather than overloading the bottleneck and making the system fail (see below). When we use the HTTP thread, it's possible (though not required) in Cowboy to limit the number of acceptor processes. So if we can handle 1000 requests per second, we can limit it at the HTTP layer, causing the requests to be queued by the kernel in the TCP/IP layer. That in turn gives backpressure to clients of the system.

One common case where we really do need to limit concurrent access is when we are talking to a database like PostgreSQL. The database works best with a relatively small number of simultaneous requests. Any more causes problems with locking. So the fundamental bottleneck in the system is concurrency of the db pool. And, once again, ETS can be a solution by caching db results that don't change.

Example: logging

We needed to write a transaction log for each request for accounting purposes. This is not a traditional text error/debug log, it is a CSV file.

We originally implemented this as an Erlang gen_event handler. Under the hood, these handlers are GenServers which respond to messages.

The event handler was responsible for managing log files with timestamped names, opening and closing them. It received events from multiple HTTP request processes, formatting them and writing messages to the files in an orderly way. This makes sense, as having multiple processes independently opening and writing to log files would cause a lot of conflict. The problem is that the GenServer became the bottleneck for the whole system. Once again, we were making all requests line up to go through the GenServer one by one. It got overloaded and timed out as disk I/O became an issue under load.

We could have played the same game of splitting things up into multiple GenServers. Instead, we followed a rule of Erlang: "Ericsson probably ran into this problem at British Telecom 20 years ago and solved it." So I went looking into the Erlang libs and found disk_log.

disk_log is very full featured, designed for exactly this situation. Telecom systems produce Call Detail Records, CDRs. Every time a switch touches a call, it records the information about who called whom and how long they talked. It then sends the records to a central server for "mediation," where they calculate the bill from the various pieces.

disk_log can handle 100K writes per second, using low level Erlang I/O features to support current writes from multiple processes. It supports error handling, log rotation, and reading back from logs. It's great, but the docs are limited, all you have is the man page. I am sure Ericsson has lots of examples from their products, but those are not open source. So we needed to find some examples and make some prototypes, but it solved the problem the right way.

This is one of the things I love about Erlang. The platform is very mature and has good solutions for the real problems we have. It's not magic, the laws of physics still apply, but it is a lot better than starting from scratch as we would with systems like Golang or Node.js.

Backpressure and load

One big problem with gen_event is that it doesn't have a mechanism for backpressure. It's really designed for low message volumes, and is now deprecated in Elixir. Instead you should generally use GenStage, which uses a pull model to avoid overload.

Lack of backpressure is a funamental issue with the way Erlang process mailboxes work. If you send more data to a GenServer than it can handle, the mailbox will fill up, and eventually you will get a timeout. If you are having performance problems, look for processes with overloaded mailboxes and deal with them.

You may wonder why the Erlang system hasn't fixed this. The current mechanism has low overhead. If we had to acknowledge every message, it would increase the load on the system. It also fits with the unreliable nature of real world systems. If we send a message and don't get a response back, then we try again. That handles messages that get lost due to network problems, crashes and overload with the same mechanism.

Another reason is that telecom systems are sold according to the amount of load they can handle. As part of the product development process, they identify what the bottlenecks are, then they limit the inbound load to what the system can handle, rejecting anything beyond that. If you have more, you need to buy another telephone switch. If a process mailbox is filling up in production, it means you have a bug or some other resource problem, e.g. failing hardware.

With systems connected to the Internet, we can't control our load, but we can be smart about how we deal with it. Have a look at Ulf Wiger's jobs framework and his presentation at the Erlang User Conference for more info on load regulation. (Another rule of Erlang is: "pay attention to anything Ulf Wiger does.")

The standard Elixir Logger framework is based on gen_event and has a similar issue. It monitors its own mailbox and if it gets too full, it drops messages and applies backpressure by switching to "synchronous" mode. That makes things slower, though, so it can cause your system to crash, basically kicking it when it's down. If you are limiting load at the edge, you could use the mailbox size of your bottleneck as part of your load check when determining whether you accept requests.

There is some more discussion about logging in this performance tuning presentation