Kubernetes Health Checks for Elixir Apps
By Programming on Mon 01 January 2024
inHealth checks are an important part of making your application reliable and manageable in production. They can also help make development with containers faster.
Kubernetes health checks
Kubernetes has well-defined semantics for how health checks should behave, distinguishing between "startup", "liveness", and "readiness".
Liveness is the core health check. It determines whether the app is alive and able to respond to requests. It should be relatively fast, as it is called frequently, but should include checks for dependencies, e.g., whether the app can connect to a database or back-end service. If the liveness check fails for a specified period, Kubernetes kills and replaces the instance.
Startup checks whether the app has finished booting up. It is useful when the app may take significant time to start, e.g., because it loads data from a database into a cache. Separating this from liveness allows us to use different timeouts rather than making the liveness timeout long enough to support startup. Once startup has completed successfully, Kubernetes does not call it again, it uses the liveness check.
Readiness checks whether the app should receive requests. Kubernetes uses it to decide whether to route traffic to the instance. If the readiness probe fails, Kubernetes doesn't kill and restart the container. Instead it marks the pod as "unready" and stops sending traffic to it, e.g., in the ingress. It is useful to be able to temporarily stop serving traffic, e.g., when the instance is overloaded or it has transient problems connecting to a back-end service.
Kubernetes checks themselves rely only on the HTTP response code to determine
service health. A code greater than or equal to 200 and less than 400 indicates
success, and any other code indicates failure. While Kubernetes treats the
health check response as binary, i.e., ok or not, the health check can return
additional information about the cause of the error, making troubleshooting
easier for developers or ops staff. This might be formatted as a simple string
or JSON format, e.g. {"status": "OK"}
/ {"status": "error", "code": 503,
"reason": "timeout connecting to downstream service"}
.
In addition, the service should generally write a message to the log or add information to a trace, allowing people to find and debug systems that are having problems.
The kubernetes_health_check project provides a Plug that handles Kubernetes health check requests. It is driven by a module that is custom to the app. Following is an example:
defmodule Example.Health do
@moduledoc """
Collect app status for Kubernetes health checks.
"""
alias Example.Repo
@app :example
@repos Application.compile_env(@app, :ecto_repos) || []
@doc """
Check if the app has finished booting up.
This returns app status for the Kubernetes `startupProbe`.
Kubernetes checks this probe repeatedly until it returns a successful
response. After that Kubernetes switches to executing the other two probes.
If the app fails to successfully start before the `failureThreshold` time is
reached, Kubernetes kills the container and restarts it.
For example, this check might return OK when the app has started the
web-server, connected to a DB, connected to external services, and performed
initial setup tasks such as loading a large cache.
"""
@spec startup ::
:ok
| {:error, {status_code :: non_neg_integer(), reason :: binary()}}
| {:error, reason :: binary()}
def startup do
# Return error if there are available migrations which have not been executed.
# This supports deployment to AWS ECS using the following strategy:
# https://engineering.instawork.com/elegant-database-migrations-on-ecs-74f3487da99f
#
# By default Elixir migrations lock the database migration table, so they
# will only run from a single instance.
migrations =
@repos
|> Enum.map(&Ecto.Migrator.migrations/1)
|> List.flatten()
if Enum.empty?(migrations) do
liveness()
else
{:error, "Database not migrated"}
end
end
@doc """
Check if the app is alive and working properly.
This returns app status for the Kubernetes `livenessProbe`.
Kubernetes continuously checks if the app is alive and working as expected.
If it crashes or becomes unresponsive for a specified period of time,
Kubernetes kills and replaces the container.
This check should be lightweight, only determining if the server is
responding to requests and can connect to the DB.
"""
@spec liveness ::
:ok
| {:error, {status_code :: non_neg_integer(), reason :: binary()}}
| {:error, reason :: binary()}
def liveness do
case Ecto.Adapters.SQL.query(Repo, "SELECT 1") do
{:ok, %{num_rows: 1, rows: [[1]]}} ->
:ok
{:error, reason} ->
{:error, inspect(reason)}
end
rescue
e ->
{:error, inspect(e)}
end
@doc """
Check if app should be serving public traffic.
This returns app status for the Kubernetes `readinessProbe`.
Kubernetes continuously checks if the app should serve traffic. If the
readiness probe fails, Kubernetes doesn't kill and restart the container,
instead it marks the pod as "unready" and stops sending traffic to it, e.g.,
in the ingress.
This is useful to temporarily stop serving requests. For example, if the app
gets a timeout connecting to a back end service, it might return an error for
the readiness probe. After multiple failed attempts, it would switch to
returning false for the `livenessProbe`, triggering a restart.
Similarly, the app might return an error if it is overloaded, shedding
traffic until it has caught up.
"""
@spec readiness ::
:ok
| {:error, {status_code :: non_neg_integer(), reason :: binary()}}
| {:error, reason :: binary()}
def readiness do
liveness()
end
@spec basic ::
:ok
# | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
# | {:error, reason :: binary()}
def basic do
:ok
end
end
Dependencies
Services also need health checks for the services they depend on. In development, these might be databases or Kafka running in a container. In production, those might be managed services in AWS.
For services that do not provide an HTTP API, we can define a command that runs within the container.
For example, this probe checks a Postgres database container:
readinessProbe:
exec:
command: ["pg_isready"]
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 20
livenessProbe:
exec:
command: ["psql", "-w", "-U", "postgres", "-d", "my-db", "-c", "SELECT 1"]
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 1
For historical reasons, Kubernetes checks are different from Docker
healthcheck
definitions.
In docker-compose.yml
, health checks look like:
---
version: "3.9"
services:
deploy:
image: example-service
healthcheck:
test: ["CMD", "curl", "http://127.0.0.1:4001/healthz"]
start_period: 6s
interval: 2s
timeout: 5s
retries: 20
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:14.1-alpine
restart: always
healthcheck:
test: ["CMD-SHELL", "pg_isready"]
start_period: 5s
interval: 2s
timeout: 5s
retries: 20
router:
image: ghcr.io/apollographql/router:v1.2.1
ports:
# GraphQL endpoint
- "4000:4000"
# Health check
- "8088:8088"
environment:
# https://www.apollographql.com/docs/router/configuration/overview
APOLLO_ROUTER_LOG: "debug"
APOLLO_ROUTER_SUPERGRAPH_PATH: /dist/schema/local.graphql
APOLLO_ROUTER_CONFIG_PATH: /router.yaml
APOLLO_ROUTER_HOT_RELOAD: "true"
volumes:
- "./apollo-router.yml:/router.yaml"
- "./supergraph.graphql:/dist/schema/local.graphql"
healthcheck:
test: ["CMD-SHELL", "curl", "-v", "--fail", "http://127.0.0.1:8088/health"]
start_period: 5s
interval: 2s
timeout: 5s
retries: 20
depends_on:
deploy:
condition: service_healthy
With containerized tests, we might run tests via docker-compose
. The external
API tests bring up the app container, containers for other services that it
depends on, associated database containers, and the Apollo Router container.
They then run tests using Postman/Newman. Then we can run docker-compose up
router
to bring up all the containers, waiting until they are up and healthy,
and then run Newman tests on it.
Robust health checks for each component in the stack help the system to come up quickly and reliably, and they provide messages that help us debug startup failures easily.
Running OS commands
Instead of external HTTP checks, we can instead execute a health check inside the container. That may use curl to call the app on localhost. While this does not exercise the HTTP stack, it may be more reliable.
An example Kubernetes check which runs a command is
livenessProbe:
exec:
command:
- /app/grpc-health-probe
- -addr=:50051
- -connect-timeout=5s
- -rpc-timeout=5s
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
exec:
command:
- /app/grpc-health-probe
- -addr=:50051
- -connect-timeout=5s
- -rpc-timeout=5s
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
When an Elixir app is running via releases, we can evaluate code directly to run the health check, e.g.:
healthcheck:
test: ["CMD", "bin/api", "eval", "API.Health.liveness()"]
start_period: 2s
interval: 1s
timeout: 1s
Higher-level checks
The above health checks are used at the infrastructure level to identify problems. They help Kubernetes automatically resolve problems by restarting containers, scaling resources, etc.
We can also add production checks that indicate end-customer visible problems. For example, if a user can’t log in, then we should alert. Some of these can be metrics, e.g., if we would normally get 100 successful logins a minute, and now we are getting 0, then there is a problem. Or we might get 1% unsuccessful logins in a period, and now we are getting 50%.
We use external API tests as part of containerized testing in CI. We can leverage these tests to make production health checks for standard scenarios across multiple services, e.g., a customer logs in, adds an item to their cart, and then checks out.
See the following articles for more background information: