Kubernetes Health Checks for Elixir Apps

By Jake Morrison in Programming on Mon 01 January 2024

kubernetes elixir phoenix health checks ecs

Health checks are an important part of making your application reliable and manageable in production. They can also help make development with containers faster.

Kubernetes health checks

Kubernetes has well-defined semantics for how health checks should behave, distinguishing between "startup", "liveness", and "readiness".

Liveness is the core health check. It determines whether the app is alive and able to respond to requests. It should be relatively fast, as it is called frequently, but should include checks for dependencies, e.g., whether the app can connect to a database or back-end service. If the liveness check fails for a specified period, Kubernetes kills and replaces the instance.

Startup checks whether the app has finished booting up. It is useful when the app may take significant time to start, e.g., because it loads data from a database into a cache. Separating this from liveness allows us to use different timeouts rather than making the liveness timeout long enough to support startup. Once startup has completed successfully, Kubernetes does not call it again, it uses the liveness check.

Readiness checks whether the app should receive requests. Kubernetes uses it to decide whether to route traffic to the instance. If the readiness probe fails, Kubernetes doesn't kill and restart the container. Instead it marks the pod as "unready" and stops sending traffic to it, e.g., in the ingress. It is useful to be able to temporarily stop serving traffic, e.g., when the instance is overloaded or it has transient problems connecting to a back-end service.

Kubernetes checks themselves rely only on the HTTP response code to determine service health. A code greater than or equal to 200 and less than 400 indicates success, and any other code indicates failure. While Kubernetes treats the health check response as binary, i.e., ok or not, the health check can return additional information about the cause of the error, making troubleshooting easier for developers or ops staff. This might be formatted as a simple string or JSON format, e.g. {"status": "OK"} / {"status": "error", "code": 503, "reason": "timeout connecting to downstream service"}.

In addition, the service should generally write a message to the log or add information to a trace, allowing people to find and debug systems that are having problems.

The kubernetes_health_check project provides a Plug that handles Kubernetes health check requests. It is driven by a module that is custom to the app. Following is an example:

defmodule Example.Health do
  @moduledoc """
  Collect app status for Kubernetes health checks.
  """
  alias Example.Repo

  @app :example
  @repos Application.compile_env(@app, :ecto_repos) || []

  @doc """
  Check if the app has finished booting up.

  This returns app status for the Kubernetes `startupProbe`.
  Kubernetes checks this probe repeatedly until it returns a successful
  response. After that Kubernetes switches to executing the other two probes.
  If the app fails to successfully start before the `failureThreshold` time is
  reached, Kubernetes kills the container and restarts it.

  For example, this check might return OK when the app has started the
  web-server, connected to a DB, connected to external services, and performed
  initial setup tasks such as loading a large cache.
  """
  @spec startup ::
          :ok
          | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
          | {:error, reason :: binary()}
  def startup do
    # Return error if there are available migrations which have not been executed.
    # This supports deployment to AWS ECS using the following strategy:
    # https://engineering.instawork.com/elegant-database-migrations-on-ecs-74f3487da99f
    #
    # By default Elixir migrations lock the database migration table, so they
    # will only run from a single instance.
    migrations =
      @repos
      |> Enum.map(&Ecto.Migrator.migrations/1)
      |> List.flatten()

    if Enum.empty?(migrations) do
      liveness()
    else
      {:error, "Database not migrated"}
    end
  end

  @doc """
  Check if the app is alive and working properly.

  This returns app status for the Kubernetes `livenessProbe`.
  Kubernetes continuously checks if the app is alive and working as expected.
  If it crashes or becomes unresponsive for a specified period of time,
  Kubernetes kills and replaces the container.

  This check should be lightweight, only determining if the server is
  responding to requests and can connect to the DB.
  """
  @spec liveness ::
          :ok
          | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
          | {:error, reason :: binary()}
  def liveness do
    case Ecto.Adapters.SQL.query(Repo, "SELECT 1") do
      {:ok, %{num_rows: 1, rows: [[1]]}} ->
        :ok

      {:error, reason} ->
        {:error, inspect(reason)}
    end
  rescue
    e ->
      {:error, inspect(e)}
  end

  @doc """
  Check if app should be serving public traffic.

  This returns app status for the Kubernetes `readinessProbe`.
  Kubernetes continuously checks if the app should serve traffic. If the
  readiness probe fails, Kubernetes doesn't kill and restart the container,
  instead it marks the pod as "unready" and stops sending traffic to it, e.g.,
  in the ingress.

  This is useful to temporarily stop serving requests. For example, if the app
  gets a timeout connecting to a back end service, it might return an error for
  the readiness probe. After multiple failed attempts, it would switch to
  returning false for the `livenessProbe`, triggering a restart.

  Similarly, the app might return an error if it is overloaded, shedding
  traffic until it has caught up.
  """
  @spec readiness ::
          :ok
          | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
          | {:error, reason :: binary()}
  def readiness do
    liveness()
  end

  @spec basic ::
          :ok
  # | {:error, {status_code :: non_neg_integer(), reason :: binary()}}
  # | {:error, reason :: binary()}
  def basic do
    :ok
  end
end

Dependencies

Services also need health checks for the services they depend on. In development, these might be databases or Kafka running in a container. In production, those might be managed services in AWS.

For services that do not provide an HTTP API, we can define a command that runs within the container.

For example, this probe checks a Postgres database container:

readinessProbe:
  exec:
    command: ["pg_isready"]
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 20

livenessProbe:
  exec:
    command: ["psql", "-w", "-U", "postgres", "-d", "my-db", "-c", "SELECT 1"]
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 1

For historical reasons, Kubernetes checks are different from Docker healthcheck definitions.

In docker-compose.yml, health checks look like:

---
version: "3.9"
services:
  deploy:
    image: example-service
    healthcheck:
      test: ["CMD", "curl", "http://127.0.0.1:4001/healthz"]
      start_period: 6s
      interval: 2s
      timeout: 5s
      retries: 20
    depends_on:
      postgres:
        condition: service_healthy

postgres:
  image: postgres:14.1-alpine
    restart: always
    healthcheck:
      test: ["CMD-SHELL", "pg_isready"]
      start_period: 5s
      interval: 2s
      timeout: 5s
      retries: 20

router:
    image: ghcr.io/apollographql/router:v1.2.1
    ports:
      # GraphQL endpoint
      - "4000:4000"
      # Health check
      - "8088:8088"
    environment:
      # https://www.apollographql.com/docs/router/configuration/overview
      APOLLO_ROUTER_LOG: "debug"
      APOLLO_ROUTER_SUPERGRAPH_PATH: /dist/schema/local.graphql
      APOLLO_ROUTER_CONFIG_PATH: /router.yaml
      APOLLO_ROUTER_HOT_RELOAD: "true"
    volumes:
      - "./apollo-router.yml:/router.yaml"
      - "./supergraph.graphql:/dist/schema/local.graphql"
    healthcheck:
      test: ["CMD-SHELL", "curl", "-v", "--fail", "http://127.0.0.1:8088/health"]
      start_period: 5s
      interval: 2s
      timeout: 5s
      retries: 20
    depends_on:
      deploy:
        condition: service_healthy

With containerized tests, we might run tests via docker-compose. The external API tests bring up the app container, containers for other services that it depends on, associated database containers, and the Apollo Router container. They then run tests using Postman/Newman. Then we can run docker-compose up router to bring up all the containers, waiting until they are up and healthy, and then run Newman tests on it.

Robust health checks for each component in the stack help the system to come up quickly and reliably, and they provide messages that help us debug startup failures easily.

Running OS commands

Instead of external HTTP checks, we can instead execute a health check inside the container. That may use curl to call the app on localhost. While this does not exercise the HTTP stack, it may be more reliable.

An example Kubernetes check which runs a command is

livenessProbe:
  exec:
    command:
    - /app/grpc-health-probe
    - -addr=:50051
    - -connect-timeout=5s
    - -rpc-timeout=5s
  failureThreshold: 3
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10
readinessProbe:
  exec:
    command:
    - /app/grpc-health-probe
    - -addr=:50051
    - -connect-timeout=5s
    - -rpc-timeout=5s
  failureThreshold: 3
  initialDelaySeconds: 1
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10

When an Elixir app is running via releases, we can evaluate code directly to run the health check, e.g.:

healthcheck:
  test: ["CMD", "bin/api", "eval", "API.Health.liveness()"]
  start_period: 2s
  interval: 1s
  timeout: 1s

Higher-level checks

The above health checks are used at the infrastructure level to identify problems. They help Kubernetes automatically resolve problems by restarting containers, scaling resources, etc.

We can also add production checks that indicate end-customer visible problems. For example, if a user can’t log in, then we should alert. Some of these can be metrics, e.g., if we would normally get 100 successful logins a minute, and now we are getting 0, then there is a problem. Or we might get 1% unsuccessful logins in a period, and now we are getting 50%.

We use external API tests as part of containerized testing in CI. We can leverage these tests to make production health checks for standard scenarios across multiple services, e.g., a customer logs in, adds an item to their cart, and then checks out.

See the following articles for more background information: