Sheharyar Naseer

Kubernetes Health Checks in Elixir & Phoenix


At Devfest this year, I gave a talk on Self-healing applications with Kubernetes. The session covered the high-level concepts with a minimal web application in Kubernetes, but I wanted to do a dedicated write-up for accomplishing the same in Elixir web applications.

This post does assume a basic understanding of some Kubernetes concepts, specifically Containers, Pods and Clusters. If you’re not familiar with these, check out some of these resources first:

Fault-tolerance in Kubernetes

Like Elixir / Erlang, Kubernetes can also be set up as a self-healing system. While the Erlang VM gives us fault-tolerance on the process level in your application, Kubernetes can accomplish the same for each of the nodes in a cluster by monitoring them and automatically restart in case of failures.

It does this by executing three different kinds of user-defined health checks or “probes”:

  • Startup Probe:

    This probe determines if your application has finished booting up and can now be considered “live”. This probe repeatedly runs until it gets it’s first successful response. After that it no longer runs and instead starts executing the other two probes to determine the health of your application. If your app fails to successfully start until the failure threshold is met, the container is killed and restarted.

    Example use-case of this would be returning OK when your application has started the web-server, connected to any external services or dependencies it relies on (like a DB) or performed any initial setup tasks (like loading a large cache).

  • Liveness Probe:

    The liveness probe is the most important one. It continuously checks if the app is still live and working as expected. If for some reason your app crashes or becomes unresponsive for a given period of time, the underlying Kubelet process will kill and replace the container.

    Liveness checks are supposed to be very lightweight, only determining things like if the server is responding to requests and the app can connect to the DB.

  • Readiness Probe:

    This probe determines if the app can accept traffic. Most guides you’ll find online confuse the readiness probe with the startup probe, only using it to check if the app is “ready” at startup. But there are some important differences:

    1. Unlike the startup probe, the readiness probe runs throughout the container’s lifecycle.
    2. When the readiness probe fails, it doesn’t kill and restart the container like the other two probes. Instead, the pod is marked “unready” and all traffic is serviced by the remaining “ready” pods.

    Other than the startup use-cases, this is also useful if you want your app to temporarily stop serving requests without killing the container. Example of this is when you’re performing manual maintenance tasks or the node is running long, expensive jobs during which it might not make sense to accept user traffic (e.g. because of increased response times).

Each of these probes can be defined with one of three methods; either by 1) running a command inside the container (ExecAction), 2) checking for an open port in the container (TCPSocketAction), or 3) testing for a successful response of an HTTP GET request (HTTPGetAction).

Defining the Probes for your Deployment

For the majority of applications out there, you can get away by defining only the liveness probe, but for the sake of completeness we’ll do all three. Since we’re dealing with Elixir web apps, the obvious choice is using HTTPGetAction to test for a successful web request.

Open your app’s Kubernetes deployment yaml configuration and add the probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  # ...
  template:
    # ...
    spec:
      containers:
        - name: my-app
          image: my-app:1.0.0

          # Define the port your app uses
          ports:
            - containerPort: 4000
              name: http
              protocol: TCP

          # Configure the three probes

          startupProbe:
            httpGet:
              path: /health/startup
              port: http
            periodSeconds: 3
            failureThreshold: 5

          livenessProbe:
            httpGet:
              path: /health/liveness
              port: http
            periodSeconds: 10
            failureThreshold: 6

          readinessProbe:
            httpGet:
              path: /health/readiness
              port: http
            periodSeconds: 10
            failureThreshold: 1

       # ...

In the configuration above, we’ve configured the startup probe to run after ever 3 seconds, up to a maximum of 5 times – giving the app 3 * 5 = 15 seconds to get up and running. I chose the small interval of 3 seconds because Elixir apps generally boot up really quickly but depending on other startup tasks your app performs, you may increase this value.

The liveness check runs after every 10 seconds, until a maximum of 1 minute after which the container is deemed failing, and is killed and restarted. The readiness probe also runs after every 10 seconds, but the pod is marked “unready” as soon as it fails. Depending on the factors that determine if your app should service traffic or not, you might want to increase the threshold here.

For more details on defining the different probes and the available options, see the Kubernetes documentation on the topic.

Implementing Probe Handlers in Elixir

First, let’s define a module to determine the different health attributes of the app:

defmodule MyApp.Health do
  @moduledoc """
  Check various health attributes of the application
  """


  @doc """
  Check if required services are loaded and startup
  tasks completed
  """
  def has_started? do
    is_alive?() &&
    MyApp.CacheHelper.finished?() &&            # Startup tasks have been completed
    MyApp.SomeService.status() == :connected    # An external service has connected
  end


  @doc """
  Check if app is alive and working, by making a simple
  request to the DB
  """
  def is_alive? do
    !!Ecto.Adapters.SQL.query!(MyApp.Repo, "SELECT 1")
  rescue
    _e -> false
  end


  @doc """
  Check if app should be serving public traffic
  """
  def is_ready? do
    Application.get_env(:my_app, :maintenance_mode) != :enabled
  end
end

While we can just expose these via a Phoenix controller, we can make it much more lightweight by just writing a simple Plug. This way they can also be used in both Phoenix and other Plug-based applications.

defmodule MyApp.Health.Plug do
  import Plug.Conn

  @behaviour Plug

  @path_startup   "/health/startup"
  @path_liveness  "/health/liveness"
  @path_readiness "/health/readiness"


  # Plug Callbacks

  @impl true
  def init(opts), do: opts

  @impl true
  def call(%Plug.Conn{} = conn, _opts) do
    case conn.request_path do
      @path_startup   -> health_response(conn, Health.has_started?())
      @path_liveness  -> health_response(conn, Health.is_alive?())
      @path_readiness -> health_response(conn, Health.is_ready?())
      _other          -> conn
    end
  end


  # Respond according to health checks

  defp health_response(conn, true) do
    conn
    |> send_resp(200, "OK")
    |> halt()
  end

  defp health_response(conn, false) do
    conn
    |> send_resp(503, "SERVICE UNAVAILABLE")
    |> halt()
  end
end

For Phoenix, just put this in your Endpoint file near the top:

plug(MyApp.Health.Plug)

For other Plug-based apps, you can put this in your Plug.Router after the :match and :dispatch plugs.

Again, these were just an example of various factors that might determine if your app has started, or can be considered “alive” and “ready”. Depending on the requirements, YMMV. For example, for very simple Elixir/Phoenix applications, you might only need the liveness probe. And instead of having a dedicated health module, you can always return OK (meaning if the app can respond to web requests, it’s considered alive and ready):

defmodule MyApp.SimpleHealthPlug do
  import Plug.Conn

  @behaviour Plug

  @impl true
  def init(opts), do: opts

  @impl true
  def call(%Plug.Conn{request_path: "/health"} = conn, _opts) do
    conn
    |> send_resp(200, "OK")
    |> halt()
  end

  @impl true
  def call(conn, _opts), do: conn
end

Ending Notes

Like Elixir, Kubernetes can help make your application more resilient and fault-tolerant with some minor configuration. Employing the use of these health checks correctly might save you the hassle of “restarting the server” on a late Saturday night (I’ve been there).

There are tons of more resources available on the topic, but I highly recommend the following two: