Monitoring our Stack with Elixir

  • Will Raxworthy
  • Jul 14, 2016
Knowledge

Walk into any office with developers in it, and chances are you’ll see some sort of system status dashboard. For historic reasons that no one can remember, ours is called Walnut.

Walnut. No one knows why.

First, a history lesson. The previous version ran entirely in the front end. It didn’t need a backend, so you could run it locally with no deployment. It was 100 lines of code. As time went on, some of the limitations really started to show. We took on more remote workers, and running Walnut locally required that you have access to all the API keys for our various services and that you had to run an insecure browser to allow for cross-site requests. We also wanted to switch to using Chromebits in kiosk mode rather than Mac Minis for space and hassle reasons. This meant we couldn’t really run Walnut without some sort of proxy service involved, which isn’t ideal.

The idea behind Walnut is simple, when you look at the screen, you should get an instant look at the state of our stack. To this end, each box on the screen represents a third-party service, CI build or one of our applications and has three states: green, red or grey. Green and red are pretty self-explanatory. Either everything is a-ok 👌 or it’s panic stations 🚨. Grey is mainly for our CI builds which can be marked as in progress or pending while we wait for them to finish and be deployed to production.

Under the hood, each square is actually an Elixir process that’s responsible for monitoring that site or service. This is where Elixir / Erlang really shines. Running each site as a process means:

- We can add more without ever slowing down or blocking other requests.

- The processes can be monitored and the site marked red if the process crashes for whatever reason.

- We can run it on one Heroku hobby dyno.

Unfortunately, not every site offers a nice JSON api so we had to factor in for really any type of http request and response. As a result, the site “recipes” are extremely flexible. Here’s the recipe for our CI builds:

defmodule Site.CircleCi do
  use Status.Site

  def handle_response(response) do
    Poison.Parser.parse!(response.body)
    |> determine_app_state
  end

  defp url(site) do
    "https://circleci.com/api/v1/project/#{site.app}/tree/master?limit=1&circle-token=#{System.get_env("CIRCLECI_TOKEN")}"
  end

  def determine_app_state([%{"status" => "running"} | _]), do: :pending
  def determine_app_state([%{"status" => "queued"} | _]), do: :pending
  def determine_app_state([%{"status" => "not_running"} | _]), do: :pending

  def determine_app_state([%{"status" => "success"} | _]), do: :ok
  def determine_app_state([%{"status" => "fixed"} | _]), do: :ok
  def determine_app_state([%{"status" => "not_run"} | _]), do: :ok

  def determine_app_state(_), do: :notok
end

And here is a recipe for Heroku.

defmodule Site.Heroku do
  use Status.Site

  def handle_response(response) do
    Poison.Parser.parse!(response.body)
    |> determine_app_state
  end

  def determine_app_state(%{"status" => %{"Production" => "green", "Development" => "green"}}), do: :ok
  def determine_app_state(_), do: :notok
end

As you can see, both checks are quite different, but follow a similar pattern. In essence, each process gets passed the response from a URL added to a configuration file. The URL is requested, and the response handed to the recipe. If the request fails for whatever reason, the site is automatically marked as bad without the recipe needing to do anything.

In the first iteration, we figured that sending a request and considering a 200 status in return would mean that everything was a-ok. How naive. We quickly realised that this doesn’t work at all for Heroku… or really any other service. So, we switched it to a recipe type structure. Now, whether we need to parse JSON, CSS or just check for a status 200, we have the ability to do so.

A TV dashboard is great for an instant spot check of our stack, but it begins to break down because in reality, we aren’t looking at the board 24/7.

 

To help with this, we used Erlang’s GenEvent. With GenEvent, we can create different modules to respond to certain events. With an event manager in place, we can now create a GenEvent service to listen for :notok events and open tickets within Pagerduty. Likewise, we can resolve the ticket when the issue is over.

def site_checked(site = %{status: status}, status) do
  payload = {:nochange, site, %{}}
  GenServer.cast(__MODULE__, {:notify, {:site_checked, payload}})
end

def site_checked(site = %{status: new_status, reason: reason}, old_status) do
  payload = {:change, site, %{from: old_status, to: new_status, reason: reason}}
  GenServer.cast(__MODULE__, {:notify, {:site_checked, payload}})
end

def site_checked(site = %{status: new_status}, old_status) do
  payload = {:change, site, %{from: old_status, to: new_status}}
  GenServer.cast(__MODULE__, {:notify, {:site_checked, payload}})
end
The Event Manger
  def handle_event({:site_checked, {:change, site, payload}}, state) do
    %{from: old_status, to: new_status} = payload
    trigger_pagerduty(site, new_status, old_status)
    {:ok, state}
  end
  def handle_event(_, state), do: {:ok, state}

Pagerduty Event Listener

We also have events to broadcast changes across web sockets to any connected clients (including our Slack bot) to make sure that all clients have the correct data in real-time. From here we could hook into site events and trigger any type of action we wanted. There are a few gotchas with GenEvent, but I think for this particular use case, it works well. We would most likely look to move to GenStage once it’s closer to release.

The Future

From here, we’d like to focus on the design of the board. As more services are added, it becomes increasingly harder to see what exactly is failing, and we’d like to add more arbitrary information like who is on support for that day.

Collecting statistics about each request (up time, time to first byte etc), storing them in a database and then using d3 to create some meaningful insights into the data is also something that we’d like to explore.

Lastly, I’m hoping that this could be something we open source. For now it’s got some custom integrations that are unique to us and extracting them would be the first step towards open sourcing the status board.


Will Raxworthy joined AlphaSights in March 2015 and worked as a Software Engineer in our London office.