Skip to content

Automatically restart struggling Heroku dynos using LogEntries

We have a Rails app hosted on Heroku which periodically develops a memory leak, pushing it well over Heroku’s per-dyno memory quote and slowing everything down as it hits swap. The issue is intermittent, random, and only happens every few days but it’s easy enough to deal with, just restart the dynos. However it has a habit of happening at night or weekends (the site is used entirely in the US), which makes it difficult to deal with out of hours.

While we are making efforts to find the cause of the leak, our primary concern is to make sure the site remains usable.  To that end, I’ve put together a little something to restart the web dynos automatically, even when it’s the middle of the night for us.

We use the LogEntries service, available as a free plugin for Heroku apps, to monitor our applications. LogEntries tails the logs and triggers alerts based on configurable conditions. It can detect all the Heroku platform errors such as the one we are interested in “R14 Memory quote exceeded”, and send an email, slack notification, or poke a webhook. It seemed logical to use LogEntries to restart the dynos when they got into trouble.

Restarting the Dynos

To restart our web dynos we create an ActiveJob task, which uses the Heroku Platform API (Ruby gem) to fetch the list of dynos, filter them down to just the running web instances (we’ve never had a problem with the workers), and restart each one in turn.

First install the Heroku CLI OAuth Plugin

    heroku plugins:install https://github.com/heroku/heroku-oauth

Then create a OAuth token with write privileges (I suggest you use Heroku that can only access this app to create the token) and set it as an environment variable

    heroku authorizations:create -s write
    heroku config:add RESTART_API_KEY=<API KEY>


Now create an ActiveJob task, which we’ve called RestartAppJob.

require 'platform-api'

class RestartAppJob < ActiveJob::Base
  queue_as :restarts

  class Dyno
    attr_accessor :type
    attr_accessor :name
    attr_accessor :state

    def self.connection
      if ENV['RESTART_API_KEY']
        @@connection ||= PlatformAPI.connect_oauth(ENV['RESTART_API_KEY'])
      end
    end

    def self.dynos
      connection.dyno.list(ENV['APP_NAME']).map do |dyno_info|
        Dyno.new(dyno_info)
      end
    end

    def self.running_web_dynos
      dynos.select { |dyno| dyno.web? && dyno.up? }
    end

    def web?
      type == 'web'
    end

    def up?
      state == 'up'
    end

    def connection
      self.class.connection
    end

    def restart!
      connection.dyno.restart(ENV['APP_NAME'], name)
    end

    def initialize(info)
      self.type = info['type']
      self.name = info['name']
      self.state = info['state']
    end
  end

  def perform(*args)
    if Dyno.connection
      Dyno.running_web_dynos.each do |dyno|
        dyno.restart!
      end
    end
  end
end

As you can see, most of the work is done in the Dyno class.

Calling…

    RestartAppJob.perform_later

…will queue up a job to restart your webservices.

Triggering the Job

To trigger the job we have a controller action that looks like this…

def restart_web_dynos
  if params[:key] == ENV['RESTART_WEBHOOK_KEY']
    RestartAppJob.perform_later
    render text: 'Restart triggered'
  else
    render text: 'You are not allowed to restart the dynos'
  end
end

You can put this in any controller you think is appropriate, and setup the routes however you like. It expects a parameter of ‘key’ that matches whatever you set the environment variable RESTART_WEBHOOK_KEY to (I suggest generating a GUID using the SecureRandom library)

With the controller action in place you can set the webhook action in LogEntries to point to http://example.com/foo/restart_web_dynos?key=somejibberish.

Now, whenever LogEntries detects the memory quota issue it will call the webhook, which will schedule the job, which restarts the dynos. You could extend this to other events or monitoring services easily enough.

Caveat

Obviously this relies on at least one dyno still being functional. We tend to find that while the app slows down when it hits the quota it doesn’t actually stop, so this approach is ok. However if you have dynos that stop responding entirely you will need to host this code separately.