We have a Rails app hosted on Heroku which periodically develops a memory leak, pushing it well over Heroku’s per-dyno memory quote and slowing everything down as it hits swap. The issue is intermittent, random, and only happens every few days but it’s easy enough to deal with, just restart the dynos. However it has a habit of happening at night or weekends (the site is used entirely in the US), which makes it difficult to deal with out of hours.
While we are making efforts to find the cause of the leak, our primary concern is to make sure the site remains usable. To that end, I’ve put together a little something to restart the web dynos automatically, even when it’s the middle of the night for us.
We use the LogEntries service, available as a free plugin for Heroku apps, to monitor our applications. LogEntries tails the logs and triggers alerts based on configurable conditions. It can detect all the Heroku platform errors such as the one we are interested in “R14 Memory quote exceeded”, and send an email, slack notification, or poke a webhook. It seemed logical to use LogEntries to restart the dynos when they got into trouble.
Restarting the Dynos
To restart our web dynos we create an ActiveJob task, which uses the Heroku Platform API (Ruby gem) to fetch the list of dynos, filter them down to just the running web instances (we’ve never had a problem with the workers), and restart each one in turn.
First install the Heroku CLI OAuth Plugin
heroku plugins:install https://github.com/heroku/heroku-oauth
Then create a OAuth token with write privileges (I suggest you use Heroku that can only access this app to create the token) and set it as an environment variable
heroku authorizations:create -s write
heroku config:add RESTART_API_KEY=<API KEY>
Now create an ActiveJob task, which we’ve called RestartAppJob.
require 'platform-api'
class RestartAppJob < ActiveJob::Base
queue_as :restarts
class Dyno
attr_accessor :type
attr_accessor :name
attr_accessor :state
def self.connection
if ENV['RESTART_API_KEY']
@@connection ||= PlatformAPI.connect_oauth(ENV['RESTART_API_KEY'])
end
end
def self.dynos
connection.dyno.list(ENV['APP_NAME']).map do |dyno_info|
Dyno.new(dyno_info)
end
end
def self.running_web_dynos
dynos.select { |dyno| dyno.web? && dyno.up? }
end
def web?
type == 'web'
end
def up?
state == 'up'
end
def connection
self.class.connection
end
def restart!
connection.dyno.restart(ENV['APP_NAME'], name)
end
def initialize(info)
self.type = info['type']
self.name = info['name']
self.state = info['state']
end
end
def perform(*args)
if Dyno.connection
Dyno.running_web_dynos.each do |dyno|
dyno.restart!
end
end
end
end
As you can see, most of the work is done in the Dyno class.
Calling…
RestartAppJob.perform_later
…will queue up a job to restart your webservices.
Triggering the Job
To trigger the job we have a controller action that looks like this…
def restart_web_dynos
if params[:key] == ENV['RESTART_WEBHOOK_KEY']
RestartAppJob.perform_later
render text: 'Restart triggered'
else
render text: 'You are not allowed to restart the dynos'
end
end
You can put this in any controller you think is appropriate, and setup the routes however you like. It expects a parameter of ‘key’ that matches whatever you set the environment variable RESTART_WEBHOOK_KEY to (I suggest generating a GUID using the SecureRandom library)
With the controller action in place you can set the webhook action in LogEntries to point to http://example.com/foo/restart_web_dynos?key=somejibberish.
Now, whenever LogEntries detects the memory quota issue it will call the webhook, which will schedule the job, which restarts the dynos. You could extend this to other events or monitoring services easily enough.
Caveat
Obviously this relies on at least one dyno still being functional. We tend to find that while the app slows down when it hits the quota it doesn’t actually stop, so this approach is ok. However if you have dynos that stop responding entirely you will need to host this code separately.


