Deploying a Microservice With Ansible
The Problem
Some time ago, happily tending our app behind the tall walls of Heroku-town, we faced a problem: long-running requests to a Heroku dyno block it (or at least one process on the dyno) from processing any further requests, but Heroku’s router may still send requests to that dyno, resulting in timeouts.
Our app processes image uploads from both web and mobile clients, and does some thumbnailing of the images, so direct to S3 uploads were not an option and we needed a server to handle these uploads outside of Heroku.
The Quick Fix
One of our engineers whipped up a Node.js app which would:
- Accept a file upload
- Mimic the thumbnailing processes performed by Paperclip on our Rails app
- Send the files to S3
- Call back to the Rails app with the image metadata, so we could create an Image record on the server
Everything worked nicely and we stopped seeing timeouts on our Rails app.
The Problems With The Fix
Everyone on our team at the time had some basic knowledge of Node and intermediate Linux admin skill. But since these image uploads were critical to our user experience, intermediate was not enough. Any bug with this service was potentially critical, so any botched configurations – whether they were the bug or part of someone’s attempt to fix a bug – could have quite a negative impact. So config files became scary to touch, and we wanted to carefully monitor the service for some time after any change to these
Deployment was also not fully automated, so if anyone forgot a step, it could result in bugs.
The Ideal Solution
After some high stress bugfixes and unacceptably long outages, we came up with a list of requirements (a few of which had been satisfied by the original developer, hence the config files I mentioned). The service should:
- Be easy to deploy to either a running instance or a clean box.
- Bring itself back up if it failed.
- Manage its logs.
- Allow for easy restoration of any config files modified during troubleshooting.
The tool we chose for this (and more, detailed below) was Ansible. Here’s the best introductory guide I’ve seen.
I’d like to highlight a few excerpts from our Ansible playbook, because while this is not a full tutorial and far from a “best practices” guide, it does cover a few things that might get left out of other beginner level guides.
Ansible tips
Installing updated packages
Install versions of nginx and node that were relased after, say, the closing of the Western frontier:
playbook.yml
- name: Package prerequisites for node.js
action: apt pkg=python-software-properties state=installed
- name: Add the node.js PPA
apt_repository: repo="ppa:chris-lea/node.js"
- name: Add the nginx PPA
apt_repository: repo="ppa:nginx/stable"
Installing and configuring upstart
Install upstart, which allows you to easily start and stop processes, and check their status, like so:
sudo image-manager status
sudo image-manager restart
sudo image-manager stop
sudo image-manager start
The docs are thorough if a bit intimidating
playbook.yml
- name: install upstart
apt: pkg=upstart state=latest
- name: (PROD) copy upstart config file to /etc/init
copy: src=config/image-manager.conf
dest=/etc/init/image-manager.conf
The config file:
#!upstart
description "Image Manager Production App"
start on startup
stop on shutdown
console output
#Important: this means upstart will expect the process it is
#managing to call `fork` exactly twice. For our app, I
#distinguished between this and "expect fork" by trial and error
#
#see http://upstart.ubuntu.com/cookbook/#expect-daemon
expect daemon
script
export HOME="/home/ubuntu"
cd $HOME/image-manager
sudo NODE_ENV=production \
PORT=5555 \
/usr/bin/node /home/ubuntu/image-manager/app.js >> \
/var/log/image-manager.log &
end script
#Upstart has to know about the process id of all the processes it monitors
#This is the only way I could find to get that
post-start script
upstart_pid=$(status image-manager | awk '{print $NF}')
sudo echo $upstart_pid > /var/run/image-manager.pid
end script
Installing ntpd
This is the Network Time Protocol daemon, and installing it will prevent clock drift and ensure that, say, S3 won’t decide to stop talking to your server during the second half of the BCS Championship game because your server thinks it’s living 16 minutes in the future.
playbook.yml
- name: ensure ntpd is at the latest version
apt: pkg=ntp state=latest
notify:
- restart ntpd
#...
handlers:
- name: restart ntpd
service: name=ntp state=restarted
This uses an Ansible handler. It’s basically the same as a task, but is just called by other tasks like a callback.
Setting up logrotate
Don’t let your logs fill up your hard drive:
playbook.yml
- name: install logrotate
apt: pkg=logrotate state=latest
- name: copy logrotate config file
copy: src=config/logrotate
dest=/etc/logrotate.d/image-manager
/etc/logrotate.d/image-manager
/var/log/*image*.log {
daily
compress
copytruncate
size 2M
rotate 4
}
Setting up monit
Monit will periodically check the status of your services and bring them back up as needed, optionally also sending you alerts or taking other actions if it detects that a service is down. In this sense, it has some overlap with upstart, but I find it easier to use, it makes an actual http request instead of just checking the pid file, and I like to stick to upstart just for enabling start/stop commands, and starting services on server startup.
- name: copy monit config file
copy: src=config/monitrc
dest=/etc/monit/monitrc
mode=0700
#this might not be needed, I forgot why it's here
- name: reload monit
command: monit reload
- name: monitor collage service
monit: name=image-manager state=monitored
We set it up to check the service every minute, with a 10 second timeout.
The config file:
set logfile /var/log/monit.log
set daemon 60
#Even if you don't use the web interface for monit,
#this has to be set up:
set httpd port 2813 and
use address localhost
allow localhost
check process image-manager with pidfile "/var/run/image-manager.pid"
start program = "/sbin/start image-manager"
stop program = "/sbin/stop image-manager"
if failed port 5555 protocol HTTP
request /
with timeout 10 seconds
then restart
Uptime monitoring
We use pingdom.com for this. We could set up monit to send an email when the process goes down, but we also want notifications if the whole box is unreachable.
Other notes
This isn’t really part of the playbook, and probably goes without saying, but if you don’t explicitly specify the versions of your dependencies, you don’t have a repeatable deployment process.
There’s a lot more that can be done with Ansible, including just for organizing a simple playbook like ours. It’s a great tool that we will use more in the future.
Deployment
To deploy we have a hosts
file which specifies the machine(s) that the ansible playbook
can reference, which looks like this:
[aws]
ec2-123456.compute-1.amazonaws.com ansible_ssh_private_key_file=secret.pem
And to deploy we can run:
ansible-playbook -i hosts playbook.yml -u ubuntu --extra-vars "env=prod"
This will ensure all our config files match what we have in our playbook, pull down the latest master branch from the service’s Github repo, and restart the image-manager service.
If we were to need to deploy to a new AWS box, we can run:
vagrant provision web --provider=aws`
This is set up through Vagrant, and the config looks like this:
web.vm.provider :aws do |aws, override|
aws.access_key_id = "secret"
aws.secret_access_key = "secret"
# ubuntu AMI
aws.ami = "ami-1d8c9574"
aws.instance_type = "m3.medium"
aws.keypair_name = "secret"
aws.security_groups = ["quicklaunch-1"]
aws.tags = {
'Name' => 'Img Manager'
}
web.vm.box = "dummy"
web.vm.box_url = "https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box"
web.ssh.username = "ubuntu"
web.ssh.private_key_path = "secret.pem"
end