90 percent immutable

By Jake Morrison in DevOps on Fri 10 June 2016

After a fair amount of debugging, I got an app running in an AWS Auto Scaling Group (ASG), pulling its config on startup from S3 and code from Amazon CodeDeploy. On the way I found out some annoying parts of the cloud initialization process in AWS.

The idea is that we can build a "generic" AMI which has the application dependencies, then when the ASG starts up an instance, it will get the latest code and configuration.

Initially, I was planning to use the instance tags to keep bootstrapping configuration information, e.g. whether it's running in staging or production environment. This would let the instance get the config from the right S3 bucket, contact the right RDS instance, etc.

I wrote some systemd init scripts, one of which reads the EC2 instance metadata and tags and writes the data to files on the disk in JSON format and as a shell include file. Another script syncs the data from S3 to the local disk, and the third starts up the application. Getting the dependencies set up properly on these scripts ended up being tedious and difficult to get running reliably.

The fundamental problem is that it takes some time for the metadata to be available after the machine starts. And, critically, instance tags are not available until after CodeDeploy runs.

As part of the deployment process, CodeDeploy takes an instance out of the auto-scaling group, upgrades it, tests if it's healthy, then puts it back. When CodeDeploy launches a fresh instance, it puts it in Waiting state, then loads the code, then enables it. But instance tags are not available in Waiting state, they are only available when the instance is ready. So you can call boto and read the tags, but they won’t be there, the list will be empty. And you can’t wait for them, because they won’t be there until you start successfully.

The next problem is that it takes some time for the basic metadata to be available. So the startup script may run, make a HTTP request to the metadata service and not get the data, so it needs to sleep and retry.

What I ended up doing is putting the parameters in base64 JSON in the user_data field, which is available, though you may have to wait a while to get it. And I set up hard dependencies between the startup scripts. So the metadata service runs first, then the S3 sync script (which needs the metadata to know which bucket to read from), then the application startup. Yay systemd.

And to make startup faster, I added manual calls in each script to call the previous script to get its dependencies if they are not there.

Making this even more fun to debug is that it works fine in a stand alone instance, but has problems in the auto-scaling group. And when the instance fails in the ASG, the app doesn’t start up so it fails the health check, so it gets shut down and another instance runs, over and over again. So when you are debugging, you ssh in and then the instance is abruptly terminated, and you wait for the next instance to be started so you can try again. And each debug cycle takes 20 minutes as you build an AMI and deploy it. Sigh.

After all that, I am tempted to use the same lifecycle hook that CodeDeploy uses to drive Ansible i.e. an instance starts up, and it pushes a message to SNS/SQS. A python script sits on the ops server waiting for it to happen when it gets the message, it runs an Ansible playbook on the instance to configure it and deploy the code.

See the lifecycle hooks docs for details.

ANSIBLE ALL THE THINGS!

This is a case where the Chef "pull" model might be more convenient, though in general I like Ansible a lot. I find that Ansible is better at creating the instances in the first place. I like the fact that Ansible doesn't need an agent and is basically just a list of canned tasks. The list of tasks is comprehensive, and it's easy enough to define your own. Ansible is also natural for the systems admins on the team.

So here we are at something like 90% "immutable infrastructure". I could go all the way and burn the deps, env-specific config and code into an AMI and launch it from the ASG.

Packer makes it quite easy to build AMIs, and I am kinda tempted at this point, but it's still slow enough to be annoying.

Adding the runtime configuration would mean putting "secrets" into the AMI. I didn’t find a really satisfactory way of passing the vault key into Ansible. By not satisfactory, I mean it worked, but was a bit hackish. i.e. you somehow get the password into an env var, which gets passed into the Ansible script which means it visible in your terminal. Or if you are paranoid, you can use temp files which you hopefully delete. So choose your poison.

Here is an example Packer file:

{
    "variables":{
        "pass": "{{env `ANSIBLE_VAULT_PASS`}}"
    },
    "builders": [
        {
            "type": "amazon-ebs",
            "region": "us-east-1",
            "source_ami": "ami-123",
            "instance_type": "t2.micro",
            "ssh_username": "centos",
            "ami_name": "foo app {{timestamp}}",
            "vpc_id": "vpc-123",
            "subnet_id": "subnet-123",
            "associate_public_ip_address": "true",
            "ami_virtualization_type": "hvm",
            "communicator": "ssh",
            "ssh_pty": "true",
            "launch_block_device_mappings": [{
                "device_name": "/dev/sda1",
                "volume_size": 8,
                "volume_type": "gp2",
                "delete_on_termination": true
            }]
        }
    ],
    "provisioners": [
        {
            "type": "shell",
            "inline": [
                "sudo yum install -y epel-release",
                "sudo yum install -y ansible"
            ]
        },
        {
            "type": "ansible-local",
            "playbook_file": "../ansible/foo-app.yml",
            "playbook_dir": "../ansible",
            "inventory_groups": "app,tag_env_staging",
            "command": "echo '{{user `pass`}}' | ansible-playbook",
            "extra_arguments": "--vault-password-file=/bin/cat --tags setup"
        }
    ]
}

CodeDeploy

I am pretty happy with CodeDeploy so far. By standardizing the deployment process across apps, our “follow the sun” sysadmin team in Europe, Asia and Latin America can roll back to a previous successful release without needing to know much about the app. So if something goes bump in the night, someone will be able to deal with it during their day without having to get the developers out of bed.

UPDATE

We are now using Terraform to provision instances, with Ansible to set them up. We build an AMI for each environment (staging, production) using Packer, with the config it needs baked in, then deploy the app using CodeDeploy. This avoids the problems discussed here.