Using Lifecycle Hooks to better understand your AWS environment

If you have a large number of EC2 instances running in an AWS environment, it can be difficult to really understand what is happening in production.

In particular, when you use autoscaling groups, it can be difficult to notice when and why your instances are dying and requiring a restart. For the most part this isn't a problem, but in rare cases it's possible to run up large bills by restarting instances too frequently (for example if a system fails to start and so is constantly rebooted), and though usually you're perfectly happy for your machines to die and be automatically replaced, sometimes it can be useful to jump on to a dead machine to figure out what went wrong.

terminating

To help get a grip on this, you can use Lifecycle Hooks to alert you when an unhealthy instance needs to be restarted.

Lifecycle hooks are designed to allow you to run custom code when an instance is started or stopped. For example to install custom software or configure network or security settings. We can use it to alert us when an unhealthy instance is terminated, and to allow us time to log into and debug the terminated instance.

In this post I will explain how to use lifecycle hooks and lambda to alert you when your instances are unhealthy and allow you to debug. The examples I give will use CloudFormation, but you of course can create these resources manually using the AWS console, or use an alternative method such as Terraform.

This only works if you have an autoscaling group which is performing health checks.

Create a topic
First, we need to create a SNS topic where lifecycle events will be posted. We can then subscribe to this topic to act in response to instances which are being killed.

"UnhealthyHostReporterTopic": {
    "Properties": {
        "TopicName": "unhealthy-hosts"
    },
    "Type": "AWS::SNS::Topic"
}

Set up the hook

Then, we set up a lifecycle hook on our autoscaling group to post to the topic when an instance is terminating. We assume that you already have an autoscaling group with the key "AutoScalingGroup".

"Parameters": {
    "LifecycleHeartbeatTimeout": {
        "Default": "7200",
        "Description": "How long in seconds before a terminated instance will be removed.",
        "Type": "Number"
    },
    "LifecycleSNSTopicArn": {
        "Description": "Topic ARN to send lifecycle transition messages",
        "Type": "String"
    }
}
"LifecycleHook": {
    "Properties": {
        "AutoScalingGroupName": {
            "Ref": "AutoScalingGroup"
        },
        "HeartbeatTimeout": {
            "Ref": "LifecycleHeartbeatTimeout"
        },
        
        "LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
        "NotificationTargetARN": {
            "Ref": "LifecycleSNSTopicArn"
        },
        "RoleARN": {
            "Fn::GetAtt": [
                "LifecycleHookRole",
                "Arn"
            ]
        }
    },
    "Type": "AWS::AutoScaling::LifecycleHook"
}, "LifecycleHookRole": {
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Statement": [
                {
                    "Action": ["sts:AssumeRole"],
                    "Effect": "Allow",
                    "Principal": {
                        "Service": ["autoscaling.amazonaws.com"]
                    }
                }
            ]
        },
        "ManagedPolicyArns": ["arn:aws:iam::aws:policy/service-role/AutoScalingNotificationAccessRole"],
        "Path": "/",
        "Policies": [
            {
                "PolicyDocument": {
                    "Statement": [
                        {
                            "Action": ["sns:*"],
                            "Effect": "Allow",
                            "Resource": ["*"]
                        }
                    ]
                },
                "PolicyName": "LifecycleHookPolicy"
            }
        ]
    },
    "Type": "AWS::IAM::Role"
}

With this set up, when your instances are shut down (for failing health checks or for other reasons) they will first be put into a wait state for the given amount of time, and will post a message to the topic containing:

    LifecycleActionToken — The lifecycle action token.
    AccountId — The AWS account ID.
    AutoScalingGroupName — The name of the Auto Scaling group.
    LifecycleHookName — The name of the lifecycle hook.
    EC2InstanceId — The ID of the EC2 instance.
    LifecycleTransition — The lifecycle hook type.

You could stop here, set up an email subscription to that topic and be informed when your instances shut down. However, this will still put machines into a waiting state when they are being autoscaled-down, removing some of the advantages of autoscaling. To deal with this, we can create a simple Lambda which can decide if the instance should go into the wait state or shut down immediately.

Setting up a Lambda

Create a lambda with code something like the following:

const AWS = require('aws-sdk');

const HEALTHY_STATUS = 'Healthy';

exports.handler = (notification, context) => {
  const snsMessage = JSON.parse(notification.Records[0].Sns.Message);

  const autoscaling = new AWS.AutoScaling();

  autoscaling.describeAutoScalingGroups(
    { AutoScalingGroupNames: [snsMessage.AutoScalingGroupName] }, (err, data) => {
      if (err) {
        // Error - report and don't kill it
        context.fail("Error describing autoscaling groups");
      } else {
        const matchingInstances = data.AutoScalingGroups[0].Instances.filter(
          i => i.InstanceId === snsMessage.EC2InstanceId);

        if (matchingInstances.length === 0) {
          // No instance found - report and don't kill it
          context.fail("No instance found")
        } else {
          const instance = matchingInstances[0];
          if (instance.HealthStatus === HEALTHY_STATUS) {
            // Healthy - just tell the autoscaling to continue
            const lifecycleParams = {
              AutoScalingGroupName: snsMessage.AutoScalingGroupName,
              LifecycleHookName: snsMessage.LifecycleHookName,
              LifecycleActionToken: snsMessage.LifecycleActionToken,
              LifecycleActionResult: 'CONTINUE',
            };
            autoscaling.completeLifecycleAction(lifecycleParams, (lifecycleErr) => {
              if (lifecycleErr) {
                console.error('AS lifecycle completion failed.\nDetails:\n', err);
                console.debug('CompleteLifecycleAction\nParams:\n', lifecycleParams);
                context.fail("As lifecycle completion failed")
              } else {
                context.succeed("Healthy instance shut down");
              }
            });
          } else {
            // Unhealthy - report and don't kill it
            context.succeed("Unhealthy instance detected")
          }
        }
      }
    });
};

We can now have this lambda execute in response to messages on the topic.

"UnhealthyHostReporterTopic": {
    "Properties": {
        "Subscription": [
            {
                "Endpoint": {
                    "Fn::GetAtt": [
                        "UnhealthyHostReporter",
                        "Arn"
                    ]
                },
                "Protocol": "lambda"
            }
        ],
        "TopicName": "unhealthy-hosts"
    },
    "Type": "AWS::SNS::Topic"
}

We also need to add permission for the lambda to be triggered by SNS.

"UnhealthyHostsTriggerPermission": {
    "Properties": {
        "Action": "lambda:InvokeFunction",
        "FunctionName": {
            "Ref": "UnhealthyHostReporter"
        },
        "Principal": "sns.amazonaws.com"
    },
    "Type": "AWS::Lambda::Permission"
}

Now, this lambda will be executed in response to all terminated EC2 instances. It will check to see if the instance is unhealthy, and if so will pause the shutdown. If it's healthy, and is just being terminated due to autoscaling or similar, it will be allowed to shut down immediately.

Notifying

The final part is to modify the lambda so it alerts you in an appropriate way when things are shutdown, and gives you the tools to shutdown a paused instance when you're done with it.

In this example we have integrated with slack, but you could quite simply write code to post to internal tools, send emails, open jira tickets, or whatever you like.

const AWS = require('aws-sdk');
const https = require('https');
const util = require('util');

const HEALTHY_STATUS = 'Healthy';

function sendToSlack(autoscaling_group_name, subject, text) {
  return new Promise((resolve, fail) => {
    const postData = {
      channel: process.env.channel,
      username: 'AWS Load Balancers',
      text: `*${subject}*`,
      attachments: [
        {
          author_name: autoscaling_group_name,
          mrkdwn_in: ['text', 'fields'],
          text,
          author_icon: 'http://demo/icon.png',
        },
      ],
    };

    const options = {
      method: 'POST',
      hostname: 'hooks.slack.com',
      port: 443,
      path: process.env.webhook_path,
    };

    const req = https.request(options, (res) => {
      res.setEncoding('utf8');
      res.on('data', resolve);
    });

    req.on('error', e => fail(e.message));

    req.write(util.format('%j', postData));
    req.end();
  });
}

exports.handler = (notification, context) => {
  console.log(notification);
  console.log(notification.Records[0].Sns.Message);
  const snsMessage = JSON.parse(notification.Records[0].Sns.Message);

  const info = `*AutoScalingGroupName:* ${snsMessage.AutoScalingGroupName}\n*Instance:* ${snsMessage.EC2InstanceId}\n*Time:* ${snsMessage.Time}\n*Terminate Command:* aws autoscaling complete-lifecycle-action --lifecycle-action-result CONTINUE --lifecycle-action-token ${snsMessage.LifecycleActionToken} --lifecycle-hook-name ${snsMessage.LifecycleHookName} --auto-scaling-group-name ${snsMessage.AutoScalingGroupName}`;

  const autoscaling = new AWS.AutoScaling();

  autoscaling.describeAutoScalingGroups(
    { AutoScalingGroupNames: [snsMessage.AutoScalingGroupName] }, (err, data) => {
      if (err) {
        // Error - report and don't kill it
        sendToSlack(snsMessage.AutoScalingGroupName, `Could not find autoscaling group: ${snsMessage.AutoScalingGroupName}`, `*${err}*\n\n\n${info}`).then(context.succeed);
      } else {
        const matchingInstances = data.AutoScalingGroups[0].Instances.filter(
          i => i.InstanceId === snsMessage.EC2InstanceId);

        if (matchingInstances.length === 0) {
          // No instance found - report and don't kill it
          sendToSlack(snsMessage.AutoScalingGroupName, `Could not find instance: ${snsMessage.EC2InstanceId}`, info).then(context.succeed);
        } else {
          const instance = matchingInstances[0];
          if (instance.HealthStatus === HEALTHY_STATUS) {
            // Healthy - just tell the autoscaling to continue
            const lifecycleParams = {
              AutoScalingGroupName: snsMessage.AutoScalingGroupName,
              LifecycleHookName: snsMessage.LifecycleHookName,
              LifecycleActionToken: snsMessage.LifecycleActionToken,
              LifecycleActionResult: 'CONTINUE',
            };
            autoscaling.completeLifecycleAction(lifecycleParams, (lifecycleErr) => {
              if (lifecycleErr) {
                console.error('AS lifecycle completion failed.\nDetails:\n', err);
                console.debug('CompleteLifecycleAction\nParams:\n', lifecycleParams);
                sendToSlack(snsMessage.AutoScalingGroupName, 'Error terminating autoscaled instance', info).then(context.succeed);
              } else {
                context.succeed();
              }
            });
          } else {
            // Unhealthy - report to slack and don't kill it
            sendToSlack(snsMessage.AutoScalingGroupName, 'TERMINATING UNHEALTHY INSTANCE', info).then(context.succeed);
          }
        }
      }
    });
};

With this, you will now be informed when you have an unhealthy instance, and be provided with the tools to either debug it or let it be shut down.

There's more that could be done here, perhaps ensuring that if you already have a machine in a wait state you shut down subsequent dead machines so you don't end up with 10s of machines in the WAIT state when you're dealing with the first one.

The code is fairly simple so it's easy to expand on it from here.

Show Comments

Get the latest posts delivered right to your inbox.