Destroy THEIR Stacks - Ephemeral CDK Stacks as a Service

Destroy THEIR Stacks - Ephemeral CDK Stacks as a Service

In this post, we will enhance our ephemeral stack architecture by consolidating the destruction process to a central service. We will utilize a stack lifetime tag in conjunction with the MakeDestroyable aspect from the @aws-community/ephemeral npm library.

Ephemeral stacks are temporary stacks in AWS that are designed to exist for a short period of time. This is particularly useful in development environments where you want to test something but don’t need the stack to be up indefinitely.

This article is a follow-up to two previous posts on the topic of ephemeral stacks:

The Value of Ephemeral Stacks

But why would you want to use ephemeral stacks?

  1. Cost Savings 💰: By using resources only for the time needed, you can significantly reduce costs. You no longer have to worry about unused resources accumulating costs because the stacks self-terminate after the stipulated period.

  2. Efficient Resource Allocation 🔄: In fast-paced development environments, resources are constantly being allocated and deallocated. Ephemeral stacks make this process more efficient, ensuring that resources are available when needed and are released when no longer in use.

  3. Reduced Complexity 🧠: Keeping track of which resources are actively being used can be a complex task. By using ephemeral stacks, you know that any active resource is being used for a good reason. This reduces the complexity of managing your infrastructure.

  4. Enhanced Security 🔒: Minimizing the lifespan of your stacks reduces the exposure window for potential security vulnerabilities. By limiting the duration a resource is up, you inherently limit the time it can be exploited.

  5. Realistic Testing Environments 🧪: Ephemeral stacks are great for simulating production environments without the permanence. They allow you to conduct realistic tests and experiments, enabling you to glean insights and identify issues that might not be evident in traditional development environments.

  6. Simplified Clean-Up 🧹: Forget the days of manually cleaning up resources post-testing. With the self-destruction aspect of ephemeral stacks, the clean-up is automatic. This not only saves time but also ensures that no remnants are left behind that can cause clutter or additional costs.

  7. Easy Scalability for Temporary Needs ⚖️: Sometimes you need to scale resources quickly to meet a temporary need (e.g., a one-time data processing job). Ephemeral stacks allow for such scalability without the long-term commitment.

Armed with these benefits, it’s clear that ephemeral stacks are an incredibly powerful tool for optimizing AWS resource management, especially in development environments. Let's dive into how we can further improve the architecture by consolidating the destruction process.

Understanding the Key Components

We will go through the changes to the @aws-community/ephemeral npm library and demonstrate how it can be used.

The code for the @aws-community/ephemeral npm library is here: https://github.com/aws-community-projects/ephemeral

The example project that uses it is here: https://github.com/martzcodes/blog-ephemeral

The DestroyMe Stack and Construct

The DestroyMeConstruct uses the SelfDestructAspect from previous posts to ensure that all of the AWS Resources in the stack are set to a DESTROY retention policy. Additionally, it sets a STACK_LIFE tag on the stack, which indicates how long the stack should remain if there are no updates to it. This tag will be used by an external service to pick up and process the stack for destruction. Here’s the code snippet for this part:

Tags.of(Stack.of(this)).add('STACK_LIFE', duration.toSeconds().toString());
Aspects.of(Stack.of(this)).add(new SelfDestructAspect());

DestroyMeStack is a higher-level construct that simply includes DestroyMeConstruct, making it convenient to extend.

The Destroyer Stack

The DestroyerStack is the core of this enhancement. Instead of having each stack deploy a step function that will self-destroy, which could lead to conflicts or complications, we centralize the destruction process.

DestroyerStack uses AWS Service Events from CloudFormation to detect stacks that have the STACK_LIFE tag. Every time a CDK Stack deploys, it generates a CloudFormation Stack Status event. Using this event, we can fetch the stack details, including the tags, and determine if we should track the stack for deletion.

If the stack has the STACK_LIFE tag, we add an entry into a DynamoDB table with a TimeToLive (TTL) property. This TTL is the sum of the current time and the stack life duration. When DynamoDB removes the item due to TTL expiration, we trigger a Lambda function to delete the stack.

Here's how the table is created:

const tableName = 'destroyer';
const table = new Table(this, tableName, {
  tableName,
  partitionKey: {
    name: 'pk',
    type: AttributeType.STRING,
  },
  billingMode: BillingMode.PAY_PER_REQUEST,
  removalPolicy: RemovalPolicy.DESTROY,
  timeToLiveAttribute: 'ttl',
  stream: StreamViewType.NEW_AND_OLD_IMAGES,
});

In case the stack deletion fails, we can also track the DELETE FAILED status and send notifications to an SNS Topic for manual intervention.

const failTopic = new Topic(this, 'fail-topic');
new Rule(this, 'delete-failed-rule', {
  eventPattern: {
    source: ['aws.cloudformation'],
    detailType: ['CloudFormation Stack Status Change'],
    detail: {
      resourceStatus: ['DELETE_FAILED'],
    },
  },
  targets: [new SnsTopic(failTopic)],
});

CloudFormation Event Function

This Lambda function is triggered by AWS Service Events. It retrieves information from the CloudFormation service and writes to the DynamoDB table.

Here's how the Lambda function is configured:

const cloudformationFn = new NodejsFunction(this, 'fn-cloudformation', {
  runtime: Runtime.NODEJS_18_X,
  memorySize: 1024,
  timeout: Duration.minutes(5),
  entry: join(__dirname, local ? 'destroyer-stack.fn-cloudformation.ts' : 'destroyer-stack.fn-cloudformation.js'),
  initialPolicy: [
    new PolicyStatement({
      effect: Effect.ALLOW,
      actions: [
        'cloudformation:Describe*',
        'cloudformation:Get*',
        'cloudformation:List*',
      ],
      resources: ['*'],
    }),
  ],
});
table.grantReadWriteData(cloudformationFn);
cloudformationFn.addEnvironment('DESTROY_TABLE_NAME', table.tableName);

And then we trigger the lambda on those AWS Service Events:

new Rule(this, 'cloudformation-rule', {
  eventPattern: {
    source: ['aws.cloudformation'],
    detailType: ['CloudFormation Stack Status Change'],
  },
  targets: [new LambdaFunction(cloudformationFn)],
});

In our case we don't really care if the stack successfully deployed or not. We reset the ttl with every deployment (failure or not). Since a developer is actively working on the project, we don't want to delete it.

The lambda handler code simply describes the stack and if the STACK_LIFE tag exists, it puts the item into DynamoDB with the StackName as the primary key.

const StackName = event.detail['stack-id'];
const describeCommand = new DescribeStacksCommand({
StackName,
});
const stacks = await cf.send(describeCommand);
const stack = stacks.Stacks?.[0];
const stackLife = stack?.Tags?.find((tag) => tag.Key === 'STACK_LIFE')?.Value;
if (stackLife) {
    try {
      await ddbDocClient.send(
        new PutCommand({
          TableName: process.env.DESTROY_TABLE_NAME,
          Item: {
            pk: stack.StackName,
            ttl: Math.ceil(new Date().getTime() / 1000 + Number(stackLife)),
          },
        }),
      );
    } catch (e) {
      console.log(e);
    }
}

Destroy Function

The destroy function operates similarly. We make sure that it is triggered from the DynamoDB Stream and that it has access to cloudformation:DeleteStack.

destroyFn.addEventSource(
  new DynamoEventSource(table, {
    startingPosition: StartingPosition.LATEST,
  }),
);
destroyFn.addToRolePolicy(
  new PolicyStatement({
    actions: ['cloudformation:DeleteStack'],
    resources: ['*'],
    effect: Effect.ALLOW,
  }),
);

The destroy function handler code filters the dynamodb stream records to make sure that the item is being removed and the the ttl is actually expired. For safety there's an escape hatch that you can remove an item from DynamoDB if it's before the expiration, and it won't delete the stack.

const currentTimeInSeconds = new Date().getTime() / 1000;
if (item.ttl > currentTimeInSeconds) {
  // item was manually removed and not expired
  console.log('item was manually removed and not expired', currentTimeInSeconds, item.ttl);
  return [...p];
}

Then it removes the the valid expired stacks:

const client = new CloudFormationClient({});
await Promise.all(
  stacksToDestroy.map(
    async (stackName) =>
      await client.send(
        new DeleteStackCommand({
          StackName: stackName,
        }),
      ),
  ),
);

How to Use @aws-community/ephemeral

Example Code for this section is located here: https://github.com/martzcodes/blog-ephemeral

First, you need to deploy the DestroyerStack. You can do this in a separate project or by manually deploying the stack with npx cdk deply DestroyerStack

import { DestroyerStack } from '@aws-community/ephemeral';

const app = new cdk.App();
new DestroyerStack(app, 'DestroyerStack');

Once the DestroyerStack is in-place and monitoring the AWS Service Events, you can make any of your stacks ephemeral by extending the DestroyMeStack or adding the DestroyMeConstruct.

Here, we have extended the stack:

import { DestroyMeStack, DestroyMeStackProps } from '@aws-community/ephemeral';
import { Construct } from 'constructs';

export class BlogEphemeralStack extends DestroyMeStack {
  constructor(scope: Construct, id: string, props: DestroyMeStackProps) {
    super(scope, id, props);
    // your stuff here
  }
}

and then we can deploy it using npx cdk deploy EphemeralStack

new BlogEphemeralStack(app, 'EphemeralStack', {
  destroyMeEnable: true,
  destroyMeDuration: cdk.Duration.minutes(3),
});

It is important to note that the Dynamo TTL timing is NOT exact

TTL typically deletes expired items within a few days. Depending on the size and activity level of a table, the actual delete operation of an expired item can vary. Because TTL is meant to be a background process, the nature of the capacity used to expire and delete items via TTL is variable (but free of charge).

If you need to delete the stack sooner, you can manually do it or delete the item from DynamoDB after the ttl has expired.

Conclusion

In this blog post, we dove into enhancing our ephemeral stack architecture by centralizing the stack destruction process and employing a stack life tag with the MakeDestroyable aspect from the @aws-community/ephemeral npm library. This approach ensures that all of the AWS resources in the stack are set to a DESTROY retention and also sets a STACK_LIFE tag, indicating the lifetime of the stack in the absence of updates.

To summarize the key enhancements:

  1. Centralized Destruction Service: The centralization of destruction using the DestroyerStack minimizes the risks of conflicts and complications, making it more efficient.

  2. AWS Service Events: Utilizing AWS Service Events to detect stacks with the STACK_LIFE tag enables automation and efficiency in monitoring and managing the lifetime of resources.

  3. Automated Cleanup: The architecture now has an automated cleanup mechanism, which will be triggered based on the STACK_LIFE tag, and if there's a failure in the cleanup process, you will be notified via SNS.

  4. Enhanced Resource Management: With this setup, resources can be more efficiently managed, particularly during development stages where resource provisioning might be ephemeral.

This enhancement is particularly beneficial for DevOps environments, where teams frequently create and destroy resources for testing and development purposes. By automating the destruction of temporary resources, teams can ensure that only necessary resources are retained, leading to cost savings and more manageable infrastructure.

However, do remember that the timing for deletion with DynamoDB's TTL is not precise. If you require more exact timing for resource cleanup, additional manual steps may be necessary.

By integrating these enhancements into your ephemeral stack architecture, you’ll enable more streamlined, automated, and efficient resource management within your AWS environment.