Automating backups of databases in ECS containers with EFS persistent storage

aws ecs efs postgresql redis

One of my PostgreSQL tables recently just hit 50 million rows. For cost reduction and because this is just a pet project, I run all my databases in AWS ECS with persistent storage in EFS. While this does mean a only couple dollars for computing and a few dollars for storage per day, this also means I get none of the benefits of an actual AWS RDS instance. No backups or anything.

AWS EFS does have snapshots but all it does is just restore the file system back to a point in time. I wanted to make sure I had actual database backups that I could easily pull if I ever needed to test against the production dataset.

Version controlling the scheduling of backups

To me, the most important aspect aside from the actual backups is version controlling whatever my solution is. Getting a cronjob on the running container is a bit of annoyance to me because that introduces:

  1. Custom container image built off the base image to set up the cronjob
  2. Creating a container registry to host it

OR

  1. Passing all the commands to set up the cronjob at container startup

Not awful solutions but just not my preference. I like to use the official, base PostgreSQL or Redis (or whatever other storage) image. Also, my AWS infrastructure is already version controlled in Terraform. So I instead opted to handle scheduling using AWS EventBridge/CloudWatch events to trigger a Lambda at midnight with the workflow below.

This ends up being only a few more AWS resources at negligible cost.

Lambda pseudo-logic workflow for daily backups

  • Check to see if a backup needs to be done for today
  • If a backup needs to be done, Lambda needs to tell the ECS containers to run a command
    • The Lambda cannot run the command directly and must use ECS Exec
    • Example commands:
      • For Redis, simply copy the rdb dump file: cp /data/{DUMP_FILE_NAME} {EFS_PATH} | gzip - > {EFS_PATH}/done.gz
      • For PostgreSQL, pg_dump: pg_dump -U {DATABASE_USER} -d {DATABASE_NAME} -F c -f {EFS_PATH}/{DUMP_FILE_NAME} | gzip - > {EFS_PATH}/done.gz
      • The commands pipe the result to done.gz – when this file size is non-zero, the process is done
  • Trigger the Lambda again (at hour intervals) and check if the done.gz indicates the backup is ready or not
    • Repeat until the backup is done
  • Upload the finished backup to S3
  • Delete done.gz. The absence of the file means the whole process will start again on the next day

Triggering the Lambda again to poll the state of the backup is what I dislike most about this workflow, but it reduces the Lambda runtime to be negligible.

Restricted by AWS ECS

I mentioned this before, but unlike kubernetes, you can’t SSH into your ECS container. Any commands you want to execute on the container need to go through a proxy called SSM Agent. It’s super cumbersome but the only way you can run a command on your container.

So to run those database dump commands on the container themselves, there’s a lot of permissions you need to set up:

The role that the container runs as needs to allow ECS

Clarification: that role is the Task Role that the ECS container runs as, not the Task Execution Role which is in charge of spinning up the containers in ECS.

In Terraform, you would define this role as:

resource "aws_iam_role" "role_ecs_task_role" {
  name = "ecsTaskRole"
  assume_role_policy = jsonencode(
    {
      Statement = [
        {
          Action = "sts:AssumeRole"
          Effect = "Allow"
          Principal = {
            Service = "ecs-tasks.amazonaws.com"
          }
          Sid = ""
        }
      ]
      Version = "2008-10-17"
    }
  )
}

The role should give permissions to the SSM Agent

You need to attach the following permissions to the role to allow shell commands to be run.

resource "aws_iam_role_policy" "policy_systems_manager" {
  name = "ssmAgent"
  role = aws_iam_role.role_ecs_task_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ssmmessages:CreateControlChannel",
          "ssmmessages:CreateDataChannel",
          "ssmmessages:OpenControlChannel",
          "ssmmessages:OpenDataChannel"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })
}

In my case, to store the backups in S3, my ECS task role also allowed S3 access:

resource "aws_iam_role_policy" "s3" {
  name = "s3"
  role = aws_iam_role.role_ecs_task_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "s3:*"
        Effect = "Allow"
        Resource = [
          aws_s3_bucket.main_bucket.arn,
          "${aws_s3_bucket.main_bucket.arn}/*"
        ]
      }
    ]
  })
}

The ECS service needs to allow command execution

In Terraform, this is simply:

resource "aws_ecs_service" "ecs_service_postgresql" {
    ...

    enable_execute_command            = true
}

For a Lambda to interact with an ECS container in a private subnet, you need VPC endpoints

This allows a Lambda or any other AWS service to have access to the ECS container:

resource "aws_vpc_endpoint" "endpoint_ecs" {
  vpc_id          = aws_vpc.main.id
  service_name    = "com.amazonaws.us-west-2.ecs"
  ip_address_type = "ipv4"

  policy = jsonencode(
    {
      Statement = [
        {
          Action    = "*"
          Effect    = "Allow"
          Principal = "*"
          Resource  = "*"
        },
      ]
      Version = "2008-10-17"
    }
  )

  private_dns_enabled = true

  route_table_ids    = []
  security_group_ids = [aws_security_group.sg_private.id]
  subnet_ids         = [aws_subnet.private.id]

  vpc_endpoint_type = "Interface"

  dns_options {
    dns_record_ip_type                             = "ipv4"
    private_dns_only_for_inbound_resolver_endpoint = false
  }
}

ECS Exec checker

I originally missed adding the endpoint and couldn’t figure out why my Lambda could not interact with my ECS container. I finally figured out what was wrong after using amazon-ecs-exec-checker. This is an unofficial tool, but it is referenced in the official AWS documentation, so I consider it trustworthy.

In order to run this tool, it needs the following permissions. You can set up a role with this policy and let your CLI assume the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListRoles",
                "ecs:DescribeTaskDefinition",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcEndpoints"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:ecs:us-west-2:[account-id]:cluster/[cluster-name]"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:SimulatePrincipalPolicy"
            ],
            "Resource": [
                "arn:aws:iam::[account-id]:role/[ecs-task-execution-role]"
            ]
        }
    ]
}

Gotchas

Version control the Lambda code in Terraform but not the build artifacts

You can include the Lambda code alongside your Terraform files and specify that it should be zipped so it can be uploaded to Lambda.

data "archive_file" "file_upload_postgresql_backup" {
  type        = "zip"
  source_file = "${path.module}/lambda_upload_postgresql_backup.py"
  output_path = "${path.module}/lambda_upload_postgresql_backup.zip"
}

Everytime you modify your Lambda code (in my case lambda_upload_postgresql_backup.py), then you need to make sure Terraform picks this up to apply changes.

resource "aws_lambda_function" "lambda_upload_postgresql_backup" {
  ...

  source_code_hash = data.archive_file.file_upload_postgresql_backup.output_base64sha256

  ...
}

The important part is to specify source_code_hash so that everytime you run Terraform, it will zip the newest version of your code and upload the changes if the source code hash has changed.

!!! The other important thing to remember is to exclude this zip file from version control. So in the same directory, add a .gitignore with:

*.zip

Infinite backups

Currently, my backup Lambda code just uploads a nightly backup to S3 without regard to existing backups. That is fine for me, but it will keep doing this indefinitely.

You could modify your script to only keep the last X days. In my case, a rotating 7 days of backups would be fine. So all I would need to do is change the S3 upload naming schema to something like backup-{day_of_week}.dump. That way it will keep the last 7 days of backups and just overwrite old ones.