One of my PostgreSQL tables recently just hit 50 million rows. For cost reduction and because this is just a pet project, I run all my databases in AWS ECS with persistent storage in EFS. While this does mean a only couple dollars for computing and a few dollars for storage per day, this also means I get none of the benefits of an actual AWS RDS instance. No backups or anything.
AWS EFS does have snapshots but all it does is just restore the file system back to a point in time. I wanted to make sure I had actual database backups that I could easily pull if I ever needed to test against the production dataset.
Version controlling the scheduling of backups
To me, the most important aspect aside from the actual backups is version controlling whatever my solution is. Getting a cronjob on the running container is a bit of annoyance to me because that introduces:
- Custom container image built off the base image to set up the cronjob
- Creating a container registry to host it
OR
- Passing all the commands to set up the cronjob at container startup
Not awful solutions but just not my preference. I like to use the official, base PostgreSQL or Redis (or whatever other storage) image. Also, my AWS infrastructure is already version controlled in Terraform. So I instead opted to handle scheduling using AWS EventBridge/CloudWatch events to trigger a Lambda at midnight with the workflow below.
This ends up being only a few more AWS resources at negligible cost.
Lambda pseudo-logic workflow for daily backups
- Check to see if a backup needs to be done for today
- If a backup needs to be done, Lambda needs to tell the ECS containers to run a command
- The Lambda cannot run the command directly and must use ECS Exec
- Example commands:
- For Redis, simply copy the rdb dump file:
cp /data/{DUMP_FILE_NAME} {EFS_PATH} | gzip - > {EFS_PATH}/done.gz
- For PostgreSQL, pg_dump:
pg_dump -U {DATABASE_USER} -d {DATABASE_NAME} -F c -f {EFS_PATH}/{DUMP_FILE_NAME} | gzip - > {EFS_PATH}/done.gz
- The commands pipe the result to
done.gz
– when this file size is non-zero, the process is done
- For Redis, simply copy the rdb dump file:
- Trigger the Lambda again (at hour intervals) and check if the
done.gz
indicates the backup is ready or not- Repeat until the backup is done
- Upload the finished backup to S3
- Delete
done.gz
. The absence of the file means the whole process will start again on the next day
Triggering the Lambda again to poll the state of the backup is what I dislike most about this workflow, but it reduces the Lambda runtime to be negligible.
Restricted by AWS ECS
I mentioned this before, but unlike kubernetes, you can’t SSH into your ECS container. Any commands you want to execute on the container need to go through a proxy called SSM Agent. It’s super cumbersome but the only way you can run a command on your container.
So to run those database dump commands on the container themselves, there’s a lot of permissions you need to set up:
The role that the container runs as needs to allow ECS
Clarification: that role is the Task Role
that the ECS container runs as, not the Task Execution Role
which is in charge of spinning up the containers in ECS.
In Terraform, you would define this role as:
resource "aws_iam_role" "role_ecs_task_role" {
name = "ecsTaskRole"
assume_role_policy = jsonencode(
{
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
Sid = ""
}
]
Version = "2008-10-17"
}
)
}
The role should give permissions to the SSM Agent
You need to attach the following permissions to the role to allow shell commands to be run.
resource "aws_iam_role_policy" "policy_systems_manager" {
name = "ssmAgent"
role = aws_iam_role.role_ecs_task_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
]
Effect = "Allow"
Resource = "*"
},
]
})
}
In my case, to store the backups in S3, my ECS task role also allowed S3 access:
resource "aws_iam_role_policy" "s3" {
name = "s3"
role = aws_iam_role.role_ecs_task_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "s3:*"
Effect = "Allow"
Resource = [
aws_s3_bucket.main_bucket.arn,
"${aws_s3_bucket.main_bucket.arn}/*"
]
}
]
})
}
The ECS service needs to allow command execution
In Terraform, this is simply:
resource "aws_ecs_service" "ecs_service_postgresql" {
...
enable_execute_command = true
}
For a Lambda to interact with an ECS container in a private subnet, you need VPC endpoints
This allows a Lambda or any other AWS service to have access to the ECS container:
resource "aws_vpc_endpoint" "endpoint_ecs" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-west-2.ecs"
ip_address_type = "ipv4"
policy = jsonencode(
{
Statement = [
{
Action = "*"
Effect = "Allow"
Principal = "*"
Resource = "*"
},
]
Version = "2008-10-17"
}
)
private_dns_enabled = true
route_table_ids = []
security_group_ids = [aws_security_group.sg_private.id]
subnet_ids = [aws_subnet.private.id]
vpc_endpoint_type = "Interface"
dns_options {
dns_record_ip_type = "ipv4"
private_dns_only_for_inbound_resolver_endpoint = false
}
}
ECS Exec checker
I originally missed adding the endpoint and couldn’t figure out why my Lambda could not interact with my ECS container. I finally figured out what was wrong after using amazon-ecs-exec-checker. This is an unofficial tool, but it is referenced in the official AWS documentation, so I consider it trustworthy.
In order to run this tool, it needs the following permissions. You can set up a role with this policy and let your CLI assume the role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:ListRoles",
"ecs:DescribeTaskDefinition",
"ec2:DescribeSubnets",
"ec2:DescribeVpcEndpoints"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecs:DescribeClusters"
],
"Resource": [
"arn:aws:ecs:us-west-2:[account-id]:cluster/[cluster-name]"
]
},
{
"Effect": "Allow",
"Action": [
"iam:SimulatePrincipalPolicy"
],
"Resource": [
"arn:aws:iam::[account-id]:role/[ecs-task-execution-role]"
]
}
]
}
Gotchas
Version control the Lambda code in Terraform but not the build artifacts
You can include the Lambda code alongside your Terraform files and specify that it should be zipped so it can be uploaded to Lambda.
data "archive_file" "file_upload_postgresql_backup" {
type = "zip"
source_file = "${path.module}/lambda_upload_postgresql_backup.py"
output_path = "${path.module}/lambda_upload_postgresql_backup.zip"
}
Everytime you modify your Lambda code (in my case lambda_upload_postgresql_backup.py
), then you need to make sure Terraform picks this up to apply changes.
resource "aws_lambda_function" "lambda_upload_postgresql_backup" {
...
source_code_hash = data.archive_file.file_upload_postgresql_backup.output_base64sha256
...
}
The important part is to specify source_code_hash
so that everytime you run Terraform, it will zip the newest version of your code and upload the changes if the source code hash has changed.
!!! The other important thing to remember is to exclude this zip file from version control. So in the same directory, add a .gitignore
with:
*.zip
Infinite backups
Currently, my backup Lambda code just uploads a nightly backup to S3 without regard to existing backups. That is fine for me, but it will keep doing this indefinitely.
You could modify your script to only keep the last X days. In my case, a rotating 7 days of backups would be fine. So all I would need to do is change the S3 upload naming schema to something like backup-{day_of_week}.dump
. That way it will keep the last 7 days of backups and just overwrite old ones.