2022-02-16 00:00:00
A lot of teams run MySQL or PostgreSQL databases, often for critical software applications or internal systems, and companies are migrating or are planning to migrate their database to the cloud in order to reduce the complexity and overheads needed to achieve high availability, resilience, as well as the management of backups and upgrades.
In this article, we are going to describe how to leverage Terraform to deploy some additional Aurora (RDS) backup features.
We will start by going through a summary of the standard Aurora backup capabilities, and the reasons some companies might need additional backup options. We will follow up with some engineering considerations, and finally, describe the solution chosen to expand on some RDS API options that are not typically deployed with the standard Aurora service.
AWS provides some pretty awesome backup features in the box! As a quick summary, we get:
This will of course depend on which regulations and laws your business must comply with, and for example could include:
Aurora is part of the managed database service Amazon Relational Database Service (Amazon RDS). The AWS RDS API gives us access to variety of methods accessible through a programming or command line interface.
Terraform can of course provision, scale and modify some RDS resources. This will enable us to manage the RDS instances and clusters declaratively.
However, at the time of writing, some RDS interfaces and methods are not available as standard Terraform ‘resources’, nor are they necessarily suitable to be provisioned by Terraform.
For instance, RDS operations on the Aurora Clusters such as reboot_db_instance would more likely fall under the umbrella of database maintenance and operations, and for this reason, many teams will understandably decide that other tools or processes should be used.
For this particular task our main engineering considerations are:
A good starting point to deploy Aurora with Terraform can be found in this Terraform registry GitHub link. You should review and discuss all the arguments available in the aws_rds_cluster and aws_rds_cluster_instance resources. At a minimum, we’ll use encryption through a KMS key, and refrain from declaring a plain text password for the Aurora master user, preferring a data source referring to a password stored in Secrets Manager (See this article for a possible Secrets management solution).
The AWS SDK supports a good range of language-specific APIs for AWS services, and a popular choice is to use Lambda functions to make calls to the AWS APIs. However, since some RDS actions are asynchronous, it may be difficult for example, to predict how long creating and exporting snaphots could take in all possible situations and workloads.
Another option would be to use the AWS Systems Manager (SSM) Automation runbook, and here is how a simplified diagram of the solution would look.
The following actions can be used in a runbook:
aws:executeAwsApi
this automation action calls and runs AWS API operations.aws:waitForAwsResourceProperty
this automation action allows for your automation to wait for a specific resource state or event state before continuing the automation.aws:assertAwsResourceProperty
this automation action allows you to assert a specific resource state for a specific step.With Terraform, we can easily provision a SSM automation document from a template file, the IAM policies and role to execute the automation, as well as the resources that will trigger the execution of the automation steps.
Let’s assume our Database teams requested the feature to create manual Aurora snapshots , that can be kept beyond 35 days.
We will create a SSM Automation runbook template named ssm_rds_create_snap.yaml.tpl
description: Custom Automation RDS snapshots schemaVersion: '0.3' assumeRole: '{{ AutomationAssumeRole }}' parameters: InstanceId: type: String description: RDS Instance ID default: '${instance_id}' AutomationAssumeRole: type: String description: the ARN of the Role that allows to perform the actions on your behalf default: 'arn:aws:iam::{{ global:ACCOUNT_ID }}:role/${automation_assume_role}' DBClusterIndentifier: type: String description: The identifier of the DB Cluster to create a snapshot for. Not case-sensitive. default: '${cluster_identifier}' DBClusterSnapshotIdentifier: type: String description: The identifier of the DB cluster snapshot. Stored as lower case string. default: 'aurora-snapshot-{{ automation:EXECUTION_ID }}' SnsTopic: type: String description: The SNS topic to send automation notifications to, that users will be able to subscribe to. default: '${ssm_sns_topic}' mainSteps: - name: AssertNotStartingOrAvailable action: 'aws:assertAwsResourceProperty' onFailure: step:FailedJobMessage isCritical: false nextStep: CheckDBInstance inputs: Service: rds Api: DescribeDBClusters DBClusterIdentifier: '{{ DBClusterIdentifier }}' PropertySelector: '$.DBClusters[0].Status' DesiredValues: - available - starting - name: CheckDBInstance action: 'aws:waitForAwsResourceProperty' onFailure: step:FailedJobMessage nextStep: createSnapshot maxAttempts: 10 timeoutSeconds: 600 inputs: Service: rds Api: DescribeDBInstances DBInstanceIndentifier: '{{ InstanceId }}' PropertySelector: '$.DBInstances[0].DBInstanceStatus' DesiredValues: - available - name: createSnapshot action: 'aws:executeAwsApi' maxAttempts: 3 onFailure: step:FailedJobMessage nextStep: waitForSnapshotCompletion inputs: Service: rds Api: CreateDBClusterSnapshot DBClusterIdentifier: '{{ DBClusterIdentifier }}' DBClusterSnapshotIdentifier: '{{ DBClusterSnapshotIdentifier }}' - name: waitForSnapshotCompletion action: 'aws:waitForAwsResourceProperty' onFailure: step:FailedJobMessage nextStep: CompleteJobNotification inputs: Service: rds Api: DescribeDBClusterSnapshots DBClusterIdentifier: '{{ DBClusterIdentifier }}' DBClusterSnapshotIdentifier: '{{ DBClusterSnapshotIdentifier }}' PropertySelector: '$.DBClusterSnapshots[0].Status' DesiredValues: - available - name: CompleteJobNotification action: 'aws:executeAwsApi' onFailure: Abort inputs: Service: sns Api: Publish TopicArn: '{{ SnsTopic }}' Message: "RDS Snapshot created ID: {{ DBClusterSnapshotIdentifier }}" Subject: "RDS Snapshot complete" outputs: - Name: MessageId Selector: '$.MessageId' type: String isEnd: true - name: FailedJobMessage action: 'aws:executeAwsApi' onFailure: Abort inputs: Service: sns Api: Publish TopicArn: '{{ SnsTopic }}' Message: "RDS Snapshot Failed ID: {{ DBClusterSnapshotIdentifier }}" Subject: "RDS Snapshot Failed!!" outputs: - Name: MessageId Selector: '$.MessageId' type: String isEnd: trueView Full Version
Taking a look at the above SSM automation template, we see
templatefile
function.And we’d refer to the automation template in a terraform block, like:
It’s worth noting that in order to run an automation workflow, we are passing the ARN of Amazon Identity and Access Management (IAM) service role which must be configured with permissions to execute the automation and also permissions to invoke the RDS and SNS services.
A good start point is to review the AWS built-in AmazonSSMAutomationRole
and customise a least-privilege IAM policy you can attach to the automation_assume_role for the creation of snapshots and exports to S3.
Having the flexibility to look under the bonnet and tune the engine as we wish is what AWS has given us from the early days with their strong focus on APIs.
On AWS, there are several options available to teams implementing a solution, and very often there is more than one way to achieve your goals.
Sometimes, a simple, Cloud native and cost effective solution might be what your customers need, as they may prefer a resilient solution to handle the critical processes handling their Aurora Database data.