I’m using Terraform to manage our AWS setup. We recently shrunk our RDS Aurora-MySQL8 cluster from r5 to t3 instances. The cluster has two nodes: a reader and a writer in different availability zones. I’m trying to figure out if there was any downtime during this process.
Here’s what I think happened, but I’m not sure:
The r5 reader became the writer, and the old r5 writer shut down.
Some changes were applied to the new writer (still r5).
A t3 instance became the new writer, and the r5 writer turned into a reader.
There was some activity with the reader between 4:49 and 4:54.
I don’t think there was any downtime for reading or writing, but I’m not certain. Can someone help me understand what actually occurred during this resize? Did I interpret the events correctly?
# Example log snippet
April 23, 2025, 16:42 (UTC+01:00)
Started cross AZ failover to DB instance: db-cluster-node-1
April 23, 2025, 16:47 (UTC+01:00)
Finished applying modification to DB instance class
April 23, 2025, 16:49 (UTC+01:00)
Completed failover to DB instance: db-cluster-node-0
Am I on the right track with my understanding of these events?
Based on my experience with AWS RDS Aurora-MySQL, your interpretation is generally correct. The process you’ve described is typical for a resize operation, designed to minimize downtime.
During the resize, there’s usually a brief moment of unavailability during the failover, but it’s typically in the range of seconds, not minutes. This occurs when the cluster switches from the old instance type to the new one.
To verify if there was any significant downtime, I’d suggest reviewing your application logs for connection errors or timeouts during the resize window. Additionally, checking CloudWatch metrics for connection counts and query latencies can provide valuable insights.
In most cases, read operations remain available throughout the process, as Aurora maintains read replicas. Write availability might experience a momentary interruption during the failover, but it’s generally very brief.
Remember, the exact behavior can vary depending on your specific configuration and workload. If you’re concerned about potential downtime, consider implementing retry logic in your application to handle brief connection interruptions gracefully.
Having gone through a similar process with our RDS Aurora-MySQL cluster, I can share some insights. Your interpretation is quite close, but there are some nuances to consider.
In my experience, the resize process for Aurora clusters is designed to minimize downtime. The failover you’re seeing is a key part of this. When we downsized, we observed a brief connection blip during the failover, lasting just a few seconds.
The log snippet you provided aligns with what we saw. The cross-AZ failover is the critical moment. This is when your applications might experience a short interruption as connections are re-established to the new primary instance.
One thing to note: Aurora maintains read availability throughout most of the process. Even during the instance class modification, read operations can typically continue uninterrupted on the secondary node.
To confirm if there was any significant downtime, I’d recommend checking your application logs for any connection errors or reviewing CloudWatch metrics for connection count drops during that timeframe. In our case, we found these methods invaluable for pinpointing any actual impact on our services.