On Monday, October 8, our primary database server started reporting an unusually elevated load. At the time of discovery, all services were operational and efforts to root out the cause of the load were taken. During the diagnosis, networking was being affected via a growing packet loss between our application servers and primary database server.
The degradation in networking exacerbated the load on the primary database server and ultimately made the server unresponsive. At this point, our APIs for Spreedly Subscriptions and Core were unavailable and the primary focus was to bring the server back online to restore services. All in all we experienced around an hour of degraded performance and/or downtime. In hindsight, we should have promoted a replication slave at this point, or even sooner, as the means to restore availability. Instead, we carried on with a traditional recovery to bring the server back online via lights-out management.
This took up valuable time and ultimately led to a temporary recovery as we were still experiencing significant packet loss. This led to a second incident where the master database server became unresponsive under load. It was after the second crash that we decided to promote a replication slave. After promotion, services were available again under usual load. Unfortunately, replication was not working properly in the timeframe between first reboot and slave promotion. While we didn't lose any data, there was data drift between the the database servers and we had to get the new master database current while it was already online.
To achieve this we had to carefully slice the binary logs from the old master database to retrieve the data and prepare it for replay on the newly promoted database server. It was a long, complicated process, but we were able to restore all the data from the affected time window. As a service provider helping companies around the globe, we have a particular aversion to downtime and are continually taking measures to prevent it. We know you depend on us to be available around the clock and we sincerely apologize for the interruption in service.