On Tuesday, July 1, from 1:28 pm PST to 2:54 pm PST, Ashby’s reporting functionality was unavailable to the majority of our customers. This impacted dashboards, reports, and report alerts.
Ashby uses a database vendor to power our reporting. At 1:15 pm PST, the vendor began a version upgrade of the database. That upgrade included a change that was not compatible with our existing systems. This resulted in all requests to the database failing and, ultimately, the reporting system failing.
The first error occurred at 1:28 pm PST. Our on-call engineer began the investigation shortly after 1:35 pm PST. At 1:49 pm PST, the on-call engineer declared an incident.
We began investigating recent changes we made to Ashby to determine if one of them was the root cause. By 1:58 pm PST, we had identified that the issue was with the database and raised an issue with the vendor. Additionally, we started rolling out a fallback mechanism to a different database vendor.
The vendor began investigating by 2:07pm PST. At 2:13pm PST, they confirmed that the recent version upgrade was the culprit and began the downgrade process.
By 2:51pm PST, the first part of the downgrade was complete, and error rates started to decrease. By 2:57pm PST, the downgrade was completed and the incident was resolved at 2:59pm PST.
Ashby’s reporting was unavailable for approximately 1 hour and 31 minutes, with some customers recovering sooner as we rolled out our fallback.
We did not have a process for handling upgrade notifications or the necessary control over upgrades set up with the vendor.
We received an email from the vendor informing us of the upgrade. This email would have given us plenty of time to run our compatibility tests and prepare for the upgrade. However, only two people on our team were set up to be notified by email. Only one of them was capable of upgrading the database and the email went to their spam folder. We also purchased the ability to control database upgrades, but our account had not been upgraded by the vendor. As a result, we had less control over the timing of upgrades.
We’ve done an internal postmortem on this incident and have implemented a variety of changes that achieve three things:
Specifically, we have done the following:
We have also addressed the problems with this upgrade. We have made our reporting system compatible with the latest version, and the data warehouse has been upgraded without any further issues.