All-in-One & Analytics: Reports Erroring

Incident Report for Ashby

Postmortem

Summary

On Tuesday, July 1, from 1:28 pm PST to 2:54 pm PST, Ashby’s reporting functionality was unavailable to the majority of our customers. This impacted dashboards, reports, and report alerts.

What happened?

Ashby uses a database vendor to power our reporting. At 1:15 pm PST, the vendor began a version upgrade of the database. That upgrade included a change that was not compatible with our existing systems. This resulted in all requests to the database failing and, ultimately, the reporting system failing.

How did we resolve this situation?

The first error occurred at 1:28 pm PST. Our on-call engineer began the investigation shortly after 1:35 pm PST. At 1:49 pm PST, the on-call engineer declared an incident.

We began investigating recent changes we made to Ashby to determine if one of them was the root cause. By 1:58 pm PST, we had identified that the issue was with the database and raised an issue with the vendor. Additionally, we started rolling out a fallback mechanism to a different database vendor. 

The vendor began investigating by 2:07pm PST. At 2:13pm PST, they confirmed that the recent version upgrade was the culprit and began the downgrade process.

By 2:51pm PST, the first part of the downgrade was complete, and error rates started to decrease. By 2:57pm PST, the downgrade was completed and the incident was resolved at 2:59pm PST.

Ashby’s reporting was unavailable for approximately 1 hour and 31 minutes, with some customers recovering sooner as we rolled out our fallback.

How did we get to this point?

We did not have a process for handling upgrade notifications or the necessary control over upgrades set up with the vendor.

We received an email from the vendor informing us of the upgrade. This email would have given us plenty of time to run our compatibility tests and prepare for the upgrade. However, only two people on our team were set up to be notified by email. Only one of them was capable of upgrading the database and the email went to their spam folder. We also purchased the ability to control database upgrades, but our account had not been upgraded by the vendor. As a result, we had  less control over the timing of upgrades.

What have we put in place to prevent this from happening in the future?

We’ve done an internal postmortem on this incident and have implemented a variety of changes that achieve three things:

  • Reduced likelihood of future failure
  • Faster incident response times
  • Lower impact on customers in the event of failure

Specifically, we have done the following:

  • We now have explicit control over when our database can be upgraded.
  • Future upgrades of our database are scheduled for Saturday mornings, when fewer customers will be impacted by any potential issues
  • Improved how and where we receive notifications, so we get notified when an upgrade begins
  • Subscribed multiple people to the third party’s mailing lists to increase our awareness of upcoming upgrades
  • Improved our own error monitoring to decrease the time it takes for us to identify similar incidents in the future
  • Updated our automated test suites to test against the current version of the database and the next version 
  • Made more of the team aware of our compatibility testing suite

We have also addressed the problems with this upgrade. We have made our reporting system compatible with the latest version, and the data warehouse has been upgraded without any further issues.

Posted Aug 04, 2025 - 14:10 UTC

Resolved

We are considering the incident resolved. You may need to refresh reports using the 🔄 button to see results.
Posted Jul 01, 2025 - 21:59 UTC

Monitoring

Our database vendor has performed the rollback, and we are seeing reports recover. We will keep our fallback on as a precaution.
Posted Jul 01, 2025 - 21:55 UTC

Update

We've enabled a fallback for some customers and performing the work necessary to roll it out to all customers. We are also in contact with our database vendor. They are performing a rollback which has not yet completed.
Posted Jul 01, 2025 - 21:46 UTC

Identified

We've identified that our database vendor is having an outage, and they are pulling in the on-call engineer. While they do that we are investigating a short-term mitigation.
Posted Jul 01, 2025 - 21:16 UTC

Investigating

We're investigating reports and dashboard consistently erroring.
Posted Jul 01, 2025 - 20:53 UTC
This incident affected: Ashby Products (Analytics).