Ashby Unavailable

Incident Report for Ashby

Postmortem

Summary

On October 22, 2025, Ashby was unavailable for approximately 20 minutes (4:00 PM to 4:19 PM UTC). Due to a bug introduced five hours earlier, a critical service in Ashby’s infrastructure began to fail, causing our application to become unavailable. Customer job boards remained functional during this period.

Once we resolved the availability issue, we disabled the Workday and Analytics syncs and gradually increased their frequency over approximately 1 hour and 45 minutes, from 5:29 PM UTC to 7:15 PM UTC.

Why did this happen?

Before the incident, a bug was introduced to our application code that ran on an automated schedule. Over the course of five hours, the bug caused an increasing amount of network traffic to a critical service that the application code communicated with. Approximately 10 minutes before the incident, the increased network traffic caused a sudden spike in memory usage on the virtual machine hosting the critical service, resulting in the virtual machine running out of memory and the service failing. 

How did we resolve the situation?

On Thursday, October 22, 2025, at 10:48AM UTC, we deployed the application code change containing the bug.

At 4:00 PM UTC, various automated monitors alerted our team that Ashby was unavailable, and an incident was initiated by our On-call Engineer.

At 4:09 PM UTC, we determined the service failed due to the virtual machine running out of memory and began the process of failing over the service to a backup virtual machine.

At 4:19PM UTC, the service was restored, and we confirmed Ashby was available.

At 4:33PM UTC, we identified the probable cause and began reverting the change that introduced the bug.

At 6:09PM UTC, we verified the bug we suspected was the root cause. At 6:15PM UTC, the revert was shipped to production.

At 6:16PM UTC, out of an abundance of caution, we slowly increased the frequency the application code was run while monitoring the critical service it communicated with.

At 7:15 PM UTC, we felt confident that we had removed any effects introduced by the identified bug and restored the application code to its normal schedule. Because we hadn’t run the normal schedule for several hours, it would take time for the application code to go through the backlog of tasks it needed to perform.

At 8:07PM UTC, the backlog of tasks was completed, and we resolved the incident.

What have we put in place to prevent it from happening in the future?

Once we identified the root cause and resolved the incident, we immediately implemented two changes to detect or prevent this issue from recurring:

  • Monitors that detect and alert our team to outliers in network traffic to our critical service.
  • We deployed a change that removed the network traffic side effect that caused the incident. The bug uncovered this side effect.

Our team has also committed to moving the critical service that failed to virtual machines that auto-scale. One of the reasons the critical service failed was that we had explicitly allocated a specific amount of memory for the virtual machine, and when that limit was reached, the machine failed. We will be transitioning this service to one that enables our cloud service provider to automatically scale both the size (including available memory) and the number of machines on which the service runs.

Posted Nov 10, 2025 - 18:35 UTC

Resolved

Our systems have caught up on the backlog of Analytics and Workday sync tasks. Sync times have returned to normal. This incident has been resolved.
Posted Oct 22, 2025 - 20:07 UTC

Update

We're beginning to process the backlog of Analytics and Workday syncs and are continuing to monitor. Until we are caught up, customers of Ashby Analytics will see data that is a couple of hours behind their ATS and Ashby All-in-One customers will see delays in the Workday integration.
Posted Oct 22, 2025 - 19:15 UTC

Update

We are continuing to monitor. As a precaution we have disabled Workday and Analytics syncs.
Posted Oct 22, 2025 - 17:29 UTC

Monitoring

We've implemented a fix and are monitoring.
Posted Oct 22, 2025 - 16:19 UTC

Identified

We've identified the issue and are investigating a fix.
Posted Oct 22, 2025 - 16:09 UTC

Investigating

We are currently investigating this issue.
Posted Oct 22, 2025 - 16:06 UTC
This incident affected: Notification Delivery (Email, Slack), Authentication Services (Google, Office 365, Magic Link, Single Sign On), Ashby APIs (Ashby API, Reports API, Job Post API, Documentation), and Ashby Products (Recruiting, Analytics, Hosted Job Boards, Scheduling, Chrome Extension, Mobile, AI Notetaker).