Availability issues

Incident Report for Seravo

Postmortem

Partial Service Disruption in fi-perko cluster affecting small number of sites, May 14, 2025

On Wednesday May 14 2025, Seravo’s server cluster fi-perko experienced partial availability issues affecting a small number of sites on the cluster.

The disruption was caused by a configuration issue in the server environment and was triggered after deploying the new hardware in the cluster. The problem was not with the hardware, but the operating system configuration was not able to utilize all available resources.

Timeline

14.5.2025 21:33 - First site alert
14.5.2025 21:36 - Second site alert
14.5.2025 21:38 - Problem detection: On-call begins to investigate
14.5.2025 21:43 - Troubleshooting: On-call determines problem to be a systemwide issue
14.5.2025 21:46 - Status update: First notification of the disruption is published on status.seravo.com.
14.5.2025 21:52 - Tenth site alert
14.5.2025 22:10 - Escalation: Additional Systems team member joins investigation
14.5.2025 22:19 - Narrowed the problem to limitation in parallel processes and applied local fix
14.5.2025 22:20 - Mitigation: Reduce system load by shutting down shadow containers
14.5.2025 22:24 - Begin rollout restarts for malfunctioning sites
14.5.2025 22:39 - First malfunctioning site recovers
14.5.2025 23:11 - Final malfunctioning site recovers
14.5.2025 23:12 - Status update: Disruption is fixed and we continue to monitor the situation

Follow-Up Action

As a result of the incident, Seravo has identified the need for further action:

Configuration defaults will be increased to match the requirements of more performant hardware that is being deployed.
Enhancements to change management process to extend the testing of new operating system versions, related configuration and hardware implementation.

‌24/7 Monitoring

All sites of Seravo's customers are under our monitoring, 24 hours a day, 7 days a week. We detect problems quickly and respond to them directly.

‌All customer data is also backed up in a separate environment. In extreme emergencies we can restore sites to operation, even in the event of a catastrophic data center failure.

Posted May 16, 2025 - 10:07 UTC

Resolved

This incident has been resolved.

Posted May 14, 2025 - 21:08 UTC

Update

We are continuing to monitor for any further issues.

Posted May 14, 2025 - 20:13 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 14, 2025 - 20:12 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted May 14, 2025 - 19:35 UTC

Investigating

We are experiencing availability issues. Our team is actively investigating and resolving the situation.

We apologise for any inconvenience. Thank you for your patience.

Posted May 14, 2025 - 18:46 UTC

This incident affected: Finland (fi-perko cluster).