Partial Service Disruption in fi-perko cluster affecting small number of sites, May 14, 2025
On Wednesday May 14 2025, Seravo’s server cluster fi-perko experienced partial availability issues affecting a small number of sites on the cluster.
The disruption was caused by a configuration issue in the server environment and was triggered after deploying the new hardware in the cluster. The problem was not with the hardware, but the operating system configuration was not able to utilize all available resources.
Timeline
- 14.5.2025 21:33 - First site alert
- 14.5.2025 21:36 - Second site alert
- 14.5.2025 21:38 - Problem detection: On-call begins to investigate
- 14.5.2025 21:43 - Troubleshooting: On-call determines problem to be a systemwide issue
- 14.5.2025 21:46 - Status update: First notification of the disruption is published on status.seravo.com.
- 14.5.2025 21:52 - Tenth site alert
- 14.5.2025 22:10 - Escalation: Additional Systems team member joins investigation
- 14.5.2025 22:19 - Narrowed the problem to limitation in parallel processes and applied local fix
- 14.5.2025 22:20 - Mitigation: Reduce system load by shutting down shadow containers
- 14.5.2025 22:24 - Begin rollout restarts for malfunctioning sites
- 14.5.2025 22:39 - First malfunctioning site recovers
- 14.5.2025 23:11 - Final malfunctioning site recovers
- 14.5.2025 23:12 - Status update: Disruption is fixed and we continue to monitor the situation
Follow-Up Action
As a result of the incident, Seravo has identified the need for further action:
- Configuration defaults will be increased to match the requirements of more performant hardware that is being deployed.
- Enhancements to change management process to extend the testing of new operating system versions, related configuration and hardware implementation.
24/7 Monitoring
All sites of Seravo's customers are under our monitoring, 24 hours a day, 7 days a week. We detect problems quickly and respond to them directly.
All customer data is also backed up in a separate environment. In extreme emergencies we can restore sites to operation, even in the event of a catastrophic data center failure.