Incident report from 2021-04-29

Something went wrong!

As first: Sorry. As you can maybe imagine, this should not happen! As part of further improving the stability of my systems I post this incident report. This is a more in-depth analysis of the respective incident. It should give you an overview of what went how wrong. Maybe you can learn from it or at least understand how such thing could happen in the first place.

https://media.giphy.com/media/QQQoLTqkm7v3y/source.gif

When?

2021-04-29 17:40 - 2021-04-29 18:20

Which services got affected?

All services hosted on the server “Luke” and “Zeus”.

What happened?

17:40:11 Electrical fuse failed
         UPS takeover
17:56:00 UPS reports low battery voltage
         Server begins emergency shutdown procedure
18:??:?? Emergency shutdown hangs
18:??:?? UPS power fails prematurely, causes server to crash
18:18:00 Power restored
         Server boots
18:20:00 Services coming back online

What went wrong?

The UPS failed way too early, as it should have kept the server even under full load alive for at least an hour (it failed after 20 minutes)! Also the emergency shutdown hung up, as the service for saving the VMs states took too long to stop. This is caused by the (over time) increased server performance of the past years. Therefore too many resources needed to be suspended, which took too way longer than expected (~6 minutes) comparing to the time of writing the respective service units.

How to improve?

  • Investigate UPS health, maybe scheduling further maintenance windows.
  • Perform real load tests, the self-tests of the UPS are fine, but they do not reflect a real incident with longer periodes of power failure.
  • Disabled respective service unit, until I have time to apply it to only a selected group of VMs.