Tuesday 14th February 2017

Network Data center power outage, scheduled 1 year ago

Updates from the data center. As an aside, we were back online by 6:30 PM EST:

6:20 PM EST

At approximately 5:40 EST, we experienced a temporary utility interruption at the data center. This temporary utility interruption caused an unknown error in our UPS which resulted in a power outage to your environment. zColo Operations are diligently working to restore power to your environment. Additional updates will be provided when available.

6:41 PM EST

As we continue to troubleshoot the power issue affecting your data center space. Please be aware there will be some delays on ticket requests as we work through the power issues. We are working to bring the power back online safely and as quickly as possible. Our UPS vendor has been dispatched to assist us in troubleshooting the issues. Additional updates will be provided when available.

7:30 PM EST

We are still diligently working to restore power to your environment. Our UPS vendor has an estimated time of arrival of 1 hour and 30 minutes. Please be aware there will be some delays on ticket requests as we work through the power issue. Additional updates will be provided when available.

8:33 PM EST

Unfortunately, we are still working diligently to restore power to your environment. Our UPS vendor has an estimated time of arrival at 9:30 EST. Please be aware there will be some delays on ticket requests as we work through the power issue. Additional updates will be provided when available.

11:01 PM EST

We have resolved the issue that may have impacted your services and believe all service levels are now restored. If you are still experiencing issues, please reach out to us immediately. An RFO will be available within 72 hours upon request.

2017/02/17 An RFO has been published from our data center. A confidentiality agreement has been tied to it, so open a ticket in the panel, as an affected third-party to request the RFO.


On our end, misconfiguration on HV1 that required manual configuration changes to Sol and Luna. Other servers came back up as soon as power was restored. We're continuing to evaluate the overall response on our end from this disaster recovery. If you have any issues, open at ticket in Launchpad or drop us an email at help@hostineer.com. But please don't call in these situations, it takes hands off fixing major issue and instead puts them on addressing public issues to people individually!

Addendum

  • An above average rate of spam flowed in for ~75 minutes. Spam filtering uses an aggregate memory-backed DB on another server, which went down with the power outage. MySQL takes some time to recover its internal InnoDB records before fully coming online, which is why DB outages take 4-5 minutes to stabilize. Further, because the system uses SysV-style startup scripts, without dependencies, rc.local fired, which restores from backup, before mysql fully started up. Adjusted to add a 2 minute sleep before attempting to restore filter data.
  • On v5 and earlier platforms, all dedicated IPs were not brought up on boot resulting in prolonged outages for customers on these servers. This was caused by oversight in changing an 8 year old startup routine, specifically: system("[[ ! -f /home/virtual/"$3"/info/disabled ]]"). While a successful system exit code is 0, which system(CALL == 0) translates to true, system(!CALL == 0) translates to 1 == 0 in the expected state. This issue did not affect platform v6+ and above.
  • CP reverse proxy failed to come up, because the path on the reboot cron was incorrect.