Past Incidents

Wednesday 15th February 2017

Cabinet Migration, scheduled 8 years ago

All servers will be taken offline beginning Saturday, January 28 at 10 PM EST (-0500 GMT) to migrate to another cabinet in the data center. Anticipate a 30-90 minute window as all servers are deracked and reracked to another location in the main annex.

12:02 AM: still attempting to remove old rails from rack; some screws are stripped and those are being tended to now.

12:17 AM: network online, new uplink is GigE. Continuing to cable servers, then will power on

12:41 AM: servers coming back online

1:34 AM: everything back online, maintenance window concluded.

2:08 AM: another bug preventing mysql to startup on systemd-based platforms (v6.5+) has been identified and corrected.

Tuesday 31st January 2017

No incidents reported

Monday 30th January 2017

No incidents reported

Sunday 29th January 2017

No incidents reported

Saturday 28th January 2017

Luna Luna automated system recovery & reboot startup, scheduled 8 years ago

At 9:50 AM EST (-0500 GMT), Luna underwent an automated system recovery that necessitated a fast 20 second reboot. Upon rebooting, some services were incorrectly booted or not booted at all:

cron tasks depend upon a state marker file in /var/run, which is typically a tmpfs mount. In synthetic filesystems it is backed by disk storage that persists across reboots. Because the marker persists, crond on boot doesn't understand this is the first boot and ignores @reboot tasks. Resolved.

mysqld checks to see if it is running before attempting to restart, which is necessary to prevent InnoDB corruption. A known issue occurs when mysqld, on startup, performs a lengthy scan of the InnoDB journal, then a second service check fires, determines mysqld is not running, starts a second mysqld process while the initial mysqld is still recovering the InnoDB journal, then the second mysqld process binds to the known MySQL sockets (:3306, mysql.sock) without InnoDB support. A prerequisite check is made to ensure this scenario doesn't happen, as it always requires human intervention to fix, but that failed this morning. Root cause is still under investigation.

11:10 AM EST: Resolved. lsof arguments to -F (field selection) changed between 4.87 and 4.82 causing the InnoDB lock query to fail.

Friday 27th January 2017

No incidents reported

Thursday 26th January 2017

No incidents reported

Wednesday 25th January 2017

No incidents reported