Past Incidents

Thursday 2nd February 2017

Sol Backporting concurrency deadlock check, scheduled 8 years ago

On February 2 beginning at approximately 7:10 AM EST (-0500 GMT) and concluding by 7:35 AM EST (-0500 GMT), Sol experienced connectivity issues to MySQL. The root cause is a deadlock bug that can be triggered if a few conditions are met. This bug has been a concern since it was first encountered with MySQL 4.0 released in 2003; since then, steps have been taken to reduce its exposure.

Sol uses a second generation monitor that works more efficiently, tracks a wider array of defects, and can detect service flaps. Up to today, it did not detect the deadlock bug. That check has been ported to Sol and newer platforms, which will resolve incidents like this going forward.

For those curious about the bug conditions, (1) table storage engine must be MyISAM, (2) user must be over quota, (3) identical UPDATE or INSERT queries must be issued within a millisecond of each other, (4) another UPDATE or INSERT query must be issued by another account in between those 2 queries and before the queries are postponed due to quota overages.

Wednesday 1st February 2017

No incidents reported

Tuesday 31st January 2017

No incidents reported

Monday 30th January 2017

No incidents reported

Sunday 29th January 2017

No incidents reported

Saturday 28th January 2017

Luna Luna automated system recovery & reboot startup, scheduled 8 years ago

At 9:50 AM EST (-0500 GMT), Luna underwent an automated system recovery that necessitated a fast 20 second reboot. Upon rebooting, some services were incorrectly booted or not booted at all:

cron tasks depend upon a state marker file in /var/run, which is typically a tmpfs mount. In synthetic filesystems it is backed by disk storage that persists across reboots. Because the marker persists, crond on boot doesn't understand this is the first boot and ignores @reboot tasks. Resolved.

mysqld checks to see if it is running before attempting to restart, which is necessary to prevent InnoDB corruption. A known issue occurs when mysqld, on startup, performs a lengthy scan of the InnoDB journal, then a second service check fires, determines mysqld is not running, starts a second mysqld process while the initial mysqld is still recovering the InnoDB journal, then the second mysqld process binds to the known MySQL sockets (:3306, mysql.sock) without InnoDB support. A prerequisite check is made to ensure this scenario doesn't happen, as it always requires human intervention to fix, but that failed this morning. Root cause is still under investigation.

11:10 AM EST: Resolved. lsof arguments to -F (field selection) changed between 4.87 and 4.82 causing the InnoDB lock query to fail.

Friday 27th January 2017

No incidents reported