Saturday 28th January 2017

Luna Luna automated system recovery & reboot startup, scheduled 7 years ago

At 9:50 AM EST (-0500 GMT), Luna underwent an automated system recovery that necessitated a fast 20 second reboot. Upon rebooting, some services were incorrectly booted or not booted at all:

cron tasks depend upon a state marker file in /var/run, which is typically a tmpfs mount. In synthetic filesystems it is backed by disk storage that persists across reboots. Because the marker persists, crond on boot doesn't understand this is the first boot and ignores @reboot tasks. Resolved.

mysqld checks to see if it is running before attempting to restart, which is necessary to prevent InnoDB corruption. A known issue occurs when mysqld, on startup, performs a lengthy scan of the InnoDB journal, then a second service check fires, determines mysqld is not running, starts a second mysqld process while the initial mysqld is still recovering the InnoDB journal, then the second mysqld process binds to the known MySQL sockets (:3306, mysql.sock) without InnoDB support. A prerequisite check is made to ensure this scenario doesn't happen, as it always requires human intervention to fix, but that failed this morning. Root cause is still under investigation.

11:10 AM EST: Resolved. lsof arguments to -F (field selection) changed between 4.87 and 4.82 causing the InnoDB lock query to fail.