Saturday 3rd June 2017

Helios MySQL Outage, scheduled 1 year ago

Helios experienced a MySQL outage beginning at approximately 8:45 AM EDT (-0400 GMT) that extended until 11 AM EDT, when the issue was fully resolved.

MySQL has suffered from a mutex bug, since 4.0 in 2003 that can exist when a few conditions are met:

  • account exceeds its storage quota
  • multiple queries are issued to a database that is bound by the exceeded storage quota
  • a query is issued when another query, which cannot execute due to a blocking write, expires

The lock, which should exist on the database, its tables, and nothing more leaks out to lock all tables across all accounts. It's a serious bug, which is why over the years checks have been created and reimplemented to detect and automatically restart the database service when such a condition is encountered. Presently, an internal check runs in 3 minute intervals to query the number of concurrent database connections. If that connection count exceeds reasonable operation and the system load is normal, then the mutex bug has struck and it restarts automatically.

What happens when, during a 3 minute window, all available concurrent connection slots go from < 50> 50), then it is assumed the mutex bug has struck again. Instead of querying the connection count, which will fail, a restart is now issued automatically.

Helios is the oldest server still continuously managed under Hostineer. Likewise Helios runs some pretty old software including Percona MySQL 5.5 that still suffers from the mutex bug. Newer servers run MariaDB and there is an off-chance this bug doesn't exist with that variation of MySQL, but only time will tell. Moving forward, this internal check has been corrected.