11月 20 2014

[MySQL案例]之备份与Waiting for table flush

今天在处理慢查询平台的slow query时发现今天排在第一个的居然是一个简单的SQL，运行了4166次，最长耗时610秒，长期处于“Waiting for table flush”状态。

结果说在前面

排查一番后果然不出意外：一个慢查询堵塞数据库备份进程，备份进程堵塞了后续的查询SQL，最后导致后续的大面积SQL等待。

那么问题来了

一个慢查询是怎么堵塞mysqldump进程
mysqldump是怎么堵塞后续的查询请求

首先我们来看下“Waiting for table flush”的解释

Waiting for table：
Waiting for tables, Waiting for table, Waiting for table flush
The thread got a notification that the underlying structure for a table has changed and it needs to reopen the table to get the new structure. However, to reopen the table, it must wait until all other threads have closed the table in question.
This notification takes place if another thread has used FLUSH TABLES or one of the following statements on the table in question: FLUSH TABLES tbl_name, ALTER TABLE, RENAME TABLE, REPAIR TABLE, ANALYZE TABLE, or OPTIMIZE TABLE.
In MySQL 5.5.6, Waiting for table was replaced with Waiting for table flush.

看到手册的说明是不是能想到点什么？如果不行我们模拟下案发时的场景或许就明白了。

场景重现

session A //模拟一个慢查询

1 2	mysql> select count(*) from t_user_log a left join t_user b on a.uid = b.id; ...

session B //启动一个备份进程

1	shell> mysqldump -q -Q --flush-logs --force -R --opt --single-transaction -S ./mysql.sock dbname > ./dbname.sql

session C //模拟一个简单的查询

1 2	mysql> select * from t_user_log where id = 1; ...

session D //查看进程状态

mysql> show processlist;
+------+------+-----------+--------------+---------+------+-------------------------+----------------------------------------------------------------------+-----------+---------------+-----------+
| Id   | User | Host      | db           | Command | Time | State                   | Info                                                                 | Rows_sent | Rows_examined | Rows_read |
+------+------+-----------+--------------+---------+------+-------------------------+----------------------------------------------------------------------+-----------+---------------+-----------+
| 1405 | root | localhost | join_test_db | Query   |    7 | Sending data            | select count(*) from t_user_log a left join t_user b on a.uid = b.id |         0 |             0 |         1 | 
| 1406 | root | localhost | join_test_db | Query   |    2 | Waiting for table flush | select * from t_user_log where id = 1                                |         0 |             0 |         1 | 
| 1408 | root | localhost | join_test_db | Query   |    0 | NULL                    | show processlist                                                     |         0 |             0 |         5 | 
| 1413 | root | localhost | NULL         | Query   |    5 | Waiting for table flush | FLUSH TABLES                                                         |         0 |             0 |         1 | 
+------+------+-----------+--------------+---------+------+-------------------------+----------------------------------------------------------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

从上面的实验中我们可以看到一个慢查询堵塞了FLUSH TABLES的请求，FLUSH TABLES堵塞了后面的select请求。OK，回归到第一个问题。

一个慢查询是怎么堵塞mysqldump的备份

我们的备份参数是-q -Q —flush-logs —force -R —opt —single-transaction，其中—flush-logs与—single-transaction两个参数放在一起会在开始dump数据之前先执行一个FLUSH TABLES操作，而FLUSH TABLES会关闭所有已经打开的表，所以会等待之前的请求执行结束，于是一个慢查询就这么堵塞了mysqldump进程。书面化的语言来描述就是：FLUSH TABLES正在等待成功获得锁，所以它必须等待其他事物其他锁的释放。

FLUSH TABLES的效果是：关闭所有打开的表，同时清空query cache的内容。

备份时general日志的记录

141120 18:07:11  1458 Connect   root@localhost on
                 1458 Query     SELECT SQL_NO_FCACHE NOW()
                 1458 Query     /*!40100 SET @@SQL_MODE='' */
                 1458 Query     /*!40103 SET TIME_ZONE='+00:00' */
                 1458 Query     FLUSH TABLES
                 1458 Query     FLUSH TABLES WITH READ LOCK
                 1458 Refresh

mysqldump是怎么堵塞后续的查询请求

我们说到按上面描述的参数执行mysqldump时会首先执行一个FLUSH TABLES的命令，在FLUSH TABLES等待加锁成功的过程中会导致后续的请求也在排队等待获得锁，类似一个排他锁的效果。因此后续执行的SQL也在等待中，这就是文章开头我们描述的问题产生的全过程，

it must wait until all other threads have closed the table in question.

总结

最后总结下，很多时候我们会把后台统计的SQL或者备份进程放在半夜负载最低的时候运行，但是这往往就容易造成上面我们描述的一系列问题，严重的时候可能导致系统瘫痪，所以各位DBA同学需要合理的管理规划好自己数据库的统计以及备份的作业计划。

其余说明

mysqldump除了—flush-logs与—single-transaction两个参数放在一起会导致这样的问题以外，—lock-all-tables以及—master-data两个参数同样也会引发这个问题。

遗留问题

mysqldump进程单独使用—flush-logs与—single-transaction两个参数其中一个时都不会执行FLUSH TABLES，为什么放在一起就会呢？
一个是刷新二进制日志，一个是通过设置事务隔离级别达到获取InnoDB表一致性快照的效果，两个加在一起为什么会需要执行FLUSH TABLES呢？