[MySQL案例]之一把看不见的锁后续(metadata lock)

今天MySQL过载保护又立功干掉了一堆堵塞的SQL,到后台一看,发现了一大堆熟悉的”Waiting for table metadata lock”。之前刚好写过一篇关于metadata lock相关问题处理的文章,有兴趣的同学自己点进去看看。

关于metadata lock说明

在正式说明问题之前我们先来回顾下MySQL的metadata lock。metadata lock是MySQL在5.5.3版本以后引入的,在那之前MySQL的元数据锁的颗粒度是statement级别,在5.5.3版本引入metadata lock以后变成是transaction级别,改造的目的是为了解决著名的BUG989,也就是说可能出现slave同步复制失败。简单来说就是因为MySQL二进制日志是顺序写,如果在事务未提交前执行了DDL,则二进制日志先记录了DDL再记录DML的语句就可能导致slave同步应用失败。具体参见MySQL BUG989。关于metadata lock的说明可以看这里

问题说在前面

这次发现大量的SELECT执行超过300秒,状态都是”Waiting for table metadata lock”。按照前面的说明,metadata lock应该是与DDL加锁互斥,而DDL加的是一把write access read lock,理论上是不会堵塞查询请求。而后在测试过程中又遇到几个问题,这里就一一描述下:

问题一:为什么出现上文提及的select会需要等待metadata lock加锁?
问题二:在测试过程中发现模拟同样的操作,有些select无需等待metadata lock加锁?
问题三:为什么有些DDL不会有metadata lock等待?同样的数据表在5.5与5.6版本下实验结果不一致?

问题一过程重现

这个就是这次发生的问题,这里做个简答的现场重现

session A //开启显性事务,执行一个查询后不要提交

1
2
3
4
5
6
7
8
9
10
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from a;
+------+
| id |
+------+
| 1 |
+------+
1 row in set (0.00 sec)

session B //执行一个DDL

1
2
mysql> alter table a add id2 int;
...堵塞中

session C //执行一个查询

1
2
mysql> select * from a;
...堵塞中

session D //查看进程状态

1
2
3
4
5
6
7
8
9
10
mysql> show processlist;
+-----+------+-----------+--------------+---------+------+---------------------------------+---------------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+-----+------+-----------+--------------+---------+------+---------------------------------+---------------------------+-----------+---------------+-----------+
| 181 | root | localhost | join_test_db | Sleep | 169 | | NULL | 1 | 1 | 2 |
| 182 | root | localhost | join_test_db | Query | 96 | Waiting for table metadata lock | alter table a add id2 int | 0 | 0 | 2 |
| 184 | root | localhost | join_test_db | Query | 50 | Waiting for table metadata lock | select * from a | 0 | 0 | 1 |
| 185 | root | localhost | join_test_db | Query | 0 | NULL | show processlist | 0 | 0 | 5 |
+-----+------+-----------+--------------+---------+------+---------------------------------+---------------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

然后session A提交以后,session B跟session C就都解锁开始执行。

问题二过程重现

session A //开启显性事务,执行一个查询后不要提交

1
2
3
4
5
6
7
8
9
10
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from a;
+------+------+
| id | id2 |
+------+------+
| 1 | NULL |
+------+------+
1 row in set (0.00 sec)

session B //执行一个DDL

1
2
mysql> alter table a add id3 int;
...堵塞中

session C //执行一个查询

1
2
3
4
5
6
7
mysql> mysql> select * from a;
+------+------+
| id | id2 |
+------+------+
| 1 | NULL |
+------+------+
1 row in set (0.00 sec)

session D //查看进程状态

1
2
3
4
5
6
7
8
9
10
mysql> show processlist;
+-----+------+-----------+--------------+---------+------+---------------------------------+---------------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+-----+------+-----------+--------------+---------+------+---------------------------------+---------------------------+-----------+---------------+-----------+
| 181 | root | localhost | join_test_db | Sleep | 15 | | NULL | 1 | 1 | 2 |
| 182 | root | localhost | join_test_db | Query | 11 | Waiting for table metadata lock | alter table a add id3 int | 0 | 0 | 2 |
| 184 | root | localhost | join_test_db | Sleep | 7 | | NULL | 0 | 0 | 2 |
| 185 | root | localhost | join_test_db | Query | 0 | NULL | show processlist | 0 | 0 | 5 |
+-----+------+-----------+--------------+---------+------+---------------------------------+---------------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

然后session A提交以后,session B就解锁开始执行。

问题三过程重现

第三个问题是在一张两千万的大表上测试发现的,在测试的过程中发现DDL的状态并未处于”Waiting for table metadata lock”,而是开始执行,在查看进程的时候已经是处于”copy to tmp table”状态。
更奇怪的是同样的数据在不同版本的数据库中,实验结果是不一样的。在5.5.28的版本中如上面描述的未出现”Waiting for table metadata lock”,但是在5.6.21版本实验的结果是跟问题一描述得有一般。

MySQL5.5.28实验

session A //开启显性事务,执行一个查询后不要提交

1
2
3
4
5
6
7
8
9
10
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from t_user_log limit 1 ;
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
| id | uid | username | appname | loginip | loginlocation | logintime | logintype | useragent |
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
| 1 | 101603700 | xxxxxx@17173.com | 发号中心 | xx.246.216.xxx | 湖南省郴州市 (永兴)电信 | 1406822402 | 1 | |
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
1 row in set (0.00 sec)

session B //执行一个DDL

1
mysql> alter table t_user_log add xxx int;

session C //执行一个查询

1
2
3
4
5
6
7
mysql> select * from t_user_log limit 1 ;
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
| id | uid | username | appname | loginip | loginlocation | logintime | logintype | useragent |
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
| 1 | 101603700 | xxxxxx@17173.com | 发号中心 | xx.246.216.xxx | 湖南省郴州市 (永兴)电信 | 1406822402 | 1 | |
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
1 row in set (0.00 sec)

session D //查看进程状态

1
2
3
4
5
6
7
8
9
10
mysql> show processlist;
+-----+------+-----------+--------------+---------+------+-------------------+------------------------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+-----+------+-----------+--------------+---------+------+-------------------+------------------------------------+-----------+---------------+-----------+
| 181 | root | localhost | join_test_db | Sleep | 28 | | NULL | 1 | 1 | 2 |
| 182 | root | localhost | join_test_db | Query | 15 | copy to tmp table | alter table t_user_log add xxx int | 0 | 0 | 377166 |
| 184 | root | localhost | join_test_db | Sleep | 5 | | NULL | 1 | 1 | 2 |
| 185 | root | localhost | join_test_db | Query | 0 | NULL | show processlist | 0 | 0 | 5 |
+-----+------+-----------+--------------+---------+------+-------------------+------------------------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

MySQL5.6.21实验

session A //开启显性事务,执行一个查询后不要提交

1
2
3
4
5
6
7
8
9
10
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from t_user_log limit 1 ;
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
| id | uid | username | appname | loginip | loginlocation | logintime | logintype | useragent |
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
| 1 | 101603700 | xxxxxx@17173.com | 发号中心 | xx.246.216.xxx | 湖南省郴州市 (永兴)电信 | 1406822402 | 1 | |
+----+-----------+----------------------+--------------+-----------------+-----------------------------------+------------+-----------+-----------+
1 row in set (0.00 sec)

session B //执行一个DDL

1
2
mysql> alter table t_user_log add xxx int;
...堵塞中

session C //执行一个查询

1
2
mysql> select * from t_user_log limit 1 ;
...堵塞中

session D //查看进程状态

1
2
3
4
5
6
7
8
9
10
mysql> show processlist;
+----+------+-----------+--------------+---------+------+---------------------------------+------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+--------------+---------+------+---------------------------------+------------------------------------+
| 38 | root | localhost | join_test_db | Sleep | 18 | | NULL |
| 39 | root | localhost | join_test_db | Query | 12 | Waiting for table metadata lock | alter table t_user_log add xxx int |
| 40 | root | localhost | join_test_db | Query | 5 | Waiting for table metadata lock | select * from t_user_log limit 1 |
| 41 | root | localhost | join_test_db | Query | 0 | init | show processlist |
+----+------+-----------+--------------+---------+------+---------------------------------+------------------------------------+
4 rows in set (0.00 sec)

同样的实验在不同的数据库版本中结果居然是不一样的,这个又是为什么呢?

问题一答案

在第一个问题的实验中我们发现一个未提交的事务使得后续的DDL以及SELECT请求都堵塞了,一旦commit以后,后面的两个请求都能迅速执行,那么我们就来可以借助profile工具来看下到底两个请求分别是在什么地方发生了等待。

session B

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
mysql> show profile;
+------------------------------+-----------+
| Status | Duration |
+------------------------------+-----------+
| starting | 0.000046 |
| checking permissions | 0.000004 |
| checking permissions | 0.000004 |
| init | 0.000008 |
| Opening tables | 0.000052 |
| System lock | 0.000009 |
| setup | 0.000020 |
| creating table | 0.016214 |
| After create | 0.000051 |
| copy to tmp table | 0.000165 |
| rename result table | 16.229622 |
| end | 0.000134 |
| Waiting for query cache lock | 0.000003 |
| end | 0.000013 |
| query end | 0.000003 |
| closing tables | 0.000011 |
| freeing items | 0.000017 |
| cleaning up | 0.000003 |
+------------------------------+-----------+
18 rows in set (0.00 sec)

session C

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000016 |
| Waiting for query cache lock | 0.000002 |
| Waiting on query cache mutex | 0.000002 |
| checking query cache for query | 0.000033 |
| checking permissions | 0.000006 |
| Opening tables | 8.808491 |
| System lock | 0.000009 |
| Waiting for query cache lock | 0.000002 |
| Waiting on query cache mutex | 0.000017 |
| init | 0.000016 |
| optimizing | 0.000003 |
| statistics | 0.000009 |
| preparing | 0.000008 |
| executing | 0.000002 |
| Sending data | 0.000040 |
| end | 0.000003 |
| query end | 0.000003 |
| closing tables | 0.000005 |
| freeing items | 0.000005 |
| Waiting for query cache lock | 0.000002 |
| Waiting on query cache mutex | 0.000001 |
| freeing items | 0.000008 |
| Waiting for query cache lock | 0.000001 |
| Waiting on query cache mutex | 0.000001 |
| freeing items | 0.000002 |
| storing result in query cache | 0.000002 |
| logging slow query | 0.000001 |
| cleaning up | 0.000002 |
+--------------------------------+----------+
28 rows in set (0.00 sec)

从上面profile的结果我们可以的只session B是在rename result table的时候发生等待,也就是说metadata lock确实堵塞了修改表结构的操作。
而session C先是检查query cache后发现没有命中,然后才在Opening tables的时候发生了等待,而实际上也就是说rename result table这个步骤堵塞了Opening tables的执行,这个也能理解,就像X与S互斥一样。(在MySQL 5.6的版本中profile输出略有不同,会在rename result table下面显示Waiting for table metadata lock的具体等待时间以及等待时长)
这样我们就发现了第一个问题的答案,那么第二个问题也不难发现,那就是session C执行SELECT过程中没有执行Opening tables步骤,也就是说直接在query cache中命中,下面我们继续验证。

问题二答案

我们继续看下问题二的session C的profile
session C

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000015 |
| Waiting for query cache lock | 0.000002 |
| Waiting on query cache mutex | 0.000002 |
| checking query cache for query | 0.000006 |
| checking privileges on cached | 0.000003 |
| checking permissions | 0.000007 |
| sending cached result to clien | 0.000011 |
| logging slow query | 0.000002 |
| cleaning up | 0.000001 |
+--------------------------------+----------+
9 rows in set (0.00 sec)

结果不出意料之外,SELECT请求没有出现等待metadata lock的情况是因为query cache能够命中而无需执行Opening tables。

问题三答案

我们刚在问题三的MySQL 5.5.28实验中发现session B中的DDL请求并未处于”Waiting for table metadata lock”而是开始执行并处于”copy to tmp table”状态。但是经过前面两个问题的解析,我们就能明白其实并不是session B的DDL没有等待metadata lock加锁,而是加锁是在”rename result table”步骤,处于”copy to tmp table”之后,而前面我们也说了问题三的实验对象是一张两千万行的大表,所以临时表环境耗时很长,因此我们在show processlist的时候发现该进程处于”copy to tmp table”而不是”Waiting for table metadata lock”状态,造成了没有等待加锁的假象,其实继续等待下去,等待临时表生成完毕还是会进入等待加锁状态。

那为什么同样一张大表,在MySQL 5.6.21中却直接进入”Waiting for table metadata lock”呢?我们猜测跟MySQL 5.6的online DDL特性有关,下面我们还是通过profile来验证。

session B

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000065 |
| checking permissions | 0.000004 |
| checking permissions | 0.000005 |
| init | 0.000003 |
| Opening tables | 0.000035 |
| setup | 0.000030 |
| creating table | 0.000663 |
| After create | 0.000078 |
| Waiting for table metadata loc | 1.003192 |
| After create | 0.000012 |
| Waiting for table metadata loc | 1.002947 |
| After create | 0.000014 |
| Waiting for table metadata loc | 1.002883 |
| After create | 0.000012 |
| Waiting for table metadata loc | 1.002999 |
| After create | 0.000012 |
| Waiting for table metadata loc | 1.002944 |
| After create | 0.000010 |
| Waiting for table metadata loc | 1.002934 |
| After create | 0.000008 |
| Waiting for table metadata loc | 1.002937 |
| After create | 0.000009 |
| Waiting for table metadata loc | 1.002954 |
| After create | 0.000009 |
| Waiting for table metadata loc | 0.247139 |
| After create | 0.000021 |
| System lock | 0.000015 |
| preparing for alter table | 0.001604 |
| altering table | 0.785378 |
| committing alter table to stor | 0.032014 |
| end | 0.000025 |
| query end | 0.000154 |
| closing tables | 0.000010 |
| freeing items | 0.000017 |
| cleaning up | 0.000018 |
+--------------------------------+----------+
35 rows in set, 1 warning (0.00 sec)

我们在profile中并未见”copy to tmp table”的步骤,这也就是为什么在一开始测试的5.6.21版本中不会像5.5.28版本那样需要在”copy to tmp table”等待那么久。下面我们关闭掉online DDL特性再来一试。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
mysql> set old_alter_table = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> alter table xxx add xxxid int; //执行DDL
...
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000055 |
| checking permissions | 0.000005 |
| checking permissions | 0.000004 |
| init | 0.000002 |
| Opening tables | 0.000034 |
| setup | 0.000035 |
| creating table | 0.000558 |
| After create | 0.000020 |
| System lock | 0.002289 |
| copy to tmp table | 1.039543 |
| rename result table | 0.000017 |
| Waiting for table metadata loc | 1.002122 |
| rename result table | 0.000012 |
| Waiting for table metadata loc | 1.001946 |
| rename result table | 0.000009 |
| Waiting for table metadata loc | 1.001933 |
| rename result table | 0.000008 |
| Waiting for table metadata loc | 1.001938 |
| rename result table | 0.000009 |
| Waiting for table metadata loc | 1.001964 |
| rename result table | 0.000009 |
| Waiting for table metadata loc | 0.586454 |
| rename result table | 0.003639 |
| end | 0.000019 |
| query end | 0.000135 |
| closing tables | 0.000018 |
| freeing items | 0.000019 |
| cleaning up | 0.000028 |
+--------------------------------+----------+
28 rows in set, 1 warning (0.00 sec)

果然”copy to tmp table”出现了,这次再MySQL 5.6.21版本的实验结果就跟之前在问题三的MySQL 5.5.28版本的实验结果是一致的了(上面profile的结果是一张1w行的小表,2000w的大表执行一次ddl等待时间太长了…)。

结论重复

  1. 在MySQL 5.5.3以后的版本中,未提交的事务一定会堵塞DDL请求(据说5.6.6以后的版本做了相关的优化,未做测试暂且不提)
  2. DDL在等待metada lock的过程中肯定会堵塞后续的DML请求,也有可能堵塞后续的SELECT请求(需要看SELECT请求是否能够命中qcache)
  3. 怎么处理这种问题是老生常谈了,优化SQL避免出现大事务,请求结束立即执行commit关闭事务,千万千万不要把所有的请求甚至包括代码的逻辑处理都塞在一个事务里面去完成,否则一旦出现一个DDL那就是灾难的发生,很难想象这次要是没有过载保护,XX系统是不会直接宕机呢?