[MySQL案例]之一把看不见的锁

昨晚半夜处理了一个故障,只是简单SQL执行缺少索引导致CPU飙高,加一条索引就搞定。原来以为事情到此为止,让另外一个同事收尾准备睡觉了,这个时候高潮出现了。刚才处理的是一台slave server,为了保证实例数据对象一致性,让同事先删掉,然后在master server再加这条索引,最后发现add index的操作在slave server一直在等metadata lock释放了,并且在processlist里面居然找不到任何引起表锁的请求。就这样折腾了到凌晨,后面索性就丢在自己等锁释放,差不多过了6分钟创建成功,load再次下降。

分析

事后在5.5的手册中找到了解释:

To ensure transaction serializability, the server must not permit one session to perform a data definition language (DDL) statement on a table that is used in an uncompleted transaction in another session. The server achieves this by acquiring metadata locks on tables used within a transaction and deferring release of those locks until the transaction ends. A metadata lock on a table prevents changes to the table’s structure. This locking approach has the implication that a table that is being used by a transaction within one session cannot be used in DDL statements by other sessions until the transaction ends.

如此一来我们就知道原来是有事务堵塞DDL操作,但是回忆昨晚的操作,我曾经把所有非Sleep的session都杀掉,但是锁依然没有释放,除非是那些Sleep的session在作怪,想想还真有可能。
这里我们做个试验测试下:

A session关闭自动提交,做一个简单的查询后,不主动提交事务。

1
2
3
4
5
6
7
8
9
10
11
12
mysql> set autocommit=0;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from t where id = 8;
+------+------+-------+-------+-------+-------+-------+
| id | name | name1 | name2 | name3 | name4 | name5 |
+------+------+-------+-------+-------+-------+-------+
| 8 | NULL | NULL | NULL | NULL | NULL | NULL |
+------+------+-------+-------+-------+-------+-------+
1 row in set (0.00 sec)
mysql>

B session做一个DDL操作,这时候被堵塞了。

1
2
mysql> alter table t add index idx_1 (id);
...

C session查看MySQL进程表,发现session B被堵塞,session A处于Sleep状态,这样根本看不到任何引起表锁的请求。

1
2
3
4
5
6
7
8
9
10
mysql> show processlist;
+------+-----------------+-----------+------+---------+---------+---------------------------------+------------------------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+------+-----------------+-----------+------+---------+---------+---------------------------------+------------------------------------+-----------+---------------+-----------+
| 1 | event_scheduler | localhost | NULL | Daemon | 5286457 | Waiting on empty queue | NULL | 0 | 0 | 1 |
| 3127 | root | localhost | test | Query | 16 | Waiting for table metadata lock | alter table t add index idx_1 (id) | 0 | 0 | 1 |
| 3133 | root | localhost | test | Sleep | 45 | | NULL | 1 | 9 | 10 |
| 3134 | root | localhost | test | Query | 0 | NULL | show processlist | 0 | 0 | 3 |
+------+-----------------+-----------+------+---------+---------+---------------------------------+------------------------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

为了验证当时数据库是不是就是这样引起DDL的堵塞,我大致抓取了十几秒钟的general_log,果然发现了大量的set autocommit=x/commit的操作,不出意外就是这些processlist看不到的进程堵塞了DDL。

1
2
3
4
[ root@localhost ]#cat general_log |grep commit |cut -f 4 |sort |uniq -c
197 commit
186 SET autocommit=0
393 SET autocommit=1

怎么快速发现锁的元凶

通过上面的实验我们证实了是因为未提交的事务引起metadata lock,但是怎么快速发现元凶并解决它呢?

查找InnoDB事务表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mysql> SELECT * FROM information_schema.INNODB_TRX\G
*************************** 1. row ***************************
trx_id: 57E651
trx_state: RUNNING
trx_started: 2014-09-21 15:10:47
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 0
trx_mysql_thread_id: 3133
trx_query: NULL
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 0
trx_lock_structs: 0
trx_lock_memory_bytes: 376
trx_rows_locked: 0
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 10000
1 row in set (0.00 sec)

通过事务表我们确实快速找到了未提交的事务,但是这只是模拟的测试环境,换成是生产环境中不可能只有一张表,而且存在许多并发的事务,所以通过事务表来查找明显就不现实了。

通过InnoDB的状态监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
------------
TRANSACTIONS
------------
Trx id counter 57E655
Purge done for trx's n:o < 57E651 undo n:o < 0
History list length 1641
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 3134, OS thread handle 0x418e4940, query id 11413 localhost root
show engine innodb status
---TRANSACTION 57E652, not started
mysql tables in use 1, locked 1
MySQL thread id 3127, OS thread handle 0x412db940, query id 11402 localhost root Waiting for table metadata lock
alter table t add index idx_1 (id)
---TRANSACTION 57E651, ACTIVE 426 sec
MySQL thread id 3133, OS thread handle 0x41946940, query id 11401 localhost root
Trx read view will not see trx with id >= 57E652, sees < 57E652

上面是InnoDB status输出的事务信息(用innodb_lock_monitor来调试,发现输出是一样的,可能5.5以后的版本show engine innodb status包含了所有的信息输出了吧),从上面信息我们可以看到线程3217被锁住,但是依然找不到锁的根源,上面只有看到一个活跃的线程3133,或许我们可以推断出来。但是换成生产环境,这又不现实了。

这里我们是使用事务控制来模拟锁,如果是在autocommit=1的情况下,直接加一个显示锁呢?那更悲剧了,事务表跟innodb status都看不到任何信息。

1
lock table t read/write;

A session

1
2
3
4
5
mysql> set autocommit=1;
Query OK, 0 rows affected (0.00 sec)
mysql> lock table t read;
Query OK, 0 rows affected (0.00 sec)

B session

1
2
mysql> alter table t add index idx_4 (id);
...

C session

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
mysql> SELECT * FROM information_schema.INNODB_TRX\G
Empty set (0.00 sec)
mysql> show engine innodb status\G
*************************** 1. row ***************************
...
------------
TRANSACTIONS
------------
Trx id counter 57E66B
Purge done for trx's n:o < 57E667 undo n:o < 0
History list length 1635
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
mysql tables in use 1, locked 1
MySQL thread id 3138, OS thread handle 0x41915940, query id 11448 localhost root
---TRANSACTION 0, not started
MySQL thread id 3137, OS thread handle 0x4181e940, query id 11451 localhost root
show engine innodb status
---TRANSACTION 57E668, not started
mysql tables in use 1, locked 1
MySQL thread id 3127, OS thread handle 0x412db940, query id 11449 localhost root Waiting for table metadata lock
alter table t add index idx_4 (id)
...

所以说想快速发现锁的根源,好像还是没有快速便捷的方法,还是只能仔细的看事务表或者innodb status输出了,如果你有更好的办法也麻烦告知Orz。

其他总结

在做实验看文档的过程中顺便复习了下锁,MVCC等一些列的知识,顺便记录下:

  1. 在RR跟RC两种事务隔离级别下,InnoDB支持MVCC,我们可以理解为行级锁的妥协,在很多查询的场景下不加共享锁,实现非堵塞读,也就是说这个时候update跟select是不会互斥的(当然根据不同的事务隔离级别,查询出来的值可能是不一样的)。
  2. 实验中间自己思维绕进死胡同:支持mvcc的模式下select不堵塞update是因为不加S锁,但是为什么依然堵塞DDL?难道还是有加S锁?后面再想想其实引起DDL堵塞的是metadata lock,其实跟S锁无关,是MySQL为了保证事务客串行性(serializability)所做的限制,而可串行性是并行事务正确性的唯一准则。
  3. 在5.6版本以前DDL加的是write allow read lock(因为MySQL是通过重建表+重命名方式实现ddl)。