9月 20 2014

[MySQL案例]之一把看不见的锁

昨晚半夜处理了一个故障，只是简单SQL执行缺少索引导致CPU飙高，加一条索引就搞定。原来以为事情到此为止，让另外一个同事收尾准备睡觉了，这个时候高潮出现了。刚才处理的是一台slave server，为了保证实例数据对象一致性，让同事先删掉，然后在master server再加这条索引，最后发现add index的操作在slave server一直在等metadata lock释放了，并且在processlist里面居然找不到任何引起表锁的请求。就这样折腾了到凌晨，后面索性就丢在自己等锁释放，差不多过了6分钟创建成功，load再次下降。

分析

事后在5.5的手册中找到了解释：

To ensure transaction serializability, the server must not permit one session to perform a data definition language (DDL) statement on a table that is used in an uncompleted transaction in another session. The server achieves this by acquiring metadata locks on tables used within a transaction and deferring release of those locks until the transaction ends. A metadata lock on a table prevents changes to the table’s structure. This locking approach has the implication that a table that is being used by a transaction within one session cannot be used in DDL statements by other sessions until the transaction ends.

如此一来我们就知道原来是有事务堵塞DDL操作，但是回忆昨晚的操作，我曾经把所有非Sleep的session都杀掉，但是锁依然没有释放，除非是那些Sleep的session在作怪，想想还真有可能。
这里我们做个试验测试下：

A session关闭自动提交，做一个简单的查询后，不主动提交事务。

mysql> set autocommit=0;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from t where id = 8;
+------+------+-------+-------+-------+-------+-------+
| id   | name | name1 | name2 | name3 | name4 | name5 |
+------+------+-------+-------+-------+-------+-------+
|    8 | NULL | NULL  | NULL  | NULL  | NULL  | NULL  |
+------+------+-------+-------+-------+-------+-------+
1 row in set (0.00 sec)
mysql>

B session做一个DDL操作，这时候被堵塞了。

1 2	mysql> alter table t add index idx_1 (id); ...

C session查看MySQL进程表，发现session B被堵塞，session A处于Sleep状态，这样根本看不到任何引起表锁的请求。

mysql> show processlist;
+------+-----------------+-----------+------+---------+---------+---------------------------------+------------------------------------+-----------+---------------+-----------+
| Id   | User            | Host      | db   | Command | Time    | State                           | Info                               | Rows_sent | Rows_examined | Rows_read |
+------+-----------------+-----------+------+---------+---------+---------------------------------+------------------------------------+-----------+---------------+-----------+
|    1 | event_scheduler | localhost | NULL | Daemon  | 5286457 | Waiting on empty queue          | NULL                               |         0 |             0 |         1 | 
| 3127 | root            | localhost | test | Query   |      16 | Waiting for table metadata lock | alter table t add index idx_1 (id) |         0 |             0 |         1 | 
| 3133 | root            | localhost | test | Sleep   |      45 |                                 | NULL                               |         1 |             9 |        10 | 
| 3134 | root            | localhost | test | Query   |       0 | NULL                            | show processlist                   |         0 |             0 |         3 | 
+------+-----------------+-----------+------+---------+---------+---------------------------------+------------------------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

为了验证当时数据库是不是就是这样引起DDL的堵塞，我大致抓取了十几秒钟的general_log，果然发现了大量的set autocommit=x/commit的操作，不出意外就是这些processlist看不到的进程堵塞了DDL。

[ root@localhost ]#cat general_log |grep commit |cut -f 4 |sort |uniq -c
    197 commit
    186 SET autocommit=0
    393 SET autocommit=1

怎么快速发现锁的元凶

通过上面的实验我们证实了是因为未提交的事务引起metadata lock，但是怎么快速发现元凶并解决它呢？

查找InnoDB事务表

mysql> SELECT * FROM information_schema.INNODB_TRX\G 
*************************** 1. row ***************************
                    trx_id: 57E651
                 trx_state: RUNNING
               trx_started: 2014-09-21 15:10:47
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 0
       trx_mysql_thread_id: 3133
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 0
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 0
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 10000
1 row in set (0.00 sec)

通过事务表我们确实快速找到了未提交的事务，但是这只是模拟的测试环境，换成是生产环境中不可能只有一张表，而且存在许多并发的事务，所以通过事务表来查找明显就不现实了。

通过InnoDB的状态监控

------------
TRANSACTIONS
------------
Trx id counter 57E655
Purge done for trx's n:o < 57E651 undo n:o < 0
History list length 1641
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 3134, OS thread handle 0x418e4940, query id 11413 localhost root
show engine innodb status
---TRANSACTION 57E652, not started
mysql tables in use 1, locked 1
MySQL thread id 3127, OS thread handle 0x412db940, query id 11402 localhost root Waiting for table metadata lock
alter table t add index idx_1 (id)
---TRANSACTION 57E651, ACTIVE 426 sec
MySQL thread id 3133, OS thread handle 0x41946940, query id 11401 localhost root
Trx read view will not see trx with id >= 57E652, sees < 57E652

上面是InnoDB status输出的事务信息（用innodb_lock_monitor来调试，发现输出是一样的，可能5.5以后的版本show engine innodb status包含了所有的信息输出了吧），从上面信息我们可以看到线程3217被锁住，但是依然找不到锁的根源，上面只有看到一个活跃的线程3133，或许我们可以推断出来。但是换成生产环境，这又不现实了。

这里我们是使用事务控制来模拟锁，如果是在autocommit=1的情况下，直接加一个显示锁呢？那更悲剧了，事务表跟innodb status都看不到任何信息。

1	lock table t read/write;

A session

mysql> set autocommit=1;
Query OK, 0 rows affected (0.00 sec)
mysql> lock table t read;
Query OK, 0 rows affected (0.00 sec)

B session

1 2	mysql> alter table t add index idx_4 (id); ...

C session

mysql> SELECT * FROM information_schema.INNODB_TRX\G 
Empty set (0.00 sec)
mysql> show engine innodb status\G                   
*************************** 1. row ***************************
...
------------
TRANSACTIONS
------------
Trx id counter 57E66B
Purge done for trx's n:o < 57E667 undo n:o < 0
History list length 1635
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
mysql tables in use 1, locked 1
MySQL thread id 3138, OS thread handle 0x41915940, query id 11448 localhost root
---TRANSACTION 0, not started
MySQL thread id 3137, OS thread handle 0x4181e940, query id 11451 localhost root
show engine innodb status
---TRANSACTION 57E668, not started
mysql tables in use 1, locked 1
MySQL thread id 3127, OS thread handle 0x412db940, query id 11449 localhost root Waiting for table metadata lock
alter table t add index idx_4 (id)
...

所以说想快速发现锁的根源，好像还是没有快速便捷的方法，还是只能仔细的看事务表或者innodb status输出了，如果你有更好的办法也麻烦告知Orz。

其他总结

在做实验看文档的过程中顺便复习了下锁，MVCC等一些列的知识，顺便记录下：

在RR跟RC两种事务隔离级别下，InnoDB支持MVCC，我们可以理解为行级锁的妥协，在很多查询的场景下不加共享锁，实现非堵塞读，也就是说这个时候update跟select是不会互斥的（当然根据不同的事务隔离级别，查询出来的值可能是不一样的）。
实验中间自己思维绕进死胡同：支持mvcc的模式下select不堵塞update是因为不加S锁，但是为什么依然堵塞DDL？难道还是有加S锁？后面再想想其实引起DDL堵塞的是metadata lock，其实跟S锁无关，是MySQL为了保证事务客串行性（serializability）所做的限制，而可串行性是并行事务正确性的唯一准则。
在5.6版本以前DDL加的是write allow read lock（因为MySQL是通过重建表+重命名方式实现ddl）。