[MySQL SQL优化系列]之分页查询

记得在很久以前我在公司内部做过一次MySQL优化分享里面说到一个使用延迟关联实现排序分页类型SQL的优化,这个案例在《高性能MySQL》的第二版还是第三版也有提及。

延迟关联:通过覆盖索引返回所需数据行的主键,再通过主键获取所需数据。

这里我们来看一个使用延迟关联优化排序分页SQL的案例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> select sql_no_cache * from t_user_log where appname = '发号中心' order by logintime limit 1000000,1;
+---------+-----------+--------------------------+--------------+-----------------+------------------------------------+------------+-----------+-----------+
| id | uid | username | appname | loginip | loginlocation | logintime | logintype | useragent |
+---------+-----------+--------------------------+--------------+-----------------+------------------------------------+------------+-----------+-----------+
| 2246895 | 114790028 | xxxxxxxxxxxxxx@17173.com | 发号中心 | 220.172.xxx.xxx | 贵州省黔南州都匀市 电信 | 1407055334 | 1 | |
+---------+-----------+--------------------------+--------------+-----------------+------------------------------------+------------+-----------+-----------+
1 row in set (2.05 sec)
mysql> select sql_no_cache a.* from t_user_log a inner join (select id from t_user_log where appname = '发号中心' order by logintime limit 1000000,1) b on a.id = b.id;
+---------+-----------+--------------------------+--------------+-----------------+------------------------------------+------------+-----------+-----------+
| id | uid | username | appname | loginip | loginlocation | logintime | logintype | useragent |
+---------+-----------+--------------------------+--------------+-----------------+------------------------------------+------------+-----------+-----------+
| 2246895 | 114790028 | xxxxxxxxxxxxxx@17173.com | 发号中心 | 220.172.xxx.xxx | 贵州省黔南州都匀市 电信 | 1407055334 | 1 | |
+---------+-----------+--------------------------+--------------+-----------------+------------------------------------+------------+-----------+-----------+
1 row in set (0.87 sec)

确实使用延迟关联以后SQL效率提升了约135%左右。

那么问题来了

  1. 延迟关联的用法在InnoDB跟MyISAM两种引擎上是否有区别?
  2. 延迟关联是否优化排序分页SQL的最优方法?

延迟关联的方法在InnoDB与MyISAM两种引擎的区别

上面我们说了什么是延迟关联,但是很明显跟InnoDB应该是没有关系的,所以大体来说是一样的,但是严格来说还是存在一点点小区别。区别在于覆盖索引的范围。因为InnoDB是主键聚合,所以任何二级索引的尾部都会加上pk,但是MyISAM就没有这个特性了,因此MyISAM如果需要使用延迟关联,那么覆盖索引的最后需要加上主键字段,否则效率会大打折扣。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mysql> select sql_no_cache * from log_201409 where lastloginchannel = '1018' limit 1000000,1;
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| username | loginip | logintime | lastloginchannel | regtime | id |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| xxxxxxxxxxxxxxx@17173.com | 60.167.xx.xx | 2014-09-03 22:24:10 | 1018 | 0000-00-00 00:00:00 | 2353297 |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
1 row in set (1.41 sec)
mysql> select sql_no_cache a.* from log_201409 a inner join (select id from log_201409 where lastloginchannel = '1018' limit 1000000,1 ) b on a.id = b.id;
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| username | loginip | logintime | lastloginchannel | regtime | id |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| xxxxxxxxxxxxxxx@17173.com | 60.167.xx.xx | 2014-09-03 22:24:10 | 1018 | 0000-00-00 00:00:00 | 2353297 |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
1 row in set (0.95 sec)
//覆盖索引包含主键
mysql> select sql_no_cache a.* from log_201409 a inner join (select id from log_201409 force index (i_lastloginchannel) where lastloginchannel = '1018' limit 1000000,1 ) b on a.id = b.id;
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| username | loginip | logintime | lastloginchannel | regtime | id |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| xxxxxxxxxxxxxxx@17173.com | 60.167.xx.xx | 2014-09-03 22:24:10 | 1018 | 0000-00-00 00:00:00 | 2353297 |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
1 row in set (1.32 sec)
// 覆盖索引不包含主键

比延迟关联更好的解决方案

延迟关联避免了不必要的数据页扫描,但是避免不了索引扫描,因此在limit m,n的m值很大的时候,优化效果还是不尽如人意的。这个我看下执行计划就知道了:

1
2
3
4
5
6
7
8
9
mysql> explain select sql_no_cache * from t_user_log a inner join (select id from t_user_log where appname = '发号中心' order by logintime limit 1000000,1) b on a.id = b.id;
+----+-------------+------------+--------+-------------------------------+-----------------------+---------+-------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+-------------------------------+-----------------------+---------+-------+----------+--------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000001 | NULL |
| 1 | PRIMARY | a | eq_ref | PRIMARY | PRIMARY | 4 | b.id | 1 | NULL |
| 2 | DERIVED | t_user_log | ref | appname,idx_appname_logintime | idx_appname_logintime | 152 | const | 14672600 | Using where; Using index |
+----+-------------+------------+--------+-------------------------------+-----------------------+---------+-------+----------+--------------------------+
3 rows in set (0.00 sec)

后面在@叶金荣的博客里面看到有个同学提出在雅虎一种方法来实现排序分页的SQL优化的,用他的方法我实际测试了下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mysql> select sql_no_cache id,uid,loginlocation,appname,logintime from t_user_log force index (idx_logintime) order by logintime limit 1000005,5;
+---------+-----------+---------------------------------------------------------------------+--------------+------------+
| id | uid | loginlocation | appname | logintime |
+---------+-----------+---------------------------------------------------------------------+--------------+------------+
| 1000006 | 111620882 | 亚太地区 | | 1406949591 |
| 1000007 | 114237512 | 美国 | | 1406949591 |
| 1000008 | 108382349 | 湖南省 电信 | 发号中心 | 1406949591 |
| 1000009 | 108410167 | 德国 | 发号中心 | 1406949591 |
| 1000013 | 60018858 | 北京市 北京搜狐互联网信息服务有限公司电信节点 | 17173邮箱 | 1406949591 |
+---------+-----------+---------------------------------------------------------------------+--------------+------------+
5 rows in set (1.44 sec)
mysql> select sql_no_cache m2.id,m2.uid,m2.loginlocation,m2.appname,m2.logintime from t_user_log m1 , t_user_log m2
where m1.id = m2.id and m1.logintime >= 1406949591 and (m1.id > 1000005 or m1.logintime > 1406949591)
order by m1.logintime,m1.id limit 5;
+---------+-----------+---------------------------------------------------------------------+--------------+------------+
| id | uid | loginlocation | appname | logintime |
+---------+-----------+---------------------------------------------------------------------+--------------+------------+
| 1000006 | 111620882 | 亚太地区 | | 1406949591 |
| 1000007 | 114237512 | 美国 | | 1406949591 |
| 1000008 | 108382349 | 湖南省 电信 | 发号中心 | 1406949591 |
| 1000009 | 108410167 | 德国 | 发号中心 | 1406949591 |
| 1000013 | 60018858 | 北京市 北京搜狐互联网信息服务有限公司电信节点 | 17173邮箱 | 1406949591 |
+---------+-----------+---------------------------------------------------------------------+--------------+------------+
5 rows in set (0.00 sec)

哇哦,效率惊人啊。它的核心思想就是:通过extra过滤配合ID扫描,避免大量的回表操作,这样就达到了要取多少条,就扫描多少条的效果。但是这种方法有几个硬限制:
1. 只能适用于没有任何条件的排序分页
2. 只能适用于固定的业务场景下,否则还要再次计算某一分页的minvalue_col1跟minvalue_id的值反而需要更大的开销

总结

回过头来我们总结下,无论是延迟关联或者是雅虎提出的方法,都是在于尽量少的扫描数据页来实现排序分页的优化。因此我们建议大分页SQL需要优化的情况下:

  1. 一般情况下建议使用延迟关联这种方法
  2. 如果有特殊的业务场景(不需要使用条件过滤,可以直接固定的算法算出minvalue_col1跟minvalue_id),那么雅虎的做法值得你去借鉴

btw,优化不一定要死死遵循一定的原理,咱们的原则是尽量少的扫描数据页,在顺序读且m值不大的时候,全表扫描可能是一个更好的选择:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql> select sql_no_cache * from log_201409 where lastloginchannel = '1018' limit 1000000,1;
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| username | loginip | logintime | lastloginchannel | regtime | id |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| xxxxxxxxxxxxxxx@17173.com | 60.167.xx.xx | 2014-09-03 22:24:10 | 1018 | 0000-00-00 00:00:00 | 2353297 |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
1 row in set (1.38 sec)
mysql> select sql_no_cache * from log_201409 ignore index (i_lastloginchannel,idx_1) where lastloginchannel = '1018' limit 1000000,1;
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| username | loginip | logintime | lastloginchannel | regtime | id |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
| xxxxxxxxxxxxxxx@17173.com | 60.167.xx.xx | 2014-09-03 22:24:10 | 1018 | 0000-00-00 00:00:00 | 2353297 |
+---------------------------+--------------+---------------------+------------------+---------------------+---------+
1 row in set (0.77 sec)
//直接禁用lastloginchannel字段相关的索引,强制走全表扫描,效率比上面案例中的使用延迟关联还来的更快

参考内容:
http://imysql.com/2014/07/26/mysql-optimization-case-paging-optimize.shtml
http://www.vmcd.org/2014/07/advance-for-mysql-pagination/