"Not Exists" なクエリの最適化

某所でのネタだったんですが、少し追加情報があったのでここに書いてみます。

以下、mysql4.1での話。それ以外では、確認してないどころかどうなのか想像すらつきません(postgresとかほとんど触ったことない)。

なぞなぞ

次のようなテーブルがあるとします。

CREATE TABLE  (
   user_id INT NOT NULL,
   item_id INT NOT NULL,
   price   INT NOT NULL,
   PRIMARY KEY (user_id, item_id)
);

mysql> select * from test;
+---------+---------+-------+
| user_id | item_id | price |
+---------+---------+-------+
|       1 |     100 |    30 |
|       1 |     200 |    80 |
|       2 |     200 |    80 |
+---------+---------+-------+
3 rows in set (0.00 sec)

で、

item_id=200を買ったけどitem_id=100を買っていないユーザを求めよ

を1クエリでするにはどうすればいいでしょう?

答え

いきなりですがこんなかんじ。

mysql> select t1.user_id
    -> from test as t1 left join test as t2
    ->     on t1.user_id = t2.user_id and t1.item_id = 200 and t2.item_id = 100
    -> where t1.item_id = 200 and t2.item_id is null;
+---------+
| user_id |
+---------+
|       2 |
+---------+
1 row in set (0.00 sec)

なんで?

inner join (from t1, t2 とならべるjoin) と left join の違いを考えれば、(いわれてみれば)あたりまえなかんじです。

inner join の場合

mysql> select * from test as t1, test as t2 where t1.user_id = t2.user_id and t1.item_id = 200 and t2.item_id = 100;
+---------+---------+-------+---------+---------+-------+
| user_id | item_id | price | user_id | item_id | price |
+---------+---------+-------+---------+---------+-------+
|       1 |     200 |    80 |       1 |     100 |    30 |
+---------+---------+-------+---------+---------+-------+
1 row in set (0.00 sec)

- where の条件にかいたものだけが結果セットに入ってきます。この場合、item_id が 100 と 200 の両方を買った人が選べることになります。

left join の場合

mysql> select * from test as t1 left join test as t2 on t1.user_id = t2.user_id and t1.item_id = 200 and t2.item_id = 100; 
+---------+---------+-------+---------+---------+-------+
| user_id | item_id | price | user_id | item_id | price |
+---------+---------+-------+---------+---------+-------+
|       1 |     100 |    30 |    NULL |    NULL |  NULL |
|       1 |     200 |    80 |       1 |     100 |    30 |
|       2 |     200 |    80 |    NULL |    NULL |  NULL |
+---------+---------+-------+---------+---------+-------+
3 rows in set (0.00 sec)

- on の条件にかいたものは、該当するレコードが join するテーブル(t2)にあればそのレコードを、なければ NULL がつまったレコードを t1 に join します。

ということで、left join した結果からそれぞれ t1 の item_id は 200 で t2 の item_id は NULL のものを where で選んであげればよいことになります。

on と where の使いわけ

いままで on と where に書く条件で、どちらをどちらに書けばいいのかが明確に決まる場面にあまり遭遇していませんでした。しかし、この場合は t2.item_id = 200 と t2.item_id is null という矛盾した条件をそれぞれに適切に書いてあげなければなりません。

もちろんインデックスが効くから、とかいろいろ状況はあるので一概には言えないんですが、こういうときにきちんと使い分けるんだな、と実感しました。

本題

さて、答えのところに書いたクエリをexplainしてみるとどうなるでしょう?

mysql> explain select * from test as t1 left join test as t2 on t1.user_id = t2.user_id and t1.item_id = 200 and t2.item_id = 100 where t1.item_id = 200 and t2.item_id is null\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: aho.t1.user_id,const
         rows: 1
        Extra: Using where; Not exists
2 rows in set (0.00 sec)

Extraのところに、"Not exists" という見慣れない文字列が見えました。

http://dev.mysql.com/doc/refman/4.1/ja/explain.html (日本語) によると、

Not exists
MySQL でクエリに対する LEFT JOIN 最適化が実行でき、LEFT JOIN に一致するレコードが 1 つ検索されると、前のレコードの組み合わせによるその後のテーブルのレコードについては調べないことを示す。
この例は以下のとおりである。
SELECT * FROM t1 LEFT JOIN t2 ON t1.id=t2.id
WHERE t2.id IS NULL;
t2.id が NOT NULL で定義されているとする。この場合、MySQL で t1 がスキャンされ、t1.id で t2 内のレコードのルックアップが行われる。MySQL によって t2 内のマッチするレコードが検索されると、t2 は t2.id ではないと認識され、t2 内の同じ id を持つ残りのレコードのスキャンは行われない。言い換えると、t2 にあるマッチするレコードの数に関わらず、MySQL で実行が必要なことは t1 のレコードのそれぞれに対して、t2 のルックアップを 1 回実行することだけである。

ほとんど似たようなクエリが書いてありました。けど、はっきり言って何いってんだかさっぱりわかりません。

http://dev.mysql.com/doc/refman/4.1/en/explain.html (英語; 原文?) のほうは

Not exists
MySQL was able to do a LEFT JOIN optimization on the query and does not examine more rows in this table for the previous row combination after it finds one row that matches the LEFT JOIN criteria. Here is an example of the type of query that can be optimized this way:
SELECT * FROM t1 LEFT JOIN t2 ON t1.id=t2.id
WHERE t2.id IS NULL;
Assume that t2.id is defined as NOT NULL. In this case, MySQL scans t1 and looks up the rows in t2 using the values of t1.id. If MySQL finds a matching row in t2, it knows that t2.id can never be NULL, and does not scan through the rest of the rows in t2 that have the same id value. In other words, for each row in t1, MySQL needs to do only a single lookup in t2, regardless of how many rows actually match in t2.

キモは "it knows that t2.id can never be NULL" ってことですね。

つまり、この "Not Exists" は文字どおり「存在しないもの」(item_id=200はあるけどitem_id=100はない、みたいな)を求めるための最適化をすると表明しているわけです。

素直には

join on の条件でまずは土台となるテーブルを構成する
where の条件で土台のテーブルからフィルタする

という2段階を経ているように思うわけですが、「んなアホなことしねーよ」ってことなんですね!

いや、ほんとRDBの世界って奥が深いと思った。すごい。