Optimieren Sie die gruppenweise Maximalabfrage

Question

Jun 16, 2014, 02:42 PM

query-optimization greatest-n-per-group sql postgresql groupwise-maximum

Optimieren Sie die gruppenweise Maximalabfrage

select * 
from records 
where id in ( select max(id) from records group by option_id )

Diese Abfrage funktioniert auch bei Millionen von Zeilen einwandfrei. Wie Sie jedoch aus dem Ergebnis der EXPLAIN-Anweisung ersehen können:

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
->  HashAggregate  (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
     ->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
           ->  Seq Scan on records records_1  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
->  Index Scan using records_pkey on records  (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms

(cost=0.00..23995.15 rows=1240315 width=8) <- Hier heißt es, dass alle Zeilen gescannt werden und das ist offensichtlich ineffizient.

Ich habe auch versucht, die Abfrage neu zu ordnen:

select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
     ->  Seq Scan on records  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
->  Index Scan using records_pkey on records r  (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms

(cost=0.00..23995.15 rows=1240315 width=8) <- Alle Zeilen werden noch gescannt.

Ich habe es mit und ohne Index anprobiert(option_id), (option_id, id), (option_id, id desc)hatte keiner von ihnen Auswirkungen auf den Abfrageplan.

Gibt es eine Möglichkeit, eine gruppenweise Maximalabfrage in Postgres auszuführen, ohne alle Zeilen zu scannen?

Was ich programmatisch suche, ist ein Index, der die maximale ID für jeden speichertoption_id wie sie in die Datensatztabelle eingefügt werden. Auf diese Weise muss ich beim Abfragen der Maximalwerte für option_ids nur so oft Indexdatensätze scannen, wie es unterschiedliche option_ids gibt.

Ich habe gesehenselect distinct on Antworten von hochrangigen Nutzern aus der ganzen Welt (danke an @Clodoaldo Neto, der mir Schlüsselwörter für die Suche gegeben hat). Hier ist, warum es nicht funktioniert:

create index index_name on records(option_id, id desc)

select distinct on (option_id) *
from records
order by option_id, id desc
                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
  ->  Index Scan using records_option_id_id_idx on records  (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms

Das ist großartig, es wird ein Index verwendet. Die Verwendung eines Index zum Durchsuchen aller IDs ist jedoch wenig sinnvoll. Nach meinen Ausführungen ist es tatsächlich langsamer als ein einfacher sequentieller Scan.

Interessanterweise ist MySQL 5.5 in der Lage, die Abfrage einfach über einen Index zu optimierenrecords(option_id, id)

mysql> select count(1) from records;

+----------+
| count(1) |
+----------+
|  1086342 |
+----------+

1 row in set (0.00 sec)

mysql> explain extended select * from records
       inner join ( select max(id) max_id from records group by option_id ) mr
                                                      on mr.max_id= records.id;

+------+----------+--------------------------+
| rows | filtered | Extra                    |
+------+----------+--------------------------+
| 1056 |   100.00 |                          |
|    1 |   100.00 |                          |
|  201 |   100.00 | Using index for group-by |
+------+----------+--------------------------+

3 rows in set, 1 warning (0.02 sec)