Basic Implementation of MySQL DISTINCT in Detail

preamble

DISTINCT is actually very similar to the implementation of GROUP BY, except that only one row is taken out of each group after GROUP BY. Therefore, the implementation of DISTINCT is basically the same as the implementation of GROUP BY, without much difference. DISTINCT can also be accomplished by loose index scanning or compact index scanning, but of course, when DISTINCT cannot be accomplished by using indexes alone, MySQL can only accomplish it by using temporary tables.

However, the difference with GROUP BY is that DISTINCT does not require sorting. That is to say, in the case of a DISTINCT query where the index alone cannot be used to complete the operation, MySQL will use the temporary table to "cache" the data, but will not perform a filesort operation on the data in the temporary table.

Of course, if we also use GROUP BY and grouping in DISTINCT, and use aggregation functions such as MAX, there is no way to avoid filesort.

We'll show you how DISTINCT is implemented with a few simple Query examples.

1. First look at the loose index scanning to complete the DISTINCT operation:

sky@localhost : example 11:03:41> EXPLAIN SELECT DISTINCT group_id 
  -> FROM group_messageG
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)

We can clearly see that the Extra message in the execution plan is "Using index for group-by", what does this mean? Why is it that when I don't perform a GROUP BY operation, the execution plan will tell me that I have performed a GROUP BY through the index here?

In fact, this is related to the implementation principle of DISTINCT, in the process of implementing DISTINCT, it is also necessary to group, and then take out one piece of data from each group and return it to the client. The Extra information here tells us that MySQL completed the entire operation using a loose index scan.

Of course, if MySQL Query Optimizer could be a bit more user-friendly and replace the message with "Using index for distinct", it would be much better and easier for people to understand, huh?

2. Let's go back to the example of scanning through a compact index:

sky@localhost : example 11:03:53> EXPLAIN SELECT DISTINCT user_id 
  -> FROM group_message
  -> WHERE group_id = 2G
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: ref
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: const
     rows: 4
    Extra: Using WHERE; Using index
1 row in set (0.00 sec)

The display here is exactly the same as the GROUP BY implementation via compact index scanning as well. In fact, during the implementation of this query, MySQL will let the storage engine scan all index keys with group_id = 2 to get all the user_ids, and then utilize the sorted nature of the indexes to retain one piece of information for each index key value of user_id, so that the entire DISTINCT operation can be completed when all index keys with gruop_id = 2 are scanned. DISTINCT operation can be completed after scanning all index keys with gruop_id = 2.

3. Below we are looking at the inability to use the index alone can be accomplished when the DISTINCT is how it will be:

sky@localhost : example 11:04:40> EXPLAIN SELECT DISTINCT user_id 
  -> FROM group_message
  -> WHERE group_id > 1 AND group_id < 10G
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 32
    Extra: Using WHERE; Using index; Using temporary
1 row in set (0.00 sec)

When MySQL can't rely on indexes alone to complete a DISTINCT operation, it has to use temporary tables to do so. However, as we can see, there is one difference between MySQL's use of temporary tables for DISTINCT and its handling of GROUP BY, and that is the absence of filesort.

In fact, in MySQL's grouping algorithm, you don't have to sort to complete the grouping operation, as I mentioned in the GROUP BY optimization tip above. In fact, here MySQL achieves the DISTINCT operation without sorting, so the filesort operation is missing.

4. Finally, try combining it with GROUP BY:

sky@localhost : example 11:05:06> EXPLAIN SELECT DISTINCT max(user_id) 
  -> FROM group_message
  -> WHERE group_id > 1 AND group_id < 10
  -> GROUP BY group_idG
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 32
    Extra: Using WHERE; Using index; Using temporary; Using filesort
1 row in set (0.00 sec)

Finally, we look at this and GROUP BY used together with the aggregation function example, and the third example above, you can see that there has been more filesort sorting operation, it is because we use the MAX function for this reason. To obtain the MAX value after grouping, and can not use the index to complete the operation, only through the sort.

Since the implementation of DISTINCT is basically similar to the implementation of GROUP BY, this article will not draw diagrams to show the process of implementation

This is the whole content of this article.