Elasticsearch Search After paging query all data

The search_after paging method is to determine the location of the next page based on the last data on the previous page. At the same time, during the paging request, if there are additions, deletions, modifications and searches of index data, these changes will also be reflected on the cursor in real time. However, it should be noted that because the data on each page depends on the last data on the previous page, it is impossible to request a page jump.

In order to find the last piece of data on each page, each document must have a globally unique value. The official recommendation is to use _uid as the globally unique value. In fact, it is also possible to use the id of the business layer.

The above request returns an array containing sort sorted values for each document. These sort sort values can be used in the search_after parameter to grab the data on the next page. For example, we can use the sort sort value of the last document and pass it to the search_after parameter:

Note: When we use search_after, the from value must be set to 0 or -1.

The disadvantage of search_after is that it cannot randomly jump to paging, it can only turn backwards on pages one by one, and at least one unique non-repeat field is required to sort. It's very similar to the scrolling API, but unlike it, the search_after parameter is stateless, and it is always parsed against the latest version of the searcher. Therefore, the sorting order may change during the walk, depending on the update and deletion of the index.

1. search_after query

search_after query definition and practical cases

search_after The nature of query: Use a set of sorted values from the previous page to retrieve the matching next page.

Precondition: Use search_after to require multiple subsequent requests to return the same sorted result sequence as the first query. In other words, even if there may be operations such as writing new data during the subsequent page turn, these operations will not affect the original result set.

How to achieve it?

You can create a point in time Point In Time (PIT) to ensure that the index status of a specific event point is retained during the search process.

Point In Time (PIT) is a new feature that has only been available after Elasticsearch version 7.10.
The essence of PIT: a lightweight view that stores the state of indexed data.

The following example can well interpret the connotation of the PIT view.

Create a PIT

POST kibana_sample_data_logs/_pit?keep_alive=1m

Get data volume 14074

POST kibana_sample_data_logs/_count

Add a new data

POST kibana_sample_data_logs/_doc/14075
{
  "test":"just testing"
}

The total data is 14075

POST kibana_sample_data_logs/_count

Query PIT, the data is still 14074, which means that the statistics of the view at the previous time point are followed.

POST /_search

{
  "track_total_hits": true,
  "query": {
    "match_all": {}
  },
  "pit": {
    "id": "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEN3RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA"
  }
}

With PIT, the subsequent queries of search_after are based on the PIT view, which can effectively ensure the consistency of the data.

Search_after paging query can be simply summarized into the following steps.

Step 1: Create a PIT

Step 1: Create a PIT view, which is a precondition that cannot be saved.

POST kibana_sample_data_logs/_pit?keep_alive=5m

The result is returned as follows:

{
  "id" : "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEg5RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA"
}

keep_alive=5m, a scroll-like parameter, which means that the view retention time is 5 minutes.

If the execution exceeds 5 minutes, the error will be reported as follows:

{
  "type": "search_context_missing_exception",
  "reason": "No search context found for id [91600]"
}

Step 2: Create a basic query

Step 2: Create a basic query statement, where you need to set the conditions for turning pages.

GET /_search

{
  "size":10,
  "query": {
    "match" : {
      "host" : "elastic"
    }
  },
  "pit": {
     "id":  "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEg5RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA", 
     "keep_alive": "1m"
  },
  "sort": [ 
    {"": "asc"}
  ]
}

With PIT set, you don't need to specify an index when searching.

id is based on the id value returned in step 1.

Sort sort refers to: by which keyword is sorted.

At the end of each return document, there will be two result values as follows:

{
  "sort": [
    "200",
    4
  ]
}

Among them, "200" is the sorting method we specify: arranged in ascending order based on {"": "asc"}.

And what does 4 mean?

4 represents - the implicit sorting value, which is the ascending sorting method based on _shard_doc.

The official document calls this implicit field: tiebreaker (decisive field), and tiebreaker is equivalent to _shard_doc.

Tiebreaker essential meaning: the unique value of each document ensures that the paging is not lost or the paging result data is duplicated (same page duplicate or spread page duplicate).

step 3: Start turning pages

Step 3: Implement subsequent page turn.

GET /_search

{
  "size": 10,
  "query": {
    "match" : {
      "host" : "elastic"
    }
  },
  "pit": {
     "id":  "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEg5RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA", 
     "keep_alive": "1m"
  },
  "sort": [
    {"": "asc"}
  ],
  "search_after": [                                
    "200",
    4
  ]
}

In the subsequent page turn, you need to use search_after to specify the sort field value of the last document on the previous page.

The following code looks like:

{
  "search_after": [
    "200",
    4
  ]
}

Obviously, the search_after query only supports page turning backwards.

search_after query advantages and disadvantages and applicable scenarios

search_after Advantages

Not strictly subject to max_result_window, you can turn pages back without restrictions.
ps: Not strictly meaning: the value of a single request cannot exceed max_result_window; but the total page turn result set can exceed.

search_after Cons

Only backward page turn is supported, and random page turn is not supported.

search_after applicable scenarios

Similar: Today's headline page search /search
It does not support random page turnover, which is more suitable for mobile application scenarios.

2. Scroll traversal query

Scroll traversal query definition and practical cases

Compared to From + size and search_after returning a page of data, the Scroll API can be used to retrieve large amounts of results (even all) from a single search request in a similar way to cursors in a traditional database.

If the two requests from + size and search_after are regarded as near-real-time request processing methods, then scroll scroll traversal query is obviously non-real-time. When the data volume is large, the response time may be relatively long.

The execution steps of scroll core are as follows:

Step 1: Specify the search statement to set the scroll context retention time at the same time.

In fact, scroll already includes the view or snapshot functionality of PIT by default for search_after.

The results returned from the Scroll request reflect the status of the index when the initial search request was issued, similar to the snapshot at that moment. Subsequent changes (write, update, or delete) to the document will only affect future search requests.

POST kibana_sample_data_logs/_search?scroll=3m
{
  "size": 100,
  "query": {
    "match": {
      "host": "elastic"
    }
  }
}

Step 2: Turn the page back and continue to get the data until there is no result to return.

POST _search/scroll                                   
{
  "scroll" : "3m",
  "scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFkY4UkIwZWtlU2d1OTdTUjRIbzVXdHcAAAAAAAGmkBZ0bVM5YUxMX1R1Nkd1VkNiaGhZSWNn" 
}

The scroll_id value is the result value returned in step 1.

Advantages and disadvantages of Scroll traversal query and applicable scenarios

Advantages of scroll query

Supports full traversal.
ps: The size value of a single traversal cannot exceed the max_result_window size.

Scroll query disadvantages

The response time is not real-time.
Preserving the context requires sufficient heap memory space.

Scroll query applicable scenarios

When the full amount or the data volume is large, iterates over the result data, rather than paging query.
The official documentation emphasizes: It is no longer recommended to use the scroll API for in-depth pagination. If you want to search for more than Top 10,000+ results, it is recommended to use: PIT + search_after.

Summarize

From+ size: It is necessary to randomly jump to different paging (similar to mainstream search engines) and display scenes within the top 10,000 pieces of data.
search_after: Only scenes that need to turn pages back and more than the top 10,000 data require paging.
Scroll: You need to traverse the full data scene.
max_result_window: Adjust the symptoms and not the root cause, and it is not recommended to adjust the symptoms and excessively.

The above is personal experience. I hope you can give you a reference and I hope you can support me more.