SoFunction
Updated on 2025-04-15

Detailed guide to Elasticsearch for in-depth pagination (avoid traps + report errors)

1. Problem recurring: Why does the query trigger "Result window is too large"

When we use traditional paging parameters from and size in Elasticsearch, if from + size > 10000, the following exception will be sent directly:

{
  "error": {
    "root_cause": [{
      "type": "illegal_argument_exception",
      "reason": "Result window is too large, from + size must be <= 10000"
    }]
  }
}

​​Rule Cause​​:

Elasticsearch defaults to limit the total number of documents returned by a single query to no more than 10,000 (i.e., the index.max_result_window parameter). When performing in-depth paging (such as querying the 10001-10100 data), the coordinating point needs to pull the first 10100 data from all shards, then perform global sorting and intercepting, resulting in an explosion of memory and computing resources.

2. Solution comparison: Which solution is suitable for your scenario

plan principle advantage shortcoming Applicable scenarios
​​​Adjust max_result_window​​ Directly modify the index configuration to increase the paging window Simple implementation without changing the code High memory risk, only suitable for small data volume A small amount of data pages (≤100,000 pieces)
Scroll API​​ Maintain query context through snapshot mechanism and pull data in batches Support massive data export Poor real-time data and high resource consumption Batch export/offline tasks
​​Search After​​ Based on the sorted value of the last document on the previous page as a cursor, avoiding from accumulation Optimal performance, support real-time pagination Global sort fields must be defined Real-time page paging on C-end (such as browsing list pages)

3. Detailed solution and code implementation

1. Violent expansion method: adjust max_result_window (not recommended)

​​Implementation steps:

# Dynamically modify the index configuration (the original settings need to be retained)PUT /your_index/_settings?preserve_existing=true
{
  "index": {
    "max_result_window": "20000"  # Set to a larger value  }
}

Core questions:

  • Officials clearly warn that this operation may cause OOM (memory overflow) and node failures
  • During deep paging, the coordinating points still need to load N pieces of data into memory before loading, and the performance declines exponentially.
  • Only suitable for temporary testing or scenarios with extremely small data volume (such as 100,000 pieces of data exported by the backend management background)

2. Batch export method: Scroll API (suitable for offline scenarios)

Implementation principle:

Create a snapshot context through the scroll parameter, and subsequent requests continuously pull data through scroll_id to avoid repeated calculations of sorting.

Java code examples:

public JSONArray scrollQuery(JSONObject params) {
    JSONArray result = new JSONArray();
    String scrollId = null;
    
    try {
        // Initialize the scroll query (keep 10 minutes snapshot)        SearchRequest searchRequest = new SearchRequest("logs");
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        (1000);
        (());
        
        (sourceBuilder);
        ((10));
        
        // Get scroll_id for the first query        SearchResponse response = (searchRequest, );
        scrollId = ();
        ((().getHits()));
        
        // Continuously pulling data        while (true) {
            SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
            ((10));
            response = (scrollRequest, );
            
            if (().getHits().length == 0) break;
            ((().getHits()));
            scrollId = ();
        }
    } finally {
        // Clean the context (must operate)        if (scrollId != null) {
            ClearScrollRequest clearRequest = new ClearScrollRequest();
            (scrollId);
            (clearRequest, );
        }
    }
    return result;
}

Key Issues:

  • Each scroll needs to be maintained scroll_id, and memory usage increases with the amount of data
  • Data snapshot version may cause inconsistent query results (such as document updates or deletes)

3. Real-time pagination method: Search After (recommended solution)

Implementation principle:

Skip invalid data scans by recording the sorted value (such as a timestamp or unique ID) of the last document on the previous page, locate directly to that location on the next query.

​​​Java code implementation:

java
public JSONObject searchData(JSONObject queryConditionsParam) {
    int pageSize = ("pageSize");
    double[] searchAfter = null;
    
    // Extract cursor parameters (sorted value of the last document on the previous page)    if (("search_after")) {
        JSONArray searchAfterArray = ("search_after");
        searchAfter = ();
    }

    SearchRequest searchRequest = new SearchRequest("my_log");
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();

    // Key configuration: The sort field must correspond to search_after    ("created_start_time", );
    if (searchAfter != null) {
        (searchAfter);
    }
    (pageSize); // No need to set from parameters
    // Build query conditions (example: filter by log ID)    BoolQueryBuilder boolQueryBuilder = ();
    (("log_id", ("log_id")));
    // Other complex conditions can be added here...
    (boolQueryBuilder);
    (sourceBuilder);

    try {
        SearchResponse response = (searchRequest, );
        return buildResult(response); // Encapsulate the result and return the cursor    } catch (IOException e) {
        ("ES query failed", e);
        throw new RuntimeException("Query Exception");
    }
}

// Result encapsulation method: Extract the cursor and return to the next page parametersprivate JSONObject buildResult(SearchResponse response) {
    JSONObject result = new JSONObject();
    JSONArray hits = new JSONArray();
    double[] nextCursor = null;

    for (SearchHit hit : ()) {
        (new JSONObject(()));
        // Extract the sort field value as the next page cursor        if (().length &gt; 0) {
            nextCursor = (())
                              .mapToDouble(Double::valueOf)
                              .toArray();
        }
    }

    ("data", hits);
    ("totalCount", ().getTotalHits().value);
    if (nextCursor != null) {
        ("search_after", nextCursor); // Return to the cursor for next query    }
    return result;
}

Performance Advantages:

​​No depth paging overhead: Each query only obtains the current page data, avoiding scanning of full data

Real-time guarantee: Direct access to the latest data snapshots, not affected by index refresh

Low resource consumption: Memory usage is linearly related to the paging size, not the total data volume

4. Plan selection decision tree

Data volume ≤100,000 pieces​​ → Adjust max_result_window (fast implementation)

​​Requires full export → Scroll API (with asynchronous tasks)

​​​C-side real-time interaction​​​ → Search After (Best Practice)

5. Guide to avoid pits

1. Cursor failure scenario:

When data is updated or deleted, cursor failure may occur (required in combination with business scenario evaluation)

Avoid using search_after on frequently updated fields

2. Pagination depth limit:

Even with search_after, it is recommended to limit the maximum paging depth (such as up to 1000 pages) to prevent malicious requests

​​3. Monitoring and Alarming:

Monitor the page query frequency through Elasticsearch's _cat/indices interface and set threshold alarms

6. Summary

plan Recommended index Applicable stage
Adjust max_result_window ⭐☆☆☆☆ Early verification phase
Scroll API ⭐⭐☆☆☆ Temporary data migration/batch export
Search After ⭐⭐⭐⭐⭐ Real-time pagination of production environment

Ultimate Suggestions:In scenarios such as log analysis and user behavior tracking, combined with search_after + time range filtering + appropriate caching strategies, efficient paging of billions of data can be achieved. Upgrade your paging plan now and say goodbye to Result window is too large error!

This is the article about Elasticsearch's detailed guide to in-depth pagination (avoiding pitfalls + reporting errors). For more relevant content on in-depth pagination of Elasticsearch, please search for my previous articles or continue browsing the following related articles. I hope everyone will support me in the future!