Elasticsearch has a large amount of data, how to summarize the whole table
there are multiple index
, recording product data, each index
recent 20g
large
I want to summarize each of these index
, such as all merchants
statistics in an index and save them to the new index
, but the aggs
query cannot be paged. I know that there are scroll
, scan
, but also see
.
if the scroll query contains aggregation, only the initial query result is the aggregate result
scan query does not support aggregation
so, if I want to count the entire index
, what is the option?
ES paging scheme
Code:
- disable From/Size
- the sorted data is obtained by the search_after method. Only sorted data is returned each time, and no other data is returned. The sorted data is saved in ES
.
- Paging search uses the search_after method, sets the search starting point according to the ES sorting index, and returns all the data each time
advantages: the access speed is the fastest, theoretically, the access time of a single page is seconds, and the pressure on ES is small
disadvantages: it requires a separate thread to maintain the sorted data array, and because the ES data in the sorted index may be deleted, but the data of the ES sorted index has not been updated, the access data has a lower probability that a piece of data on a single page is squeezed out of the paging interface by the new data. There is a low probability that the last piece of data on the previous page will appear on this page, and the last piece of data on this page will appear on the next page.
specific implementation details: the program will create a sort index in ES at the beginning, and then use the search_after algorithm to constantly calculate the sorting data of the index to be paged and save it in ES. This procedure is carried out in a loop. Then when you want to access it, you only need to read the values of the paging index and the corresponding paging sort index, and you can get the data of the corresponding page.
introduction to Restful process:
- Restful statement to get sorted data
first page:
GET test_delete/_search
{
"size":15,
"sort":[
{"randomDouble":"DESC"},
{"randomInt": "DESC"},
{"phone":"DESC"}
],
"_source": "{}"
}
get the last sorted data
page N:
GET test_delete/_search
{
"size":15,
"sort":[
{"randomDouble":"DESC"},
{"randomInt": "DESC"},
{"phone":"DESC"}
],
"search_after":[
],
"_source": "{}"
}
this process opens a thread that continuously updates the sorted data
.
- get the Restful statement of the corresponding page data
first page:
GET test_delete/_search
{
"size":15,
"sort":[
{"randomDouble":"DESC"},
{"randomInt": "DESC"},
{"phone":"DESC"}
]
}
Page N:
GET test_delete/_search
{
"size":15,
"sort":[
{"randomDouble":"DESC"},
{"randomInt": "DESC"},
{"phone":"DESC"}
],
"search_after":[
],
}
contrast to get sorted data, the restful statement deletes "_ source": "{}" to get all the data on the page