How does pagination works in elastic search?

How does pagination works in elastic search? - python

I am currently using elastic search python client for search the index of my elastic search.
Let's say I have 20 million documents, and I am using the pagination with from and size parameters. I have read in the documentation that there is a limit of 10k. But I didn't understand what that limit mean.
For example,
Did that limit mean I can only use pagination (i.e. from and size) calls 10000 times?
like from=0, size =10, from=10, size =10 etc., 10000 times.
Or Do they mean I can make unlimited pagination calls using the from and size params but there is a size limit of 10k per each pagination call?
Can someone clarify this?

Pagination limit of 10k means
For the applied query only the first 10k results can be displayed.
from:0 size:10,001 will given an error "Result window is too large"
from:10000, size:10 will given an error "Result window is too large"
In the above 2 cases we are trying to access 10000+ offset of the document of the current query, hence the exception
from doesn't represent pageNumber, instead it represents starting offset

The limit is called max_result_window and default value is 10k. Mathematically this is the max value size+from can take.
from:1, size:10000 will give error.
from:5, size:9996 will give error.
from:9999, size:2 will give error.
Search after is the recommended alternative if you want deeper results.

You can update existing index settings with this query:
PUT myexistingindex/_settings
{
"settings": {
"max_result_window": 20000000
}
}
If your are creating dynamic index, you can give max result window parameter in settings.
In Java like this:
private void createIndex(String indexName) throws IOException {
Settings settings = Settings.builder()
.put("number_of_shards", 1)
.put("number_of_replicas", 0)
.put("index.translog.durability", "async")
.put("index.translog.sync_interval", "5s")
.put("max_result_window", "20000000").build();
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName).settings(settings);
restHighLevelClient.indices().create(createIndexRequest, RequestOptions.DEFAULT);
}
After these configurations, you can give "from" parameters up to 20 million.
But this way is not recommended.
You can review this document: Scroll Api

Related

Periodically process and update documents in elasticsearch index

I need to come up with a strategy to process and update documents in an elasticsearch index periodically and efficiently. I do not have to look at documents that I processed before.
My setting is that I have a long running process, which continuously inserts documents to an index, say approx. 500 documents per hour (think about the common logging example).
I need to find a solution to update some amount of documents periodically (via cron job, e.g) to run some code on a specific field (text field, eg.) to enhance that document with a number of new fields. I want to do this to offer more fine grained aggregations on the index. In the logging analogy, this could be, e.g., I get the UserAgent-string from a log entry (document), do some parsing on that, and add some new fields back to that document and index it.
So my approach would be:
Get some amount of documents (or even all) that I haven't looked at before. I could query them by combining must_not and exists, for instance.
Run my code on these documents (run the parser, compute some new stuff, whatever).
Update the documents obtained previously (probably most preferably via bulk api).
I know there is the Update by query API. But this does not seem to be right here, since I need to run my own code (which btw depends on external libraries), on my server and not as a painless script, which would not offer that comprehensive tasks I need.
I am accessing elasticsearch via python.
The problem is now that I don't know how to implement the above approach. E.g. what if the amount of document obtained in step 1. is larger than myindex.settings.index.max_result_window?
Any ideas?

I considered #Jay's comment and ended up with this pattern, for the moment:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import scan
from my_module.postprocessing import post_process_doc
es = Elasticsearch(...)
es.ping()
def update_docs( docs ):
""""""
for idx,doc in enumerate(docs):
if idx % 10000 == 0:
print( 'next 10k' )
new_field_value = post_process_doc( doc )
doc_update = {
"_index": doc["_index"],
"_id" : doc["_id"],
"_op_type" : "update",
"doc" : { <<the new field>> : new_field_value }
}
yield doc_update
docs = scan( es, query='{ "query" : { "bool": { "must_not": { "exists": { "field": <<the new field>> }} } }}', index=index, scroll="1m", preserve_order=True )
bulk( es, update_docs( docs ) )
Comments:
I learned that elasticsearch keeps a view of the search results when you do a scroll and pass the corresponding ids with the query request. The scan abstraction method will handle that for you. The scroll-parameter in the method above tells elasticsearch how long the view will be open, i.e., how long the view will be consistant.
As stated in my comment the documentation says that they no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging use .. point in time (PIT), but I haven't tried it yet.
In my implementation, I needed to pass preserve_over=True, otherwise an error was thrown.
Remember to update the mapping of the index beforehand, e.g., when you want to add a nested fields as another field in your document.

Python Reddit PRAW get top week. How to change limit?

I have been familiarising myself with PRAW for reddit. I am trying to get the top x posts for the week, however I am having trouble changing the limit for the "top" method.
The documentation doesn't seem to mention how to do it, unless I am missing something. I can change the time peroid ok by just passing in the string "week", but the limit has be flummoxed. The image shows that there is a param for limit and it is set to 100.
r = self.getReddit()
sub = r.subreddit('CryptoCurrency')
results = sub.top("week")
for r in results:
print(r.title)
DOCS: subreddit.top()
IMAGE: Inspect listing generator params

From the docs you've linked:
Additional keyword arguments are passed in the initialization of
ListingGenerator.
So we follow that link and see the limit parameter for ListingGenerator:
limit – The number of content entries to fetch. If limit is None, then
fetch as many entries as possible. Most of reddit’s listings contain a
maximum of 1000 items, and are returned 100 at a time. This class will
automatically issue all necessary requests (default: 100).
So using the following should do it for you:
results = sub.top("week", limit=500)

How to paginate an aggregation pipeline result in pymongo?

I have a web app where I store some data in Mongo, and I need to return a paginated response from a find or an aggregation pipeline. I use Django Rest Framework and its pagination, which in the end just slices the Cursor object. This works seamlessly for Cursors, but aggregation returns a CommandCursor, which does not implement __getitem__().
cursor = collection.find({})
cursor[10:20] # works, no problem
command_cursor = collection.aggregate([{'$match': {}}])
command_cursor[10:20] # throws not subscriptable error
What is the reason behind this? Does anybody have an implementation for CommandCursor.__getitem__()? Is it feasible at all?
I would like to find a way to not fetch all the values when I need just a page. Converting to a list and then slicing it is not feasible for large (100k+ docs) pipeline results. There is a workaround with based on this answer, but this only works for the first few pages, and the performance drops rapidly for pages at the end.

Mongo has certain aggregation pipeline stages to deal with this, like $skip and $limit that you can use like so:
aggregation_results = list(collection.aggregate([{'$match': {}}, {'$skip': 10}, {'$limit': 10}]))
Specifically as you noticed Pymongo's command_cursor does not have implementation for __getitem__ hence the regular iterator syntax does not work as expected. I would personally recommend not to tamper with their code unless you're interested in becoming a contributer to their package.

The MongoDB cursor for find and aggregate functions in a different way since cursor result from aggregation query is a result of precessed data (in most cases) which is not the case for find-cursors as they are static and hence documents can be skipped and limitted to your will.
You can add the paginator limits as $skip and $limit stages in the aggregation pipeline.
For Example:
command_cursor = collection.aggregate([
{
"$match": {
# Match Conditions
}
},
{
"$skip": 10 # No. of documents to skip (Should be `0` for Page - 1)
},
{
"$limit": 10 # No. of documents to be displayed on your webpage
}
])

How to get the get the total number of items in a single G Suite API query?

I am developing a simple app to consume data from some G Suite APIs (Admin SDK, Drive, Gmail, etc.).
G Suite API endpoints allowing the list method (for collections) provide queries with a response of the following kind (content may vary from API to API):
{
"kind": "admin#directory#users",
"etag": "\"WczyXiapC9UmAQ6oKabcde6P59w-7argQ83zwDwKoUE/zsH-hyZTP1lFsB3-wabK4_8VXMk\"",
"users": [
{
"kind": "admin#directory#user",
"id": "137674315191655104007",
"etag": "\"WczyXiapC9..."
...
},
...
# N elements of type 'user', where N <= maxResults,
# being <maxResults>, the maximum number of elements in the response per query.
# <maxResults> has a system default value.
]
}
In order to get the total number of available elements for consumption in that API, I may encounter the following cases:
One single query if the total number of available elements is less or equal than maxResults.
More than one if the total number of available elements is greater than maxResults.
When number two occurs, the G Suite API returns a pagination token which I will use in successive queries to retrieve more pages with up to maxResults elements.
Once I have consumed all the elements I can get the total number.
My question is:
Is it possible, to retrieve the total number of elements (just the integer value) in the query with a single API call and thus, avoid pagination?
Thank you for your answers.

Is it possible, to retrieve the total number of elements (just the integer value) in the query with a single API call and thus, avoid pagination?
if a method contains a parameter called MaxResults that is because it has a maximum number of rows that a call can return.
If you look at the documentation for the Google drive api file.list method
The maximum number of files to return per page. Partial or empty result pages are possible even before the end of the files list has been reached. Acceptable values are 1 to 1000, inclusive. (Default: 100)
This means that it can return to you a maximum of 1000 files then you will need to paginate. There is no way around this limitation in the api.

MongoDb giving error while searching

I am trying to get all the documents which has ancestor name as "Laptops" with following lines of code in python with the help of pymongo.
for p in collection.find({"ancestors.name":"Laptops"}):
print p
But I am getting this error.
pymongo.errors.OperationFailure: database error: BSONObj size: 536871080 (0x200000A8) is invalid. Size must be between 0 and 16793600(16MB) First element: seourl: "https://example.com"
If I limit the query as
for p in collection.find({"ancestors.name":"Laptops"}).limit(5):
print p
Then it works. So I guess the problem is while fetching all the documents with this category. How to solve this problem? I want all the documents with "Laptops".
EDIT:-
With aggregation pipeline Concept I tried following query
db.product_attributes.aggregate([
{
$match:
{
"ancestors.name":"Laptops"
}
}
])
I get the same error
uncaught exception: aggregate failed: {
"errmsg" : "exception: BSONObj size: 536871080 (0x200000A8) is invalid. Size must be between 0 and 16793600(16MB) First element: seourl: \"https://example.com"",
"code" : 10334,
"ok" : 0
}
Whats wrong here..? Help is appreciated :)

The restriction was created to not allow your mongoDB process to consume all your memory on server.To know more - here is a ticket about 4->16 MB limit increase, and discussion about it purpose.
Alternative approach is to use Aggregation pipeline
If the aggregate command returns a single document that contains the
complete result set, the command will produce an error if the result
set exceeds the BSON Document Size limit, which is currently 16
megabytes. To manage result sets that exceed this limit, the aggregate
command can return result sets of any size if the command return a
cursor or store the results to a collection.

The maximum size of a document returned by the query is 16MB. You can see that, and other limits on the official document
To overcome this you could count the total number of records and loop over the records and print them
Sample:
count=db.collection.count({"ancestors.name":"Laptops"})
for num in range (0,count,500):
if num!=0:
for p in collection.find({"ancestors.name":"Laptops"}).skip(num-1).limit(500):
print p
else:
for p in collection.find({"ancestors.name":"Laptops"}).limit(500):
print p
Warning:
This method is slow since you skip and limit records

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.