I am trying to get all the documents which has ancestor name as "Laptops" with following lines of code in python with the help of pymongo.
for p in collection.find({"ancestors.name":"Laptops"}):
print p
But I am getting this error.
pymongo.errors.OperationFailure: database error: BSONObj size: 536871080 (0x200000A8) is invalid. Size must be between 0 and 16793600(16MB) First element: seourl: "https://example.com"
If I limit the query as
for p in collection.find({"ancestors.name":"Laptops"}).limit(5):
print p
Then it works. So I guess the problem is while fetching all the documents with this category. How to solve this problem? I want all the documents with "Laptops".
EDIT:-
With aggregation pipeline Concept I tried following query
db.product_attributes.aggregate([
{
$match:
{
"ancestors.name":"Laptops"
}
}
])
I get the same error
uncaught exception: aggregate failed: {
"errmsg" : "exception: BSONObj size: 536871080 (0x200000A8) is invalid. Size must be between 0 and 16793600(16MB) First element: seourl: \"https://example.com"",
"code" : 10334,
"ok" : 0
}
Whats wrong here..? Help is appreciated :)
The restriction was created to not allow your mongoDB process to consume all your memory on server.To know more - here is a ticket about 4->16 MB limit increase, and discussion about it purpose.
Alternative approach is to use Aggregation pipeline
If the aggregate command returns a single document that contains the
complete result set, the command will produce an error if the result
set exceeds the BSON Document Size limit, which is currently 16
megabytes. To manage result sets that exceed this limit, the aggregate
command can return result sets of any size if the command return a
cursor or store the results to a collection.
The maximum size of a document returned by the query is 16MB. You can see that, and other limits on the official document
To overcome this you could count the total number of records and loop over the records and print them
Sample:
count=db.collection.count({"ancestors.name":"Laptops"})
for num in range (0,count,500):
if num!=0:
for p in collection.find({"ancestors.name":"Laptops"}).skip(num-1).limit(500):
print p
else:
for p in collection.find({"ancestors.name":"Laptops"}).limit(500):
print p
Warning:
This method is slow since you skip and limit records
Related
I need to come up with a strategy to process and update documents in an elasticsearch index periodically and efficiently. I do not have to look at documents that I processed before.
My setting is that I have a long running process, which continuously inserts documents to an index, say approx. 500 documents per hour (think about the common logging example).
I need to find a solution to update some amount of documents periodically (via cron job, e.g) to run some code on a specific field (text field, eg.) to enhance that document with a number of new fields. I want to do this to offer more fine grained aggregations on the index. In the logging analogy, this could be, e.g., I get the UserAgent-string from a log entry (document), do some parsing on that, and add some new fields back to that document and index it.
So my approach would be:
Get some amount of documents (or even all) that I haven't looked at before. I could query them by combining must_not and exists, for instance.
Run my code on these documents (run the parser, compute some new stuff, whatever).
Update the documents obtained previously (probably most preferably via bulk api).
I know there is the Update by query API. But this does not seem to be right here, since I need to run my own code (which btw depends on external libraries), on my server and not as a painless script, which would not offer that comprehensive tasks I need.
I am accessing elasticsearch via python.
The problem is now that I don't know how to implement the above approach. E.g. what if the amount of document obtained in step 1. is larger than myindex.settings.index.max_result_window?
Any ideas?
I considered #Jay's comment and ended up with this pattern, for the moment:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import scan
from my_module.postprocessing import post_process_doc
es = Elasticsearch(...)
es.ping()
def update_docs( docs ):
""""""
for idx,doc in enumerate(docs):
if idx % 10000 == 0:
print( 'next 10k' )
new_field_value = post_process_doc( doc )
doc_update = {
"_index": doc["_index"],
"_id" : doc["_id"],
"_op_type" : "update",
"doc" : { <<the new field>> : new_field_value }
}
yield doc_update
docs = scan( es, query='{ "query" : { "bool": { "must_not": { "exists": { "field": <<the new field>> }} } }}', index=index, scroll="1m", preserve_order=True )
bulk( es, update_docs( docs ) )
Comments:
I learned that elasticsearch keeps a view of the search results when you do a scroll and pass the corresponding ids with the query request. The scan abstraction method will handle that for you. The scroll-parameter in the method above tells elasticsearch how long the view will be open, i.e., how long the view will be consistant.
As stated in my comment the documentation says that they no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging use .. point in time (PIT), but I haven't tried it yet.
In my implementation, I needed to pass preserve_over=True, otherwise an error was thrown.
Remember to update the mapping of the index beforehand, e.g., when you want to add a nested fields as another field in your document.
I have been familiarising myself with PRAW for reddit. I am trying to get the top x posts for the week, however I am having trouble changing the limit for the "top" method.
The documentation doesn't seem to mention how to do it, unless I am missing something. I can change the time peroid ok by just passing in the string "week", but the limit has be flummoxed. The image shows that there is a param for limit and it is set to 100.
r = self.getReddit()
sub = r.subreddit('CryptoCurrency')
results = sub.top("week")
for r in results:
print(r.title)
DOCS: subreddit.top()
IMAGE: Inspect listing generator params
From the docs you've linked:
Additional keyword arguments are passed in the initialization of
ListingGenerator.
So we follow that link and see the limit parameter for ListingGenerator:
limit – The number of content entries to fetch. If limit is None, then
fetch as many entries as possible. Most of reddit’s listings contain a
maximum of 1000 items, and are returned 100 at a time. This class will
automatically issue all necessary requests (default: 100).
So using the following should do it for you:
results = sub.top("week", limit=500)
I am currently using elastic search python client for search the index of my elastic search.
Let's say I have 20 million documents, and I am using the pagination with from and size parameters. I have read in the documentation that there is a limit of 10k. But I didn't understand what that limit mean.
For example,
Did that limit mean I can only use pagination (i.e. from and size) calls 10000 times?
like from=0, size =10, from=10, size =10 etc., 10000 times.
Or Do they mean I can make unlimited pagination calls using the from and size params but there is a size limit of 10k per each pagination call?
Can someone clarify this?
Pagination limit of 10k means
For the applied query only the first 10k results can be displayed.
from:0 size:10,001 will given an error "Result window is too large"
from:10000, size:10 will given an error "Result window is too large"
In the above 2 cases we are trying to access 10000+ offset of the document of the current query, hence the exception
from doesn't represent pageNumber, instead it represents starting offset
The limit is called max_result_window and default value is 10k. Mathematically this is the max value size+from can take.
from:1, size:10000 will give error.
from:5, size:9996 will give error.
from:9999, size:2 will give error.
Search after is the recommended alternative if you want deeper results.
You can update existing index settings with this query:
PUT myexistingindex/_settings
{
"settings": {
"max_result_window": 20000000
}
}
If your are creating dynamic index, you can give max result window parameter in settings.
In Java like this:
private void createIndex(String indexName) throws IOException {
Settings settings = Settings.builder()
.put("number_of_shards", 1)
.put("number_of_replicas", 0)
.put("index.translog.durability", "async")
.put("index.translog.sync_interval", "5s")
.put("max_result_window", "20000000").build();
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName).settings(settings);
restHighLevelClient.indices().create(createIndexRequest, RequestOptions.DEFAULT);
}
After these configurations, you can give "from" parameters up to 20 million.
But this way is not recommended.
You can review this document: Scroll Api
Some of the AWS accounts I'm using have a lot of KMS keys that have aliases tied to them.
My problem is that if the list_aliases() command returns too many results, the results are truncated and the script fails if the value it's searching for is beyond the truncation point.
I tried this to get 200 results back, but it did not work:
alias_list = (kms_client.list_aliases(Marker='200')
botocore.errorfactory.InvalidMarkerException: An error occurred (InvalidMarkerException) when calling the ListAliases operation: Could not deserialize marker '200'
How do I set the marker for the list_aliases() command?
First, per the documentation list_aliases returns 1 to 100 results. This is set using Limit, not Marker.
Example:
alias_list = (kms_client.list_aliases(Limit=100)
Second, results beyond the first 100 require using the Marker value that is returned by the first call to list_aliases(). Below is a simple example of how to get the marker and use it to get the next 1-100 values.
Disclaimer, this code doesn't actually do anything with the retrieved aliases and I've not tested it.
def get_kms_alias(kms_client, limit=100, marker=""):
if marker:
alias_request = (kms_client.list_aliases(Limit=limit, Marker=marker))
else:
alias_request = (kms_client.list_aliases(Limit=limit))
print(alias_request["Aliases"])
alias_truncated = alias_request["Truncated"]
if truncated in "True":
marker = alias_request["NextMarker"]
get_kms_alias(kms_client, limit, marker)
return None
I'm new to MongoDB and pymongo and looking for some guidance in terms of algorithms and performance for a specific task described below. I have posted a link to an image of the data sample and also my sample python code below.
I have a single collection that grows about 5 to 10 Million documents every month. It receives all this info from other systems, which I have no access to modify in any way (they are in different companies). Each document represent sort of a financial transaction. I need to group documents that are part of a same "transaction group".
Each document has hundreds of keys. Almost all keys vary between documents (which is why they moved from MySQL to MongoDB - no easy way to align schema). However, I found out that three keys are guaranteed to always be in all of them. I'll call these keys key1, key2 and key3 in this example. These keys are my only option to identify the transactions that are part of the same transaction group.
The basic rule is:
- If consecutive documents have the same key1, and the same key2, and the same key3, they are all in the same "transaction group". Then I must give it some integer id in a new key named 'transaction_group_id'
- Else, consecutive documents that do not matck key1, key2 and key3 are all in their own individual "transaction_groups".
It's really easy to understand it by looking at the screenshot of a data sample (better than my explanation anyway). See here:
As you can see in the sample:
- Documents 1 and 2 are in the same group, because they match key1, key2 and key3;
- Documents 3 and 4 also match and are in their own group;
- Following the same logic, documents 18 and 19 are a group obviously. However, even though they match the values of documents 1 and 3, they are not in the same group (because the documents are not consecutive).
I created a very simplified version of the current python function, to give you guys an idea of the current implementation:
def groupTransactions(mongo_host,
mongo_port,
mongo_db,
mongo_collection):
"""
Group transactions if Keys 1, 2 and 3 all match in consecutive docs.
"""
mc = MongoClient(mongo_host, mongo_port)
db = mc['testdb']
coll = db['test_collection']
# The first document transaction group must always be equal to 1.
first_doc_id = coll.find_one()['_id']
coll.update({'_id': first_doc_id},
{"$set": {"transaction_group_id": 1}},
upsert=False, multi=False)
# Cursor order is undetermined unless we use sort(), no matter what the _id is. We learned it the hard way.
cur = coll.find().sort('subtransaction_id', ASCENDING)
doc_count = cur.count()
unique_data = []
unique_data.append(cur[0]['key1'], cur[0]['key2'], cur[0]['key3'])
transaction_group_id = 1
i = 1
while i < doc_count:
doc_id = cur[i]['_id']
unique_data.append(cur[i]['key1'], cur[i]['key2'], cur[i]['key3'])
if unique_data[i] != unique_data[i-1]:
# New group find, increase group id by 1
transaction_group_id = transaction_group_id + 1
# Update the group id in the database
coll.update({'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
i = i + 1
print "%d subtransactions were grouped into %d transaction groups." % (doc_count, i)
return 1
This is the code, more or less, and it works. But it takes between 2 to 3 days to finish, which is starting to become unacceptable. The hardware is good: VMs in last generation Xeon, local MongoDB in SSD, 128GB RAM). It will probably run fast if we decide to run it on AWS, use threading/subprocesses, etc - which are all obviously good options to try at some point.
However, I'm not convinced this is the best algorithm. It's just the best I could come up with.There must be obvious ways to improve it that I'm not seeing.
Moving to c/c++ or out of NoSQL is out of the question at this point. I have to make it work the way it is.
So basically the question is: Is this the best possible algorithm (using MongoDB/pymongo) in terms of speed? If not, I'd appreciate it if you could point me in the right direction.
EDIT: Just so you can have an idea of how slow this code performance is: Last time I measured it, it took 22 hours to run on 1.000.000 results. As a quick workaround, I wrote something else to load the data to a Pandas DataFrame first and then apply the same logic of this code more or less. It took 3 to 4 minutes to group everything, using the same hardware. I mean, I know Pandas is efficient, etc. But there's something wrong, there can't be such a huge gap between between the two solutions performances (4min vs 1,320min).
It is the case that most of the time is spent writing to the database, which includes the round trip of sending work to the DB, plus the DB doing the work. I will point out a couple of places where you can speed up each of those.
Speeding up the back-and-forth of sending write requests to the DB:
One of the best ways to improve the latency of the requests to the DB is to minimize the number of round trips. In your case, it's actually possible because multiple documents will get updated with the same transaction_group_id. If you accumulate their values and only send a "multi" update for all of them, then it will cut down on the back-and-forth. The larger the transaction groups the more this will help.
In your code, you would replace the current update statement:
coll.update( {'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
With an accumulator of doc_id values (appending them to a list should be just fine). When you detect the "pattern" change and transaction group go to the next one, you would then run one update for the whole group as:
coll.update( {'_id': {$in: list-of-docids },
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=True)
A second way of increasing parallelism of this process and speeding up end-to-end work would be to split the job between more than one client - the downside of this is that you need a single unit of work to pre-calculate how many transaction_group_id values there will be and where the split points are. Then you can have multiple clients like this one which only handle range of specific subtransaction_id values and their transaction_group_id starting value is not 1 but whatever is given to them by the "pre-work" process.
Speeding up the actual write on the DB:
The reason I asked about existence of the transaction_group_id field is because if a field that's being $set does not exist, it will be created and that increases the document size. If there is not enough space for the increased document, it has to be relocated and that's less efficient than the in-place update.
MongoDB stores documents in BSON format. Different BSON values have different sizes. As a quick demonstration, here's a shell session that shows total document size based on the type and size of value stored:
> db.sizedemo.find()
{ "_id" : ObjectId("535abe7a5168d6c4735121c9"), "transaction_id" : "" }
{ "_id" : ObjectId("535abe7d5168d6c4735121ca"), "transaction_id" : -1 }
{ "_id" : ObjectId("535abe815168d6c4735121cb"), "transaction_id" : 9999 }
{ "_id" : ObjectId("535abe935168d6c4735121cc"), "transaction_id" : NumberLong(123456789) }
{ "_id" : ObjectId("535abed35168d6c4735121cd"), "transaction_id" : " " }
{ "_id" : ObjectId("535abedb5168d6c4735121ce"), "transaction_id" : " " }
> db.sizedemo.find().forEach(function(doc) { print(Object.bsonsize(doc)); })
43
46
46
46
46
53
Note how the empty string takes up three bytes fewer than double or NumberLong do. The string " " takes the same amount as a number and longer strings take proportionally longer. To guarantee that your updates that $set the transaction group never cause the document to grow, you want to set transaction_group_id to the same size value on initial load as it will be updated to (or larger). This is why I suggested -1 or some other agreed upon "invalid" or "unset" value.
You can check if the updates have been causing document moves by looking at the value in db.serverStatus().metrics.record.moves - this is the number of document moves caused by growth since the last time server was restarted. You can compare this number before and after your process runs (or during) and see how much it goes up relative to the number of documents you are updating.