Bulk update in Pymongo using multiple ObjectId - python

I want to update thousands of documents in mongo collection. I want to find them using ObjectId and then whichever document matches , should be updated. My update is same for all documents. I have list of ObjectId. For every ObjectId in list, mongo should find matching document and update "isBad" key of that document to "N"
ids = [ObjectId('56ac9d3fa722f1029b75b128'), ObjectId('56ac8961a722f10249ad0ad1')]
bulk = db.testdata.initialize_unordered_bulk_op()
bulk.find( { '_id': ids} ).update( { '$set': { "isBad" : "N" } } )
print bulk.execute()
This gives me result :
{'nModified': 0, 'nUpserted': 0, 'nMatched': 0, 'writeErrors': [], 'upserted': [], 'writeConcernErrors': [], 'nRemoved': 0, 'nInserted': 0}
This is expected because it is trying to match "_id" with list. But I don't know how to proceed.
I know how to update every document individually. My list size is of the order of 25000. I do not want to make 25000 calls individually. Number of documents in my collection are much more. I am using python2, pymongo = 3.2.1.

Iterate through the id list using a for loop and send the bulk updates in batches of 500:
bulk = db.testdata.initialize_unordered_bulk_op()
counter = 0
for id in ids:
# process in bulk
bulk.find({ '_id': id }).update({ '$set': { 'isBad': 'N' } })
counter += 1
if (counter % 500 == 0):
bulk.execute()
bulk = db.testdata.initialize_ordered_bulk_op()
if (counter % 500 != 0):
bulk.execute()
Because write commands can accept no more than 1000 operations (from the docs), you will have to split bulk operations into multiple batches, in this case you can choose an arbitrary batch size of up to 1000.
The reason for choosing 500 is to ensure that the sum of the associated document from the Bulk.find() and the update document is less than or equal to the maximum BSON document size even though there is no there is no guarantee using the default 1000 operations requests will fit under the 16MB BSON limit. The Bulk() operations in the mongo shell and comparable methods in the drivers do not have this limit.

bulk = db.testdata.initialize_unordered_bulk_op()
for id in ids:
bulk.find( { '_id': id}).update({ '$set': { "isBad" : "N" }})
bulk.execute()

Related

Pymongo not modifying all matching documents

I'm trying to update around 100k documents using pymongo 3.12. I believe I'm using the pymongo api correctly, but every time I run a bulk write; it only updates roughly half of the documents that it matches with.
upserts = [UpdateOne({'_id': event["$set"]["metaData"]["id"]}, {'$set': event["$set"]}, upsert=True)
for event in events_update_data]
result = cursor.bulk_write(upserts, ordered=False)
print(result.bulk_api_result)
Results in:
{'writeErrors': [], 'writeConcernErrors': [], 'nInserted': 0, 'nUpserted': 427, 'nMatched': 116940, 'nModified': 43303, 'nRemoved': 0, 'upserted': removed_by_me}
So why is it matching with all the documents I need it to, but only modifying half? I have to run this multiple times for it to update all docs, and the consistency varies greatly. Here is the operation for each document.
data = {"$set": {f'inventoryData.{self.parse_time}': event.pop("inventory"), 'metaData': event}}
I've also tried different variations of upserting. All reproduce the same result.
bulk_operations = cursor.initialize_unordered_bulk_op()
for event in events_update_data:
bulk_operations.find({'_id': event["$set"]["metaData"]["id"]}).upsert().update({"$set": event["$set"]})
result = bulk_operations.execute()
print(result)
and
upserts = []
for event in events_update_data:
upserts.append(UpdateOne({'_id': event["$set"]["metaData"]["id"]}, {'$set': event["$set"]}, upsert=True))
if len(upserts) == 1000:
try:
result = cursor.bulk_write(upserts, ordered=False)
upserts = []
except BulkWriteError as e:
print(e.details)
print(result.bulk_api_result)
If your nMatched is higher then your nModified, then chances are your "updates" are the same as the original record, hence not modified.

How to sort paginated logs by #timestamp with Elasticsearch?

My goal is to sort millions of logs by timestamp that I receive out of Elasticsearch.
Example logs:
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:00:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:01:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:02:09.000Z"}
{"realIp": "192.168.0.2", "#timestamp": "2020-12-06T02:04:09.000Z"}
Unfortunately, I am not able to get all the logs sorted out of Elastic. It seems like I have to do it by myself.
Approaches I have tried to get the data sorted out of elastic:
es = Search(index="somelogs-*").using(client).params(preserve_order=True)
for hit in es.scan():
print(hit['#timestamp'])
Another approach:
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.scan()
)
So I am looking for a way to sort these logs by myself or directly through Elasticsearch. Currently, I am saving all the data in a local 'logs.json' and it seems to me I have to iter over and sort it by myself.
You should definitely let Elasticsearch do the sorting, then return the data to you already sorted.
The problem is that you are using .scan(). It uses Elasticsearch's scan/scroll API, which unfortunately only applies the sorting params on each page/slice, not the entire search result. This is noted in the elasticsearch-dsl docs on Pagination:
Pagination
...
If you want to access all the documents matched by your query you can
use the scan method which uses the scan/scroll elasticsearch API:
for hit in s.scan():
print(hit.title)
Note that in this case the results won’t be sorted.
(emphasis mine)
Using pagination is definitely an option especially when you have a "millions of logs" as you said. There is a search_after pagination API:
Search after
You can use the search_after parameter to retrieve the next page of
hits using a set of sort values from the previous page.
...
To get the first page of results, submit a search request with a sort
argument.
...
The search response includes an array of sort values for
each hit.
...
To get the next page of results, rerun the previous search using the last hit’s sort values as the search_after argument. ... The search’s query and sort arguments must remain unchanged. If provided, the from argument must be 0 (default) or -1.
...
You can repeat this process to get additional pages of results.
(omitted the raw JSON requests since I'll show a sample in Python below)
Here's a sample how to do it with elasticsearch-dsl for Python. Note that I'm limiting the fields and the number of results to make it easier to test. The important parts here are the sort and the extra(search_after=).
search = Search(using=client, index='some-index')
# The main query
search = search.extra(size=100)
search = search.query('range', **{'#timestamp': {'gte': '2020-12-29T09:00', 'lt': '2020-12-29T09:59'}})
search = search.source(fields=('#timestamp', ))
search = search.sort({
'#timestamp': {
'order': 'desc'
},
})
# Store all the results (it would be better to be wrap all this in a generator to be performant)
hits = []
# Get the 1st page
results = search.execute()
hits.extend(results.hits)
total = results.hits.total
print(f'Expecting {total}')
# Get the next pages
# Real use-case condition should be "until total" or "until no more results.hits"
while len(hits) < 1000:
print(f'Now have {len(hits)}')
last_hit_sort_id = hits[-1].meta.sort[0]
search = search.extra(search_after=[last_hit_sort_id])
results = search.execute()
hits.extend(results.hits)
with open('results.txt', 'w') as out:
for hit in hits:
out.write(f'{hit["#timestamp"]}\n')
That would lead to an already sorted data:
# 1st 10 lines
2020-12-29T09:58:57.749Z
2020-12-29T09:58:55.736Z
2020-12-29T09:58:53.627Z
2020-12-29T09:58:52.738Z
2020-12-29T09:58:47.221Z
2020-12-29T09:58:45.676Z
2020-12-29T09:58:44.523Z
2020-12-29T09:58:43.541Z
2020-12-29T09:58:40.116Z
2020-12-29T09:58:38.206Z
...
# 250-260
2020-12-29T09:50:31.117Z
2020-12-29T09:50:27.754Z
2020-12-29T09:50:25.738Z
2020-12-29T09:50:23.601Z
2020-12-29T09:50:17.736Z
2020-12-29T09:50:15.753Z
2020-12-29T09:50:14.491Z
2020-12-29T09:50:13.555Z
2020-12-29T09:50:07.721Z
2020-12-29T09:50:05.744Z
2020-12-29T09:50:03.630Z
...
# 675-685
2020-12-29T09:43:30.609Z
2020-12-29T09:43:30.608Z
2020-12-29T09:43:30.602Z
2020-12-29T09:43:30.570Z
2020-12-29T09:43:30.568Z
2020-12-29T09:43:30.529Z
2020-12-29T09:43:30.475Z
2020-12-29T09:43:30.474Z
2020-12-29T09:43:30.468Z
2020-12-29T09:43:30.418Z
2020-12-29T09:43:30.417Z
...
# 840-850
2020-12-29T09:43:27.953Z
2020-12-29T09:43:27.929Z
2020-12-29T09:43:27.927Z
2020-12-29T09:43:27.920Z
2020-12-29T09:43:27.897Z
2020-12-29T09:43:27.895Z
2020-12-29T09:43:27.886Z
2020-12-29T09:43:27.861Z
2020-12-29T09:43:27.860Z
2020-12-29T09:43:27.853Z
2020-12-29T09:43:27.828Z
...
# Last 3
2020-12-29T09:43:25.878Z
2020-12-29T09:43:25.876Z
2020-12-29T09:43:25.869Z
There are some considerations on using search_after as discussed in the API docs:
Use a Point In Time or PIT parameter
If a refresh occurs between these requests, the order of your results may change, causing inconsistent results across pages. To prevent this, you can create a point in time (PIT) to preserve the current index state over your searches.
You need to first make a POST request to get a PIT ID
Then add an extra 'pit': {'id':xxxx, 'keep_alive':5m} parameter to every request
Make sure to use the PIT ID from the last response
Use a tiebreaker
We recommend you include a tiebreaker field in your sort. This tiebreaker field should contain a unique value for each document. If you don’t include a tiebreaker field, your paged results could miss or duplicate hits.
This would depend on your Document schema
# Add some ID as a tiebreaker to the `sort` call
search = search.sort(
{'#timestamp': {
'order': 'desc'
}},
{'some.id': {
'order': 'desc'
}}
)
# Include both the sort ID and the some.ID in `search_after`
last_hit_sort_id, last_hit_route_id = hits[-1].meta.sort
search = search.extra(search_after=[last_hit_sort_id, last_hit_route_id])
Thank you Gino Mempin. It works!
But I also figured out, that a simple change does the same job.
by adding .params(preserve_order=True) elasticsearch will sort all the data.
es = Search(index="somelog-*").using(client)
notifications = (es
.query("range", **{
"#timestamp": {
'gte': 'now-48h',
'lt' : 'now'
}
})
.sort("#timestamp")
.params(preserve_order=True)
.scan()
)

Does spark utilize the sorted order of hbase keys, when using hbase as data source

I store time-series data in HBase. The rowkey is composed from user_id and timestamp, like this:
{
"userid1-1428364800" : {
"columnFamily1" : {
"val" : "1"
}
}
}
"userid1-1428364803" : {
"columnFamily1" : {
"val" : "2"
}
}
}
"userid2-1428364812" : {
"columnFamily1" : {
"val" : "abc"
}
}
}
}
Now I need to perform per-user analysis. Here is the initialization of hbase_rdd (from here)
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
The natural mapreduce-like way to process would be:
hbase_rdd
.map(lambda row: (row[0].split('-')[0], (row[0].split('-')[1], row[1]))) # shift timestamp from key to value
.groupByKey()
.map(processUserData) # process user's data
While executing first map (shift timestamp from key to value) it is crucial to know when the time-series data of the current user is finished and therefore groupByKey transformation could be started. Thus we do not need to map over all table and store all the temporary data. It is possible because hbase stores row-keys in a sorted order.
With hadoop streaming it could be done in such way:
import sys
current_user_data = []
last_userid = None
for line in sys.stdin:
k, v = line.split('\t')
userid, timestamp = k.split('-')
if userid != last_userid and current_user_data:
print processUserData(last_userid, current_user_data)
last_userid = userid
current_user_data = [(timestamp, v)]
else:
current_user_data.append((timestamp, v))
The question is: how to utilize the sorted order of hbase keys within Spark?
I'm not super familiar with the guarantees you get with the way you're pulling data from HBase, but if I understand correctly, I can answer with just plain old Spark.
You've got some RDD[X]. As far as Spark knows, the Xs in that RDD are completely unordered. But you have some outside knowledge, and you can guarantee that the data is in fact grouped by some field of X (and perhaps even sorted by another field).
In that case, you can use mapPartitions to do virtually the same thing you did with hadoop streaming. That lets you iterate over all the records in one partition, so you can look for blocks of records w/ the same key.
val myRDD: RDD[X] = ...
val groupedData: RDD[Seq[X]] = myRdd.mapPartitions { itr =>
var currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
var currentUser: X = null
//itr is an iterator over *all* the records in one partition
itr.flatMap { x =>
if (currentUser != null && x.userId == currentUser.userId) {
// same user as before -- add the data to our list
currentUserData += x
None
} else {
// its a new user -- return all the data for the old user, and make
// another buffer for the new user
val userDataGrouped = currentUserData
currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
currentUserData += x
currentUser = x
Some(userDataGrouped)
}
}
}
// now groupedRDD has all the data for one user grouped together, and we didn't
// need to do an expensive shuffle. Also, the above transformation is lazy, so
// we don't necessarily even store all that data in memory -- we could still
// do more filtering on the fly, eg:
val usersWithLotsOfData = groupedRDD.filter{ userData => userData.size > 10 }
I realize you wanted to use python -- sorry I figure I'm more likely to get the example correct if I write in Scala. and I think the type annotations make the meaning more clear, but that it probably a Scala bias ... :). In any case, hopefully you can understand what is going on and translate it. (Don't worry too much about flatMap & Some & None, probably unimportant if you understand the idea ...)

MongoDb speed decrease

I use mongodb to store compressed html files .
Basically, a complete document of mongod is like:
{'_id': 1, 'p1': data, 'p2': data2, 'p3': data3}
where data, data1, data3 are :bson.binary.Binary(zlib_compressed_html)
I have 12 Million ids and dataX are each one average 90KB,
so each document has at least size 180KB + sizeof(_id) + some_overhead.
The total data size would be at least 2TB.
I would like to notice that '_id' is index.
I insert to mongo with the following way:
def _save(self, mongo_col, my_id, page, html):
doc = mongo_col.find_one({'_id': my_id})
key = 'p%d' % page
success = False
if doc is None:
doc = {'_id': my_id, key: html}
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
else:
try:
mongo_col.update({'_id': my_id}, {'$set': {key: html}})
success = True
except:
log.exception('Exception updating mongodb')
return success
As you can see first I lookup the collection to see if a document with
my_id exists.
If it does not exist then I create it and save it to mongo else I update it.
The problem with the above is that although it was super fast, at some point it became really slow.
I will give you some numbers:
When it was fast I was doing 1.500.000 per 4 hours and after 300.000 per 4 hours.
I suspect that this affects the speed:
Note
When performing update operations that increase the document size beyond the allocated space for that document, the update operation relocates the document on disk and may reorder the document fields depending on the type of update.
As of these driver versions, all write operations will issue a getLastError command to confirm the result of the write operation:
{ getLastError: 1 }
Refer to the documentation on write concern in the Write Operations document for more information.
the above is from : http://docs.mongodb.org/manual/applications/update/
I am saying that because we could have the following :
{'_id: 1, 'p1': some_data}, ...., {'_id': 10000000, 'p2': some_data2}, ...{'_id': N, 'p1': sd3}
and imagine that I am calling the above _save method as:
_save(my_collection, 1, 2, bin_compressed_html)
Then it should update the doc with _id 1 . But if the thing that mongo site is the case,
because I am adding a key to the document it does not fit and should rearrange the document.
It is possible to move the document in the end of the collection, which could be very far on the disk. Could this slow things down?
Or speed slow down has to do with the size of the collection?
In any way to you think it should be more efficient to modify my structure to be like:
{'_id': ObjectId, 'mid': 1, 'p': 1, 'd': html}
where mid=my_id, p=page, d=compressed html
and modify _save method to do only inserts?
def _save(self, mongo_col, my_id, page, html):
doc = {'mid': my_id, 'p': page, 'd': html}
success = False
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
return success
this way I avoid the update (so the rearrange on disk) and one lookup (find_one)
but the documents would be 3x mores and I would have 2 indexes ( _id and mid ) .
What do you suggest?
Document relocation could be an issue if you continue to add pages of html as new attributes. Would it really be an issue to move pages to a new collection where you could simply add them one record each? Also I don't really think MongoDB is a good fit for your use case. E.g. Redis would be much more efficient.
Another thing you should take care of is to have enough ram for your _id index. Use db.mongocol.stats() to check the index size.
When inserting new Documents into MongoDB, a Document can grow without moving it up to a certain point. Because the DB is analyzing the incoming Data and adds a padding to the Document.
So do deal with less Document movements you can do two things:
manually tweaking the padding factor
preallocate space (attributes) for each document.
See Article about Padding or MongoDB Docs for more Information about the padding factor.
Btw. insetad of using save for creating new documents, you should use .insert() which will throw a duplicate key error if the _id is already there (.save() will overwrite your document)

How can I use Mongodb Aggregation in this example?

I am currently using Python to build many of my results instead of MongoDB itself. I am trying to get my head around Aggregation, but I'm struggling a bit. Here is an example of what I am doing currently which perhaps could be better handled by MongoDB.
I have a collection of programs and a collection of episodes. Each program has a list of episodes (DBRefs) associated with it. (The episodes are stored in their own collection because both programs and episodes are quite complex and deep, so embedding is impractical). Each episode has a duration (float). If I want to find a program's average episode duration, I do this:
episodes = list(db.Episodes.find({'Program':DBRef('Programs',ObjectId(...))}))
durations = set(e['Duration'] for e in episodes if e['Duration'] > 0)
avg_mins = int(sum(durations) / len(durations) / 60
This is pretty slow when a program has over 1000 episodes. Is there a way I can do it in MongoDB?
Here is some sample data in Mongo shell format. There are three episodes belonging to the same program. How can I calculate the average episode duration for the program?
> db.Episodes.find({
'_Program':DBRef('Programs',ObjectId('4ec634fbf4c4005664000313'))},
{'_Program':1,'Duration':1}).limit(3)
{
"_id" : ObjectId("506c15cbf4c4005f9c40f830"),
"Duration" : 1643.856,
"_Program" : DBRef("Programs", ObjectId("4ec634fbf4c4005664000313"))
}
{
"_id" : ObjectId("506c15d3f4c4005f9c40f8cf"),
"Duration" : 1598.088,
"_Program" : DBRef("Programs", ObjectId("4ec634fbf4c4005664000313"))
}
{
"_id" : ObjectId("506c15caf4c4005f9c40f80e"),
"_Program" : DBRef("Programs", ObjectId("4ec634fbf4c4005664000313")),
"Duration" : 1667.04
}
I figured it out, and it is ridiculously fast compared to pulling it all into Python.
p = db.Programs.find_one({'Title':'...'})
pipe = [
{'$match':{'_Program':DBRef('Programs',p['_id']),'Duration':{'$gt':0}}},
{'$group':{'_id':'$_Program', 'AverageDuration':{'$avg':'$Duration'}}}
]
eps = db.Episodes.aggregate(pipeline=pipe)
print eps['result']

Categories