I'm trying to process about a hundred million records in mongodb. Basically, each key (prescription number) responds to about 1300 records (not unique). These keys have been indexed.
Right now, I am querying specific key with pymongo to return those sets of results so they can be processed with python.
Querying mongo is the biggest bottle neck. It is taking about 20 seconds per query. At the current rate, it will take 400 hrs to query every record.
This is what I looks like when I 'explain' my query:
db.prescriptions.find({'key':68565299}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 103578563,
"nscannedObjects" : 103578563,
"n" : 1603,
"millis" : 287665,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
And this shows that I have the indexes in place
> db.prescriptions.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "processed_data.prescriptions",
"name" : "_id_"
}
]
Am I off my rocker for trying to run this data processing on one server instance?
(Interestingly, my CPU and RAM do not appear to be maxed out when I run top.)
I would be grateful for any advice.
Thanks!!
Add an index
From the explain result in the query, there is no index on "key", you need to add one.
> db.prescriptions.addIndex({'key': 1});
If mongo reports any kind of warning, you'll need to action it
Related
I add documents to a collection using insert_many() then when I go to add more documents, for example the same documents as before but with an additionally document (as data is generated daily), I get duplicate entries in my data base. How can I simply add the new document(s) and not have duplicates? I have tried using update_many({},{},{upsert:true}) but nothing appears to be working?
I have tried using update_many({},{},{upsert:true}) but nothing appears to be working?
This is what it looks like in the db, 1st and 2nd Jan 2019 are uploaded then, report is generated on the 3rd and 1st, 2nd and 3rd get added but I only want the 3rd Jan as 1st and 2nd are duplicates
{ "_id" : ObjectId("5d4178fe4c934b011e86b193"),
"date" : ISODate("2019-01-01"),
"sessions" : "330",
"bounceRate" : "81.22"}
{ "_id" : ObjectId("5d4178fe4c934b011e86b194"),
"date" : ISODate("2019-01-02"),
"sessions" : "333",
"bounceRate" : "80.21"}
{ "_id" : ObjectId("5d4178fe4c934b011e86b195"),
"date" : ISODate("2019-01-01"),
"sessions" : "330",
"bounceRate" : "81.22"}
{"_id" : ObjectId("5d4178fe4c934b011e86b196"),
"date" : ISODate("2019-01-02"),
"sessions" : "333",
"bounceRate" : "80.21"}
{ "_id" : ObjectId("5d4178fe4c934b011e86b197"),
"date" : ISODate("2019-01-03"),
"sessions" : "340",
"bounceRate" : "79.45"}
I have a collection with documents like this:
{
"_id" : "1234567890",
"area" : "Zone 63",
"last_state" : "Cloudy",
"recent_indices" : [
21,
18,
33,
...
38
41
],
"Report_stats" : [
{
"date_hour" : "2017-01-01 01",
"count" : 31
},
{
"date_hour" : "2017-01-01 02",
"count" : 20
},
...
{
"date_hour" : "2018-08-26 13",
"count" : 3
}
]
}
which is supposed to be updated based on some online real-time reports
and assume each report looks like this:
{
'datetime' : '2018-08-26 13:48:11.677635',
'areas' : 'Zone 3; Zone 45; Zone 63',
'status' : 'Clear',
'index' : '33'
}
Now I have to update the collection in way that:
Each time that a new 'area' (say Zone 1025) shows up on the report, a new document adds to keep the related data
New 'index' adds to list "recent_indices" while "last_state" updates to 'status'
based on what the 'datetime' is, the respective "Report_stats.count" increments by 1 or a new "Report_stats" document ('datetime' with an hour resolution, where its 'count' is 1) inserted.
The way to do each of these updates separately, is somehow obvious, the problem is: How can I do all these simultaneously in a single update/upsert task?
I tried to use update_one and find_one_and_update(as well as update and find_and_modify) using pyMongo, but it was not possible (for me at least) to resolve the problem.
So I started to wonder if there possibly is a simple/single task to do so, or I should start trying to fix it in a different way altogether.
Can you please help me how to do this or (since there is a lot of data being gathered and therefore should be processed) suggest a low-cost alternative?
Thank you!
I am unsure if I understand your question, but if your problem revolves around upsert i.e update it or add the record if it is not there.
You can do it by adding one parameter like this:
update_one( {'_id':1}, {$set:{}}, upsert=True )
If you want to update multiple fields you can simply do it like setting your updated document:
{
name: 'Kanika',
age: 19
},
//set document
{
name: 'Andy',
age: 30
}
Please try looking into: https://docs.mongodb.com/manual/reference/method/db.collection.update/ , if it helps.
Thanks, Kanika
The best solution I have reached so far, is this:
if mycollection.find_one({'area': 'zone 45', 'Report_stats.date_hour': '2018-08-26 13'}):
mycollection.update_one({'area': 'zone 45', 'Report_stats.date_hour': '2018-08-26 13'},
{
'$inc': {
'Report_stats.$.count': 1
},
'$set': {
'last_state': 'Clear'
},
'$push': {
'recent_indices': 33,
}
},
)
else:
mycollection.update_one({'area': 'zone 45'},
{
'$set': {
'last_state': 'Clear'
},
'$push': {
'recent_indices': 33,
'Report_stats':{'date_hour':'2018-08-26 13', 'count':1}
}
},
upsert = True
)
However, it still is performing one query twice to update one document based on one request, which is not quite satisfactory.
Any better suggestions?
What if I figured out from your above reply is that if Report_stats.date_hour exists in your document, then you increment the counter or else you just push a new document.
I believe we can do it using $cond or $switch. Can you please take a look.
https://docs.mongodb.com/manual/reference/operator/aggregation/cond/#exp._S_cond
Meanwhile, I am trying to write the whole query for you and lets see if it works.
Thanks, Kanika
I have a JSON-array from a mongoexport containing data from the Beddit sleeptracker. Below is an example of one of the truncated documents (removed some unneeded detail).
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"data" : [
{
"end_timestamp" : 1480570804.26226,
"properties" : {
"sleep_efficiency" : 0.8772404,
"resting_heart_rate" : 67.67578,
"short_term_resting_heart_rate" : 61.36963,
"activity_index" : 50.51958,
"average_respiration_rate" : 16.25667,
"total_sleep_score" : 64,
},
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"tags" : [
"not_enough_sleep",
"long_sleep_latency"
],
"updated" : 1480570805.25201
}
],
"__v" : 0
}
Several related questions like this and this do not seem to work for the data structure above. As recommended in other related questions I am trying to stay away from looping over each row for performance reasons (the full dataset is ~150MB). How would I flatten out the "data"-key with json_normalize so that each key is at the top-level? I would prefer one DataFrame where e.g. total_sleep_score is a column.
Any help is much appreciated! Even though I know how to 'prepare' the data using JavaScript, I would like to be able to understand and do it using Python.
edit (request from comment to show preferred structure):
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"end_timestamp" : 1480570804.26226,
"properties.sleep_efficiency" : 0.8772404,
"properties.resting_heart_rate" : 67.67578,
"properties.short_term_resting_heart_rate" : 61.36963,
"properties.activity_index" : 50.51958,
"properties.average_respiration_rate" : 16.25667,
"properties.total_sleep_score" : 64,
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"updated" : 1480570805.25201,
"__v" : 0
}
The 'properties' append is not necessary but would be nice.
Try This algo for flatten:-
def flattenPattern(pattern):
newPattern = {}
if type(pattern) is list:
pattern = pattern[0]
if type(pattern) is not str:
for key, value in pattern.items():
if type(value) in (list, dict):
returnedData = flattenPattern(value)
for i,j in returnedData.items():
if key == "data":
newPattern[i] = j
else:
newPattern[key + "." + i] = j
else:
newPattern[key] = value
return newPattern
print(flattenPattern(dictFromJson))
OutPut:-
{
'session_range_start':1480545636.55059,
'start_timestamp':1480545636.55059,
'properties.average_respiration_rate':16.25667,
'session_range_end':1480570804.26226,
'properties.resting_heart_rate':67.67578,
'properties.short_term_resting_heart_rate':61.36963,
'updated':1480570805.25201,
'properties.total_sleep_score':64,
'properties.activity_index':50.51958,
'__v':0,
'user':'xxx',
'provider':'beddit',
'date':'2016-12-01',
'properties.sleep_efficiency':0.8772404,
'end_timestamp':1480570804.26226
}
Although not explicitly what I asked for, the following worked for me so far:
Step 1
Normalize the data record using json_normalize on the original dataset (not inside a Pandas DataFrame) and prefix the data.
beddit_data = pd.io.json.json_normalize(beddit, record_path='data', record_prefix='data.', meta='_id')
Step 2
The properties record was a Series with dicts so these can be 'formatted' with .apply(pd.Series)
beddit_data_properties = beddit_data['data.properties'].apply(pd.Series)
Step 3
Final step is to merge both DataFrames. In step 1, I kept the 'meta=_id' so that DataFrame can be merged with the original DataFrame from Bedit. I didn't include it in the final step yet because I can spend some time on the results from the results so far.
beddit_final = pd.concat([beddit_data_properties[:], beddit_data[:]], axis=1)
If anyone is interested, I can share the final Jupyter Notebook when it is ready :)
My problem is albeit a bit atypical. My Mongo instance records appear as follows:
{
"_id" : ObjectId("559670400084d37ea4cafa29"),
"('7412791816', '3838144', '723031613')" : {
"Customer_Loc_PinCode" : "110035",
"Net_Delivery_Time" : 3,
"Manifest_Date" : ISODate("2015-04-04T00:00:00Z"),
"Shipping_Date" : ISODate("2015-04-05T00:00:00Z"),
"Shipping_Method_Code" : "COD",
"Origin_PinCode" : "382470",
"Net_Manifest_Time" : 0,
"Transition_State" : [
[
"DNE",
"CTD",
"NULL",
"2015-04-05 15:23:22",
"NULL"
],
...# Many more such tuples present within this list.
],
"Net_Shipping_Time" : 2,
"RTD_Date" : "NULL",
"Delivery_Date" : ISODate("2015-04-07T00:00:00Z"),
"Intervening_Distance" : 522.3881079330106,
"Awb_Number" : "723031613",
"SubOrder_Number" : "7412791816",
"Last_Status" : "SHP",
"Customer_LatLong" : [
-,#Some float value
-#Some float value
],
"Order_Date" : ISODate("2015-04-04T00:00:00Z"),
"RTA_Date" : "NULL",
"Return_Direction" : 0,
"New_Status" : "DEL",
"Origin_LatLong" : [
-,#Some float value
-
],
"Rec_ID" : "3838144",
"RTU_Date" : "NULL"
}}
Now I require to obtain the dates and Net_Delivery_Time, as an example here, for all the records for further processing(plotting).
However the major debacle is that each such dictionary is referenced by a composite key,i.e. a tuple consisting of 3 fields. Now each such key uniquely identifies the associated record. I wish to extract the required fields from each such dictionary, but I have no means of iterating through all the keys.
I tried an approach to first collect all the keys and then retrieve the concerned fields, but that method didn't ork as there is no associated support for that in PyMongo.
If I were to use the db.'collection_name'.find() method, how will I craft the query? Can the uniqueness of each key present any potential problems? And what approach should I employ to achieve this task?
Thank You
I have created 4 indexes to test query performance in my collection when quering for two fields of the same document, one of which is an array (needs a multi-key index). Two of the indexes are single and two compound.
I am surpised because of getting better performance with one of the single indexes than with the compound ones. I was expecting to obtain the best performace with a compound index, because I understand that it indexes the two fields allowing for faster querying.
These are my indexes:
{ "v" : 1,
"key" : { "_id" : 1 },
"ns" : "bt_twitter.mallorca.mallorca",
"name" : "_id_"
},
{ "v" : 1,
"key" : { "epoch_creation_date" :1 },
"ns" : "bt_twitter.mallorca.mallorca",
"name" : "epoch_creation_date_1"
},
{ "v" : 1,
"key" : { "related_hashtags" : 1 },
"ns" : "bt_twitter.mallorca.mallorca",
"name" : "related_hashtags_1"
},
{ "v" : 1,
"key" : { "epoch_creation_date" : 1, "related_hashtags" : 1 },
"ns" : "bt_twitter.mallorca.mallorca",
"name" : "epoch_creation_date_1_related_hashtags_1"
}
My queries and performance indicators are (hint parameter shows the index used at each query):
QUERY 1:
active_collection.find(
{'epoch_creation_date': {'$exists': True}},
{"_id": 0, "related_hashtags":1}
).hint([("epoch_creation_date", ASCENDING)]).explain()
millis: 237
nscanned: 101226
QUERY 2:
active_collection.find(
{'epoch_creation_date': {'$exists': True}},
{"_id": 0, "related_hashtags": 1}
).hint([("related_hashtags", ASCENDING)]).explain()
millis: 1131
nscanned: 306715
QUERY 3:
active_collection.find(
{'epoch_creation_date': {'$exists': True}},
{"_id": 0, "related_hashtags": 1}
).hint([("epoch_creation_date", ASCENDING), ("related_hashtags", ASCENDING)]).explain()
millis: 935
nscanned: 306715
QUERY 4:
active_collection.find(
{'epoch_creation_date': {'$exists': True}},
{"_id": 0, "related_hashtags": 1}
).hint([("related_hashtags", ASCENDING),("epoch_creation_date", ASCENDING)]).explain()
millis: 1165
nscanned: 306715
QUERY 1 scans less documents, what is probably the reason to be faster. Can somebody help me to understand why is it performing better than queries with compound indexes? Therefore, when is better to use a compound index than a single one?
I am reading mongo documentation but these concepts are resulting hard for me to digest.
Thanks in advance.
UPDATED question (in response to Sammaye and Philipp)
This is the result of a full explain()
"cursor" : "BtreeCursor epoch_creation_date_1",
"isMultiKey" : false,
"n" : 101226,
"nscannedObjects" : 101226,
"nscanned" : 101226,
"nscannedObjectsAllPlans" : 101226,
"nscannedAllPlans" : 101226,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 242,
"indexBounds" : {u'epoch_creation_date': [[{u'$minElement': 1}, {u'$maxElement': 1}]]
},
"server" : "vmmongodb:27017"
for the following query:
active_collection.find(
{'epoch_creation_date': {'$exists': True}},
{"_id": 0, "related_hashtags":1})
.hint([("epoch_creation_date", ASCENDING)]).explain()
You created a compound index (named epoch_creation_date_1_related_hashtags_1), but you aren't using it in those hints. Instead of that you are using the two single-field indexes you also created (related_hashtags_1 and epoch_creation_date_1) in different order.
Of those two indexes, only epoch_creation_date_1 is effective, because you aren't querying for both fields. You are only querying for one, and this is 'epoch_creation_date': {'$exists': True}. The field-filtering which you perform with {"_id": 0, "related_hashtags":1} is done on the documents which were found by that query. At that point, indexes are of no use anymore. That means any index on related_hashtags won't be able to increase performance on this query. The compound index (when you would actually use it) might be better than no index at all, but not as good as the index on epoch_creation_date only.
Ok after reading the question more I understand the problem. The multikey index will write an index entry PER multivalue. This means if you have 3 values per related_hashtags per document your index is actually 3x the size and has 3x the number of values to scan (if my math adds up there...).
nscanned is a counter for how times a document had to be looked at (note counter, not a specific number of unique documents looked at), this means that due to the multikey index you had to scan roughly 3x the amount of (same) documents you normally would for the first query.
This is a known caveat with multikey indexes and why you should be careful about just throwing them around like this.
I believe the reason why the third query is so slow is because multikey indexes cannot support indexOnly cursors so MongoDB could not use covered queries there.