mongodb query takes too long time - python

I have following documents in my mongodb collection:
{'name' : 'abc-1','parent':'abc', 'price': 10}
{'name' : 'abc-2','parent':'abc', 'price': 5}
{'name' : 'abc-3','parent':'abc', 'price': 9}
{'name' : 'abc-4','parent':'abc', 'price': 11}
{'name' : 'efg', 'parent':'', 'price': 10}
{'name' : 'efg-1','parent':'efg', 'price': 5}
{'name' : 'abc-2','parent':'efg','price': 9}
{'name' : 'abc-3','parent':'efg','price': 11}
I want to perform following action:
a. Group By distinct parent
b. Sort all the groups based on price
c. For each group select a document with minimum price
i. check each record's parent sku exists as a record in name field
ii. If the name exists, do nothing
iii. If the record does not exists, insert a document with parent as empty and other values as the value of the record selected previously (minimum value).
I tired to do use for each as follows:
db.file.find().sort([("price", 1)]).forEach(function(doc){
cnt = db.file.count({"sku": {"$eq": doc.parent}});
if (cnt < 1){
newdoc = doc;
newdoc.name = doc.parent;
newdoc.parent = "";
delete newdoc["_id"];
db.file.insertOne(newdoc);
}
});
The problem with it is it takes too much time. What is wrong here? How can it be optimized? Would aggregation pipeline be a good solution, if yes how can it be done?

Retrieve a set of product names ✔
def product_names():
for product in db.file.aggregate([{$group: {_id: "$name"}}]):
yield product['_id']
product_names = set(product_names())
Retrieve product with minimum
price from group ✔
result_set = db.file.aggregate([
{
'$sort': {
'price': 1,
}
},
{
'$group': {
'_id': '$parent',
'name': {
'$first': '$name',
},
'price': {
'$min': '$price',
}
}
},
{
'$sort': {
'price': 1,
}
}
])
Insert products retrieved in 2 if name not in set
of product names retrieved in 1. ✔
from pymongo.operations import InsertOne
def insert_request(product):
return InsertOne({
name: product['name'],
price: product['price'],
parent: ''
})
requests = (
insert_request(product)
for product in result_set
if product['name'] not in product_names
)
db.file.bulk_write(list(requests))
Steps 2 and 3 can be implemented in the aggregation pipeline.
db.file.aggregate([
{
'$sort': {'price': 1}
},
{
'$group': {
'_id': '$parent',
'name': {
'$first': '$name'
},
'price': {
'$min': '$price'
},
}
},
{
'$sort': {
'price': 1
}
},
{
'$project': {
'name': 1,
'price': 1,
'_id': 0,
'parent':''
}
},
{
'$match': {
'name': {
'$nin': list(product_names())
}
}
},
{
'$out': 'file'
}
])

Related

point specific key and value from highly nested dictionary

I believe there must be a way to point specific key from nested dict, not in the traditional ways.
imagine dictionary like this.
dict1 = { 'level1': "value",
'unexpectable': { 'maybe': { 'onemotime': {'name': 'John'} } } }
dict2 = { 'level1': "value", 'name': 'Steve'}
dict3 = { 'find': { 'what': { 'you': { 'want': { 'in': { 'this': { 'maze': { 'I': { 'made': { 'for': { 'you': { 'which': { 'is in': { 'fact that': { 'was just': { 'bully your': { 'searching': { 'for': { 'the name': { 'even tho': { 'in fact': { 'actually': { 'there': { 'is': { 'in reality': { 'only': { 'one': { 'key': { 'named': { 'name': 'Michael' } } } } } } } } } } } } } } } } } } } } } } } } } } } } } }
in this case, if we want to point 'name' key to get 'John' and 'Steve' and the 'Michael', you should code differently against dict1 and dict2 and dict3
and the traditional way to point the key buried in nested dictionary that I know is this.
print(dict1['unexpectable']['maybe']['onemotime']['name'])
and if you don't want your code to break because of empty value of dict, you may want to use get() function.
and I'm curious that if I want to get 'name' of dict1 safely with get(), should I code like this?
print(dict1.get('unexpectable', '').get('maybe', '').get('onemotime', '').get('name', ''))
in fact, i've got error when run those get().get().get().get() thing.
And please consider if you have to print() 'name' from that horrible dict3 even it has actually only one key.
and, imagine the case you extract 'name' from unknown dict4 which you cannot imagine what nesting structure the dict4 would have.
I believe that python must have a way to deal with this.
I searched on the internet about this problem, but the solutions seems really really difficult.
I just wanted simple solution.
the solution without pointing every keys on the every level.
like just pointing that last level key, the most important key.
like, print(dict1.lastlevel('name')) --> 'John'
like, no matter what structure of nesting levels they have, no matter how many duplicates they have, even if they omitted nested key in the middle of nested dict so that dict17 has one less level of dict16, you could get what you want, the last level value of the last level key.
So Conclusion.
I want to know if there is a simple solution like
print(dict.lastlevel('name'))
without creating custom function.
I want to know if there is solution like above from the default python methods, syntax, function, logic or concept.
The solution like above can be applied to dict1, dict2, dict3, to whatever dict would come.
There is no built-in method to accomplish what you are asking for. However, you can use a recursive function to dig through a nested dictionary. The function checks if the desired key is in the dictionary and returns the value if it is. Otherwise it iterates over the dict's values for other dictionaries and scans their keys as well, repeating until it reaches the bottom.
dict1 = { 'level1': "value",
'unexpectable': { 'maybe': { 'onemotime': {'name': 'John'} } } }
dict2 = { 'level1': "value", 'name': 'Steve'}
dict3 = { 'find': { 'what': { 'you': { 'want': { 'in': { 'this': { 'maze': { 'I': {
'made': { 'for': { 'you': { 'which': { 'is in': { 'fact that': {
'was just': { 'bully your': { 'searching': { 'for': { 'the name': {
'even tho': { 'in fact': { 'actually': { 'there': { 'is': { 'in reality': {
'only': { 'one': { 'key': { 'named': { 'name': 'Michael'
} } } } } } } } } } } } } } } } } } } } } } } } } } } } } }
def get_nested_dict_key(d, key):
if key in d:
return d[key]
else:
for item in d.values():
if not isinstance(item, dict):
continue
return get_nested_dict_key(item, key)
print(get_nested_dict_key(dict1, 'name'))
print(get_nested_dict_key(dict2, 'name'))
print(get_nested_dict_key(dict3, 'name'))
# prints:
# John
# Steve
# Michael
You can make simple recursive generator function which yields value of every particular key:
def get_nested_key(source, key):
if isinstance(source, dict):
key_value = source.get(key)
if key_value:
yield key_value
for value in source.values():
yield from get_nested_key(value, key)
elif isinstance(source, (list, tuple)):
for value in source:
yield from get_nested_key(value, key)
Usage:
dictionaries = [
{'level1': 'value', 'unexpectable': {'maybe': {'onemotime': {'name': 'John'}}}},
{'level1': 'value', 'name': 'Steve'},
{'find': {'what': {'you': {'want': {'in': {'this': {'maze': {'I': {'made': {'for': {'you': {'which': {'is in': {'fact that': {'was just': {'bully your': {'searching': {'for': {'the name': {'even tho': {'in fact': {'actually': {'there': {'is': {'in reality': {'only': {'one': {'key': {'named': {'name': 'Michael'}}}}}}}}}}}}}}}}}}}}}}}}}}}}}},
{'level1': 'value', 'unexpectable': {'name': 'Alex', 'maybe': {'onemotime': {'name': 'John'}}}},
{}
]
for d in dictionaries:
print(*get_nested_key(d, 'name'), sep=', ')
Output:
John
Steve
Michael
Alex, John

group all elements in arrays from mongo db

I have data in mongodb and it contains many fields, one of them is the content of the tweet that I scraped, all I want is to get all hashtags from the content then group them.
my data looks like that:
{
"_id" : NumberLong(1564531556487659520),
"content" : "Wie hat die #Corona-Pandemie den Arbeitsmarkt in Deutschland verändert? – #JansenAnika und Paula Risius vom #iw_koeln geben auf unserem Blog einen",
"likes" : NumberInt(0),
"replies" : NumberInt(0),
"retweet" : NumberInt(0)
},
{
"_id" : NumberLong(1564531463999168512),
"content" : "Start-ups noch pessimistischer als im #Corona-#Krisenjahr 2020",
"likes" : NumberInt(0),
"replies" : NumberInt(0),
"retweet" : NumberInt(0)
},
{
"_id" : NumberLong(1564531140802789381),
"content" : "Gesundheitsminister #klausholetschek fürchtet das Sinken der Hemmschwelle bei der #Legalisierung von #Cannabis. Ab Mitte September erleben wir in #München wieder das Absinken ganz anderer Hemmschwellen, #Corona-Hotspot inklusive.",
"likes" : NumberInt(1),
"replies" : NumberInt(1),
"retweet" : NumberInt(0)
}
After I write the below Code:
data = db.tweets.aggregate([{
"$project":{
"content":{
"$regexFindAll":{
"input":"$content",
"regex":r'[#]\w+'
}
}
}
},
{
"$group":{
"_id":"$content.match",
"count":{
"$sum":1
}
}
}
])
my result was different than what I want, it give me a dictionaries and each dictionary contain the "_id" which contain a list of the hashtags that I collect
my results:
{'_id': ['#Gersemann', '#Corona'], 'count': 1},
{'_id': ['#MAH', '#CORONA', '#CASES'], 'count': 3},
{'_id': ['#corona', '#coronalanding', '#coronasymptoms', '#coronawordpresstheme', '#coronavirus', '#coronavirusprevention', '#covid', '#covid19', '#covid19theme', '#covid19',
'#healthbeauty', '#healthcare', '#imithemes', '#medical'], 'count': 1},
{'_id': ['#China', '#Covid', '#Corona', '#SarsCoV2'], 'count': 1},
{'_id': ['#Gehorsam', '#Staat', '#Unterdr', '#Corona', '#Covid', '#Masken', '#Manie', '#Deutschen', '#Coronauten'], 'count': 1},
{'_id': ['#Maskenregeln', '#Corona', '#COVID19', '#Maske'], 'count': 1},
{'_id': ['#Pandemie', '#GBD', '#Medienversagen', '#Corona'], 'count': 1},
{'_id': ['#Herbst', '#Covid', '#Gesundheit', '#Corona', '#Maskenpflicht', '#Bundesregierung', '#Krankheit', '#Pandemie', '#Wochenblatt', '#WochenblattMedia', '#WochenblattNews'], 'count': 1}, {'_id': ['#COVID19', '#SARSCoV2', '#CORONA'], 'count': 1}]
but what I want is to count each hashtag alone grouped.
You can use $unwind to split your list content
[
{
"$project": {
"content": {
"$regexFindAll": {
"input": "$content",
"regex": "[#]\\w+"
}
}
}
},
{
"$unwind": "$content"
},
{
"$group": {
"_id": "$content.match",
"count": {
"$sum": 1
}
}
}
]
try it here

python script to create the cloud function

We have written python script to create the cloud function , the trigger is Https. We need to invoke fetch the output of the function , So for that we are using the environment variables but some how that is not getting stored ?
def generate_config(context):
""" Entry point for the deployment resources. """
name = context.properties.get('name', context.env['name'])
project_id = context.properties.get('project', context.env['project'])
region = context.properties['region']
resources = []
resources.append(
{
'name': 'createfunction',
'type': 'gcp-types/cloudfunctions-v1:projects.locations.functions',
'properties':
{
'function': "licensevalidation",
'parent': 'projects//locations/',
'sourceArchiveUrl': 'gs://path',
'entryPoint':'handler',
'httpsTrigger': {"url": "https://.cloudfunctions.net/licensevalidation","securityLevel": "SECURE_ALWAYS"},
'timeout': '60s',
'serviceAccountEmail' : '.iam.gserviceaccount.com',
'availableMemoryMb': 256,
'runtime': 'python37' ,
'environmentvaiable' :
}
}
)
call ={
'type': 'gcp-types/cloudfunctions-v1:cloudfunctions.projects.locations.functions.call',
'name': 'call',
'properties':
{
'name':'/licensevalidation',
'data': '{""}'
},
'metadata': {
'dependsOn': ['createfunction']
}
}
resources.append(call)
return{
'resources': resources,
'outputs':
[
{
'name': 'installationtoken',
'value': 'os.environ.get(environment_variable)'
},
]
}

Merge dictionaries with same key from two lists of dicts in python

I have two dictionaries, as below. Both dictionaries have a list of dictionaries as the value associated with their properties key; each dictionary within these lists has an id key. I wish to merge my two dictionaries into one such that the properties list in the resulting dictionary only has one dictionary for each id.
{
"name":"harry",
"properties":[
{
"id":"N3",
"status":"OPEN",
"type":"energetic"
},
{
"id":"N5",
"status":"OPEN",
"type":"hot"
}
]
}
and the other list:
{
"name":"harry",
"properties":[
{
"id":"N3",
"type":"energetic",
"language": "english"
},
{
"id":"N6",
"status":"OPEN",
"type":"cool"
}
]
}
The output I am trying to achieve is:
"name":"harry",
"properties":[
{
"id":"N3",
"status":"OPEN",
"type":"energetic",
"language": "english"
},
{
"id":"N5",
"status":"OPEN",
"type":"hot"
},
{
"id":"N6",
"status":"OPEN",
"type":"cool"
}
]
}
As id: N3 is common in both the lists, those 2 dicts should be merged with all the fields. So far I have tried using itertools and
ds = [d1, d2]
d = {}
for k in d1.keys():
d[k] = tuple(d[k] for d in ds)
Could someone please help in figuring this out?
Here is one of the approach:
a = {
"name":"harry",
"properties":[
{
"id":"N3",
"status":"OPEN",
"type":"energetic"
},
{
"id":"N5",
"status":"OPEN",
"type":"hot"
}
]
}
b = {
"name":"harry",
"properties":[
{
"id":"N3",
"type":"energetic",
"language": "english"
},
{
"id":"N6",
"status":"OPEN",
"type":"cool"
}
]
}
# Create dic maintaining the index of each id in resp dict
a_ids = {item['id']: index for index,item in enumerate(a['properties'])} #{'N3': 0, 'N5': 1}
b_ids = {item['id']: index for index,item in enumerate(b['properties'])} #{'N3': 0, 'N6': 1}
# Loop through one of the dict created
for id in a_ids.keys():
# If same ID exists in another dict, update it with the key value
if id in b_ids:
b['properties'][b_ids[id]].update(a['properties'][a_ids[id]])
# If it does not exist, then just append the new dict
else:
b['properties'].append(a['properties'][a_ids[id]])
print (b)
Output:
{'name': 'harry', 'properties': [{'id': 'N3', 'type': 'energetic', 'language': 'english', 'status': 'OPEN'}, {'id': 'N6', 'status': 'OPEN', 'type': 'cool'}, {'id': 'N5', 'status': 'OPEN', 'type': 'hot'}]}
It might help to treat the two objects as elements each in their own lists. Maybe you have other objects with different name values, such as might come out of a JSON-formatted REST request.
Then you could do a left outer join on both name and id keys:
#!/usr/bin/env python
a = [
{
"name": "harry",
"properties": [
{
"id":"N3",
"status":"OPEN",
"type":"energetic"
},
{
"id":"N5",
"status":"OPEN",
"type":"hot"
}
]
}
]
b = [
{
"name": "harry",
"properties": [
{
"id":"N3",
"type":"energetic",
"language": "english"
},
{
"id":"N6",
"status":"OPEN",
"type":"cool"
}
]
}
]
a_names = set()
a_prop_ids_by_name = {}
a_by_name = {}
for ao in a:
an = ao['name']
a_names.add(an)
if an not in a_prop_ids_by_name:
a_prop_ids_by_name[an] = set()
for ap in ao['properties']:
api = ap['id']
a_prop_ids_by_name[an].add(api)
a_by_name[an] = ao
res = []
for bo in b:
bn = bo['name']
if bn not in a_names:
res.append(bo)
else:
ao = a_by_name[bn]
bp = bo['properties']
for bpo in bp:
if bpo['id'] not in a_prop_ids_by_name[bn]:
ao['properties'].append(bpo)
res.append(ao)
print(res)
The idea above is to process list a for names and ids. The names and ids-by-name are instances of a Python set. So members are always unique.
Once you have these sets, you can do the left outer join on the contents of list b.
Either there's an object in b that doesn't exist in a (i.e. shares a common name), in which case you add that object to the result as-is. But if there is an object in b that does exist in a (which shares a common name), then you iterate over that object's id values and look for ids not already in the a ids-by-name set. You add missing properties to a, and then add that processed object to the result.
Output:
[{'name': 'harry', 'properties': [{'id': 'N3', 'status': 'OPEN', 'type': 'energetic'}, {'id': 'N5', 'status': 'OPEN', 'type': 'hot'}, {'id': 'N6', 'status': 'OPEN', 'type': 'cool'}]}]
This doesn't do any error checking on input. This relies on name values being unique per object. So if you have duplicate keys in objects in both lists, you may get garbage (incorrect or unexpected output).

How to access MongoDB array in which is are stored key-value pairs by key name

I am working with pymongo and after writing aggregate query
db.collection.aggregate([{'$project': {'Id': '$ResultData.Id','data' : '$Results.Data'}}])
I received the object:
{'data': [{'key': 'valid', 'value': 'true'},
{'key': 'number', 'value': '543543'},
{'key': 'name', 'value': 'Saturdays cx'},
{'key': 'message', 'value': 'it is valid.'},
{'key': 'city', 'value': 'London'},
{'key': 'street', 'value': 'Bigeye'},
{'key': 'pc', 'value': '3566'}],
Is there a way that I can access the values by the key name? Like that '$Results.Data.city' and I will receive London. I would like to do that on the level of MongoDB aggregate query so it means I want to write a query in the way:
db.collection.aggregate([{'$project':
{'Id': '$ResultData.Id',
'data' : '$Results.Data',
'city' : $Results.Data.city',
'name' : $Results.Data.name',
'street' : $Results.Data.street',
'pc' : $Results.Data.pc',
}}])
And receive all the values of provided keys.
Using the $elemMatch projection operator in the following query from mongo shell:
db.collection.find(
{ _id: <some_value> },
{ _id: 0, data: { $elemMatch: { key: "city" } } }
)
The output:
{ "data" : [ { "key" : "city", "value" : "London" } ] }
Using PyMongo (gets the same output):
collection.find_one(
{ '_id': <some_value> },
{ '_id': 0, 'data': { '$elemMatch': { 'key': 'city' } } }
)
Using PyMongo aggregate method (gets the same result):
pipeline = [
{
'$project': {
'_id': 0,
'data': {
'$filter': {
'input': '$data', 'as': 'dat',
'cond': { '$eq': [ '$$dat.key', INPUT_KEY ] }
}
}
}
}
]
INPUT_KEY = 'city'
pprint.pprint(list(collection.aggregate(pipeline)))
Naming the received object "result", if result['data'] always is a list of dictionaries with 2 keys (key and value), you can convert the whole list to a dictionary using keys as keys and values as values. Given that this statement is somewhat confusing, here's the code:
data = {pair['key']: pair['value'] for pair in result['data']}
From here, data['city'] will give you 'London', data['street'] will be 'Bigeye' and so on. Obviously, this assumes that there are no conflicts amoung key values in result['data']. Note that this dictionary will (just as the original result['data']) only contain strings so don't expect data['number'] to be an integer.
Another approach would be to dynamically create an object holding each key-value pair as an attribute, allowing you to use the following syntax: data.city, data.street, ... But this would required more complicated code and is a less common and stable approach.

Categories