Mongo Distinct Query with full row object - python

first of all i'm new to mongo so I don't know much and i cannot just remove duplicate rows due to some dependencies.
I have following data stored in mongo
{'id': 1, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 2, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 3, 'key': 'pehnvosjijipehnvosjijipehnvosjijipehnvosjijipehnvosjiji', 'name': 'some name', 'country': 'IN'},
{'id': 4, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'},
{'id': 5, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'}
you can see some of the rows are duplicate with different id
as long as it will take to solve this issue from input I must tackle it on output.
I need the data in the following way:
{'id': 1, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 3, 'key': 'pehnvosjijipehnvosjijipehnvosjijipehnvosjijipehnvosjiji', 'name': 'some name', 'country': 'IN'},
{'id': 4, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'}
My query
keys = db.collection.distinct('key', {})
all_data = db.collection.find({'key': {$in: keys}})
As you can see it takes two queries for a same result set Please combine it to one as the database is very large
I might also create a unique key on the key but the value is so long (152 characters) that it will not help me.
Or it will??

You need to use the aggregation framework for this. There are multiple ways to do this, the solution below uses the $$ROOT variable to get the first document for each group:
db.data.aggregate([{
"$sort": {
"_id": 1
}
}, {
"$group": {
"_id": "$key",
"first": {
"$first": "$$ROOT"
}
}
}, {
"$project": {
"_id": 0,
"id":"$first.id",
"key":"$first.key",
"name":"$first.name",
"country":"$first.country"
}
}])

Related

Mapping JSON key-value pairs from source to destination using Python

Using Python requests I want to grab a piece of JSON from one source and post it to a destination. The structure of the JSON received and the one required by the destination, however, differs a bit so my question is, how do I best map the items from the source structure onto the destination structure?
To illustrate, imagine we get a list of all purchases made by John and Mary. And now we want to post the individual items purchased linking these to the individuals who purchased them (NOTE: The actual use case involves thousands of entries so I am looking for an approach that would scale accordingly):
Source JSON:
{
'Total Results': 2,
'Results': [
{
'Name': 'John',
'Age': 25,
'Purchases': [
{
'Fruits': {
'Type': 'Apple',
'Quantity': 3,
'Color': 'Red'}
},
{
'Veggie': {
'Type': 'Salad',
'Quantity': 2,
'Color': 'Green'
}
}
]
},
{
'Name': 'Mary',
'Age': 20,
'Purchases': [
{
'Fruits': {
'Type': 'Orange',
'Quantity': 2,
'Color': 'Orange'
}
}
]
}
]
}
Destination JSON:
{
[
{
'Purchase': 'Apple',
'Purchased by': 'John',
'Quantity': 3,
'Type': 'Red',
},
{
'Purchase': 'Salad',
'Purchased by': 'John',
'Quantity': 2,
'Type': 'Green',
},
{
'Purchase': 'Orange',
'Purchased by': 'Mary',
'Quantity': 2,
'Type': 'Orange',
}
]
}
Any help on this would be greatly appreciated! Cheers!
Just consider loop through the dict.
res = []
for result in d['Results']:
value = {}
for purchase in result['Purchases']:
item = list(purchase.values())[0]
value['Purchase'] = item['Type']
value['Purchased by'] = result['Name']
value['Quantity'] = item['Quantity']
value['Type'] = item['Color']
res.append(value)
pprint(res)
[{'Purchase': 'Apple', 'Purchased by': 'John', 'Quantity': 3, 'Type': 'Red'},
{'Purchase': 'Salad', 'Purchased by': 'John', 'Quantity': 2, 'Type': 'Green'},
{'Purchase': 'Orange', 'Purchased by': 'Mary', 'Quantity': 2, 'Type': 'Orange'}]

How can I implement this recursion in python?

Let's say that I have a Dictionary like this
dict1 = [{
'Name': 'Team1',
'id': '1',
'Members': [
{
'type': 'user',
'id': '11'
},
{
'type': 'user',
'id': '12'
}
]
},
{
'Name': 'Team2',
'id': '2',
'Members': [
{
'type': 'group'
'id': '1'
},
{
'type': 'user',
'id': '21'
}
]
},
{
'Name': 'Team3',
'id': '3',
'Members': [
{
'type': 'group'
'id': '2'
}
]
}]
and I want to get an output that can replace all the groups and nested groups with all distinct users.
In this case the output should look like this:
dict2 = [{
'Name': 'Team1',
'id': '1',
'Members': [
{
'type': 'user',
'id': '11'
},
{
'type': 'user',
'id': '12'
}
]
},
{
'Name': 'Team2',
'id': '2',
'Members': [
{
'type': 'user',
'id': '11'
},
{
'type': 'user',
'id': '12'
}
{
'type': 'user',
'id': '21'
}
]
},
{
'Name': 'Team3',
'id': '3',
'Members': [
{
'type': 'user',
'id: '11'
},
{
'type': 'user',
'id': '12'
}
{
'type': 'user',
'id': '21'
}
]
}]
Now let's assume that I have a large dataset to perform these actions on. (approx 20k individual groups)
What would be the best way to code this? I am attempting recursion, but I am not sure about how to search through the dictionary and lists in this manner such that it doesn't end up using too much memory
I do not think you need recursion. Looping is enough.
I think you can simply evaluate each Memberss, fetch users if group type, and make them unique. Then you can simply replace Members's value with distinct_users.
You might have a dictionary for groups like:
group_dict = {
'1': [
{'type': 'user', 'id': '11'},
{'type': 'user', 'id': '12'}
],
'2': [
{'type': 'user', 'id': '11'},
{'type': 'user', 'id': '12'},
{'type': 'user', 'id': '21'}
],
'3': [
{'type': 'group', 'id': '1'},
{'type': 'group', 'id': '2'},
{'type': 'group', 'id': '3'} # recursive
]
...
}
You can try:
def users_in_group(group_id):
users = []
groups_to_fetch = []
for user_or_group in group_dict[group_id]:
if user_or_group['type'] == 'group':
groups_to_fetch.append(user_or_group)
else: # 'user' type
users.append(user_or_group)
groups_fetched = set() # not to loop forever
while groups_to_fetch:
group = groups_to_fetch.pop()
if group['id'] not in groups_fetched:
groups_fetched.add(group['id'])
for user_or_group in group_dict[group['id']]:
if user_or_group['type'] == 'group' and user_or_group['id'] not in groups_fetched:
groups_to_fetch.append(user_or_group)
else: # 'user' type
users.append(user_or_group)
return users
def distinct_users_in(members):
distinct_users = []
def add(user):
if user['id'] not in user_id_set:
distinct_users.append(user)
user_id_set.add(user['id'])
user_id_set = set()
for member in members:
if member['type'] == 'group':
for user in users_in_group(member['id']):
add(user)
else: # 'user'
user = member
add(user)
return distinct_users
dict2 = dict1 # or `copy.deepcopy`
for element in dict2:
element['Members'] = distinct_users_in(element['Members'])
Each Members is re-assigned by distinct_users returned by the corresponding function.
The function takes Members and fetches users from each if member type. If user type, member itself is a user. While (fetched) users are appended to distinct_user, you can use their ids for uniquity.
When you fetch users_in_group, you can use two lists; groups_to_fetch and groups_fetched. The former is a stack to recursively fetch all groups in a group. The latter is not to fetch an already fetched group again. Or, it could loop forever.
Finally, if your data are already in memory, this approach may not exhaust memory and work.

Create avro schema for python dict

I want to create an avro-schema for following python-dictionary:
d = {
'topic': 'example',
'content': (
{ 'description': {'name': 'alex', 'value': 12}, 'id': '234ba' },
{ 'description': {'name': 'john', 'value': 14}, 'id': '823cx' }
)
}
How can I do this?
Have you tried to use the default serialization and deserialization included in the avro library for python?
https://avro.apache.org/docs/1.10.0/gettingstartedpython.html
Verify that is what you want

Sort and return all of nested dictionaries based on specified key value

I am trying to re-arrange the contents of a nested dictionaries where it will check the value of a specified key.
dict_entries = {
'entries': {
'AzP746r3Nl': {
'uniqueID': 'AzP746r3Nl',
'index': 2,
'data': {'comment': 'First Plastique Mat.',
'created': '17/01/19 10:18',
'project': 'EMZ',
'name': 'plastique_varA',
'version': '1'},
'name': 'plastique_varA',
'text': 'plastique test',
'thumbnail': '/Desktop/mat/plastique_varA/plastique_varA.jpg',
'type': 'matEntry'
},
'Q2tch2xm6h': {
'uniqueID': 'Q2tch2xm6h',
'index': 0,
'data': {'comment': 'Camino from John Inds.',
'created': '03/01/19 12:08',
'project': 'EMZ',
'name': 'camino_H10a',
'version': '1'},
'name': 'camino_H10a',
'text': 'John Inds : Camino',
'thumbnail': '/Desktop/chips/camino_H10a/camino_H10a.jpg',
'type': 'ChipEntry'
},
'ZeqCFCmHqp': {
'uniqueID': 'ZeqCFCmHqp',
'index': 1,
'data': {'comment': 'Prototype Bleu.',
'created': '03/01/19 14:07',
'project': 'EMZ',
'name': 'bleu_P23y',
'version': '1'},
'name': 'bleu_P23y',
'text': 'Bleu : Prototype',
'thumbnail': '/Desktop/chips/bleu_P23y/bleu_P23y.jpg',
'type': 'ChipEntry'
}
}
}
In my above nested dictionary example, I am trying to check it by the name and created key (2 functions each) and once it has been sorted, the index value will be updated accordingly as well...
Even so, I am able to query for the values of the said key(s):
for item in dict_entries.get('entries').values():
#The key that I am targetting at
tar_key = item['name']
but this is returning me the value of the name key and I am unsure on my next step as I am trying to sort by the value of the name key and capturing + re-arranging all the contents of the nested dictionaries.
This is my desired output (if checking by name):
{'entries': {
'ZeqCFCmHqp': {
'uniqueID': 'ZeqCFCmHqp',
'index': 1,
'data': {'comment': 'Prototype Bleu.',
'created': '03/01/19 14:07',
'project': 'EMZ',
'name': 'bleu_P23y',
'version': '1'},
'name': 'bleu_P23y',
'text': 'Bleu : Prototype',
'thumbnail': '/Desktop/chips/bleu_P23y/bleu_P23y.jpg',
'type': 'ChipEntry'
}
'Q2tch2xm6h': {
'uniqueID': 'Q2tch2xm6h',
'index': 0,
'data': {'comment': 'Camino from John Inds.',
'created': '03/01/19 12:08',
'project': 'EMZ',
'name': 'camino_H10a',
'version': '1'},
'name': 'camino_H10a',
'text': 'John Inds : Camino',
'thumbnail': '/Desktop/chips/camino_H10a/camino_H10a.jpg',
'type': 'ChipEntry'
},
'AzP746r3Nl': {
'uniqueID': 'AzP746r3Nl',
'index': 2,
'data': {'comment': 'First Plastique Mat.',
'created': '17/01/19 10:18',
'project': 'EMZ',
'name': 'plastique_varA',
'version': '1'},
'name': 'plastique_varA',
'text': 'plastique test',
'thumbnail': '/Desktop/mat/plastique_varA/plastique_varA.jpg',
'type': 'matEntry'
}
}
}

Filter python dictionary with dictionary-comprehension

I have a dictionary that is really a geojson:
points = {
'crs': {'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}, 'type': 'name'},
'features': [
{'geometry': {
'coordinates':[[[-3.693162104185235, 40.40734504903418],
[-3.69320229317164, 40.40719570724241],
[-3.693227952841606, 40.40698546120488],
[-3.693677594635894, 40.40712700492216]]],
'type': 'Polygon'},
'properties': {
'name': 'place1',
'temp': 28},
'type': 'Feature'
},
{'geometry': {
'coordinates': [[[-3.703886381691941, 40.405197271972035],
[-3.702972834622821, 40.40506272989243],
[-3.702552994966045, 40.40506798079752],
[-3.700985024825222, 40.405500820623814]]],
'type': 'Polygon'},
'properties': {
'name': 'place2',
'temp': 27},
'type': 'Feature'
},
{'geometry': {
'coordinates': [[[-3.703886381691941, 40.405197271972035],
[-3.702972834622821, 40.40506272989243],
[-3.702552994966045, 40.40506798079752],
[-3.700985024825222, 40.405500820623814]]],
'type': 'Polygon'},
'properties': {
'name': 'place',
'temp': 25},
'type': 'Feature'
}
],
'type': u'FeatureCollection'
}
I would like to filter it to stay only with places that have a specific temperature, for example, more than 25 degrees Celsius.
I have managed to do it this way:
dict(crs = points["crs"],
features = [i for i in points["features"] if i["properties"]["temp"] > 25],
type = points["type"])
But I wondered if there was any way to do it more directly, with dictionary comprehension.
Thank you very much.
I'm very late. A dict compreheneison won't help you since you have only three keys. But if you meet the following conditions: 1. you don't need a copy of features (e.g. your dict is read only); 2. you don't need index access to features, you my use a generator comprehension instead of a list comprehension:
dict(crs = points["crs"],
features = (i for i in points["features"] if i["properties"]["temp"] > 25),
type = points["type"])
The generator is created in constant time, while the list comprehension is created in O(n). Furthermore, if you create a lot of those dicts, you have only one copy of the features in memory.

Categories