parsing nested json using Pandas - python

I wanted to try to parse this nested json using Pandas, and I'm confused when i wanted to extract the data from column "amount" and "items", and the data has so many rows like hundreds, this is one of the example
{
"_id": "62eaa99b014c9bb30203e48a",
"amount": {
"product": 291000,
"shipping": 75000,
"admin_fee": 4500,
"order_voucher_deduction": 0,
"transaction_voucher_deduction": 0,
"total": 366000,
"paid": 366000
},
"status": 32,
"items": [
{
"_id": "62eaa99b014c9bb30203e48d",
"earning": 80400,
"variants": [
{
"name": "Color",
"value": "Black"
},
{
"name": "Size",
"value": "38"
}
],
"marketplace_price": 65100,
"product_price": 62000,
"reseller_price": 145500,
"product_id": 227991,
"name": "Heels",
"sku_id": 890512,
"internal_markup": 3100,
"weight": 500,
"image": "https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg",
"quantity": 1,
"supplier_price": 60140
}
I've tried using this and it'd only shows the index
dfjson=pd.json_normalize(datasetjson)
dfjson.head(3)
##UPDATE
I tried added the pd.Dataframe , yes it works to become dataframe, but i still haven't got to know how to extract the _id, earning, variants

Given:
data = {
'_id': '62eaa99b014c9bb30203e48a',
'amount': {'admin_fee': 4500,
'order_voucher_deduction': 0,
'paid': 366000,
'product': 291000,
'shipping': 75000,
'total': 366000,
'transaction_voucher_deduction': 0},
'items': [{'_id': '62eaa99b014c9bb30203e48d',
'earning': 80400,
'image': 'https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg',
'internal_markup': 3100,
'marketplace_price': 65100,
'name': 'Heels',
'product_id': 227991,
'product_price': 62000,
'quantity': 1,
'reseller_price': 145500,
'sku_id': 890512,
'supplier_price': 60140,
'variants': [{'name': 'Color', 'value': 'Black'},
{'name': 'Size', 'value': '38'}],
'weight': 500}],
'status': 32
}
Doing:
df = pd.json_normalize(data, ['items'], ['amount'])
df = df.join(df.amount.apply(pd.Series))
df = df.join(df.variants.apply(pd.DataFrame)[0].set_index('name').T.reset_index(drop=True))
df = df.drop(['amount', 'variants'], axis=1)
print(df)
Output:
_id earning marketplace_price product_price reseller_price product_id name sku_id internal_markup weight image quantity supplier_price product shipping admin_fee order_voucher_deduction transaction_voucher_deduction total paid Color Size
0 62eaa99b014c9bb30203e48d 80400 65100 62000 145500 227991 Heels 890512 3100 500 https://product-asset.s3.ap-southeast-1.amazon... 1 60140 291000 75000 4500 0 0 366000 366000 Black 38
There's probably a better way to do some of this, but the sample provided wasn't even a valid json object, so I can't be sure what the real data actually looks like.

Try pd.json_normalize(datasetjson, max_level=0)

I guess you are confuse working with dictionaries or json format.
This line is the same sample you have but it's missed ]} at the end. I formatted removing blank spaces but it's the same:
dfjson = {"_id":"62eaa99b014c9bb30203e48a","amount":{"product":291000,"shipping":75000,"admin_fee":4500,"order_voucher_deduction":0,"transaction_voucher_deduction":0,"total":366000,"paid":366000},"status":32,"items":[{"_id":"62eaa99b014c9bb30203e48d","earning":80400,"variants":[{"name":"Color","value":"Black"},{"name":"Size","value":"38"}],"marketplace_price":65100,"product_price":62000,"reseller_price":145500,"product_id":227991,"name":"Heels","sku_id":890512,"internal_markup":3100,"weight":500,"image":"https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg","quantity":1,"supplier_price":60140}]}
Now, if you want to call amount:
dfjson['amount']
# Output
{'product': 291000,
'shipping': 75000,
'admin_fee': 4500,
'order_voucher_deduction': 0,
'transaction_voucher_deduction': 0,
'total': 366000,
'paid': 366000}
If you want to call items:
dfjson['items']
# Output
[{'_id': '62eaa99b014c9bb30203e48d',
'earning': 80400,
'variants': [{'name': 'Color', 'value': 'Black'},
{'name': 'Size', 'value': '38'}],
'marketplace_price': 65100,
'product_price': 62000,
'reseller_price': 145500,
'product_id': 227991,
'name': 'Heels',
'sku_id': 890512,
'internal_markup': 3100,
'weight': 500,
'image': 'https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg',
'quantity': 1,
'supplier_price': 60140}]
For getting the items, you can create a list:
list_items = []
for i in dfjson['items']:
list_items.append(i)

For how to import the entire json data into a pandas DataFrame check out the answer given by BeRT2me
but if you are only after extracting _id, earning, variants to a Pandas DataFrame giving:
_id _id_id earning variants
0 62eaa99b014c9bb30203e48a 62eaa99b014c9bb30203e48d 80400 [{'name': 'Color', 'value': 'Black'}, {'name':...
as you state in your question:
but I still haven't got to know how to extract the _id, earning, variants
notice that the problem with extracting _id, earning, variants is that this values are 'hidden" within a list of one item. Resolving this issue with indexing it by [0] gives the required values:
json_text = """\
{'_id': '62eaa99b014c9bb30203e48a',
'amount': {'admin_fee': 4500,
'order_voucher_deduction': 0,
'paid': 366000,
'product': 291000,
'shipping': 75000,
'total': 366000,
'transaction_voucher_deduction': 0},
'items': [{'_id': '62eaa99b014c9bb30203e48d',
'earning': 80400,
'image': 'https://product-asset.s3.ap-southeast-1.amazonaws.com/1659384575578.jpeg',
'internal_markup': 3100,
'marketplace_price': 65100,
'name': 'Heels',
'product_id': 227991,
'product_price': 62000,
'quantity': 1,
'reseller_price': 145500,
'sku_id': 890512,
'supplier_price': 60140,
'variants': [{'name': 'Color', 'value': 'Black'},
{'name': 'Size', 'value': '38'}],
'weight': 500}],
'status': 32}
json_dict = eval(json_text)
print(f'{(_id := json_dict["items"][0]["_id"])=}')
print(f'{(earning := json_dict["items"][0]["earning"])=}')
print(f'{(variants := json_dict["items"][0]["variants"])=}')
print('---')
print(f'{_id=}')
print(f'{earning=}')
print(f'{variants=}')
gives:
(_id := json_dict["items"][0]["_id"])='62eaa99b014c9bb30203e48d'
(earning := json_dict["items"][0]["earning"])=80400
(variants := json_dict["items"][0]["variants"])=[{'name': 'Color', 'value': 'Black'}, {'name': 'Size', 'value': '38'}]
---
_id='62eaa99b014c9bb30203e48d'
earning=80400
variants=[{'name': 'Color', 'value': 'Black'}, {'name': 'Size', 'value': '38'}]```
If you want in addition a Pandas DataFrame with rows holding all this extracted values you can loop over all your json data files adding a row to a DataFrame using:
# Create an empty DataFrame:
df = pd.DataFrame(columns=['_id', '_id_id', 'earning', 'variants'])
# Add rows to df in a loop processing the json data files:
df_to_append = pd.DataFrame(
[[json_dict['_id'], _id, earning, variants]],
columns=['_id', '_id_id', 'earning', 'variants']
)
df = df.append(df_to_append)
pd.set_option('display.max_columns', None)
print(df.to_dict())

Related

How could I create this dictionary who's values are coming from a DataFrame faster?

I have this dataframe:
event
attendees
duration
meeting
[{"id":1, "email": "email1"}, {"id":2, "email": "email2"}]
3600
lunch
[{"id":2, "email": "email2"}, {"id":3, "email": "email3"}]
7200
Which I am trying to turn into this dictionary:
{
email1: {
'num_events_with': 1,
'duration_of_events': 3600,
},
email2: {
'num_events_with': 2,
'duration_of_events': 10,800,
},
email3: {
'num_events_with': 1,
'duration_of_events': 7200,
},
}
except in my case the dataframe has thousands of rows and the dictionary I'm creating uses multiple columns to get the results for the dictionary keys so I need to be able to access the information that is relevant to each user email while creating the dictionary.
The purpose of the dictionary is to give information about the people the user has been in events with. So the first dictionary key is saying that the user has been in 1 event with email1 which lasted 3600 seconds.
Here is my approach to getting this dictionary:
# need to sort because I use diff() later
df.sort_values(by='startTime', ascending=True, inplace=True)
# a list of all contacts (emails) that have been in events with user
contacts = contacts_in_events
contact_info_dict = {}
df['attendees_str'] = df['attendees'].astype(str)
for contact in contacts:
temp_df = df[df['attendees_str'].str.contains(contact)]
duration_of_events = temp_df['duration'].sum()
num_events_with = len(temp_df.index)
contact_info_dict[contact] = {
'duration_of_events': duration_of_events,
'num_events_with': num_events_with
}
but this is too slow. Any ideas for how to do this in a different way that would be faster?
This is the output of the actual dataframe .to_dict('records'):
{
'creator': {
'displayName': None,
'email': 'creator of event',
'id': None,
'self': None
},
'start': {
'date': None,
'dateTime': '2022-09-13T12:30:00-04:00',
'timeZone': 'America/Toronto'
},
'end': {
'date': None,
'dateTime': '2022-09-13T13:00:00-04:00',
'timeZone': 'America/Toronto'
},
'attendees': [
{
'comment': None,
'displayName': None,
'email1': 'email1#email.com',
'responseStatus': 'accepted'
},
{
'comment': None,
'displayName': None,
'email': 'email2#email.com',
'responseStatus': 'accepted'
}
],
'summary': 'One on One Meeting',
'description': '...',
'calendarType': 'work',
'startTime': Timestamp('2022-09-13 16:30:00+0000', tz='UTC'),
'endTime': Timestamp('2022-09-13 17:00:00+0000', tz='UTC'),
'eventDuration': 1800.0,
'dowStart': 1.0,
'endStart': 1.0,
'weekday': True,
'startTOD': 59400,
'endTOD': 61200,
'day': Period('2022-09-13', 'D')
}
explode 'attendees' to individual rows, then convert to columns with json_normalize, aggregate the data with groupby.agg and convert to_dict:
out = (df
.explode('attendees', ignore_index=True)
.pipe(lambda d: d.join(pd.json_normalize(d.pop('attendees'))))
.groupby('email')
.agg(**{'num_events_with': ('email', 'count'),
'duration_of_events': ('duration', 'sum')
})
.to_dict(orient='index')
)
Output:
{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
'email2': {'num_events_with': 2, 'duration_of_events': 10800},
'email3': {'num_events_with': 1, 'duration_of_events': 7200}}
Example
col = ['event', 'attendees', 'duration']
data = [['meeting', [{"id":1, "email": "email1"}, {"id":2, "email": "email2"}], 3600], ['lunch', [{"id":2, "email": "email2"}, {"id":3, "email": "email3"}],7200]]
df = pd.DataFrame(data, columns=col)
Code
df1 = df.explode('attendees')
grouper = df1['attendees'].str['email']
col1 = ['num_events_with', 'duration_of_events']
out = (df1.groupby(grouper)['duration'].agg(['count', sum]).T.set_axis(col1).to_dict())
out:
{'email1': {'num_events_with': 1, 'duration_of_events': 3600},
'email2': {'num_events_with': 2, 'duration_of_events': 10800},
'email3': {'num_events_with': 1, 'duration_of_events': 7200}}
if you want 1 line use following
(df.explode('attendees').assign(attendees=lambda x:x['attendees'].str['email'])
.groupby('attendees')['duration'].agg(['count',sum])
.T.set_axis(['num_events_with', 'duration_of_events']).to_dict())

Mapping JSON key-value pairs from source to destination using Python

Using Python requests I want to grab a piece of JSON from one source and post it to a destination. The structure of the JSON received and the one required by the destination, however, differs a bit so my question is, how do I best map the items from the source structure onto the destination structure?
To illustrate, imagine we get a list of all purchases made by John and Mary. And now we want to post the individual items purchased linking these to the individuals who purchased them (NOTE: The actual use case involves thousands of entries so I am looking for an approach that would scale accordingly):
Source JSON:
{
'Total Results': 2,
'Results': [
{
'Name': 'John',
'Age': 25,
'Purchases': [
{
'Fruits': {
'Type': 'Apple',
'Quantity': 3,
'Color': 'Red'}
},
{
'Veggie': {
'Type': 'Salad',
'Quantity': 2,
'Color': 'Green'
}
}
]
},
{
'Name': 'Mary',
'Age': 20,
'Purchases': [
{
'Fruits': {
'Type': 'Orange',
'Quantity': 2,
'Color': 'Orange'
}
}
]
}
]
}
Destination JSON:
{
[
{
'Purchase': 'Apple',
'Purchased by': 'John',
'Quantity': 3,
'Type': 'Red',
},
{
'Purchase': 'Salad',
'Purchased by': 'John',
'Quantity': 2,
'Type': 'Green',
},
{
'Purchase': 'Orange',
'Purchased by': 'Mary',
'Quantity': 2,
'Type': 'Orange',
}
]
}
Any help on this would be greatly appreciated! Cheers!
Just consider loop through the dict.
res = []
for result in d['Results']:
value = {}
for purchase in result['Purchases']:
item = list(purchase.values())[0]
value['Purchase'] = item['Type']
value['Purchased by'] = result['Name']
value['Quantity'] = item['Quantity']
value['Type'] = item['Color']
res.append(value)
pprint(res)
[{'Purchase': 'Apple', 'Purchased by': 'John', 'Quantity': 3, 'Type': 'Red'},
{'Purchase': 'Salad', 'Purchased by': 'John', 'Quantity': 2, 'Type': 'Green'},
{'Purchase': 'Orange', 'Purchased by': 'Mary', 'Quantity': 2, 'Type': 'Orange'}]

How to convert raw json to required format in pythonic way

I have json from some service, where each value is different row.
Input example:
[
{'author': 'alf', 'topic': 'topic1', 'lang': 'ge', 'value': 11},
{'author': 'alf', 'topic': 'topic1', 'lang': 'ge', 'value': 22},
{'author': 'bob', 'topic': 'topic1', 'lang': 'ge', 'value': 33},
{'author': 'bob', 'topic': 'topic1', 'lang': 'ge', 'value': 44},
{'author': 'alf', 'topic': 'topic1', 'lang': 'fr', 'value': 99},
{'author': 'alf', 'topic': 'topic2', 'lang': 'ge', 'value': -20},
]
Output example:
{
'alf': {
'topic1': [
{'ge': [11, 22]},
{'fr': [99]}
],
'topic2': [
{'ge': [-20]}
]
},
'bob': {
'topic1': [
{'ge': [33, 44]}
]
}
}
So basically this is simple transformation via grouping specified keys to collect all values in to one array.
I done this transformation via checking and creating required key if it is missing:
for entry in self._raw_data:
parsed = {}
author = entry["author"]
topic = entry["topic"]
lang = entry["lang"]
value = entry["value"]
if not parsed.get(author):
parsed[author] = {}
if not parsed[author].get(topic):
parsed[author][topic] = []
#etc
I am sure, that could be done in more transparent way. Can anyone recommend something?
If you're willing to change the type of "topic"'s value from list to dict, you can use .setdefault():
res = {}
for entry in raw_data:
res.setdefault(entry['author'], {}).setdefault(entry["topic"], {}).setdefault(entry["lang"], []).append(entry["value"])
OUTPUT:
{
"alf": {
"topic1": {
"fr": [99],
"ge": [11, 22]
},
"topic2": {
"ge": [-20]
}
},
"bob": {
"topic1": {
"ge": [33, 44]
}
}
}

Filter python dictionary with dictionary-comprehension

I have a dictionary that is really a geojson:
points = {
'crs': {'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}, 'type': 'name'},
'features': [
{'geometry': {
'coordinates':[[[-3.693162104185235, 40.40734504903418],
[-3.69320229317164, 40.40719570724241],
[-3.693227952841606, 40.40698546120488],
[-3.693677594635894, 40.40712700492216]]],
'type': 'Polygon'},
'properties': {
'name': 'place1',
'temp': 28},
'type': 'Feature'
},
{'geometry': {
'coordinates': [[[-3.703886381691941, 40.405197271972035],
[-3.702972834622821, 40.40506272989243],
[-3.702552994966045, 40.40506798079752],
[-3.700985024825222, 40.405500820623814]]],
'type': 'Polygon'},
'properties': {
'name': 'place2',
'temp': 27},
'type': 'Feature'
},
{'geometry': {
'coordinates': [[[-3.703886381691941, 40.405197271972035],
[-3.702972834622821, 40.40506272989243],
[-3.702552994966045, 40.40506798079752],
[-3.700985024825222, 40.405500820623814]]],
'type': 'Polygon'},
'properties': {
'name': 'place',
'temp': 25},
'type': 'Feature'
}
],
'type': u'FeatureCollection'
}
I would like to filter it to stay only with places that have a specific temperature, for example, more than 25 degrees Celsius.
I have managed to do it this way:
dict(crs = points["crs"],
features = [i for i in points["features"] if i["properties"]["temp"] > 25],
type = points["type"])
But I wondered if there was any way to do it more directly, with dictionary comprehension.
Thank you very much.
I'm very late. A dict compreheneison won't help you since you have only three keys. But if you meet the following conditions: 1. you don't need a copy of features (e.g. your dict is read only); 2. you don't need index access to features, you my use a generator comprehension instead of a list comprehension:
dict(crs = points["crs"],
features = (i for i in points["features"] if i["properties"]["temp"] > 25),
type = points["type"])
The generator is created in constant time, while the list comprehension is created in O(n). Furthermore, if you create a lot of those dicts, you have only one copy of the features in memory.

Mongo Distinct Query with full row object

first of all i'm new to mongo so I don't know much and i cannot just remove duplicate rows due to some dependencies.
I have following data stored in mongo
{'id': 1, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 2, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 3, 'key': 'pehnvosjijipehnvosjijipehnvosjijipehnvosjijipehnvosjiji', 'name': 'some name', 'country': 'IN'},
{'id': 4, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'},
{'id': 5, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'}
you can see some of the rows are duplicate with different id
as long as it will take to solve this issue from input I must tackle it on output.
I need the data in the following way:
{'id': 1, 'key': 'qscderftgbvqscderftgbvqscderftgbvqscderftgbvqscderftgbv', 'name': 'some name', 'country': 'US'},
{'id': 3, 'key': 'pehnvosjijipehnvosjijipehnvosjijipehnvosjijipehnvosjiji', 'name': 'some name', 'country': 'IN'},
{'id': 4, 'key': 'pfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnewpfvvjwovnew', 'name': 'some name', 'country': 'IN'}
My query
keys = db.collection.distinct('key', {})
all_data = db.collection.find({'key': {$in: keys}})
As you can see it takes two queries for a same result set Please combine it to one as the database is very large
I might also create a unique key on the key but the value is so long (152 characters) that it will not help me.
Or it will??
You need to use the aggregation framework for this. There are multiple ways to do this, the solution below uses the $$ROOT variable to get the first document for each group:
db.data.aggregate([{
"$sort": {
"_id": 1
}
}, {
"$group": {
"_id": "$key",
"first": {
"$first": "$$ROOT"
}
}
}, {
"$project": {
"_id": 0,
"id":"$first.id",
"key":"$first.key",
"name":"$first.name",
"country":"$first.country"
}
}])

Categories