Create dataframe columns from JSON within CSV column - python

Good day all!
I am trying to flatten some nested JSON using json_normalize, but I the output I keep getting is not what I need.
Here's my code so far:
df1 = pd.read_csv('data_file.csv')
groups_dict = df1['groups']
df2 = pd.json_normalize(groups_dict)
The bit where the dictionary gets created seems to be working as seen here:
groups_dict.info()
groups_dict.head()
<class 'pandas.core.series.Series'>
RangeIndex: 19 entries, 0 to 18
Series name: groups
Non-Null Count Dtype
-------------- -----
19 non-null object
dtypes: object(1)
memory usage: 280.0+ bytes
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
But when I try to normalize the dictionary, I get the following output:
df2 = pd.json_normalize(groups_dict)
df2.head()
0
1
2
3
4
I need to have each item from the groups column listed as it's own column to complete my project. Please see example below for sample data file (csv format) and what I am trying to accomplish.
CSV:
campaign_id,name,groups,status,content,duration_type,start_date,end_date,relative_duration,auto_enroll,allow_multiple_enrollments,completion_percentage
201644,Clicker 1 Retraining ,"[{'group_id': 798800, 'name': 'Clickers 1 '}]",Closed,"[{'store_purchase_id': 1076203, 'content_type': 'Store Purchase', 'name': 'Spot the Phish Game: Foundational', 'description': 'Make sure you can spot a phishing attempt by using this condensed Spot the Phish game. With ten...', 'type': 'Game', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-10-02T17:08:16.000Z', 'publisher': 'APP1', 'purchase_date': '2022-04-13T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-19T08:00:00.000Z,,1 weeks,TRUE,FALSE,14
201645,Clicker 2 Retraining ,"[{'group_id': 798803, 'name': 'Clickers 2'}]",In Progress,"[{'store_purchase_id': 1060139, 'content_type': 'Store Purchase', 'name': 'Micro-module – Social Engineering', 'description': 'This five-minute micro-module defines social engineering and describes what criminals are after....', 'type': 'Training Module', 'duration': 5, 'retired': False, 'retirement_date': None, 'publish_date': '2020-09-09T16:06:01.000Z', 'publisher': 'APP2', 'purchase_date': '2022-03-21T00:00:00.000Z', 'policy_url': None}]",Relative End Date,2022-04-13T08:00:00.000Z,,1 weeks,TRUE,FALSE,0
Before script:
df1['groups'].head()
0 [{'group_id': 798800, 'name': 'Clickers 1 '}]
1 [{'group_id': 798803, 'name': 'Clickers 2'}]
2 [{'group_id': 848426, 'name': 'Colin Safe Brow...
3 [{'group_id': 798804, 'name': 'Clickers 3'}]
4 [{'group_id': 855348, 'name': 'Email Whitelist...
Name: groups, dtype: object
After script:
df2.head()
group_id name
0 798800 Clickers 1
1 798803 Clickers 2
2 848426 Colin Safe Brow...
3 798804 Clickers 3
4 855348 Email Whitelist...
Anyone have pointers on how I should proceed?
Any assistance would be greatly appreaciated. Thanks!

You need to first extract the nested dict from its str representation by using eval or ast.literal_eval from the ast module.
You can then create a separate dataframe from the column you want by doing:
import ast
df1['groups'] = df1['groups'].apply(ast.literal_eval)
However, this returns a list of a single dict in your dataset. To combat this, we'll extract the first element of each row.
df1['groups'] = df1['groups'].apply(lambda l: l[0])
df2 = df1['groups'].apply(pd.Series)
Then you can access individual columns such as group_id and name using:
df2['group_id']
df2['name'] # etc.
group_id
0 798800
1 798803
2 848426
3 798804
4 855348
Similarly for other columns within your nested dict.

Related

Several dictionary with duplicate key but different value and no limit in column

Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24

how do I split a column into seperate columns in a csv file?

So I'm working on a movie genre data set and the dataset has all the genres in a single column but I want to split them.
here's how the data set looks like:
genres
----------------------------------------------
[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
[{'id': 35, 'name': 'Comedy'}]
[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
So what I want to do is get only the first genre so the new column should look like:
genres
_____________
Animation
Comedy
Comedy
Comedy
Action
I hope this is clear enough to understand my problem.
Use DataFrame.apply.
The first dictionary in the list is selected in each cell. From that dictionary the name field is selected:
df['genres']=df['genres'].apply(lambda x: x[0]['name'])
print(df)
ID genres
0 0 Animation
1 1 Comedy
2 2 Comedy
3 3 Comedy
4 4 Action
or
df['genres']=df['genres'].apply(lambda x: eval(x)[0]['name'])
TRY THIS
def decode_str_dict(x):
try:
out=eval(x)[0]['name']
except Exception:
try:
out=eval(x)['name']
except Exception:
try:
out=eval(x)
except Exception:
out=x
return out
df['genres'].apply(decode_str_dict)
df['genres'] = df['genres'].map(lambda x:[i['name'] for i in x])
df['first_genre'] = df['genres'][0]
df = df[['name','first_genre']]
This works if the values are considered a string.
from ast import literal_eval
df['genres'] = df.genres.map(lambda x: literal_eval(x)[0]['name'])
Result:
Out[294]:
ID genres
1 0 Animation
2 1 Comedy
3 2 Comedy
4 3 Comedy
5 4 Action

Convert dataframe column values into a list

I have a dataframe named matchdf. It is a huge one so I'm showing the 1st 3 rows and columns of the dataframe:
print(matchdf.iloc[:3,:3]
Unnamed: 0 athletesInvolved awayScore
0 0 [{'id': '39037', 'name': 'Azhar Ali', 'shortNa... 0
1 1 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
2 2 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
I was working with athletesInvolved column and as you can see it contains a list which is of form:
print(matchdf['athletesInvolved'][0])
[{'id': '39037', 'name': 'Azhar Ali', 'shortName': 'Azhar Ali', 'displayName': 'Azhar Ali'}, {'id': '17134', 'name': 'Tim Murtagh', 'shortName': 'Murtagh', 'displayName': 'Tim Murtagh'}]
However the datatype for this object is str as opposed to a list. How can we convert the above datatype to a list
We can using ast
import ast
df.c=df.c.apply(ast.literal_eval)

Extract values from column of dictionaries using pandas

I am trying to extract the name from the below dictionary:
df = df[[x.get('Name') for x in df['Contact']]]
Given below is how my Dataframe looks like:
data = [{'emp_id': 101,
'name': {'Name': 'Kevin',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000bt4HEG4'}}},
{'emp_id': 102,
'name': {'Name': 'Scott',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000yr5UTR9'}}}]
df = pd.DataFrame(data)
df
emp_id name
0 101 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102 {'Name': 'Scott', 'attributes': {'type': 'Cont...
I get an error:
AttributeError: 'NoneType' object has no attribute 'get'
If there are no NaNs, use json_normalize.
pd.io.json.json_normalize(df.name.tolist())['Name']
0 Kevin
1 Scott
Name: Name, dtype: object
If there are NaNs, you will need to drop them first. However, it is easy to retain the indices.
df
emp_id name
0 101.0 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102.0 NaN
2 103.0 {'Name': 'Scott', 'attributes': {'type': 'Cont...
idx = df.index[df.name.notna()]
names = pd.io.json.json_normalize(df.name.dropna().tolist())['Name']
names.index = idx
names
0 Kevin
2 Scott
Name: Name, dtype: object
Use apply, and use tolist to make it a list:
print(df['name'].apply(lambda x: x.get('Name')).tolist())
Output:
['Kevin', 'Scott']
If don't need list, want Series, use:
print(df['name'].apply(lambda x: x.get('Name')))
Output:
0 Kevin
1 Scott
Name: name, dtype: object
Update:
print(df['name'].apply(lambda x: x['attributes'].get('Name')).tolist())
Try following line:
names = [name.get('Name') for name in df['name']]

Python: Json unpacking into a dataframe

I'm trying to parse json I've recieved from an api into a pandas DataFrame. That json is ierarchical, in this example I have city code, line name and list of stations for this line. Unfortunately I can't "unpack" it. Would be gratefull for help and explanation.
Json:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская', <------Line name
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино', <------Station 1
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево', <------Station 2
'order': 1},
etc.
I'm trying to recieve evrything from lowest level and the add all higher level information (starting from linename):
c = r.content
j = simplejson.loads(c)
tmp=[]
i=0
data1=pd.DataFrame(tmp)
data2=pd.DataFrame(tmp)
pd.concat
station['name']
for station in j['lines']:
data2 = data2.append(pd.DataFrame(station['stations'], station['name']),ignore_index=True)
data2
Once more - the questions are:
How to make it work?
Is this solution an optimal one, or there are some functions I should know about?
Update:
The Json parses normally:
json_normalize(j)
id lines name
1 [{'hex_color': 'FFCD1C', 'stations': [{'lat': ... Москва
Current DataFrame I can get:
data2 = data2.append(pd.DataFrame(station['stations']),ignore_index=True)
id lat lng name order
0 8.189 55.745113 37.864052 Новокосино 0
1 8.88 55.752237 37.814587 Новогиреево 1
Desired dataframe can be represented as:
id lat lng name order Line_Name Id_Top Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 1 Москва
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 1 Москва
In addition to MaxU's answer, I think you still need the highest level id, this should work:
json_normalize(data, ['lines','stations'], ['id',['lines','name']],record_prefix='station_')
Assuming you have the following dictionary:
In [70]: data
Out[70]:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская',
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино',
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево',
'order': 1}]}]}
Solution: use pandas.io.json.json_normalize:
In [71]: pd.io.json.json_normalize(data['lines'],
['stations'],
['name', 'id'],
meta_prefix='parent_')
Out[71]:
id lat lng name order parent_name parent_id
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 8
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 8
UPDATE: reflects updated question
res = (pd.io.json.json_normalize(data,
['lines', 'stations'],
['id', ['lines', 'name']],
meta_prefix='Line_')
.assign(Name_Top='Москва'))
Result:
In [94]: res
Out[94]:
id lat lng name order Line_id Line_lines.name Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 1 Калининская Москва
1 8.88 55.752237 37.814587 Новогиреево 1 1 Калининская Москва

Categories