How to convert JSON string in data frame column into multiple columns - python

I have a data frame with a JSON string in a column:
ID
Data
11
{'Name': 'Sam', 'Age': 21}
22
{'Name': 'Nam', 'Age': 22}
33
{'Name': 'Pam', 'Age': 21, 'Salary': 10000}
How can I convert the above JSON string in to columns?
Desired result:
ID
Name
Age
Salary
11
Sam
21
22
Nam
22
33
Pam
21
10000

You can use pandas.Series to read your column of dictionaries values into columns.
Creating the data
data = {
'Id' : [11, 22, 33],
'Data': ["{'Name': 'Sam', 'Age': 21}", "{'Name': 'Nam', 'Age': 22}", "{'Name': 'Pam', 'Age': 21, 'Salary': 10000}"],
}
df = pd.DataFrame(data)
Converting dictionary to column
df['Data'] = df['Data'].map(lambda x: eval(x) if pd.notnull(x) else x)
df = pd.concat([df, df.pop("Data").apply(pd.Series)], axis=1)
Output :
Id Name Age Salary
0 11 Sam 21 NaN
1 22 Nam 22 NaN
2 33 Pam 21 10000.0
Alternate solution
You can also use json_normalize to unravel the dictionary column to column from dictionary keys.
df['Data'] = df['Data'].map(lambda x: eval(x) if pd.notnull(x) else x)
df = pd.concat([df, pd.json_normalize(df.pop("Data"))], axis=1)
which gives you same output

Related

Replace certain values of one column, with different values from a different df, pandas

I have a df,
for example -
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['python', 'medical','sql','c++'],
})
and another df -
df2 = pd.DataFrame({'occupation': ['data scientist', 'data analyst'],
'knowledge':['5', '4'],
})
I want to replace the knowledge values of the first DF with the knowledge values of the second, but only for the rows which are the same.
making the first DF look like that:
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['5', 'medical','4','c++'],
})
I tried to do stuff with replace, but it didn't work...
You may try this:
occ_know_dict = df2.set_index('occupation').to_dict()['knowledge']
df['knowledge'] = df[['knowledge','occupation']].apply(
lambda row: occ_know_dict[row['occupation']] if row['occupation'] in occ_know_dict else row['knowledge'], axis=1)
You can try map the corresponding knowledge column which shares the same occupation of df2 to df1 then update the value to df.
df['knowledge'].update(df['occupation'].map(df2.set_index('occupation')['knowledge']))
Note that update happens inplace.
print(df)
name age occupation knowledge
0 name1 21 data scientist 5
1 name2 23 doctor medical
2 name3 24 data analyst 4
3 name4 28 engineer c++

Several dictionary with duplicate key but different value and no limit in column

Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24
So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.
First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.
You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24

filter pandas where some columns contain any of the words in a list

I would like to filter a Dataframe. The resulting dataframe should contain all the rows where in any of a number of columns contain any of the words of a list.
I started to use for loops but there should be a better pythonic/pandonic way.
Example:
# importing pandas
import pandas as pd
# Creating the dataframe with dict of lists
df = pd.DataFrame({'Name': ['Geeks', 'Peter', 'James', 'Jack', 'Lisa'],
'Team': ['Boston', 'Boston', 'Boston', 'Chele', 'Barse'],
'Position': ['PG', 'PG', 'UG', 'PG', 'UG'],
'Number': [3, 4, 7, 11, 5],
'Age': [33, 25, 34, 35, 28],
'Height': ['6-2', '6-4', '5-9', '6-1', '5-8'],
'Weight': [89, 79, 113, 78, 84],
'College': ['MIT', 'MIT', 'MIT', 'Stanford', 'Stanford'],
'Salary': [99999, 99994, 89999, 78889, 87779]},
index =['ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
df1 = df[df['Team'].str.contains("Boston") | df['College'].str.contains('MIT')]
print(df1)
So it is clear how to filter columns individually that contain a particular word
further on it is also clear how to filter rows per column containing any of the strings of a list:
df[df.Name.str.contains('|'.join(search_values ))]
Where search_values contains a list of words or strings.
search_values = ['boston','mike','whatever']
I am looking for a short way to code
#pseudocode
give me a subframe of df where any of the columns 'Name','Position','Team' contains any of the words in search_values
I know I can do
df[df['Name'].str.contains('|'.join(search_values )) | df['Position'].str.contains('|'.join(search_values )) | df['Team'].contains('|'.join(search_values )) ]
but if I would have like 20 columns that would be a mess of a line of code
any suggestion?
EDIT Bonus:
When looking in a list of columns i.e. 'Name','Position','Team' how to include also the index? passing ['index','Name','Position','Team'] does not work.
thanks.
I had a look to this:
https://www.geeksforgeeks.org/get-all-rows-in-a-pandas-dataframe-containing-given-substring/
https://kanoki.org/2019/03/27/pandas-select-rows-by-condition-and-string-operations/
Filter out rows based on list of strings in Pandas
You can also stack with any on level=0:
cols_list = ['Name','Team'] #add column names
df[df[cols_list].stack().str.contains('|'.join(search_values),case=False,na=False)
.any(level=0)]
Name Team Position Number Age Height Weight College Salary
ind1 Geeks Boston PG 3 33 6-2 89 MIT 99999
ind2 Peter Boston PG 4 25 6-4 79 MIT 99994
ind3 James Boston UG 7 34 5-9 113 MIT 89999
Do apply with any
df[[c1,c2..]].apply(lambda x : x.str.contains('|'.join(search_values )),axis=1).any(axis=1)
You can simply apply in this case:
cols_to_filter = ['Name', 'Position', 'Team']
search_values = ['word1', 'word2']
patt = '|'.join(search_values)
mask = df[cols_to_filter].apply(lambda x: x.str.contains(patt)).any(1)
df[mask]

how to fix 'size', 'occurred at index City' error

I am trying to do the example in Use Python & Pandas to replace NaN in the 'size' column with a specific value, depending on the City. In the example below i am trying to assign a value of 18 if the City is St. Louis.
I have used the lambda function to do it since the original dataframe has lot of rows with the repeat of City names and only few of them have NaN values.
when i run the code I am getting an error - KeyError: ('size', 'occurred at index City')
below is the snippet of the code -
raw_data = {'City' : ['Dallas', 'Chicago', 'St Louis', 'SFO', 'St Louis'],
'size': [24, 36, 'NaN', 'NaN', 22],
'Type' : ['Pie', 'Hallo', 'Zombi', 'Dru', 'Zoro']
}
df = pd.DataFrame(raw_data)
df
df['size'] = df.apply(lambda x : x['size'].fillna(value = 18 if x['City' == 'St Louis'] else x['size'], axis = 1, inplace = True))
df
Expected - 18 to be populated in size column for St. Louis
Actual - KeyError: ('size', 'occurred at index City')
If all you're trying to do is set the size of St. Louis, you can run:
df.loc[df['City'] == 'St Louis', 'size'] = 18
However, if you instead want to set all values of NaN to 18, you could likewise run:
df.loc[df['size'] == 'NaN', 'size'] = 18
And if you'd just like to set the size of all St. Louis entries where the size is NaN, you could do:
df.loc[df['City'] == 'St Louis' and df['size'] == 'NaN', 'size'] = 18
There is a simple solution by fillna method
df['size'] = df['size'].fillna(18)
EDITED
What I failed to notice - that you populate cells with NaN string, not with real NaN values.
If you change your input data as
raw_data = {'City' : ['Dallas', 'Chicago', 'St Louis', 'SFO', 'St Louis'],
'size': [24, 36, np.NaN, np.NaN, 22],
'Type' : ['Pie', 'Hallo', 'Zombi', 'Dru', 'Zoro']
}
Then the following method will allow you to re-populate size columns cells by city names
df = pd.DataFrame(raw_data)
df[['City', 'size']] = df.set_index('City')['size'].fillna({'St Louis': 18, 'SFO': 20}).reset_index()

Extract values from column of dictionaries using pandas

I am trying to extract the name from the below dictionary:
df = df[[x.get('Name') for x in df['Contact']]]
Given below is how my Dataframe looks like:
data = [{'emp_id': 101,
'name': {'Name': 'Kevin',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000bt4HEG4'}}},
{'emp_id': 102,
'name': {'Name': 'Scott',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000yr5UTR9'}}}]
df = pd.DataFrame(data)
df
emp_id name
0 101 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102 {'Name': 'Scott', 'attributes': {'type': 'Cont...
I get an error:
AttributeError: 'NoneType' object has no attribute 'get'
If there are no NaNs, use json_normalize.
pd.io.json.json_normalize(df.name.tolist())['Name']
0 Kevin
1 Scott
Name: Name, dtype: object
If there are NaNs, you will need to drop them first. However, it is easy to retain the indices.
df
emp_id name
0 101.0 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102.0 NaN
2 103.0 {'Name': 'Scott', 'attributes': {'type': 'Cont...
idx = df.index[df.name.notna()]
names = pd.io.json.json_normalize(df.name.dropna().tolist())['Name']
names.index = idx
names
0 Kevin
2 Scott
Name: Name, dtype: object
Use apply, and use tolist to make it a list:
print(df['name'].apply(lambda x: x.get('Name')).tolist())
Output:
['Kevin', 'Scott']
If don't need list, want Series, use:
print(df['name'].apply(lambda x: x.get('Name')))
Output:
0 Kevin
1 Scott
Name: name, dtype: object
Update:
print(df['name'].apply(lambda x: x['attributes'].get('Name')).tolist())
Try following line:
names = [name.get('Name') for name in df['name']]

Categories