convert dictionary with metadata to a dataframe - python

i Have a dictionary like below, looking to convert it as a dataframe with its all data.
{'content': 'hello i'm from london',
'meta': {'name': 'smith)',
'q_id': 001,
'c_id': '002'}}

IIUC use:
d = {'content': 'hello im from london',
'meta': {'name': 'smith)', 'q_id': '001', 'c_id': '002'}}
df = pd.json_normalize(d)
print (df)
content meta.name meta.q_id meta.c_id
0 hello im from london smith) 001 002

Related

Extracting a scraped list into new columns

I have this code (borrowed from an old question posted ont his site)
import pandas as pd
import json
import numpy as np
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml")
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")
#(The table has an id, it makes it more simple to target )
batting = doc.find(id='misc_batting')
careers = []
for row in batting.find_all('tr')[1:]:
dictionary = {}
dictionary['names'] = row.find(attrs = {"data-stat": "player"}).text.strip()
dictionary['experience'] = row.find(attrs={"data-stat": "experience"}).text.strip()
careers.append(dictionary)
Which generates a result like this:
[{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}, {'names':
How do I create this into a column separated dataframe like this?
Names Experience
David Adams 1
You can simplify this quite a bit with pandas. Have it pull the table, then you just want the Names and Yrs columns.
import pandas as pd
url = "https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml"
df = pd.read_html(url, attrs = {'id': 'misc_batting'})[0]
df_filter = df[['Name','Yrs']]
If you need to rename those columns, add:
df_filter = df_filter.rename(columns={'Name':'names','Yrs':'experience'})
Output:
print(df_filter)
names experience
0 David Adams 1
1 Steve Ames 1
2 Rick Ankiel 11
3 Jairo Asencio 4
4 Luis Ayala 9
.. ... ...
209 Dewayne Wise 11
210 Ross Wolf 3
211 Kevin Youkilis 10
212 Michael Young 14
213 Totals 1357
[214 rows x 2 columns]
Simply pass your list of dicts (careers) to pandas.DataFrame() to get your expected result.
Example
import pandas as pd
careers = [{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}]
pd.DataFrame(careers)
Output
names
experience
David Adams
1
Steve Ames
1
Rick Ankiel
11
Jairo Asencio
4
Luis Ayala
9
Brandon Bantz
1
Scott Barnes
2

How to apply json_normalize on entire pandas column

I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.
data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }
source_df = pd.DataFrame(data)
source_df looks like below :
As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.
Code to apply for one row:
Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')
Output for above code :
I would like to know how we can achieve desired output for the entire column
Expected output:
EDIT:
#lostCode , below is the input with nan and empty list
You can do:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)
Output:
volume_state volume_id volume_name name id state nodes
0 available 330172 q_-4144d4e WAG 01 105F available 3
1 available 275192 p_3089d821ae WAG 01 105F available 3
2 unavailable 830172 w_-4144d4e FEC 01 382E available 4
3 unavailable 223192 g_3089d821ae FEC 01 382E available 4
4 unavailable 930172 e_-4144d4e ASD 01 303F available 6
5 unavailable 245192 h_3089d821ae ASD 01 303F available 6
concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df
You can use to check type of source_df:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)
Target_df=source_df.apply(json_normalize)

Convert dataframe column values into a list

I have a dataframe named matchdf. It is a huge one so I'm showing the 1st 3 rows and columns of the dataframe:
print(matchdf.iloc[:3,:3]
Unnamed: 0 athletesInvolved awayScore
0 0 [{'id': '39037', 'name': 'Azhar Ali', 'shortNa... 0
1 1 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
2 2 [{'id': '568276', 'name': 'Imam-ul-Haq', 'shor... 0
I was working with athletesInvolved column and as you can see it contains a list which is of form:
print(matchdf['athletesInvolved'][0])
[{'id': '39037', 'name': 'Azhar Ali', 'shortName': 'Azhar Ali', 'displayName': 'Azhar Ali'}, {'id': '17134', 'name': 'Tim Murtagh', 'shortName': 'Murtagh', 'displayName': 'Tim Murtagh'}]
However the datatype for this object is str as opposed to a list. How can we convert the above datatype to a list
We can using ast
import ast
df.c=df.c.apply(ast.literal_eval)

Converting pandas JSON rows into separate columns

I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:
ID | production_companies
---------------
1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
4 | nan
5 | nan
6 | nan
7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"
As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.
I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.
Therefore I hope for your help guys!
EDIT:
The following code ("movies" is the main database):
from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)
gives me the following error:
AttributeError: 'str' object has no attribute 'values'
Adding on to #Andy's answer above to answer OP's question.
This part was by #Andy:
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
My additions to answer OP's requirements:
tmp_lst = []
for idx, item in df.groupby(by='ID'):
# Crediting this part to #Andy above
tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')
# Transpose dataframe
tmp_df = tmp_df.T
# Add back movie id to tmp_df
tmp_df['ID'] = item['ID'].values
# Accumulate tmp_df from all unique movie ids
tmp_lst.append(tmp_df)
pd.concat(tmp_lst, sort=False)
Result:
0 1 2 ID
name Paramount Pictures United Artists Metro-Goldwyn-Mayer (MGM) 1
name Walt Disney Pictures NaN NaN 3
This should do it
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))
Which yields:
id name
0 4 Paramount Pictures
1 60 United Artists
2 8411 Metro-Goldwyn-Mayer (MGM)
3 2 Walt Disney Pictures

Extract values from column of dictionaries using pandas

I am trying to extract the name from the below dictionary:
df = df[[x.get('Name') for x in df['Contact']]]
Given below is how my Dataframe looks like:
data = [{'emp_id': 101,
'name': {'Name': 'Kevin',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000bt4HEG4'}}},
{'emp_id': 102,
'name': {'Name': 'Scott',
'attributes': {'type': 'Contact',
'url': '/services/data/v38.0/sobjects/Contact/00985300000yr5UTR9'}}}]
df = pd.DataFrame(data)
df
emp_id name
0 101 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102 {'Name': 'Scott', 'attributes': {'type': 'Cont...
I get an error:
AttributeError: 'NoneType' object has no attribute 'get'
If there are no NaNs, use json_normalize.
pd.io.json.json_normalize(df.name.tolist())['Name']
0 Kevin
1 Scott
Name: Name, dtype: object
If there are NaNs, you will need to drop them first. However, it is easy to retain the indices.
df
emp_id name
0 101.0 {'Name': 'Kevin', 'attributes': {'type': 'Cont...
1 102.0 NaN
2 103.0 {'Name': 'Scott', 'attributes': {'type': 'Cont...
idx = df.index[df.name.notna()]
names = pd.io.json.json_normalize(df.name.dropna().tolist())['Name']
names.index = idx
names
0 Kevin
2 Scott
Name: Name, dtype: object
Use apply, and use tolist to make it a list:
print(df['name'].apply(lambda x: x.get('Name')).tolist())
Output:
['Kevin', 'Scott']
If don't need list, want Series, use:
print(df['name'].apply(lambda x: x.get('Name')))
Output:
0 Kevin
1 Scott
Name: name, dtype: object
Update:
print(df['name'].apply(lambda x: x['attributes'].get('Name')).tolist())
Try following line:
names = [name.get('Name') for name in df['name']]

Categories