Extracting a scraped list into new columns - python

I have this code (borrowed from an old question posted ont his site)
import pandas as pd
import json
import numpy as np
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml")
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")
#(The table has an id, it makes it more simple to target )
batting = doc.find(id='misc_batting')
careers = []
for row in batting.find_all('tr')[1:]:
dictionary = {}
dictionary['names'] = row.find(attrs = {"data-stat": "player"}).text.strip()
dictionary['experience'] = row.find(attrs={"data-stat": "experience"}).text.strip()
careers.append(dictionary)
Which generates a result like this:
[{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}, {'names':
How do I create this into a column separated dataframe like this?
Names Experience
David Adams 1

You can simplify this quite a bit with pandas. Have it pull the table, then you just want the Names and Yrs columns.
import pandas as pd
url = "https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml"
df = pd.read_html(url, attrs = {'id': 'misc_batting'})[0]
df_filter = df[['Name','Yrs']]
If you need to rename those columns, add:
df_filter = df_filter.rename(columns={'Name':'names','Yrs':'experience'})
Output:
print(df_filter)
names experience
0 David Adams 1
1 Steve Ames 1
2 Rick Ankiel 11
3 Jairo Asencio 4
4 Luis Ayala 9
.. ... ...
209 Dewayne Wise 11
210 Ross Wolf 3
211 Kevin Youkilis 10
212 Michael Young 14
213 Totals 1357
[214 rows x 2 columns]

Simply pass your list of dicts (careers) to pandas.DataFrame() to get your expected result.
Example
import pandas as pd
careers = [{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}]
pd.DataFrame(careers)
Output
names
experience
David Adams
1
Steve Ames
1
Rick Ankiel
11
Jairo Asencio
4
Luis Ayala
9
Brandon Bantz
1
Scott Barnes
2

Related

Pandas flattening nested jsons

so this is probably going to be a duplicate question but i'll make a try since I have not found anything.
I am trying to flatten a json with pandas, normal work.
Looking at the examples of the docs here is the closest example for what I am trying to do:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
However, this example show us a way to get the data inside counties flatten with alongside the column state and shortname.
Let's say that I have n number of columns at the root of each json object ( n number of state or shortname columns in the example above ). How do I include them all, in order to flat the counties but keep everything else that is adjacent?
First I tried things like these:
#None to treat data as a list of records
#Result of counties is still nested, not working
result = pd.json_normalize(data, None, ['counties'])
or
result = pd.json_normalize(data, None, ['counties', 'name'])
Then I Thought of getting the columns with dataframe.columns and reuse it since meta argument of json_normalize can take array of string.
But i'm stuck. and columns appear to return nested json attribute which I don't want to.
#still nested
cols = pd.json_normalize(data).columns.to_list()
#Exclude it because we already have it
cols = [index for index in cols if index != 'counties']
#remove nested columns if any
cols = [index for index in cols if "." not in index]
result = pd.json_normalize(data, 'counties', cols, errors="ignore")
#still nested
name population state shortname ... other6 other7 counties info.governor
0 Dade 12345 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
1 Broward 40000 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
2 Palm Beach 60000 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
3 Summit 1234 Ohio OH ... dumb_data dumb_data [{'name': 'Summit', 'population': 1234}, {'nam... NaN
4 Cuyahoga 1337 Ohio OH ... dumb_data dumb_data [{'name': 'Summit', 'population': 1234}, {'nam... NaN
I would prefere not to just harcode the column names since they change and that for this case I have 64 of them...
For better understanding, this is the real kind of data i'm working on from Woo Rest API. I am not using it here because its really long, but basically I am trying to flat line_items keeping only product_id inside it and of course all the other columns which is adjacent to line_items.
Okay so guys if you want to flatten a json and keeping everything else, you should used pd.Dataframe.explode()
Here is my logic:
import pandas as pd
data = [
{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [
{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}
]
},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}
]
#No Formating only converting to a Df
result = pd.json_normalize(data)
#Exploding the wanted nested column
exploded = result.explode('counties')
#Keeping the name only - this can be custom
exploded['countie_name'] = exploded['counties'].apply(lambda x: x['name'])
#Drop the used column since we took what interested us inside it.
exploded = exploded.drop(['counties'], axis=1)
print(exploded)
#Duplicate for Florida, as wanted with diferent countie names
state shortname info.governor countie_name
0 Florida FL Rick Scott Dade
0 Florida FL Rick Scott Broward
0 Florida FL Rick Scott Palm Beach
1 Ohio OH John Kasich Summit
1 Ohio OH John Kasich Cuyahoga
Imagine you have the content of a basket of product as a nested json, to explode the content of the basket while keeping the general basket attributes, you can do this.

pandas json normalize not all fields from record path

I am trying to get just some of the fields of a record because I do not want to delete the not wanted columns afterwards but can't figure out how to do it. My real JSON has a lot more fields in the "countries" path, this is just an example.
Example JSON
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
json_normalize
result = pd.json_normalize(
data=data,
record_path='counties',
meta=['state', 'shortname',
['info', 'governor']])
output
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
but I do not want the "population" in this example, I just want the name of the counties
I tried all kind of combinations in the meta attribute.

How to transpose column values to headers, while getting values from another column?

I am currently struggling with extracting/flatten data from hugely nested dictionary: Flattening a nested dictionary with unique keys for each dictionary? .
I received a somewhat acceptable response, but do now have problems in relation to applying that methodology to another dictionary.
So far I have gotten to a point where I have the following
DataFrame.
First I would concatenate the values of "this_should_be_columns" + '_' + "child_column_name", (not a problem)
What I want is for all the unique values in ("this_should_be_columns"_"child_column_name") to become headers, and the rows should be there corresponding value (column "0").
Any ideas/solutions would be much appreciated!
FYI, my dictionary looks as follows:
{'7454':
{'coach':
{'wyId': 562711, 'shortName': 'Name1', 'firstName': 'N1', 'middleName': '', 'lastName': 'N2',
'birthDate': None,
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7454, 'gender': 'male', 'status': 'active'}},
'7453':
{'coach':
{'wyId': 56245, 'shortName': 'Name2', 'firstName': 'N3', 'middleName': '', 'lastName': 'N4',
'birthDate': 'yyyy-mm-dd',
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7453, 'gender': 'male', 'status': 'active'}}}
The code looks as follows:
df_test = pd.DataFrame(pd.Series(responses).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).reset_index().rename(columns={'level_0': 'teamId', 'level_1': 'type', 'level_2': 'this_should_be_columns', 'level_3': 'child_column_name', 'level_4': 'firstname', 'level_5' :'middleName', 'level_6' : 'ignore'}))
del df_test['firstname']
del df_test['middleName']
del df_test['ignore']
print(df_test)
The problem is that your dictionaries have a different number of levels. 'birthArea' and 'passportArea' contain dictionaries while the other keys simply contain values. You can use pd.json_normalize() to flatten the keys of the innermost dictionary as described in Flatten nested dictionaries, compressing keys.
In [37]: pd.DataFrame(responses).stack().apply(lambda x: pd.json_normalize(x, sep='_').to_dict(orient='records')[0]).apply(pd.Series).stack().reset_index()
Out[37]:
level_0 level_1 level_2 0
0 coach 7454 wyId 562711
1 coach 7454 shortName Name1
2 coach 7454 firstName N1
3 coach 7454 middleName
4 coach 7454 lastName N2
.. ... ... ... ...
28 coach 7453 birthArea_name Denmark
29 coach 7453 passportArea_id 208
30 coach 7453 passportArea_alpha2code DK
31 coach 7453 passportArea_alpha3code DNK
32 coach 7453 passportArea_name Denmark

How to apply json_normalize on entire pandas column

I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.
data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }
source_df = pd.DataFrame(data)
source_df looks like below :
As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.
Code to apply for one row:
Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')
Output for above code :
I would like to know how we can achieve desired output for the entire column
Expected output:
EDIT:
#lostCode , below is the input with nan and empty list
You can do:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)
Output:
volume_state volume_id volume_name name id state nodes
0 available 330172 q_-4144d4e WAG 01 105F available 3
1 available 275192 p_3089d821ae WAG 01 105F available 3
2 unavailable 830172 w_-4144d4e FEC 01 382E available 4
3 unavailable 223192 g_3089d821ae FEC 01 382E available 4
4 unavailable 930172 e_-4144d4e ASD 01 303F available 6
5 unavailable 245192 h_3089d821ae ASD 01 303F available 6
concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df
You can use to check type of source_df:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)
Target_df=source_df.apply(json_normalize)

Converting pandas JSON rows into separate columns

I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:
ID | production_companies
---------------
1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
4 | nan
5 | nan
6 | nan
7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"
As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.
I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.
Therefore I hope for your help guys!
EDIT:
The following code ("movies" is the main database):
from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)
gives me the following error:
AttributeError: 'str' object has no attribute 'values'
Adding on to #Andy's answer above to answer OP's question.
This part was by #Andy:
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
My additions to answer OP's requirements:
tmp_lst = []
for idx, item in df.groupby(by='ID'):
# Crediting this part to #Andy above
tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')
# Transpose dataframe
tmp_df = tmp_df.T
# Add back movie id to tmp_df
tmp_df['ID'] = item['ID'].values
# Accumulate tmp_df from all unique movie ids
tmp_lst.append(tmp_df)
pd.concat(tmp_lst, sort=False)
Result:
0 1 2 ID
name Paramount Pictures United Artists Metro-Goldwyn-Mayer (MGM) 1
name Walt Disney Pictures NaN NaN 3
This should do it
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))
Which yields:
id name
0 4 Paramount Pictures
1 60 United Artists
2 8411 Metro-Goldwyn-Mayer (MGM)
3 2 Walt Disney Pictures

Categories