Extracting a scraped list into new columns

Extracting a scraped list into new columns - python

I have this code (borrowed from an old question posted ont his site)
import pandas as pd
import json
import numpy as np
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml")
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")
#(The table has an id, it makes it more simple to target )
batting = doc.find(id='misc_batting')
careers = []
for row in batting.find_all('tr')[1:]:
dictionary = {}
dictionary['names'] = row.find(attrs = {"data-stat": "player"}).text.strip()
dictionary['experience'] = row.find(attrs={"data-stat": "experience"}).text.strip()
careers.append(dictionary)
Which generates a result like this:
[{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}, {'names':
How do I create this into a column separated dataframe like this?
Names Experience
David Adams 1

You can simplify this quite a bit with pandas. Have it pull the table, then you just want the Names and Yrs columns.
import pandas as pd
url = "https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml"
df = pd.read_html(url, attrs = {'id': 'misc_batting'})[0]
df_filter = df[['Name','Yrs']]
If you need to rename those columns, add:
df_filter = df_filter.rename(columns={'Name':'names','Yrs':'experience'})
Output:
print(df_filter)
names experience
0 David Adams 1
1 Steve Ames 1
2 Rick Ankiel 11
3 Jairo Asencio 4
4 Luis Ayala 9
.. ... ...
209 Dewayne Wise 11
210 Ross Wolf 3
211 Kevin Youkilis 10
212 Michael Young 14
213 Totals 1357
[214 rows x 2 columns]

Simply pass your list of dicts (careers) to pandas.DataFrame() to get your expected result.
Example
import pandas as pd
careers = [{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}]
pd.DataFrame(careers)
Output
names
experience
David Adams
1
Steve Ames
1
Rick Ankiel
11
Jairo Asencio
4
Luis Ayala
9
Brandon Bantz
1
Scott Barnes
2

Related

Pandas flattening nested jsons

so this is probably going to be a duplicate question but i'll make a try since I have not found anything.
I am trying to flatten a json with pandas, normal work.
Looking at the examples of the docs here is the closest example for what I am trying to do:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
However, this example show us a way to get the data inside counties flatten with alongside the column state and shortname.
Let's say that I have n number of columns at the root of each json object ( n number of state or shortname columns in the example above ). How do I include them all, in order to flat the counties but keep everything else that is adjacent?
First I tried things like these:
#None to treat data as a list of records
#Result of counties is still nested, not working
result = pd.json_normalize(data, None, ['counties'])
or
result = pd.json_normalize(data, None, ['counties', 'name'])
Then I Thought of getting the columns with dataframe.columns and reuse it since meta argument of json_normalize can take array of string.
But i'm stuck. and columns appear to return nested json attribute which I don't want to.
#still nested
cols = pd.json_normalize(data).columns.to_list()
#Exclude it because we already have it
cols = [index for index in cols if index != 'counties']
#remove nested columns if any
cols = [index for index in cols if "." not in index]
result = pd.json_normalize(data, 'counties', cols, errors="ignore")
#still nested
name population state shortname ... other6 other7 counties info.governor
0 Dade 12345 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
1 Broward 40000 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
2 Palm Beach 60000 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
3 Summit 1234 Ohio OH ... dumb_data dumb_data [{'name': 'Summit', 'population': 1234}, {'nam... NaN
4 Cuyahoga 1337 Ohio OH ... dumb_data dumb_data [{'name': 'Summit', 'population': 1234}, {'nam... NaN
I would prefere not to just harcode the column names since they change and that for this case I have 64 of them...
For better understanding, this is the real kind of data i'm working on from Woo Rest API. I am not using it here because its really long, but basically I am trying to flat line_items keeping only product_id inside it and of course all the other columns which is adjacent to line_items.

Okay so guys if you want to flatten a json and keeping everything else, you should used pd.Dataframe.explode()
Here is my logic:
import pandas as pd
data = [
{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [
{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}
]
},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}
]
#No Formating only converting to a Df
result = pd.json_normalize(data)
#Exploding the wanted nested column
exploded = result.explode('counties')
#Keeping the name only - this can be custom
exploded['countie_name'] = exploded['counties'].apply(lambda x: x['name'])
#Drop the used column since we took what interested us inside it.
exploded = exploded.drop(['counties'], axis=1)
print(exploded)
#Duplicate for Florida, as wanted with diferent countie names
state shortname info.governor countie_name
0 Florida FL Rick Scott Dade
0 Florida FL Rick Scott Broward
0 Florida FL Rick Scott Palm Beach
1 Ohio OH John Kasich Summit
1 Ohio OH John Kasich Cuyahoga
Imagine you have the content of a basket of product as a nested json, to explode the content of the basket while keeping the general basket attributes, you can do this.

pandas json normalize not all fields from record path

I am trying to get just some of the fields of a record because I do not want to delete the not wanted columns afterwards but can't figure out how to do it. My real JSON has a lot more fields in the "countries" path, this is just an example.
Example JSON
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
json_normalize
result = pd.json_normalize(
data=data,
record_path='counties',
meta=['state', 'shortname',
['info', 'governor']])
output
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
but I do not want the "population" in this example, I just want the name of the counties
I tried all kind of combinations in the meta attribute.

How to transpose column values to headers, while getting values from another column?

I am currently struggling with extracting/flatten data from hugely nested dictionary: Flattening a nested dictionary with unique keys for each dictionary? .
I received a somewhat acceptable response, but do now have problems in relation to applying that methodology to another dictionary.
So far I have gotten to a point where I have the following
DataFrame.
First I would concatenate the values of "this_should_be_columns" + '_' + "child_column_name", (not a problem)
What I want is for all the unique values in ("this_should_be_columns"_"child_column_name") to become headers, and the rows should be there corresponding value (column "0").
Any ideas/solutions would be much appreciated!
FYI, my dictionary looks as follows:
{'7454':
{'coach':
{'wyId': 562711, 'shortName': 'Name1', 'firstName': 'N1', 'middleName': '', 'lastName': 'N2',
'birthDate': None,
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7454, 'gender': 'male', 'status': 'active'}},
'7453':
{'coach':
{'wyId': 56245, 'shortName': 'Name2', 'firstName': 'N3', 'middleName': '', 'lastName': 'N4',
'birthDate': 'yyyy-mm-dd',
'birthArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'passportArea':
{'id': 208, 'alpha2code': 'DK', 'alpha3code': 'DNK', 'name': 'Denmark'},
'currentTeamId':
7453, 'gender': 'male', 'status': 'active'}}}
The code looks as follows:
df_test = pd.DataFrame(pd.Series(responses).apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).stack().apply(pd.Series).reset_index().rename(columns={'level_0': 'teamId', 'level_1': 'type', 'level_2': 'this_should_be_columns', 'level_3': 'child_column_name', 'level_4': 'firstname', 'level_5' :'middleName', 'level_6' : 'ignore'}))
del df_test['firstname']
del df_test['middleName']
del df_test['ignore']
print(df_test)

The problem is that your dictionaries have a different number of levels. 'birthArea' and 'passportArea' contain dictionaries while the other keys simply contain values. You can use pd.json_normalize() to flatten the keys of the innermost dictionary as described in Flatten nested dictionaries, compressing keys.
In [37]: pd.DataFrame(responses).stack().apply(lambda x: pd.json_normalize(x, sep='_').to_dict(orient='records')[0]).apply(pd.Series).stack().reset_index()
Out[37]:
level_0 level_1 level_2 0
0 coach 7454 wyId 562711
1 coach 7454 shortName Name1
2 coach 7454 firstName N1
3 coach 7454 middleName
4 coach 7454 lastName N2
.. ... ... ... ...
28 coach 7453 birthArea_name Denmark
29 coach 7453 passportArea_id 208
30 coach 7453 passportArea_alpha2code DK
31 coach 7453 passportArea_alpha3code DNK
32 coach 7453 passportArea_name Denmark

How to apply json_normalize on entire pandas column

I have a dataframe with LISTS(with dicts) as column values . My intention is to normalize entire column(all rows). I found way to normalize a single row . However, I'm unable to apply the same function for the entire dataframe or column.
data = {'COLUMN': [ [{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}], [{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}], [{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}] ] }
source_df = pd.DataFrame(data)
source_df looks like below :
As per https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html I managed to get output for one row.
Code to apply for one row:
Target_df = json_normalize(source_df['COLUMN'][0], 'volumes', ['name','id','state','nodes'], record_prefix='volume_')
Output for above code :
I would like to know how we can achieve desired output for the entire column
Expected output:
EDIT:
#lostCode , below is the input with nan and empty list

You can do:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index]).reset_index(drop=True)
Output:
volume_state volume_id volume_name name id state nodes
0 available 330172 q_-4144d4e WAG 01 105F available 3
1 available 275192 p_3089d821ae WAG 01 105F available 3
2 unavailable 830172 w_-4144d4e FEC 01 382E available 4
3 unavailable 223192 g_3089d821ae FEC 01 382E available 4
4 unavailable 930172 e_-4144d4e ASD 01 303F available 6
5 unavailable 245192 h_3089d821ae ASD 01 303F available 6
concat, is used to concatenate a dataframe list, in this case the list that is generated using json_normalize is concatenated on all rows of source_df
You can use to check type of source_df:
Target_df=pd.concat([json_normalize(source_df['COLUMN'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in source_df.index if isinstance(source_df['COLUMN'][key],list)]).reset_index(drop=True)

Target_df=source_df.apply(json_normalize)

Converting pandas JSON rows into separate columns

I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:
ID | production_companies
---------------
1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
4 | nan
5 | nan
6 | nan
7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"
As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.
I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.
Therefore I hope for your help guys!
EDIT:
The following code ("movies" is the main database):
from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)
gives me the following error:
AttributeError: 'str' object has no attribute 'values'

Adding on to #Andy's answer above to answer OP's question.
This part was by #Andy:
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
My additions to answer OP's requirements:
tmp_lst = []
for idx, item in df.groupby(by='ID'):
# Crediting this part to #Andy above
tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')
# Transpose dataframe
tmp_df = tmp_df.T
# Add back movie id to tmp_df
tmp_df['ID'] = item['ID'].values
# Accumulate tmp_df from all unique movie ids
tmp_lst.append(tmp_df)
pd.concat(tmp_lst, sort=False)
Result:
0 1 2 ID
name Paramount Pictures United Artists Metro-Goldwyn-Mayer (MGM) 1
name Walt Disney Pictures NaN NaN 3

This should do it
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))
Which yields:
id name
0 4 Paramount Pictures
1 60 United Artists
2 8411 Metro-Goldwyn-Mayer (MGM)
3 2 Walt Disney Pictures

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting a scraped list into new columns - python

Related

Pandas flattening nested jsons

pandas json normalize not all fields from record path

How to transpose column values to headers, while getting values from another column?

How to apply json_normalize on entire pandas column

Converting pandas JSON rows into separate columns

Categories

Resources