Web scraping a table through multiple pages with a single link

Web scraping a table through multiple pages with a single link - python

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)

The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

Related

Scrapy: unable to locate table or scrape data in table

For a group project, I am trying to scrape Salaries table within https://www.basketball-reference.com/players/a/allenra02.html.
I have tried multiple CSS and Xpath selectors such as
#all_salaries > tbody > tr:nth-child(1)
#all_salaries > tbody
#all_salaries > tbody > tr:nth-child(1) > td.right
#all_salaries
//*[#id="all_salaries"]/tbody/tr[1]/td[3]
//*[#id="all_salaries"]/tbody
//*[#id="all_salaries"]
Code look as follows:
def start_requests(self):
start_urls = ['https://www.basketball-reference.com/players/a/allenra02.html']
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_season)
def parse_player(self, response):
response.css('#all_salaries > tbody)
I tried printing it out, but it keeps returning an empty list.
Other tables seem fine, except this one.
EDIT:
My final solution looks something like
regex = re.compile(r'<!--(.*)-->', re.DOTALL)
salaries = response.xpath('//*[#id="all_all_salaries"]/comment()').get()
if salaries:
salaries = response.xpath('//*[#id="all_all_salaries"]/comment()').re(regex)[0]
salaries_sel = scrapy.Selector(text=salaries, type="html")
all_salaries = salaries_sel.css('#all_salaries > tbody > tr').extract()

You can use BeautifulSoup to pull out the comments then parse the table with pandas. I chose to only pull out the salary table, but you can get all the tables in the comments this way.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.basketball-reference.com/players/a/allenra02.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each), attrs = {'id': 'all_salaries'})[0])
break
except:
continue
print(tables[0].to_string())
Output:
Season Team Lg Salary
0 1996-97 Milwaukee Bucks NBA $1,785,000
1 1997-98 Milwaukee Bucks NBA $2,052,360
2 1998-99 Milwaukee Bucks NBA $2,320,000
3 1999-00 Milwaukee Bucks NBA $9,000,000
4 2000-01 Milwaukee Bucks NBA $10,130,000
5 2001-02 Milwaukee Bucks NBA $11,250,000
6 2002-03 Milwaukee Bucks NBA $12,375,000
7 2003-04 Seattle SuperSonics NBA $13,500,000
8 2004-05 Seattle SuperSonics NBA $14,625,000
9 2005-06 Seattle SuperSonics NBA $13,223,140
10 2006-07 Seattle SuperSonics NBA $14,611,570
11 2007-08 Boston Celtics NBA $16,000,000
12 2008-09 Boston Celtics NBA $18,388,430
13 2009-10 Boston Celtics NBA $18,776,860
14 2010-11 Boston Celtics NBA $10,000,000
15 2011-12 Boston Celtics NBA $10,000,000
16 2012-13 Miami Heat NBA $3,090,000
17 2013-14 Miami Heat NBA $3,229,050
18 Career (may be incomplete) NaN $184,356,410

It's because that table is actually commented out in the original source code and later added via javascript. Have a look here on how to get the comment contents: Scrapy: Extract commented (hidden) content

How to exclude certain rows in a table using BeautifulSoup?

The code works fine, however, the URL I'm trying to fetch the table for seems to have headers repeated throughout the table, I'm not sure how to deal with this and remove those rows as I'm trying to get the data into BigQuery and there are certain characters which aren't allowed.
URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
f.write(json_string)
print(json_string)

You can make class of the tr rows to None so that you don't get duplicate headers.
The following code creates a dataframe from the table
from bs4 import BeautifulSoup
import requests
import pandas as pd
res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]
data = []
for tr in table.find('tbody').find_all('tr', class_=False):
temp = [tr.find('th').get_text(strip=True)]
temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
data.append(temp)
df = pd.DataFrame(data, columns = columns)
print(df)
Output:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS     Attend. Notes
0 Sat, Aug 1, 2020 1:00p Miami Heat 125 Denver Nuggets 105 Box Score
1 Sat, Aug 1, 2020 3:30p Utah Jazz 94 Oklahoma City Thunder 110 Box Score
2 Sat, Aug 1, 2020 6:00p New Orleans Pelicans 103 Los Angeles Clippers 126 Box Score
3 Sat, Aug 1, 2020 7:00p Philadelphia 76ers 121 Indiana Pacers 127 Box Score
4 Sat, Aug 1, 2020 8:30p Los Angeles Lakers 92 Toronto Raptors 107 Box Score
.. ... ... ... ... ... ... ... .. ... ...
75 Thu, Aug 13, 2020 Portland Trail Blazers Brooklyn Nets
76 Fri, Aug 14, 2020 Philadelphia 76ers Houston Rockets
77 Fri, Aug 14, 2020 Miami Heat Indiana Pacers
78 Fri, Aug 14, 2020 Oklahoma City Thunder Los Angeles Clippers
79 Fri, Aug 14, 2020 Denver Nuggets Toronto Raptors
[80 rows x 10 columns]
In order to insert to bigquery, you can directly insert json to bigquery or a dataframe to bigquery using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html

scraping basketball results and associate related competition to each match

I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)

Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools

This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

Parsing html with correct encoding

I'm trying to use the basketball-reference API using python with the requests and bs4 libraries.
from requests import get
from bs4 import BeautifulSoup
Here's a minimal example of what I'm trying to do:
# example request
r = get(f'https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fteams%2FMIL%2F2015.html&div=div_roster')
soup = BeautifulSoup(dd.content, 'html.parser')
table = soup.find('table')
It all works well, I can then feed this table to pandas with its read_html and get the data I need nicely packed into a dataframe.
The problem I have is the encoding.
In this particular request I got two NBA player names with weird characters: Ersan Ä°lyasova (Ersan İlyasova) and Jorge GutiÃ©rrez (Jorge Gutiérrez). In the current code they are interpreted as "Ersan Ä°lyasova" and "Jorge GutiÃ©rrez", which is obviously not what I want.
So the question is -- how do I fix it? This website seems to suggest they have the windows-1251 encoding, but I'm not sure how to use that information (in fact I'm not even sure if that's true).
I know I'm missing something fundamental here as I'm a bit confused how these encodings work at which point they are being "interpreted" etc, so I'll be grateful if you help me with this!

I really don't know why you are usingformat string and even your question is not clear. you've just copy/paste the url from the network traffic and then you mixing things about quoted string with encoding.
Below you should be able to done it.
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/teams/MIL/2015.html")
print(df)
Output:
[ No. Player Pos ... Unnamed: 6 Exp College
0 34 Giannis Antetokounmpo SG ... gr 1 NaN
1 19 Jerryd Bayless PG ... us 6 Arizona
2 5 Michael Carter-Williams PG ... us 1 Syracuse
3 9 Jared Dudley SG ... us 7 Boston College
4 11 Tyler Ennis PG ... ca R Syracuse
5 13 Jorge Gutiérrez PG ... mx 1 California
6 31 John Henson C ... us 2 UNC
7 7 Ersan İlyasova PF ... tr 6 NaN
8 23 Chris Johnson SF ... us 2 Dayton
9 11 Brandon Knight PG ... us 3 Kentucky
10 5 Kendall Marshall PG ... us 2 UNC
11 6 Kenyon Martin PF ... us 14 Cincinnati
12 0 O.J. Mayo SG ... us 6 USC
13 22 Khris Middleton SF ... us 2 Texas A&M
14 3 Johnny O'Bryant PF ... us R LSU
15 27 Zaza Pachulia C ... ge 11 NaN
16 12 Jabari Parker PF ... us R Duke
17 21 Miles Plumlee C ... us 2 Duke
18 8 Larry Sanders C ... us 4 Virginia Commonwealth
19 6 Nate Wolters PG ... us 1 South Dakota State

Trying to extract one column that seems to be JSON from a pandas dataframe in Python , how do I achieve this?

I have a dataset that I loaded in a pandas dataframe with one column that seems to be JSON format (not sure) and I want to extract the information for this column and put them in other columns of the same dataframe.
I've tried read_json, normalization and other python function but I can't achieve my goal ...
Here's what I tried :
x = {'latitude': '47.61219025', 'needs_recoding': False, 'human_address': '{""address"":""405 OLIVE WAY"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33799744'}
print (x.get('latitude'))
print (x.get('longitude')) this works for one line only.
Also tried this :
s = data2015.groupby('OSEBuildingID')['Location'].apply(lambda x: x.tolist())
print(s)
pd.read_json(s,typ='series',orient='records')
but I get this error :
ValueError: Invalid file path or buffer object type
loading the dataframe :
data2015 = pd.read_csv(filepath_or_buffer=r'C:\Users\mehdi\OneDrive\Documents\OpenClassRooms\Projet 3\2015-building-energy-benchmarking\2015-building-energy-benchmarking.csv', delimiter=",",low_memory=False)
example of the file content :
OSEBuildingID,DataYear,BuildingType,PrimaryPropertyType,PropertyName,TaxParcelIdentificationNumber,Location,CouncilDistrictCode,Neighborhood,YearBuilt,NumberofBuildings,NumberofFloors,PropertyGFATotal,PropertyGFAParking,PropertyGFABuilding(s),ListOfAllPropertyUseTypes,LargestPropertyUseType,LargestPropertyUseTypeGFA,SecondLargestPropertyUseType,SecondLargestPropertyUseTypeGFA,ThirdLargestPropertyUseType,ThirdLargestPropertyUseTypeGFA,YearsENERGYSTARCertified,ENERGYSTARScore,SiteEUI(kBtu/sf),SiteEUIWN(kBtu/sf),SourceEUI(kBtu/sf),SourceEUIWN(kBtu/sf),SiteEnergyUse(kBtu),SiteEnergyUseWN(kBtu),SteamUse(kBtu),Electricity(kWh),Electricity(kBtu),NaturalGas(therms),NaturalGas(kBtu),OtherFuelUse(kBtu),GHGEmissions(MetricTonsCO2e),GHGEmissionsIntensity(kgCO2e/ft2),DefaultData,Comment,ComplianceStatus,Outlier
1,2015,NonResidential,Hotel,MAYFLOWER PARK HOTEL,659000030,"{'latitude': '47.61219025', 'needs_recoding': False, 'human_address': '{""address"":""405 OLIVE WAY"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33799744'}",7,DOWNTOWN,1927,1,12,88434,0,88434,Hotel,Hotel,88434,,,,,,65,78.90,80.30,173.50,175.10,6981428,7097539,2023032,1080307,3686160,12724,1272388,0,249.43,2.64,No,,Compliant,
2,2015,NonResidential,Hotel,PARAMOUNT HOTEL,659000220,"{'latitude': '47.61310583', 'needs_recoding': False, 'human_address': '{""address"":""724 PINE ST"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33335756'}",7,DOWNTOWN,1996,1,11,103566,15064,88502,"Hotel, Parking, Restaurant",Hotel,83880,Parking,15064,Restaurant,4622,,51,94.40,99.00,191.30,195.20,8354235,8765788,0,1144563,3905411,44490,4448985,0,263.51,2.38,No,,Compliant,
3,2015,NonResidential,Hotel,WESTIN HOTEL,659000475,"{'latitude': '47.61334897', 'needs_recoding': False, 'human_address': '{""address"":""1900 5TH AVE"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33769944'}",7,DOWNTOWN,1969,1,41,961990,0,961990,"Hotel, Parking, Swimming Pool",Hotel,757243,Parking,100000,Swimming Pool,0,,18,96.60,99.70,242.70,246.50,73130656,75506272,19660404,14583930,49762435,37099,3709900,0,2061.48,1.92,Yes,,Compliant,
5,2015,NonResidential,Hotel,HOTEL MAX,659000640,"{'latitude': '47.61421585', 'needs_recoding': False, 'human_address': '{""address"":""620 STEWART ST"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33660889'}",7,DOWNTOWN,1926,1,10,61320,0,61320,Hotel,Hotel,61320,,,,,,1,460.40,462.50,636.30,643.20,28229320,28363444,23458518,811521,2769023,20019,2001894,0,1936.34,31.38,No,,Compliant,High Outlier
8,2015,NonResidential,Hotel,WARWICK SEATTLE HOTEL,659000970,"{'latitude': '47.6137544', 'needs_recoding': False, 'human_address': '{""address"":""401 LENORA ST"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98121""}', 'longitude': '-122.3409238'}",7,DOWNTOWN,1980,1,18,119890,12460,107430,"Hotel, Parking, Swimming Pool",Hotel,123445,Parking,68009,Swimming Pool,0,,67,120.10,122.10,228.80,227.10,14829099,15078243,0,1777841,6066245,87631,8763105,0,507.7,4.02,No,,Compliant,
9,2015,Nonresidential COS,Other,WEST PRECINCT (SEATTLE POLICE),660000560,"{'latitude': '47.6164389', 'needs_recoding': False, 'human_address': '{""address"":""810 VIRGINIA ST"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33676431'}",7,DOWNTOWN,1999,1,2,97288,37198,60090,Police Station,Police Station,88830,,,,,,,135.70,146.90,313.50,321.60,12051984,13045258,0,2130921,7271004,47813,4781283,0,304.62,2.81,No,,Compliant,
10,2015,NonResidential,Hotel,CAMLIN WORLDMARK HOTEL,660000825,"{'latitude': '47.6141141', 'needs_recoding': False, 'human_address': '{""address"":""1619 9TH AVE"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33274086'}",7,DOWNTOWN,1926,1,11,83008,0,83008,Hotel,Hotel,81352,,,,,,25,76.90,79.60,149.50,158.20,6252842,6477493,0,785342,2679698,35733,3573255,0,208.46,2.37,No,,Compliant,
11,2015,NonResidential,Other,PARAMOUNT THEATER,660000955,"{'latitude': '47.61290234', 'needs_recoding': False, 'human_address': '{""address"":""901 PINE ST"",""city"":""SEATTLE"",""state"":""WA"",""zip"":""98101""}', 'longitude': '-122.33130949'}",7,DOWNTOWN,1926,1,8,102761,0,102761,Other - Entertainment/Public Assembly,Other - Entertainment/Public Assembly,102761,,,,,,,62.50,71.80,152.20,160.40,6426022,7380086,2003108,1203937,4108004,3151,315079,0,199.99,1.77,No,,Compliant,
The dataframe :
I would like to have at least another dataframe with the columns : Latitude, needs_recoding, human_address,and longitude.

There might be a better way of doing this, but I just iterated through the rows and parsed that json string into the indivual data parts, and put back together into a dataframe. You could then just use .to_csv() to save it:
import pandas as pd
import json
import ast
data2015 = pd.read_csv('C:/test.csv', delimiter=",",low_memory=False)
results = pd.DataFrame()
for idx, row in data2015.iterrows():
data_dict = ast.literal_eval(row['Location'])
lat = data_dict['latitude']
lon = data_dict['longitude']
need_recode = data_dict['needs_recoding']
normalize = pd.Series(json.loads(data_dict['human_address']))
row = row.drop('Location')
cols = list(row.index) + ['latitude', 'longitude', 'need_recoding'] + list(normalize.index)
temp_df = pd.DataFrame([list(row) + [lat, lon, need_recode] + list(normalize)], columns = cols )
results = results.append(temp_df).reset_index(drop=True)
Output:
print (results.to_string())
OSEBuildingID DataYear BuildingType PrimaryPropertyType PropertyName TaxParcelIdentificationNumber CouncilDistrictCode Neighborhood YearBuilt NumberofBuildings NumberofFloors PropertyGFATotal PropertyGFAParking PropertyGFABuilding(s) ListOfAllPropertyUseTypes LargestPropertyUseType LargestPropertyUseTypeGFA SecondLargestPropertyUseType SecondLargestPropertyUseTypeGFA ThirdLargestPropertyUseType ThirdLargestPropertyUseTypeGFA YearsENERGYSTARCertified ENERGYSTARScore SiteEUI(kBtu/sf) SiteEUIWN(kBtu/sf) SourceEUI(kBtu/sf) SourceEUIWN(kBtu/sf) SiteEnergyUse(kBtu) SiteEnergyUseWN(kBtu) SteamUse(kBtu) Electricity(kWh) Electricity(kBtu) NaturalGas(therms) NaturalGas(kBtu) OtherFuelUse(kBtu) GHGEmissions(MetricTonsCO2e) GHGEmissionsIntensity(kgCO2e/ft2) DefaultData Comment ComplianceStatus Outlier latitude longitude need_recoding address city state zip
0 1 2015 NonResidential Hotel MAYFLOWER PARK HOTEL 659000030 7 DOWNTOWN 1927 1 12 88434 0 88434 Hotel Hotel 88434 NaN NaN NaN NaN NaN 65.0 78.9 80.3 173.5 175.1 6981428 7097539 2023032 1080307 3686160 12724 1272388 0 249.43 2.64 No NaN Compliant NaN 47.61219025 -122.33799744 False 405 OLIVE WAY SEATTLE WA 98101
1 2 2015 NonResidential Hotel PARAMOUNT HOTEL 659000220 7 DOWNTOWN 1996 1 11 103566 15064 88502 Hotel, Parking, Restaurant Hotel 83880 Parking 15064.0 Restaurant 4622.0 NaN 51.0 94.4 99.0 191.3 195.2 8354235 8765788 0 1144563 3905411 44490 4448985 0 263.51 2.38 No NaN Compliant NaN 47.61310583 -122.33335756 False 724 PINE ST SEATTLE WA 98101
2 3 2015 NonResidential Hotel WESTIN HOTEL 659000475 7 DOWNTOWN 1969 1 41 961990 0 961990 Hotel, Parking, Swimming Pool Hotel 757243 Parking 100000.0 Swimming Pool 0.0 NaN 18.0 96.6 99.7 242.7 246.5 73130656 75506272 19660404 14583930 49762435 37099 3709900 0 2061.48 1.92 Yes NaN Compliant NaN 47.61334897 -122.33769944 False 1900 5TH AVE SEATTLE WA 98101
3 5 2015 NonResidential Hotel HOTEL MAX 659000640 7 DOWNTOWN 1926 1 10 61320 0 61320 Hotel Hotel 61320 NaN NaN NaN NaN NaN 1.0 460.4 462.5 636.3 643.2 28229320 28363444 23458518 811521 2769023 20019 2001894 0 1936.34 31.38 No NaN Compliant High Outlier 47.61421585 -122.33660889 False 620 STEWART ST SEATTLE WA 98101
4 8 2015 NonResidential Hotel WARWICK SEATTLE HOTEL 659000970 7 DOWNTOWN 1980 1 18 119890 12460 107430 Hotel, Parking, Swimming Pool Hotel 123445 Parking 68009.0 Swimming Pool 0.0 NaN 67.0 120.1 122.1 228.8 227.1 14829099 15078243 0 1777841 6066245 87631 8763105 0 507.70 4.02 No NaN Compliant NaN 47.6137544 -122.3409238 False 401 LENORA ST SEATTLE WA 98121
5 9 2015 Nonresidential COS Other WEST PRECINCT (SEATTLE POLICE) 660000560 7 DOWNTOWN 1999 1 2 97288 37198 60090 Police Station Police Station 88830 NaN NaN NaN NaN NaN NaN 135.7 146.9 313.5 321.6 12051984 13045258 0 2130921 7271004 47813 4781283 0 304.62 2.81 No NaN Compliant NaN 47.6164389 -122.33676431 False 810 VIRGINIA ST SEATTLE WA 98101
6 10 2015 NonResidential Hotel CAMLIN WORLDMARK HOTEL 660000825 7 DOWNTOWN 1926 1 11 83008 0 83008 Hotel Hotel 81352 NaN NaN NaN NaN NaN 25.0 76.9 79.6 149.5 158.2 6252842 6477493 0 785342 2679698 35733 3573255 0 208.46 2.37 No NaN Compliant NaN 47.6141141 -122.33274086 False 1619 9TH AVE SEATTLE WA 98101
7 11 2015 NonResidential Other PARAMOUNT THEATER 660000955 7 DOWNTOWN 1926 1 8 102761 0 102761 Other - Entertainment/Public Assembly Other - Entertainment/Public Assembly 102761 NaN NaN NaN NaN NaN NaN 62.5 71.8 152.2 160.4 6426022 7380086 2003108 1203937 4108004 3151 315079 0 199.99 1.77 No NaN Compliant NaN 47.61290234 -122.33130949 False 901 PINE ST SEATTLE WA 98101

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping a table through multiple pages with a single link - python

Related

Scrapy: unable to locate table or scrape data in table

How to exclude certain rows in a table using BeautifulSoup?

scraping basketball results and associate related competition to each match

Parsing html with correct encoding

Trying to extract one column that seems to be JSON from a pandas dataframe in Python , how do I achieve this?

Categories

Resources