Parsing webpage table with BeautifulSoup4 - python

So, I'm attempting to parse the table from a webpage using BeautifulSoup4 and it is able to get the webpage, and parse the content, but when I move onto looking for the table to put into a pandas data frame I get an attribute error: 'NONETYPE' object has no attribute 'Find_all'
I tried this same process for another webpage and it was able to work just fine, and I'm just trying to figure out what I'm doing incorrectly here where one works and the other does not.
#Imports
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
#Load data
url = 'https://gisopendata.siouxfalls.org/datasets/7b0407feca3e4f47bfe54559b9c1dd5d_13/data'
#Get request
web_data = requests.get(url)
#Parse Content
soup = BeautifulSoup(web_data.text, 'lxml')
#print(soup.prettify())
table = soup.find('table', {'class':'table table-striped table-bordered table-hover'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)

Data is dynamically pulled from a POST request. However, the page shows you an API endpoint you can use. The following is one way you can make a request to that API and generate a dataframe from the response.
Simplest is to use with json specified for output:
import requests
import pandas as pd
r = requests.get('https://gis2.siouxfalls.org/arcgis/rest/services/Data/Community/MapServer/13/query?where=1%3D1&outFields=*&outSR=4326&f=json').json()
print(pd.DataFrame([i['attributes'] for i in r['features']]))
Otherwise,
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://gis2.siouxfalls.org/arcgis/rest/services/Data/Community/MapServer/13/query?outFields=*&where=1%3D1')
soup = bs(r.content, 'lxml')
headers = ['OBJECTID', 'Id', 'DESCRIP', 'LOCATION', 'YEARBUILT', 'LOCAL_REGISTER', 'LOCAL_REG_DATE',
'NATIONAL_REGISTER', 'NATIONAL_REG_DATE', 'GlobalID', 'Shape_Length', 'Shape_Area']
data = {}
for header in headers:
if header == 'OBJECTID':
data[header] = [i.next_sibling.next_sibling.text for i in soup.select(f'i:contains("{header}")')]
else:
data[header] = [i.next_sibling for i in soup.select(f'i:contains("{header}")')]
df = pd.DataFrame(zip(*data.values()), columns = headers)
print(df)

each table usually has a thead and tbody (and possibly a tr) which you need to access before you can use find_all on th.
If you check the html on the source page this is indeed the case, you have
<table class="table table-striped table-bordered table-hover" role="grid">
<thead role="rowgroup">
<tr role="row">
<th id="ember123" class="ember-view">
So after the table tag, you have to access the thead tag, then the tr tag, then you can use find_all to gather all the ths
Can you try and see whether something like this works:
for i in table.find('thead').find('tr').find_all('th'):
title = i.text.strip()
headers.append(title)
The giveaway here is to observe carefully data in the source page, the AttributeError tells you clearly that BeautifulSoup cannot find the tag with the instructions you specified, hence the NoneType reference.

Related

Beautifulsoup object does not contain full table from webpage, instead grabs first 100 rows

I am attempting to scrape tables from the website spotrac.com and save the data to a pandas dataframe. For whatever reason, if the table I am scraping is over 100 rows, the BeautifulSoup object only appears to grab the first 100 rows of the table. If you run my code below, you'll see that the resulting dataframe has only 100 rows, and ends with "David Montgomery." If you visit the webpage (https://www.spotrac.com/nfl/rankings/2019/base/running-back/) and ctrl+F "David Montgomery", you'll see that there are additional rows. If you change the webpage in the get row of the code to "https://www.spotrac.com/nfl/rankings/2019/base/wide-receiver/" you'll see that the same thing happens. Only the first 100 rows are included in the BeautifulSoup object and in the dataframe.
import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup
# Begin requests session
with requests.session() as s:
# Get page
r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
I have read that changing the parser can help. I have tried using different parsers by replacing the BeautifulSoup object code with the following:
soup = BeautifulSoup(r.content,'html5lib')
soup = BeautifulSoup(r.content,'html.parser')
Neither of these changes worked. I have run "pip install html5lib" and "pip install lxml" and confirmed that both were already installed.
This page uses JavaScript to load extra data.
In DevTools in Firefox/Chrome you can see it sends POST request with extra information {'ajax': True, 'mobile': False}
import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup
with requests.session() as s:
r = s.post('https://www.spotrac.com/nfl/rankings/2019/base/running-back/', data={'ajax': True, 'mobile': False})
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
print(df)
I suggest you use request-html
import pandas as pd
from bs4 import BeautifulSoup
from requests_html import HTMLSession
if __name__ == "__main__":
# Begin requests session
s = HTMLSession()
# Get page
r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
r.html.render()
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.html.html, 'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
Then you will get 140 lines.

How to access specific table shown in inspect using Python and BeautifulSoup for web scraping

I am working on web scraping using Python and BeautifulSoup. My purpose is to pull members data from https://thehia.org/directory?&tab=1. There are around 1685 records.
When I view the page source on my Chrome, I cannot find the table. Seems it dynamically pulls the data. But when I use the inspect option of Chrome, I can find the "membersTable" table in the div that I need.
How can I use BeautifulSoup to access that membersTable that I can access in the inspect.
You can mimic the POST request the page makes for content then use hjson to handle unquoted keys in string pulled out of response
import requests, hjson
import pandas as pd
data = {'formId': '3721260'}
r = requests.post('https://thehia.org/Sys/MemberDirectory/LoadMembers', data=data)
data = hjson.loads(r.text.replace('while(1); ',''))
total = data['TotalCount']
structure = data['JsonStructure']
members = hjson.loads(structure)
df = pd.DataFrame([[member[k][0]['v'] for k in member.keys()] for member in members['members'][0]]
,columns = ['Organisation', 'City', 'State','Country'])
print(df)
Try this one
import requests
from bs4 import BeautifulSoup
url = "https://thehia.org/directory?&tab=1"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'membersTable'})
row_list = []
for row in table.findAll('tr',{'class':['normal']}):
data= []
for cell in row.findAll('td'):
data.append(cell.text)
row_list.append(data)
print(row_list)

Web scrape not pulling back title correctly

I am trying to pull back only the Title from a source code online. My code is able to currently pull all the correct lines, but I cannot figure out how to make it only pull back the title.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('td')
print(name.get_text('title'))
I expect it to only say
Nexus
Pylon
Gateway
Assimilator
ect
but I get the error:
Traceback (most recent call last):
File "main.py", line 11, in <module>
print(name.get_text().strip())
AttributeError: 'NoneType' object has no attribute 'get_text'
I don't understand what I am doing wrong since from what I read it should only pull back the desired results
Try below code. Your first row had table header instead of table data so it will be none when you are looking for the td tag.
So add the condition to check when you can find either td or span inside td tag then get its title as below.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('span')
if name is not None:
# Process only if the element is available
print(name['title'])
I thinkg you should use something like
for link in tb.find_all('tr'):
name = link.select('td[title]')
print(name.get_text('title'))
Because until I see, the string comes empty because there are not title tag name, so you are trying to get text from title attr from the tag td
bkyada's answer is perfect if you want another solution then.
In your for loop instead of finding td find_all span and iterate through it and find it's title attribute.
containers = link.find('span')
if containers is not None:
print(containers['title'])
It is more efficient to simply use the class name to identify the elements with title attribute as they all have one in first column.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
titles = [i['title'] for i in tb.select('.blizzard_icons_single')]
print(titles)
titles = {i['title'] for i in tb.select('.blizzard_icons_single')} #set of unique
print(titles)
As title attribute is limited to that column you could also have used (slighlty less quick) attribute selector:
titles = {i['title'] for i in tb.select('[title]')} #set of unique

Using BeautifulSoup to find a attribute called data-stats

I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.
<th scope="row" class="left " data-stat="year_id">2000</th>
If you would like to check the site for yourself here is the link.
https://www.pro-football-reference.com/players/B/BradTo00.htm
I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.
Sorry for the formatting and the grammer.
Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things
# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds
headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]
data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
for row in soup.find_all("tr", class_=True)]
tags = soup.find(data ='pos')
#stats = tags.find_all('td')
print(tags)
You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
It's not very clear what exactly you're trying to extract, but this might help you a little bit:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
page = requests.get(url)
soup = bs(page.text, "html.parser")
# Extract table
table = soup.find_all('table')
# Let's extract data from each row in table
for row in table:
col = row.find_all('td')
for c in col:
print(c.text)
Hope this helps!

Wikipedia table scraping using python

I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.

Categories