I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.
<th scope="row" class="left " data-stat="year_id">2000</th>
If you would like to check the site for yourself here is the link.
https://www.pro-football-reference.com/players/B/BradTo00.htm
I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.
Sorry for the formatting and the grammer.
Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things
# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds
headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]
data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
for row in soup.find_all("tr", class_=True)]
tags = soup.find(data ='pos')
#stats = tags.find_all('td')
print(tags)
You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
It's not very clear what exactly you're trying to extract, but this might help you a little bit:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
page = requests.get(url)
soup = bs(page.text, "html.parser")
# Extract table
table = soup.find_all('table')
# Let's extract data from each row in table
for row in table:
col = row.find_all('td')
for c in col:
print(c.text)
Hope this helps!
Related
So, I'm attempting to parse the table from a webpage using BeautifulSoup4 and it is able to get the webpage, and parse the content, but when I move onto looking for the table to put into a pandas data frame I get an attribute error: 'NONETYPE' object has no attribute 'Find_all'
I tried this same process for another webpage and it was able to work just fine, and I'm just trying to figure out what I'm doing incorrectly here where one works and the other does not.
#Imports
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
#Load data
url = 'https://gisopendata.siouxfalls.org/datasets/7b0407feca3e4f47bfe54559b9c1dd5d_13/data'
#Get request
web_data = requests.get(url)
#Parse Content
soup = BeautifulSoup(web_data.text, 'lxml')
#print(soup.prettify())
table = soup.find('table', {'class':'table table-striped table-bordered table-hover'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
Data is dynamically pulled from a POST request. However, the page shows you an API endpoint you can use. The following is one way you can make a request to that API and generate a dataframe from the response.
Simplest is to use with json specified for output:
import requests
import pandas as pd
r = requests.get('https://gis2.siouxfalls.org/arcgis/rest/services/Data/Community/MapServer/13/query?where=1%3D1&outFields=*&outSR=4326&f=json').json()
print(pd.DataFrame([i['attributes'] for i in r['features']]))
Otherwise,
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://gis2.siouxfalls.org/arcgis/rest/services/Data/Community/MapServer/13/query?outFields=*&where=1%3D1')
soup = bs(r.content, 'lxml')
headers = ['OBJECTID', 'Id', 'DESCRIP', 'LOCATION', 'YEARBUILT', 'LOCAL_REGISTER', 'LOCAL_REG_DATE',
'NATIONAL_REGISTER', 'NATIONAL_REG_DATE', 'GlobalID', 'Shape_Length', 'Shape_Area']
data = {}
for header in headers:
if header == 'OBJECTID':
data[header] = [i.next_sibling.next_sibling.text for i in soup.select(f'i:contains("{header}")')]
else:
data[header] = [i.next_sibling for i in soup.select(f'i:contains("{header}")')]
df = pd.DataFrame(zip(*data.values()), columns = headers)
print(df)
each table usually has a thead and tbody (and possibly a tr) which you need to access before you can use find_all on th.
If you check the html on the source page this is indeed the case, you have
<table class="table table-striped table-bordered table-hover" role="grid">
<thead role="rowgroup">
<tr role="row">
<th id="ember123" class="ember-view">
So after the table tag, you have to access the thead tag, then the tr tag, then you can use find_all to gather all the ths
Can you try and see whether something like this works:
for i in table.find('thead').find('tr').find_all('th'):
title = i.text.strip()
headers.append(title)
The giveaway here is to observe carefully data in the source page, the AttributeError tells you clearly that BeautifulSoup cannot find the tag with the instructions you specified, hence the NoneType reference.
I'm new to Python and am working to extract data from website https://www.screener.in/company/ABB/consolidated/ on a particular table (the last table which is Shareholding Pattern)
I'm using BeautifulSoup library for this but I do not know how to go about it.
So far, here below is my code snippet. am failing to pick the right table due to the fact that the page has multiple tables and all tables share common classes and IDs which makes it difficult for me to filter for the one table I want.
import requests import urllib.request
from bs4 import BeautifulSoup
url = "https://www.screener.in/company/ABB/consolidated/"
r = requests.get(url)
print(r.status_code)
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
# print(soup)
#data_table = soup.find('table', class_ = "data-table")
# print(data_table) table_needed = soup.find("<h2>ShareholdingPattern</h2>")
#sub = table_needed.contents[0] print(table_needed)
Just use requests and pandas. Grab the last table and dump it to a .csv file.
Here's how:
import pandas as pd
import requests
df = pd.read_html(
requests.get("https://www.screener.in/company/ABB/consolidated/").text,
flavor="bs4",
)
df[-1].to_csv("last_table.csv", index=False)
Output from a .csv file:
I am working on web scraping using Python and BeautifulSoup. My purpose is to pull members data from https://thehia.org/directory?&tab=1. There are around 1685 records.
When I view the page source on my Chrome, I cannot find the table. Seems it dynamically pulls the data. But when I use the inspect option of Chrome, I can find the "membersTable" table in the div that I need.
How can I use BeautifulSoup to access that membersTable that I can access in the inspect.
You can mimic the POST request the page makes for content then use hjson to handle unquoted keys in string pulled out of response
import requests, hjson
import pandas as pd
data = {'formId': '3721260'}
r = requests.post('https://thehia.org/Sys/MemberDirectory/LoadMembers', data=data)
data = hjson.loads(r.text.replace('while(1); ',''))
total = data['TotalCount']
structure = data['JsonStructure']
members = hjson.loads(structure)
df = pd.DataFrame([[member[k][0]['v'] for k in member.keys()] for member in members['members'][0]]
,columns = ['Organisation', 'City', 'State','Country'])
print(df)
Try this one
import requests
from bs4 import BeautifulSoup
url = "https://thehia.org/directory?&tab=1"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'membersTable'})
row_list = []
for row in table.findAll('tr',{'class':['normal']}):
data= []
for cell in row.findAll('td'):
data.append(cell.text)
row_list.append(data)
print(row_list)
I have this small piece of code to scrape table data from a web site and then display in a csv format. The issue is that for loop is printing the records multiple time . I am not sure if it is due to tag. btw I am new to Python. Thanks for your help!
#import needed libraries
import urllib
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import sys
import re
# read the data from a URL
url = requests.get("https://www.top500.org/list/2018/06/")
# parse the URL using Beauriful Soup
soup = BeautifulSoup(url.content, 'html.parser')
newtxt= ""
for record in soup.find_all('tr'):
tbltxt = ""
for data in record.find_all('td'):
tbltxt = tbltxt + "," + data.text
newtxt= newtxt+ "\n" + tbltxt[1:]
print(newtxt)
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.top500.org/list/2018/06/")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', attrs={'class':'table table-condensed table-striped'})
for i in table:
tr = i.find_all('tr')
for x in tr:
print(x.text)
Or the best way to parse table using pandas
import pandas as pd
table = pd.read_html('https://www.top500.org/list/2018/06/', attrs={
'class': 'table table-condensed table-striped'}, header = 1)
print(table)
It's printing much of the data multiple times because the newtext variable, which you are printing after getting the text of each <td></td>, is just accumulating all the values. Easiest way to get this to work is probably to just move the line print(newtxt) outside of both for loops - that is, leave it totally unindented. You should then see a list of all the text, with that from each row on a new line, and that from each individual cell in a row separated by commas.
I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a <th> tag. The other column of interest has as <td> tag. In any case, all I've been able to get out is a list of the column with the tags. But, I just want the text.
table is already a list so I can't use findAll(text=True). I'm not sure how to get the listing of the first column in another form.
from BeautifulSoup import BeautifulSoup
from sys import argv
import re
filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one
print table
You can try this code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print first_column, third_column
As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'.
In addition to #jonhkr's answer I thought I'd post an alternate solution I came up with.
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
from sys import argv
filename = argv[1]
#get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody
data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
print data
Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example:
python file.py table.html
you can try this code also
import requests
from bs4 import BeautifulSoup
page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print (first_column, third_column)