Scrape Data From a sports table using Python and Beautiful soup - python

I'm trying to scrape the data from a table - namely (http://stats.nba.com/leagueTeamGeneral.html?pageNo=1&rowsPerPage=30). I am having difficulty with using the right commands. Tried various parameters, none worked. Ideally having the data returned in the format,
example:
Atlanta Hawks,32, 48.8, 18, 14, .563, etc
I can get the data formatted no problem, just getting the required data is what is causing me grief.
import urllib2
from bs4 import BeautifulSoup
page = 'http://stats.nba.com/leagueTeamGeneral.html?pageNo=1&rowsPerPage=30'
page = urllib2.urlopen(page)
soup = BeautifulSoup(page)
for dS in soup.find_all(???):
print(dS.get(???))

use a tool like firefox firebug to track down the html call you need, looking at the link you shared in firebug 'net' tab shows that the data you are after is in a subsequent request call to http://www.nba.com/cmsinclude/desktopWrapperHeader_jsonp.html
which actually contains json data, not sure BeautifulSoup will be handy here, try to load it using python json

Thank for the suggestion, Worked rather nicely. What I ended up using was something like
import json
from pprint import pprint
with open('NBA_DATA.json') as data_file:
data = json.load(data_file)
#Have this here for debug purpose just to see output
pprint(data["resultSets"])
for hed in data["resultSets"]:
s1 = hed["headers"]
s2 = hed["rowSet"]
#more debugging
#pprint(hed["headers"])
#pprint(hed["rowSet"])
list_of_s1 = list(hed["headers"])
list_of_s2 = list(hed["rowSet"])

Related

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Python web scraping - how to get resources with beautiful soup when page loads contents via JS?

So I am trying to scrape a table from a specific website using BeautifulSoup and urllib. My goal is to create a single list from all the data in this table. I have tried using this same code using tables from other websites, and it works fine. However, while trying it with this website the table returns a NoneType object. Can someone help me with this? I've tried looking for other answers online but I'm not having much luck.
Here's the code:
import requests
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read())
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
It looks like this data is loaded via an ajax call:
You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php
import requests
import urllib
from bs4 import BeautifulSoup
params = {
"type":"team-detail",
"league":"ncb",
"stat_id":"3083",
"season_id":"312",
"cat_type":"2",
"view":"stats_v1",
"is_previous":"0",
"date":"04/06/2015"
}
content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
soup = BeautifulSoup(content)
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
Using your web inspector you can also view the parameters that are passed along with the POST request.
Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2 because I generally prefer to use that library.
If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.
The table on that website is being created via javascript, and so does not exist when you simply throw the source code at BeautifulSoup.
Either you need to start poking around with your web inspector of choice, and find out where the javascript is getting the data from - or you should use something like selenium to run a complete browser instance.
Since the table data is loaded dynamically, there be some lag is updating the table data due multiple reason like network delay. So you can wait for time by giving a delay and reading the data.
Check if table data i.e. length is null, if so read the table data after some delay. This will help .
Looked at the url you have used. Since you are using class selector for the table. make sure that it is present other places in the HTML

urllib keeps freezing while trying to pull HTML data from a website - is my code correct?

I'm trying to build a simple Python script algorithm on Mac OS X that has four parts to it.
go to a defined website and grab all the HTML using urllib
parse the HTML data to find a table of numbers (using beautifulsoup)
with those numbers do a simple calculation
print out the results in a table in numerical order
I'm having trouble with step 1, i can grab the data with urllib using this code
import urllib.request
y=urllib.request.urlopen('my target website url')
x=y.read()
print(x)
But it keeps freezing once it has returned the HTML and the Python shell is non-responsive.
Since you mentioned requests, I think it's a great solution.
import requests
import BeautifulSoup
r = requests.get('http://example.com')
html = r.content
soup = BeautifulSoup(html)
table = soup.find("table", {"id": "targettable"})
As suggested by jonrsharpe, if you're concerned about the size of the response returned by that url, you can check the size first before printing or parsing.
With requests:
r = requests.get('http://example.com')
print r.headers['content-length']

BeautifulSoup not parsing the entire page content

I am trying to get the set of url's(which are webpages) from newyork times, but i get a different answer, I am sure that I gave a correct class, though it extracts different classes. My ny_url.txt has "http://query.nytimes.com/search/sitesearch/?action=click&region=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis; http://query.nytimes.com/search/sitesearch/?action=click&region=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis/since1851/allresults/2/"
Here is my code:
import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
text_file = open('ny_url.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.find_all('div', attrs = {'class' : 'element2'})
for href in links:
print href
Well its not that simple.
The data you are looking for is not in your page_source downloaded by urllib2.
Try printing the opener.open(line).read() you will find the data to be missing.
This is because, the site is making another GET request to http://query.nytimes.com/svc/cse/v2pp/sitesearch.json?query=isis&page=1
Where within the url your query parameters are passed query=isis and page=1
The data fetched is in json format, try opening the url above in the browser manually. You will find your data there.
So a pure pythonic way would be to call this url and parse JSON to get what you want.
No rocket science needed - just parse the dict using proper keys.
OR
An easier way would be to use webdrivers like Selenium - navigate to the page - and parse the page source using BeautifulSoup. That should easily fetch the entire Content.
Hope that helps. Let me know if you need more insights.

Python - getting information from nodes

I've been trying to get information from a site, and recently found out that is stored in childNodes[0].data.
I'm pretty new to python and never tried scripting against websites.
Somebody told me I could make a tmp.xml file, and extract the information from there, but as it's only getting the source code(which I think is of no use for me), I don't get any results.
Current code:
response = urllib2.urlopen(get_link)
html = response.read()
with open("tmp.xml", "w") as f:
f.write(html)
dom = parse("tmp.xml")
name = dom.getElementsByTagName("name[0].firstChild.nodeValue")
I've also tried using 'dom = parse(html)' without better result.
getElementsByTagName() takes an element name, not an expression. It is highly unlikely that there are tags in the page you are loading that contain <name[0].firstChild.nodeValue> tags.
If you are loading HTML, use a HTML parser instead, like BeautifulSoup. For XML, using the ElementTree API is a lot easier than using the (archaic and very verbose) DOM API.
Neither approach requires that you first save the source to disk, both APIs can parse directly from the response object returned by urllib2.
# HTML
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen(get_link)
soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))
print soup.find('title').text
or
# XML
import urllib2
from xml.etree import ElementTree as ET
response = urllib2.urlopen(get_link)
tree = ET.parse(response)
print tree.find('elementname').text

Categories