Using Beautiful Soup to find specific class - python

I am trying to use Beautiful Soup to scrape housing price data from Zillow.
I get the web page by property id, eg. http://www.zillow.com/homes/for_sale/18429834_zpid/
When I try the find_all() function, I do not get any results:
results = soup.find_all('div', attrs={"class":"home-summary-row"})
However, if I take the HTML and cut it down to just the bits I want, eg.:
<html>
<body>
<div class=" status-icon-row for-sale-row home-summary-row">
</div>
<div class=" home-summary-row">
<span class=""> $1,342,144 </span>
</div>
</body>
</html>
I get 2 results, both <div>s with the class home-summary-row. So, my question is, why do I not get any results when searching the full page?
Working example:
from bs4 import BeautifulSoup
import requests
zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")
results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

Your HTML is non-well-formed and in cases like this, choosing the right parser is crucial. In BeautifulSoup, there are currently 3 available HTML parsers which work and handle broken HTML differently:
html.parser (built-in, no additional modules needed)
lxml (the fastest, requires lxml to be installed)
html5lib (the most lenient, requires html5lib to be installed)
The Differences between parsers documentation page describes the differences in more detail. In your case, to demonstrate the difference:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>>
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3
As you can see, in your case, both html.parser and lxml do the job, but html5lib does not.

import requests
from bs4 import BeautifulSoup
zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "home-summary-row"})
print g_data[1].text
#for item in g_data:
# print item("span")[0].text
# print '\n'
I got this working too -- but it looks like someone beat me to it.
going to post anyways.

According to the W3.org Validator, there are a number of issues with the HTML such as stray closing tags and tags split across multiple lines. For example:
<a
href="http://www.zillow.com/danville-ca-94526/sold/" title="Recent home sales" class="" data-za-action="Recent Home Sales" >
This kind of markup can make it much more difficult for BeautifulSoup to parse the HTML.
You may want to try running something to clean up the HTML, such as removing the line breaks and trailing spaces from the end of each line. BeautifulSoup can also clean up the HTML tree for you:
from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

Related

Finding number of pages using Python BeautifulSoup

I want to extract the total page number (11 in this case) from a steam page. I believe that the following code should work (return 11), but it is returning an empty list. Like if it is not finding paged_items_paging_pagelink class.
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
total_pages = soup.find_all("span",{"class":"paged_items_paging_pagelink"})[-1].text
If you check the page source, the content you want is not available. It means that it is generated dynamically through Javascript.
The page numbers are located inside the <span id="NewReleases_links"> tag, but in the page source the HTML shows only this:
<span id="NewReleases_links"></span>
Easiest way to handle this is using Selenium.
But, if you look at the page source, the text Showing 1-20 of 213 results
is available. So, you can scrape this and calculate the number of pages.
Required HTML:
<div class="paged_items_paging_summary ellipsis">
Showing
<span id="NewReleases_start">1</span>
-
<span id="NewReleases_end">20</span>
of
<span id="NewReleases_total">213</span>
results
</div>
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://store.steampowered.com/tags/en-us/RPG/')
soup = BeautifulSoup(r.text, 'lxml')
def get_pages_no(soup):
total_items = int(soup.find('span', id='NewReleases_total').text)
items_per_page = int(soup.find('span', id='NewReleases_end').text)
return round(total_items/items_per_page)
print(get_pages_no(soup))
# prints 11
(Note: I still recommend the use of Selenium, as most of the content from this site is dynamically generated. It'll be a pain to scrape all the data like this.)
An alternative faster way without using BeautifulSoup:
import requests
url = "http://store.steampowered.com/contenthub/querypaginated/tags/NewReleases/render/?query=&start=20&count=20&cc=US&l=english&no_violence=0&no_sex=0&v=4&tag=RPG" # This returns your query in json format
r = requests.get(url)
print(round(r.json()['total_count'] / 20)) # total_count = number of records, 20 = number of pages shown
11

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

Tags are converted to HTML entities?

I'm trying to use BeautifulSoup to parse some dirty HTML. One such HTML is http://f10.5post.com/forums/showthread.php?t=1142017
What happens is that, firstly, the tree misses a large chunk of the page. Secondly, tostring(tree) would convert tags like <div> on half of the page to HTML entities like </div>. For instance
Original:
<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`
toString(tree) gives
<div class="smallfont" align="center">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)
print soup
Thanks
Use beautifulsoup4 and an extremely lenient html5lib parser:
import urllib2
from bs4 import BeautifulSoup # NOTE: importing beautifulsoup4 here
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")
print soup

I am struggling with python-html . I know the class of a certain header. I need the info from the generic <a href... in this h1

So, I have this:
<h1 class='entry-title'>
<a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>
How can I retrieve the URL (it is not always the same) and the title (also not always the same)?
Parse it with an HTML parser, e.g. with BeautifulSoup it would be:
from bs4 import BeautifulSoup
data = "your HTML here" # data can be the result of urllib2.urlopen(url)
soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]
print link.get("href")
print link.get_text()
where h1.entry-title > a is a CSS selector matching an a element directly under h1 element with class="entry-title".
Well, just working with strings, you can
>>> s = '''<h1 class='entry-title'>
... <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'
There are other (and better) options for parsing HTML though.

Categories