Python web scraping - python

On using this regular expression in python :
pathstring = '<span class="titletext">(.*)</span>'
pathFinderTitle = re.compile(pathstring)
My output is:
Govt has nothing to do with former CAG official RP Singh:
Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper">
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0">
<tbody><tr><td class="al-attribution-cell source-cell">
<span class='al-attribution-source'>Times of India</span></td>
<td class="al-attribution-cell timestamp-cell">
<span class='dash-separator'> - </span>
<span class='al-attribution-timestamp'>‎46 minutes ago‎
The text find should have stopped at first "< /span>".
Please suggest whats wrong here.

.* is a greedy match of any characters; it is going to consume as many characters as possible. Instead, use the non-greedy version .*?, as in
pathstring = '<span class="titletext">(.*?)</span>'

I would suggest using pyquery instead of going mad on regular expressions... It's based on lxml and makes HTML parsing easy as using jQuery.
Something like this is everything you need:
doc = PyQuery(html)
doc('span.titletext').text()
You could also use beautifulsoup, but the result is always the same: don't use regular expressions for parsing HTML, there are tools out there for making your life easier.

.* will match </span> so it keeps on going until the last one.
The best answer is: Don't parse html with regular expressions. Use the lxml library (or something similar).
from lxml import html
html_string = '<blah>'
tree = html.fromstring(html_string)
titles = tree.xpath("//span[#class='titletext']")
for title in titles:
print title.text
Using a proper xml/html parser will save you massive amounts of time and trouble. If you roll your own parser, you'll have to cater for malformed tags, comments, and myriad other things. Don't reinvent the wheel.

You could also just as easily use BeautifulSoup which is great for doing this kind of thing.
#using BeautifulSoup4, install by "pip install BeautifulSoup4"
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
result = soup.find('span', 'titletext')
And then result would hold the <span> with class titletext as you're looking for.

Related

get div from HTML with Python

I want to get a value inside certain div from a HTML page
<div class="well credit">
<div class="span2">
<h3><span>
$ 5.402
</span></h3>
</div>
</div>
I've done it with regular expressions ( re.seach() ) but it take too long to find the div since it's a huge html.
Is there a way to do this faster but with no external libraries?
Thanks
I would use BeautifulSoup!
to get everything with <div> tag just do:
soup = BeautifulSoup(html)#make soup that is parse-able by bs
soup.findAll('div')
to get the value inside of span you could do:
soup.find('span').get_text()
there are tons of differnt methods of getting the informaton you need
Good Luck hope this helps!
Python has only one HTML parser in the standard library and it's pretty low-level, so you'll have to install some sort of HTML parsing library if you want to work with HTML.
lxml is by far the fastest:
import lxml.html
root = lxml.html.parse(handle)
price = root.xpath('//div[#class="well credit"]//span/#text')[0]
If you want it to be even faster, use root.iter and stop parsing the HTML once you hit the right element.
Scrapy might also be a solution for this. Please read http://doc.scrapy.org/en/latest/topics/selectors.html
x = sel.xpath('//div[#class="span2"]')
for i in x:
print i.extract()
Output:
<div class="span2">
<h3><span>
$ 5.402
</span></h3>
</div>

Using Beautiful Soup for HTML tables that lack </td> tags

I'm struggling with parsing some flaky HTML tables down to lists with Beautiful Soup. The tables in question lack a </td> tag.
Using the following code (not the real tables I'm parsing, but functionally similar):
import bs4
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
def walk_table2(text):
"Take an HTML table and spit out a list of lists (of entries in a row)."
soup = bs4.BeautifulSoup(text)
return [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
print walk_table2(test)
Gives me:
[[<td>1<td>2<td>3</td></td></td>, <td>2<td>3</td></td>, <td>3</td>], [<td>4<td>5<td>6</td></td></td>, <td>5<td>6</td></td>, <td>6</td>]]
Rather than the expected:
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
It seems that the lxml parser that Beautiful Soup is using decides to add the </td> tag before the next instance of </tr> rather than the next instance of <td>.
At this point, I'm wondering if there a good option to make the parser place the ending td tags in the correct location, or if it would be easier to use a regular expression to place them manually before tossing the string into BeautifulSoup... Any thoughts? Thanks in advance!
You're seeing decisions made by Python's built-in HTML parser. If you don't like the way that parser does things, you can tell Beautiful Soup to use a different parser. The html5lib parser and the lxml parser both give the result you want:
>>> soup = bs4.BeautifulSoup(test, "lxml")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
>>> soup = bs4.BeautifulSoup(test, "html5lib")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
This sounds like a BeautifulSoup bug to me. I found this page detailing why there are regressions in BS 3.1 from 3.0.8 (including "'bad end tag' errors") which suggest that, for parsing bad HTML, one option would be to jump back several versions. That said, the page says it's been superseded and now exists only for historical reference. It's unclear however exactly how much BS4 resolves the issues introduced in BS 3.1 - at the very least, it couldn't hurt to try the older version.
A patchy fix to get you through this particular pinch:
Massage the incoming data with a regex (this is VERY brittle, and I know how stackoverflow feels about regexes and html but C'MON, just this one time...)
import re
r1 = re.compile('(?<!\<tr\>)\<td', re.IGNORECASE)
r2 = re.compile('\<\/tr>', re.IGNORECASE)
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
test = r1.sub('</td><td', test)
test = r2.sub('</td></tr>', test)
print test
Oh, and test afterwards:
<table> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td>1</td><td>2</td><td>3</td></tr> </table>

How to use Beautiful Soup to find a tag with changing id?

I am using Beautiful Soup in Python.
Here is an example URL:
http://www.locationary.com/place/en/US/Ohio/Middletown/McDonald%27s-p1013254580.jsp
In the HTML, there are a bunch of tags and the only way I can specify which ones to find is with their id. The only thing I want to find is the telephone number. The tag looks like this:
<td class="dispTxt" id="value_xxx_c_1_f_8_a_134242498">5134231582</td>
I have gone to other URLs on the same website and found almost the same id for the telephone number tag every time. The part that always stays the same is:
'value_xxx_c_1_f_8_a_'
However, the numbers that come after that always change. Is there a way that I can tell Beautiful Soup to look for part of the id and match it and let the other part be numbers like a regular expression could?
Also, once I get the tag, I was wondering...how can I extract the phone number without using regular expressions? I don't know if Beautiful Soup can do that but it would probably be simpler than regex.
You can use regular expressions (this example matches on the tag names, you need to adjust it so it matches on an element's id):
import re
for tag in soup.find_all(re.compile("^value_xxx_c_1_f_8_a_")):
print(tag.name)
Know your documentation
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
You can use CSS Selectors here, to match on an attribute value prefix:
soup.select('div[id^="value_xxx_c_1_f_8_a_"]')
This will only match <div> tags with an id attribute that starts with the string value_xxx_c_1_f_8_a_.
If you are willing to switch to lxml instead, you can use an XPath 1.0 expression to find these:
from lxml import etree
doc = etree.parse(openfile)
for elem in doc.xpath('//div[starts-with(#id, "value_xxx_c_1_f_8_a_")]'):
print elem.text
Using an lxml XPath expression will be an order of a magnitude faster than using a BeautifulSoup regular-expression match.
To get the phone number you can use the .text attribute.
tag = soup.find("foo")
phone_number = tag.text

Python: Separating an HTML snippets to paragraphs

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:
'''
<p class="my_class">Hello!</p>
<p>What's up?</p>
<p style="whatever: whatever;">Goodbye!</p>
'''
Should become:
['<p class="my_class">Hello!</p>',
'<p>What's up?</p>'
'<p style="whatever: whatever;">Goodbye!</p>']
What would be a good way to approach this?
If your string only contains paragraphs, you may be able to get away with a nicely crafted regex and re.split(). However, if your string is more complex HTML, or not always valid HTML, you might want to look at the BeautifulSoup package.
Usage goes like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(some_html)
paragraphs = list(unicode(x) for x in soup.findAll('p'))
Use lxml.html to parse the HTML into the form you want. This is essentially the same advice as the people who are recommending BeautifulSoup, except lxml is still being actively developed and BeatifulSoup development has slowed.
Use BeautifulSoup to parse the HTML and iterate over the paragraphs.
The xml.etree (std lib) or lxml.etree (enhanced) make this easy to do, but I'm not going to get the answer cred for this because I don't remember the exact syntax. I keep mixing it up with similar packages and have to look it up afresh every time.

How to get the content of a Html page in Python

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)
Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))
Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

Categories