Extract html data using regular expressions - python

I have an html page that looks like this
<tr>
<td align=left>
<a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
Th
</td>
</tr>
<tr align=right>
<td align=left>
<a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
aa
</td>
</tr>
I need to get the text
history/2c0b65635b3ac68a4d53b89521216d26.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
I wrote a script in python that uses regular expressions
import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))
where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?

Don't use regex for parsing HTML content.
Use a specialized tool - an HTML Parser.
Example (using BeautifulSoup):
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""Your HTML here"""
soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
print link['href']
Prints:
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html
Or, if you want to get the href values that follow a pattern, use:
import re
for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
print link['href']
where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.
DEMO.

You can also write something like
>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
... print a['href']
...
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Related

Python's BeautifulSoup find_all Method: Including and Excluding with RegEx in one search

I have a question regarding regular expressions and Python's (2.6) BeautifulSoup (4.4.0):
If I do this:
import re
re_expr = re.compile(r"(?!.*\bthis\b).*\bcol\b")
a = u"col this"
b = u"col"
print re.search(re_expr, a)
#None
print re.search(re_expr, b)
#<_sre.SRE_Match object at 0x0209EEC8>
I get a non hit in the first search, and a hit in the second search (as expected).
But if I do this:
from bs4 import BeautifulSoup
import re
html = """ <html>
<table>
<tr class="row">
<td class="col this">value 1</td>
<td class="col">value 2</td>
</tr>
</table>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
t_row = soup.find('tr', class_='row')
print "t_row:\n", t_row
#<tr class="row">
#<td class="col this">value 1</td>
#<td class="col">value 2</td>
#</tr>
re_expr = re.compile(r"(?!.*\bthis\b).*\bcol\b")
t_tag = t_row.find_all('td', class_=re_expr)
print "t_tag:\n", t_tag
#[<td class="col this">value 1</td>, <td class="col">value 2</td>]
I get back both td tags, and not only the td class="col" as I would expect.
I know there are many workarounds for this, and I'm aware of them,
but I try to find out why this logic works different then I expect.
Why is the "fail if we can match 'this'" RegEx block not working here?
Using positiv (inclusion) and negativ (exclusion) hit RegEx in a class_ search in BeautifulSoup would be great.
Has anybody a working example to do such a thing in ONE RegEx and the .find or .find_all method of BeautifulSoup?

How to write python regex to find guid in html?

How to find guid in following HTML section?
HTML sample:
<td>xxxxxxx</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42341</td>
<td>yyyyyy</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42342</td>
<td>zzzz</td>
re.findall("[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",the_whole_text)
this works because uuid's are ALWAYS this format ... in general when parsing html/xml you should actually use an html/xml parser and not re ... since re has a very hard time with nesting
Use an HTML Parser, like that "beautiful" and transparent BeautifulSoup package.
The idea is to locate td elements with xxxxxxx, yyyyyy texts and get the following td sibling's text value (assuming xxxxxxx and yyyyyy are labels you know beforehand):
from bs4 import BeautifulSoup
data = """
<tr>
<td>xxxxxxx</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42341</td>
<td>yyyyyy</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42342</td>
<td>zzzz</td>
</tr>
"""
soup = BeautifulSoup(data)
print soup.find("td", text="xxxxxxx").find_next_sibling('td').text
Prints:
e3aa8247-354b-e311-b6eb-005056b42341

Using regex on python + beautiful soup

I have an html page like this:
<td class="subject windowbg2">
<div>
<span id="msg_152617">
<a href= SOME INFO THAT I WANT </a>
</span>
</div>
<div>
<span id="msg_465412">
<a href= SOME INFO THAT I WANT</a>
</span>
</div>
as you can see the id="msg_465412" have a variable number, so this is my code:
import urllib.request, http.cookiejar,re
from bs4 import BeautifulSoup
contenturl = "http://megahd.me/peliculas-microhd/"
htmll=urllib.request.urlopen(contenturl).read()
soup = BeautifulSoup(htmll)
print (soup.find('span', attrs=re.compile(r"{'id': 'msg_\d{6}'}")))
in the last line I tried to find all the "span" tags that contain an id that can be msg_###### (with any number) but something is wrong in my code and it doesn't find anything.
P.S: all the code I want is in a table with 6 columns and I want the third column of all rows, but I thought that it was easier to use regex
You're a bit mixed up with your attrs argument ... at the moment it's a regex which contains the string representation of a dictionary, when it needs to be a dictionary containing the attribute you're searching for and a regex for its value.
This ought to work:
print (soup.find('span', attrs={'id': re.compile(r"msg_\d{6}")}))
Try using the following:
soup.find_all("span" id=re.compile("msg_\d{6}"))

Add parent tags with beautiful soup

I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)

Beautiful soup question

I want to fetch specific rows in an HTML document
The rows have the following attributes set: bgcolor and vallign
Here is a snippet of the HTML table:
<table>
<tbody>
<tr bgcolor="#f01234" valign="top">
<!--- td's follow ... -->
</tr>
<tr bgcolor="#c01234" valign="top">
<!--- td's follow ... -->
</tr>
</tbody>
</table>
I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.
Does anyone know what tp bass to findAll() to match the rows I want?
Don't use regex to parse html. Use a html parser
import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(#bgcolor='#f01234' or #bgcolor='#c01234') "
"and #valign='top']")
print result
That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...
Obligatory link:
RegEx match open tags except XHTML self-contained tags
Something like
soup.findAll('tr', attrs={'bgcolor':
re.compile(r'#f01234|#c01234'),
'valign': 'top'})

Categories