How to write python regex to find guid in html? - python

How to find guid in following HTML section?
HTML sample:
<td>xxxxxxx</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42341</td>
<td>yyyyyy</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42342</td>
<td>zzzz</td>

re.findall("[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",the_whole_text)
this works because uuid's are ALWAYS this format ... in general when parsing html/xml you should actually use an html/xml parser and not re ... since re has a very hard time with nesting

Use an HTML Parser, like that "beautiful" and transparent BeautifulSoup package.
The idea is to locate td elements with xxxxxxx, yyyyyy texts and get the following td sibling's text value (assuming xxxxxxx and yyyyyy are labels you know beforehand):
from bs4 import BeautifulSoup
data = """
<tr>
<td>xxxxxxx</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42341</td>
<td>yyyyyy</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42342</td>
<td>zzzz</td>
</tr>
"""
soup = BeautifulSoup(data)
print soup.find("td", text="xxxxxxx").find_next_sibling('td').text
Prints:
e3aa8247-354b-e311-b6eb-005056b42341

Related

Python BeautifulSoup find next_sibling

I have some html scraping code issues with beautiful soup. I cannot figure out how to go through the whole html document to find the rest of the things I am looking for.
I have this code that will find and print the word "Totem" in the below html. I want to be able to cycle through the html and find the remaining "One, Two, Three", and "Rent"
Code that works to find the first tag and text:
print(html.find('td', {'class': 'play'}).next_sibling.next_sibling.text)
Let the below be the sample html to scrape:
<tr>
<td class="play">
<span class="play-button as_audio-button"></span>
<audio class="as_audio_preview" src="https://shopify.audiosalad.com/" >foo</audio>
</td>
**<td>Totem</td>**
<!--<td>$0.99</td>-->
<td class="buy">
<tr>
<td class="play">
<span class="play-button as_audio-button"></span>
<audio class="as_audio_preview" src="https://shopify.audiosalad.com/" >foo</audio>
</td>
**<td>One, Two, Three</td>**
<!--<td>$0.99</td>-->
<td class="buy">
<tr>
<td class="play">
<span class="play-button as_audio-button"></span>
<audio class="as_audio_preview" src="https://shopify.audiosalad.com/" >foo</audio>
</td>
**<td>Rent</td>**
<!--<td>$0.99</td>-->
<td class="buy">
Try this. It should fetch you the content you are after:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for items in soup.find_all(class_="play"):
data = items.find_next_sibling().text
print(data)
Or, you can try like this as well:
for items in soup.find_all(class_="play"):
data = items.find_next("td").text
print(data)
Output:
Totem
One, Two, Three
Rent
you have to iterate over elements, like this:
for td in html.find_all('td', {'class': 'play'}):
print(td.next_sibling.next_sibling.text)

Python's BeautifulSoup find_all Method: Including and Excluding with RegEx in one search

I have a question regarding regular expressions and Python's (2.6) BeautifulSoup (4.4.0):
If I do this:
import re
re_expr = re.compile(r"(?!.*\bthis\b).*\bcol\b")
a = u"col this"
b = u"col"
print re.search(re_expr, a)
#None
print re.search(re_expr, b)
#<_sre.SRE_Match object at 0x0209EEC8>
I get a non hit in the first search, and a hit in the second search (as expected).
But if I do this:
from bs4 import BeautifulSoup
import re
html = """ <html>
<table>
<tr class="row">
<td class="col this">value 1</td>
<td class="col">value 2</td>
</tr>
</table>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
t_row = soup.find('tr', class_='row')
print "t_row:\n", t_row
#<tr class="row">
#<td class="col this">value 1</td>
#<td class="col">value 2</td>
#</tr>
re_expr = re.compile(r"(?!.*\bthis\b).*\bcol\b")
t_tag = t_row.find_all('td', class_=re_expr)
print "t_tag:\n", t_tag
#[<td class="col this">value 1</td>, <td class="col">value 2</td>]
I get back both td tags, and not only the td class="col" as I would expect.
I know there are many workarounds for this, and I'm aware of them,
but I try to find out why this logic works different then I expect.
Why is the "fail if we can match 'this'" RegEx block not working here?
Using positiv (inclusion) and negativ (exclusion) hit RegEx in a class_ search in BeautifulSoup would be great.
Has anybody a working example to do such a thing in ONE RegEx and the .find or .find_all method of BeautifulSoup?

Extract Text from HTML Python (BeautifulSoup, RE, Other Option?)

I am familiar with BeautifulSoup and Regular Expressions as a means of extracting text from HTML but not as familiar with others, such as ElementTree, Minidom, etc.
My question is fairly straightforward. Given the HTML snippet below, which library is best for extracting the text below? The text being the integer.
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>
With BeautifulSoup it is fairly straight-forward:
from bs4 import BeautifulSoup
data = """
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>
"""
soup = BeautifulSoup(data)
print(soup.td['data-tooltip'])
If you have multiple td elements and you need to extract the data-tooltip from each one:
for td in soup.find_all('td', {'data-tooltip': True}):
print(td['data-tooltip'])

Extract html data using regular expressions

I have an html page that looks like this
<tr>
<td align=left>
<a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
Th
</td>
</tr>
<tr align=right>
<td align=left>
<a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
aa
</td>
</tr>
I need to get the text
history/2c0b65635b3ac68a4d53b89521216d26.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
I wrote a script in python that uses regular expressions
import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))
where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?
Don't use regex for parsing HTML content.
Use a specialized tool - an HTML Parser.
Example (using BeautifulSoup):
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""Your HTML here"""
soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
print link['href']
Prints:
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html
Or, if you want to get the href values that follow a pattern, use:
import re
for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
print link['href']
where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.
DEMO.
You can also write something like
>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
... print a['href']
...
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Add parent tags with beautiful soup

I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)

Categories