Extract Text from HTML Python (BeautifulSoup, RE, Other Option?) - python

I am familiar with BeautifulSoup and Regular Expressions as a means of extracting text from HTML but not as familiar with others, such as ElementTree, Minidom, etc.
My question is fairly straightforward. Given the HTML snippet below, which library is best for extracting the text below? The text being the integer.
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>

With BeautifulSoup it is fairly straight-forward:
from bs4 import BeautifulSoup
data = """
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>
"""
soup = BeautifulSoup(data)
print(soup.td['data-tooltip'])
If you have multiple td elements and you need to extract the data-tooltip from each one:
for td in soup.find_all('td', {'data-tooltip': True}):
print(td['data-tooltip'])

Related

How to scrape just one text value on one p tag from bs4

Actually The website has one <p> but inside it there are two text values, I just want to scrape one of the texts. website HTML as below:
<p class="text-base font-medium text-gray-700 w-1/2" xpath="1">
Great Clips
<br><span class="text-blue-600 font-normal text-sm">Request Info</span>
</p>
On HTML above, there are two text values ("Great Clips" & "Request Info")if we target <p>. I just want to scrape "Great Clips" not both, how would I do that with bs4?
You could use .contents with indexing to extract only the first child:
soup.p.contents[0].strip()
Example
from bs4 import BeautifulSoup
html = '''
<p class="text-base font-medium text-gray-700 w-1/2" xpath="1">
Great Clips
<br><span class="text-blue-600 font-normal text-sm">Request Info</span>
</p>
'''
soup = BeautifulSoup(html)
soup.p.contents[0].strip()
Output
Great Clips

BeautifulSoup to find text inside the table

Here's what my HTML looks like:
<head> ... </head>
<body>
<div>
<h2>Something really cool here<h2>
<div class="section mylist">
<table id="list_1" class="table">
<thead> ... not important <thead>
<tr id="blahblah1"> <td> ... </td> </tr>
<tr id="blah2"> <td> ... </td> </tr>
<tr id="bl3"> <td> ... </td> </tr>
</table>
</div>
</div>
</body>
Now there are four occurrences of this div in my html file, each table content is different and each h2 text is different. Everything else is relatively the same. What I've been able to do so far is extract out the parent of each h2 - however, now I am not sure how to extract out each tr where in then, I can extract out the td that I really need.
Here is the code I've written so far...
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'html.parser')
currently_watching = soup.find('h2', text='Something really cool here')
parent = currently_watching.parent
I would suggest finding the parent div, which actually encloses the table, and then search for all td tags. Here's how you'd do it:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'lxml')
div = soup.find('div', class_='section mylist')
for td in div.find_all('td'):
print(td.text)
Searched around a bit and realized that it was my parser that was causing the issue. I installed lxml and everything works fine now.
Why is BeautifulSoup not finding a specific table class?

How to write python regex to find guid in html?

How to find guid in following HTML section?
HTML sample:
<td>xxxxxxx</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42341</td>
<td>yyyyyy</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42342</td>
<td>zzzz</td>
re.findall("[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",the_whole_text)
this works because uuid's are ALWAYS this format ... in general when parsing html/xml you should actually use an html/xml parser and not re ... since re has a very hard time with nesting
Use an HTML Parser, like that "beautiful" and transparent BeautifulSoup package.
The idea is to locate td elements with xxxxxxx, yyyyyy texts and get the following td sibling's text value (assuming xxxxxxx and yyyyyy are labels you know beforehand):
from bs4 import BeautifulSoup
data = """
<tr>
<td>xxxxxxx</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42341</td>
<td>yyyyyy</td>
<td style="display: none">e3aa8247-354b-e311-b6eb-005056b42342</td>
<td>zzzz</td>
</tr>
"""
soup = BeautifulSoup(data)
print soup.find("td", text="xxxxxxx").find_next_sibling('td').text
Prints:
e3aa8247-354b-e311-b6eb-005056b42341

Matching specific table within HTML, BeautifulSoup

I have this problem. There're several similar tables on the page I'm trying to scrape.
<h2 class="tabellen_ueberschrift al">Points</h2>
<div class="fl" style="width:49%;">
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
The only difference between them is the text within h2 tags, here: Points
How can I specifiy which table I need to search in?
I have this code and need to adjust the h2 tag factor:
my_tab = soup.find('table', {'class':'tabelle_grafik lh'})
Need some help guys.
This works for me. Find the "previousSiblings" and if you find a h2 with the text "Points" before an h2 tag with a different text contents, you've found a good table
from BeautifulSoup import BeautifulSoup
t="""
<h2 class="tabellen_ueberschrift al">Points</h2>
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
<th><td>yes me!</th></td></table>
<h2 class="tabellen_ueberschrift al">Bad</h2>
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
<th><td>woo woo</td></th></table>
"""
soup = BeautifulSoup(t)
for ta in soup.findAll('table'):
for s in ta.findPreviousSiblings():
if s.name == u'h2':
if s.text == u'Points':
print ta
else:
break;
Looks like this is a job for xpath. But, BeautifulSoup doesn't support XPath expressions.
Consider switching to lxml or scrapy.
FYI, for test xml like:
<html>
<h2 class="tabellen_ueberschrift al">Points</h2>
<div class="fl" style="width:49%;">
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">a</table>
</div>
<h2 class="tabellen_ueberschrift al">Illegal</h2>
<div class="fl" style="width:49%;">
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">b</table>
</div>
</html>
XPath expression to find table with class "tabelle_grafik lh" in div after h2="Points" is:
//table[#class="tabelle_grafik lh" and ../preceding-sibling::h2[1][text()="Points"]]

Add parent tags with beautiful soup

I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)

Categories