Here's what my HTML looks like:
<head> ... </head>
<body>
<div>
<h2>Something really cool here<h2>
<div class="section mylist">
<table id="list_1" class="table">
<thead> ... not important <thead>
<tr id="blahblah1"> <td> ... </td> </tr>
<tr id="blah2"> <td> ... </td> </tr>
<tr id="bl3"> <td> ... </td> </tr>
</table>
</div>
</div>
</body>
Now there are four occurrences of this div in my html file, each table content is different and each h2 text is different. Everything else is relatively the same. What I've been able to do so far is extract out the parent of each h2 - however, now I am not sure how to extract out each tr where in then, I can extract out the td that I really need.
Here is the code I've written so far...
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'html.parser')
currently_watching = soup.find('h2', text='Something really cool here')
parent = currently_watching.parent
I would suggest finding the parent div, which actually encloses the table, and then search for all td tags. Here's how you'd do it:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'lxml')
div = soup.find('div', class_='section mylist')
for td in div.find_all('td'):
print(td.text)
Searched around a bit and realized that it was my parser that was causing the issue. I installed lxml and everything works fine now.
Why is BeautifulSoup not finding a specific table class?
Related
I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site.
Here's the table I'm looking to get data for...
<div class="sastrupp-class">
<table>
<tbody>
<tr>
<td class="key">Thing I dont want 1</td>
<td class="value money">$1.23</td>
<td class="key">Thing I dont want 2</td>
<td class="value">99,999,999</td>
<td class="key">Target</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 3</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 4</td>
<td class="value percentage">1.23%</td>
<td class="key">Thing I dont want 5</td>
<td class="money value">$1.23</td>
</tr>
</tbody>
</table>
</div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want.
I figured I could just look for the class that I'm searching for like this...
output = soup.find('td', {'class':'key'})
print(output)
but that doesn't return anything.
Important to note:
< td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.
2.There are other < div>s with class="sastrupp-class" on the site.
I'm obviously a beginner at this so let me know if I can help you help me.
Any help/pointers would be appreciated.
1) First of, to get your 'Target' you need find_all, not find. Then, considering you know exactly in which position your target will be (in the example you gave it is index=2) the solution could be reached like this:
from bs4 import BeautifulSoup
html = """(YOUR HTML)"""
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]
print my_key.text # prints 'Target'
2)
There are other < div>s with class="sastrupp-class" on the site
Again, you need to select the one you need using find_all and then selecting the correct index.
Example HTML:
<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>
To extract the target, you can just:
all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3] # assuming you know exactly which index to look for
For example, i have HTML code, where contains codes like this
anchor
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
And i want remove all tags attributes and save only some tags (for example, remove table, tr, tr, th tags), so, i want get something like this.
anchor
<table>
<tr>
<td>
</td>
</tr>
</table>
<p>content</p>
I do it using for loop, but my code retrieves each tag and cleans it. I think that my way slow.
What you can suggest me? Thanks.
Update #1
In my solution i use this code for removing tags (stealed from django)
def remove_tags(html, tags):
"""Returns the given HTML with given tags removed."""
tags = [re.escape(tag) for tag in tags.split()]
tags_re = '(%s)' % '|'.join(tags)
starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
endtag_re = re.compile('</%s>' % tags_re)
html = starttag_re.sub('', html)
html = endtag_re.sub('', html)
return html
And this code to clean HTML attributes
# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean
html = 'Some html code'
safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)
Use beautifulsoup.
html = """
anchor
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
del soup.table.tr.td.attrs
del soup.table.attrs
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table>
<tr>
<td>
</td>
</tr>
</table>
<p class="" style="">
content
</p>
</body>
</html>
To clear tags:
soup = BeautifulSoup(html)
soup.table.clear()
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table id="some">
</table>
<p class="" style="">
content
</p>
</body>
</html>
To delete particulat attribute:
soup = BeautifulSoup(html)
td_tag = soup.table.td
del td_tag['class']
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table id="some">
<tr>
<td>
</td>
</tr>
</table>
<p class="" style="">
content
</p>
</body>
</html>
What you are looking for is called parsing.
BeautifulSoup is one of most popular / most used libraries for parsing html.
You can use it to remove tags and it is pretty well documented.
If you (because of some reason) can not use BeautifulSoup then look into python re module.
I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)
I have a html doc similar to following:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
<div id="Symbols" class="cb">
<table class="quotes">
<tr><th>Code</th><th>Name</th>
<th style="text-align:right;">High</th>
<th style="text-align:right;">Low</th>
</tr>
<tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
<td>A</td>
<td>A Inc.</td>
<td align="right">45.44</td>
<td align="right">44.26</td>
<tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
<td>B</td>
<td>B Inc.</td>
<td align="right">18.29</td>
<td align="right">17.92</td>
</div></html>
I need to extract code/name/high/low information from the table.
I used following code from one of the similar examples in Stack Over Flow:
#############################
import urllib2
from lxml import html, etree
webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)
for row in table.xpath('//table[#class="quotes"]/tbody/tr'):
for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
print column.strip(),
print
#############################
I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[#class="quotes"]/tbody/tr')
I just don't understand why the xpath('//table[#class="quotes"]/tbody/tr') not work.
You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.
Omit the tbody level in your XPath. For example, this works:
tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[#class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]
I have a html page that looks like:
<html>
..
<form post="/products.hmlt" ..>
..
<table ...>
<tr>...</tr>
<tr>
<td>part info</td>
..
</tr>
</table>
..
</form>
..
</html>
I tried:
form = soup.findAll('form')
table = form.findAll('table') # table inside form
But I get an error saying:
ResultSet object has no attribute 'findAll'
I guess the call to findAll doesn't return a 'beautifulsoup' object? what can I do then?
Update
There are many tables on this page, but only 1 table INSIDE the tag shown above.
findAll returns a list, so extract the element first:
form = soup.findAll('form')[0]
table = form.findAll('table')[0] # table inside form
Of course, you should do some error checking (i.e. make sure it's not empty) before indexing into the list.
I like ars's answer, and certainly agree w/ the need for error-checking;
especially if this is going to be used in any kind of production code.
Here's perhaps a more verbose / explicit way of finding the data you seek:
from BeautifulSoup import BeautifulSoup as bs
html = '''<html><body><table><tr><td>some text</td></tr></table>
<form><table><tr><td>some text we care about</td></tr>
<tr><td>more text we care about</td></tr>
</table></form></html></body>'''
soup = bs(html)
for tr in soup.form.findAll('tr'):
print tr.text
# output:
# some text we care about
# more text we care about
For reference here is the cleaned-up HTML:
>>> print soup.prettify()
<html>
<body>
<table>
<tr>
<td>
some text
</td>
</tr>
</table>
<form>
<table>
<tr>
<td>
some text we care about
</td>
</tr>
<tr>
<td>
more text we care about
</td>
</tr>
</table>
</form>
</body>
</html>