Can't scrape HTML table using BeautifulSoup - python

I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site.
Here's the table I'm looking to get data for...
<div class="sastrupp-class">
<table>
<tbody>
<tr>
<td class="key">Thing I dont want 1</td>
<td class="value money">$1.23</td>
<td class="key">Thing I dont want 2</td>
<td class="value">99,999,999</td>
<td class="key">Target</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 3</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 4</td>
<td class="value percentage">1.23%</td>
<td class="key">Thing I dont want 5</td>
<td class="money value">$1.23</td>
</tr>
</tbody>
</table>
</div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want.
I figured I could just look for the class that I'm searching for like this...
output = soup.find('td', {'class':'key'})
print(output)
but that doesn't return anything.
Important to note:
< td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.
2.There are other < div>s with class="sastrupp-class" on the site.
I'm obviously a beginner at this so let me know if I can help you help me.
Any help/pointers would be appreciated.

1) First of, to get your 'Target' you need find_all, not find. Then, considering you know exactly in which position your target will be (in the example you gave it is index=2) the solution could be reached like this:
from bs4 import BeautifulSoup
html = """(YOUR HTML)"""
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]
print my_key.text # prints 'Target'
2)
There are other < div>s with class="sastrupp-class" on the site
Again, you need to select the one you need using find_all and then selecting the correct index.
Example HTML:
<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>
To extract the target, you can just:
all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3] # assuming you know exactly which index to look for

Related

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.
Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!
The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))
I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

Python beautifulsoup select CSS issue

Learning scrapy and I'm trying to use it to get some specific topics in a forum.
In the forum the infomation I need is stored like:
<tbody id="threadnumber">
<tr>
<th class="new">
<em>[topic]</em>
postname
</th>
<td class="by">
<a something to show the poster and time>**</a>
</td>
<td class="num">
<a something to show the numbers of read and replys>**</a>
</td>
<td class="by">
<a something to show the last replyer and time>**</a>
</td>
</tr>
</tbody>
<tbody id="threadnumber">#next thread
<tr>....
</tr>
</tbody>
Is there any method to get the postname in the second a tag for a specific topic whose unique topicid is stored in the first a tag. Should I use sibling?
For example I get
[NEWS]
news1
[NEWS]
news2
[NEWS]
news3
[PIC]
picture1
for input.
And I want to get an output only include "NEWS" topic like['news1','news2','news3']
Thanks for your help!
You can use BeautifulSoup to find all tags with class="post". Then for each tag, you search a <a> tag in a descendant from its parent, and test whether its text is the topic you are interested in. If true, you add the postname to a result list. Code could be:
def findposts(soup, topic):
'''Finds all postname associated to topic in a BeautifulSoup element'''
posts = [] # initialize an empty result list
# search postnames by class
for postname in soup.findAll('a', attrs = {'class': 'post'}):
# find associated topic in immediate parent
if postname.findParent().find('a').text == topic:
posts.append(postname.text) # Ok add to result list
return posts
With your example data, you could do:
soup = BeautifulSoup('data', 'html.parser')
print(findpost(soup, 'topic')
and the result would be as expected:
['postname']

bs4 parent attrs python

I'm just starting coding in Python and my friend asked me for application finding specific data on the web, representing it nicely.
I already found pretty web, where the data is contained, I can find basic info, but then the challenge is to get deeper.
While using BS4 in Python 3.4 I have reached exemplary code:
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something6" something_id="6something0">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">08/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something">Full</span>
</td>
</tr>
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="0" something="1something4" something_id="6something7">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">05/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something"></span>
</td>
</tr>
What I want to do now is finding the date string of but only if data-something="1" of parent and not if data-something="0"
I can scrap all dates by :
soup.find_all(lambda tag: tag.name == 'td' and tag.get('class') == ['text-center'] and not tag.has_attr('style'))
but it does not check parent. That is why I tried:
def KieMeWar(tag):
return tag.name == 'td' and tag.parent.name == 'tr' and tag.parent.attrs == {"data-something": "1"} #and tag.get('class') == ['text-center'] and not tag.has_attr('style')
soup.find_all(KieMeWar)
The result is an empty set. What is wrong or how to reach the target I am aiming for with easiest solution?
P.S. This is exemplary part of full code, that is why I use not Style, even though it does not appear here but does so later.
BeautifulSoup's findAll has the attrs kwarg, which is used to find tags with a given attribute
import bs4
soup = bs4.BeautifulSoup(html)
trs = soup.findAll('tr', attrs={'data-something':'1'})
That finds all tr tags with the attribute data-something="1". Afterwards, you can loop through the trs and grab the 2nd td tag to extract the date
for t in trs:
print(str(t.findAll('td')[1].text))
>>> 08/26 15:00

Beautifulsoup Unable to Find Classes with Hyphens in Their Name

I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id attribute in <tbody> is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.
The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify() yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?
For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Just use select. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')

Categories