Some <td>'s Cannot Be Found by find_next() - python

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!

I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

Related

How to grab the numebr with the end of "0" from the website?

I use BeaustifulSoup to grab some texts on the url"https://nature.altmetric.com/details/114136890",and get such response
# The table is called twitterGeographical_TableChoice
<table>
<tr>
<th>Country</th>
<th class="num">Count</th>
<th class="num percent">As %</th>
</tr>
<tr>
<td>Japan</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>Poland</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>Spain</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>El Salvador</td>
<td class="num">2</td>
<td class="num">8%</td>
</tr>
<tr>
<td>Ecuador</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>Mexico</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>Chile</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>India</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr class="meta">
<td>Unknown</td>
<td class="num">10</td>
<td class="num">40%</td>
</tr>
</table>
Then I want to get the number from it.I use regular expression to get it.
My format is
twitterGeographical_Table_Num_pattern = re.compile('<td class=\"num\">(\d*%)</td>',re.S)
twitterGeographical_Table_Num = twitterGeographical_Table_Num_pattern.findall(twitterGeographical_TableChoice)
But I can only get 4% instead of 40%.I am puzzled.Thanks for your help!
I am not sure why you are going to get the numbers with the regex module when BeautifulSoup has already a lot of approaches for this. Anyway, if you are interested in regex you can use this pattern instead:
<td class=\"num\">((\d+)(%)?)</td>
Then you can get the numbers (percentages, if they are) using the code below:
[x[0] for x in twitterGeographical_Table_Num]
Output
['10', '40%']
Side note: I beg you to consider naming the variables shorter and more clear!:)

How to scrape a specific html line that follow another html line

I want to scrape some data from a html-page that looks something like this
<tr>
<td> Some information <td>
<td> 123 </td>
</tr>
<tr>
<td> some other information </td>
<td> 456 </td>
</tr>
<tr>
<td> and the info continues </td>
<td> 789 </td>
</tr>
What I want, is to obtain the html line that comes after a given html line. That is, if I see 'some other information' I want the output '456'. I thought of combining regex with .find_next from BeautifulSoup, but I haven't had any luck with this (I'm also not that familiar with regex). Anyone have a clue how to do it? In advance, thanks a lot
Actually with a mix of regex and find_next in BeautifulSoup you can achieve what you want:
from bs4 import BeautifulSoup
import re
html = """
<tr>
<td> Some information <td>
<td> 123 </td>
</tr>
<tr>
<td> some other information </td>
<td> 456 </td>
</tr>
<tr>
<td> and the info continues </td>
<td> 789 </td>
</tr>
"""
soup = BeautifulSoup(html)
x = soup.find('td', text = re.compile('some other information'))
print(x.find_next('td').text)
Output
' 456 '
EDIT replaced x.find_next('td').contents[0] by x.find_next('td').text, shorter

Extract 2 pieces of information from html in python

I need help figuring out how to extract Grab and the number following data-b. There are many <tr> in the complete unmodified webpage and I need to filter using the "Need" just before </a>. I've been trying to do this with beautiful soup, though it looks like lxml might work better. I can get either all of the <tr>s or only the < a>...< /a> lines that contain Need but not just the <tr>s that contain need in that <a> line.
<tr >
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr >
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
Thanks for any help!
from bs4 import BeautifulSoup
data = '''<tr>
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr>
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
'''
soup = BeautifulSoup(data)
print(soup.findAll('a',{"href":"/local" })[0].text)
for a in soup.findAll('span',{"class":["bloat","bloat2"]}):
print(a['data-b'])

BeautifulSoup scraping nested tables

I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost.
Here is the said table:
<form action="/rr/" class="form">
<table border="0" width="100%" cellpadding="2" cellspacing="0" align="left">
<tr bgcolor="#6699CC">
<td valign="top"><font face="arial"><b>Uesless Data</b></font></td>
<td width="10%"><br /></td>
<td align="right"><font face="arial">Uesless Data</font></td>
</tr>
<tr bgcolor="#DCDCDC">
<td> <input size="12" name="s" value="data:" onfocus=
"this.value = '';" /> <input type="hidden" name="d" value="research" />
<input type="submit" value="Date" /></td>
<td width="10%"><br /></td>
</tr>
</table>
</form>
<table border="0" width="100%">
<tr>
<td></td>
</tr>
</table><br />
<br />
<table border="0" width="100%">
<tr>
<td valign="top" width="99%">
<table cellpadding="2" cellspacing="0" border="0" width="100%">
<tr bgcolor="#A0B8C8">
<td colspan="6"><b>Data to be pulled</b></td>
</tr>
<tr bgcolor="#DCDCDC">
<td><font face="arial"><b>Data to be pulled</b></font></td>
<td><font face="arial"><b>Data to be pulled</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
</tr>
<tr>
<td>Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center"><br /></td>
</tr>
</table>
</td>
</tr>
</table>
There are quite a few tables, and none of which really have any distinguishing id's or tags. My most recent attempt was:
table = soup.find('table', attrs={'border':'0', 'width': "100%'})
Which is pulling only the first empty table. I feel like the answer is simple, and I am over thinking it.
If you're just looking for all of the tables, rather than the first one, you just want find_all instead of find.
If you're trying to find a particular table, like the one nested inside another one, and the page is using a 90s-style design that makes it impossible to find it via id or other attrs, the only option is to search by structure:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
# Found it!
And of course you can flatten this into a single comprehension if you really want to:
subtable = next(subtable for table in soup.find_all('table')
for subtable in table.find_all('table'))
Notice that I left off the attrs. If every table on the page has a superset of the same attrs, you aren't helping anything by specifying them.
This whole thing is obviously ugly and brittle… but there's really no way not to be brittle with this kind of layout.
Using a different library, like lxml.html, that lets you search by XPath might make it a little more compact, but it's ultimately going to be doing the same thing.

BeautifulSoup Parsing with Bad HTML Tables

I'm trying to parse tables similar to the following with BeautifulSoup to extract the name, age, and position for each person.
<TABLE width="100%" align="center" cellspacing="0" cellpadding="0" border="0">
<TR>
<TD></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>Age as of</B></FONT></TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD align="center" nowrap><FONT size="2"><B>Name</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>November 1, 1999</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap><FONT size="2"><B>Position</B></FONT></TD>
</TR>
<TR>
<TD align="center" nowrap><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap><HR size="1"></TD>
</TR>
<TR>
<TD align="left" valign="top"><FONT size="2">
Terry S. Jacobs</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">57</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Chairman of the Board, Chief Executive Officer, Treasurer and
director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
William L. Stakelin</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">56</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
President, Chief Operating Officer, Secretary and director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
Joel M. Fairman</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">70</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Vice Chairman and director</FONT></TD>
</TR>
</TABLE>
My current attempt is as follows:
soup = BeautifulSoup(in_file)
out = []
headers = soup.findAll(['td','th'])
for header in headers:
if header.find(text = re.compile(r"^age( )?", re.I)):
out.append(header)
table = out[0].find_parent("table")
rows = table.findAll('tr')
filter_regex = re.compile(r'[\w][\w .,]*', re.I)
data = [[td.find(text=filter_regex) for td in tr.findAll("td")] for tr in rows]
Things work find for the first person, but the bad <tr><td><tr><td>... lines really mess things up from there. I am trying to do this for a few thousand HTML files, each having slightly different table structure. That said, this feature of <tr> and <td> tags not being closed appears quite common across the files.
Anyone have thoughts on how to generalize the above parsing to work with tables that have constructs such as these? Thanks a lot!
You can take advantage of the fact that the valign attribute is set to top in all of the fields you'd like to keep and none of the ones you don't:
soup = BeautifulSoup(in_file)
cells = [cell.text.strip() for cell in soup('td', valign='top')]
Then you can sort this list of cells into a two-dimensional structure. There are three cells per entry, so you can sort it out pretty simply by doing something like this:
entries = []
for i in range(0, len(cells), 3):
entries.append(cells[i:i+3])
In the off chance anyone else get stuck with this issue and stumbles in here, the modern solution is to change which parser you are using. The default parser, 'html.parser' is pretty good when working with close enough HTML with properly closed tags, but the second you have to deal with edge cases (like Example 1 below, which is similar to the OP issue), that still goes right out the window even 8 years later (example 2 below).
In the documentation for BeautifulSoup4 (current version 4.9.3), there is a section detailing parser selection: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Example 1, the raw HTML:
<TABLE >
<TR VALIGN="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br>
<TR VALIGN="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<TR VALIGN="top">
<td> <td><b>Type:</b>
<td> 61
<TR VALIGN="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</TABLE>
Example 2, results when using BeautifulSoup(html, 'html.parser'):
<table>
<tr valign="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br/>
<tr valign="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<tr valign="top">
<td> <td><b>Type:</b>
<td> 61
<tr valign="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</td></td></td></tr></td></td></td></tr></td></td></td></tr></td></td></td></tr></table>
Example 3, results when using BeautifulSoup(html, 'html5lib'):
<table>
<tbody><tr valign="top">
<td> </td><td><b>Title:</b>
</td><td> title is here <i>-subtitle</i><br/>
</td></tr><tr valign="top">
<td>
</td><td><b>Date:</b>
</td><td> Thursday , August 27th, 2020
</td></tr><tr valign="top">
<td> </td><td><b>Type:</b>
</td><td> 61
</td></tr><tr valign="top">
<td>
</td><td><b>Status:</b>
</td><td> ACTIVE - ACTIVE
</td></tr></tbody></table>
There are also parsers that are written externally in C such as 'lxml' that you could potentially use that is much faster according to the documentation.

Categories