BeautifulSoup scraping nested tables - python

I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost.
Here is the said table:
<form action="/rr/" class="form">
<table border="0" width="100%" cellpadding="2" cellspacing="0" align="left">
<tr bgcolor="#6699CC">
<td valign="top"><font face="arial"><b>Uesless Data</b></font></td>
<td width="10%"><br /></td>
<td align="right"><font face="arial">Uesless Data</font></td>
</tr>
<tr bgcolor="#DCDCDC">
<td> <input size="12" name="s" value="data:" onfocus=
"this.value = '';" /> <input type="hidden" name="d" value="research" />
<input type="submit" value="Date" /></td>
<td width="10%"><br /></td>
</tr>
</table>
</form>
<table border="0" width="100%">
<tr>
<td></td>
</tr>
</table><br />
<br />
<table border="0" width="100%">
<tr>
<td valign="top" width="99%">
<table cellpadding="2" cellspacing="0" border="0" width="100%">
<tr bgcolor="#A0B8C8">
<td colspan="6"><b>Data to be pulled</b></td>
</tr>
<tr bgcolor="#DCDCDC">
<td><font face="arial"><b>Data to be pulled</b></font></td>
<td><font face="arial"><b>Data to be pulled</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
</tr>
<tr>
<td>Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center"><br /></td>
</tr>
</table>
</td>
</tr>
</table>
There are quite a few tables, and none of which really have any distinguishing id's or tags. My most recent attempt was:
table = soup.find('table', attrs={'border':'0', 'width': "100%'})
Which is pulling only the first empty table. I feel like the answer is simple, and I am over thinking it.

If you're just looking for all of the tables, rather than the first one, you just want find_all instead of find.
If you're trying to find a particular table, like the one nested inside another one, and the page is using a 90s-style design that makes it impossible to find it via id or other attrs, the only option is to search by structure:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
# Found it!
And of course you can flatten this into a single comprehension if you really want to:
subtable = next(subtable for table in soup.find_all('table')
for subtable in table.find_all('table'))
Notice that I left off the attrs. If every table on the page has a superset of the same attrs, you aren't helping anything by specifying them.
This whole thing is obviously ugly and brittle… but there's really no way not to be brittle with this kind of layout.
Using a different library, like lxml.html, that lets you search by XPath might make it a little more compact, but it's ultimately going to be doing the same thing.

Related

Python Selenium: Finding elements by xpath when there are duplicates in the html code

So I'm working in selenium to scrape content from a liqour sales website in order to more quickly add product details to a spreadsheet. I'm using selenium to log into the website and search for the correct product. Once I get to the product page I'm able to scrape all the data I need except for some data that's contained in a certain block of the code.
I'm wanting 3 pieces of data: price per case, price per bottle, and price per oz. I noticed in the code that the data I'm looking for appears twice in a similar pattern. Interestingly, the correct data that I want is the second occurrence of the data (the first occurrence is incorrect). The relevant HTML code is:
<h2>Pricing</h2>
<div id="prices-table">
<div class="table-responsive">
<table class="table table-condensed auto-width">
<thead>
<tr>
<th></th>
<th class="best-bottle-top">
Frontline
</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottles</td>
<td class="best-bottle-mid">1</td>
</tr>
<tr>
<td>Cases</td>
<td class="best-bottle-mid">—</td>
</tr>
<tr>
<td>Price per bottle</td>
<td class="best-bottle-mid">
<div>$16.14 #I don't want this data </div>
</td>
</tr>
<tr>
<td>Price per case</td>
<td class="best-bottle-mid">
<div>
$193.71 #I don't want this data
</div>
</td>
</tr>
<tr>
<td>Cost per ounce</td>
<td class="best-bottle-mid">
<div>$1.27 #I don't want this data </div>
</td>
</tr>
<tr>
<td></td>
<td class="best-bottle-bot text-muted">
<span class="best-bottle-bot-content">
<span>
<div><small>Best</small></div>
<small>Bottle</small>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
<p>
<em class="price-disclaimer">Defer to Athens Distributing Company of Tennessee in case of any price discrepancies.</em>
</p>
</div>
</div>
<hr class="visible-print-block">
<div class="tab-pane active" id="3400355">
<dl class="dl-horizontal vpv-row">
<dt>Sizing</dt><dd>750 mL bottle × 6</dd>
<dt>SKU</dt><dd>80914</dd>
<dt>UPC</dt><dd>853192006189</dd>
<dt>Status</dt><dd>Active</dd>
<dt>Availability</dt><dd>
<span class="label label-success inventory-status-badge"><span data-container="body" data-toggle="popover" data-placement="top" data-content="Athens Distributing Company of Tennessee is integrated with SevenFifty and sends inventory levels at least once a day. You can order this item and expect that it is available." data-original-title="" title="">IN STOCK</span></span>
</dd></dl>
<div id="prices-table">
<div class="table-responsive">
<table class="table table-condensed auto-width">
<h2>Pricing</h2><thead>
<tr>
<th></th>
<th class="best-bottle-top">
Frontline
</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottles</td>
<td class="best-bottle-mid">1</td>
</tr>
<tr>
<td>Cases</td>
<td class="best-bottle-mid">—</td>
</tr>
<tr>
<td>Price per bottle</td>
<td class="best-bottle-mid">
<div>$33.03 #I want THIS data </div>
</td>
</tr>
<tr>
<td>Price per case</td>
<td class="best-bottle-mid">
<div>
$198.18 I want THIS data
</div>
</td>
</tr>
<tr>
<td>Cost per ounce</td>
<td class="best-bottle-mid">
<div>$1.30 I want THIS data </div>
</td>
</tr>
<tr>
<td></td>
<td class="best-bottle-bot text-muted">
<span class="best-bottle-bot-content">
<span>
<div><small>Best</small></div>
<small>Bottle</small>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
Using the full xpath chrome give me finds me what I want, but trying a relative path does not work. Here's what I've tried:
Full xpath for case price (works but don't want to use absolute references):
/html/body/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[3]/div[2]/div[3]/div[2]/div/div/table/tbody/tr[4]/td[2]/div
Relative xpath for case price (returns None):
//*[#id="prices-table"]/div/table/tbody/tr[4]/td[2]/div
Unfortunately I can't link the actual webpage because it requires login credentials. Thanks for any/all help.
Two ways to do it.
If everything is same, tags their attribute then use xpath indexing.
//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']
This represent two nodes, using find_element will only work with first occurrence which you do not want. So you can do this :
(//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'])[2]
to locate the second web element. Similarly you can do for Price per case and
Cost per ounce
The other way would be to use find_elements
price_per_bottle_elements = driver.find_elements(By.XPATH, "//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']")
print(price_per_bottle_elements[0].text) # this we do not want.
print(price_per_bottle_elements[1].text) # this we want.

Some <td>'s Cannot Be Found by find_next()

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!
I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

Click doesn't work on locate dynamic element which is in table

Use this way self.browser.find_element_by_xpath(".//td[text() = 'image 2']/following-sibling::td/input") I can locate this input element,but when I want to click it, it doesn't work.
<div class="animationImage">
<table class="animationTab">
<tbody>
<tr>
<td class="deign_tab">Image List</td>
<td class="deign_tab" style="padding-left:30px;text-align:center;">Select</td>
</tr>
<tr>
<td>image 2</td>
<td>
<input type="checkbox" id="d6dea005-1b58-4890-8ea6-d561b30ba8c8" checked="checked">
</td>
</tr>
</tbody>
</table>
</div>
you can simply locate checkbox element with input tag like this,
//input[#id='d6dea005-1b58-4890-8ea6-d561b30ba8c8']
or if you have to access it with reference to table, then
//table[#class='animationTab']/tbody/tr[2]/td[2]/input

How can I make a "FOR"(loop) in html, using chameleon and pyramid in python 3.4?

How can a make a loop using chameleon and pyramid in my html?
I search but i found nothing like that =/
Is easier use javascript in this case?
I use datatable in MACADMIN(bootstrap theme).
<div class="table-responsive">
<table cellpadding="0" cellspacing="0" border="0" id="data-table" width="100%">
<thead>
<tr>
<th>
Rendering engine
</th>
<th>
Browser
</th>
<th>
Platform(s)
</th>
<th>
Engine version
</th>
<th>
CSS grade
</th>
</tr>
</thead>
<tbody>
Maybe put FOR here? like {for x items in "TABLE"}
<tr>
<td>
{orgao_doc[x].nome}
</td>
<td>
{orgao_doc[x].cargo}
</td>
<td>
{orgao_doc[x].coleta}
</td>
<td>
{orgao_doc[x].email}
</td>
<td>
{orgao_doc[x].endereco}
</td>
</tr>
</tbody>
</table>
<div class="clearfix">
</div>
</div>
Use a tal:repeat attribute to repeat parts of a template, given a sequence:
<tbody>
<tr tal:repeat="item orgao_doc">
<td>${item.nome}</td>
<td>${item.cargo}</td>
<td>${item.coleta}</td>
<td>${item.email}</td>
<td>${item.endereco}</td>
</tr>
</tbody>
The <tr> tag is repeatedly inserted into the output, once for each element in orgao_doc. The name item is bound to each element when rendering this part of the template.

BeautifulSoup Parsing with Bad HTML Tables

I'm trying to parse tables similar to the following with BeautifulSoup to extract the name, age, and position for each person.
<TABLE width="100%" align="center" cellspacing="0" cellpadding="0" border="0">
<TR>
<TD></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>Age as of</B></FONT></TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD align="center" nowrap><FONT size="2"><B>Name</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>November 1, 1999</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap><FONT size="2"><B>Position</B></FONT></TD>
</TR>
<TR>
<TD align="center" nowrap><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap><HR size="1"></TD>
</TR>
<TR>
<TD align="left" valign="top"><FONT size="2">
Terry S. Jacobs</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">57</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Chairman of the Board, Chief Executive Officer, Treasurer and
director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
William L. Stakelin</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">56</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
President, Chief Operating Officer, Secretary and director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
Joel M. Fairman</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">70</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Vice Chairman and director</FONT></TD>
</TR>
</TABLE>
My current attempt is as follows:
soup = BeautifulSoup(in_file)
out = []
headers = soup.findAll(['td','th'])
for header in headers:
if header.find(text = re.compile(r"^age( )?", re.I)):
out.append(header)
table = out[0].find_parent("table")
rows = table.findAll('tr')
filter_regex = re.compile(r'[\w][\w .,]*', re.I)
data = [[td.find(text=filter_regex) for td in tr.findAll("td")] for tr in rows]
Things work find for the first person, but the bad <tr><td><tr><td>... lines really mess things up from there. I am trying to do this for a few thousand HTML files, each having slightly different table structure. That said, this feature of <tr> and <td> tags not being closed appears quite common across the files.
Anyone have thoughts on how to generalize the above parsing to work with tables that have constructs such as these? Thanks a lot!
You can take advantage of the fact that the valign attribute is set to top in all of the fields you'd like to keep and none of the ones you don't:
soup = BeautifulSoup(in_file)
cells = [cell.text.strip() for cell in soup('td', valign='top')]
Then you can sort this list of cells into a two-dimensional structure. There are three cells per entry, so you can sort it out pretty simply by doing something like this:
entries = []
for i in range(0, len(cells), 3):
entries.append(cells[i:i+3])
In the off chance anyone else get stuck with this issue and stumbles in here, the modern solution is to change which parser you are using. The default parser, 'html.parser' is pretty good when working with close enough HTML with properly closed tags, but the second you have to deal with edge cases (like Example 1 below, which is similar to the OP issue), that still goes right out the window even 8 years later (example 2 below).
In the documentation for BeautifulSoup4 (current version 4.9.3), there is a section detailing parser selection: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Example 1, the raw HTML:
<TABLE >
<TR VALIGN="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br>
<TR VALIGN="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<TR VALIGN="top">
<td> <td><b>Type:</b>
<td> 61
<TR VALIGN="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</TABLE>
Example 2, results when using BeautifulSoup(html, 'html.parser'):
<table>
<tr valign="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br/>
<tr valign="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<tr valign="top">
<td> <td><b>Type:</b>
<td> 61
<tr valign="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</td></td></td></tr></td></td></td></tr></td></td></td></tr></td></td></td></tr></table>
Example 3, results when using BeautifulSoup(html, 'html5lib'):
<table>
<tbody><tr valign="top">
<td> </td><td><b>Title:</b>
</td><td> title is here <i>-subtitle</i><br/>
</td></tr><tr valign="top">
<td>
</td><td><b>Date:</b>
</td><td> Thursday , August 27th, 2020
</td></tr><tr valign="top">
<td> </td><td><b>Type:</b>
</td><td> 61
</td></tr><tr valign="top">
<td>
</td><td><b>Status:</b>
</td><td> ACTIVE - ACTIVE
</td></tr></tbody></table>
There are also parsers that are written externally in C such as 'lxml' that you could potentially use that is much faster according to the documentation.

Categories