How to scrape a specific html line that follow another html line - python

I want to scrape some data from a html-page that looks something like this
<tr>
<td> Some information <td>
<td> 123 </td>
</tr>
<tr>
<td> some other information </td>
<td> 456 </td>
</tr>
<tr>
<td> and the info continues </td>
<td> 789 </td>
</tr>
What I want, is to obtain the html line that comes after a given html line. That is, if I see 'some other information' I want the output '456'. I thought of combining regex with .find_next from BeautifulSoup, but I haven't had any luck with this (I'm also not that familiar with regex). Anyone have a clue how to do it? In advance, thanks a lot

Actually with a mix of regex and find_next in BeautifulSoup you can achieve what you want:
from bs4 import BeautifulSoup
import re
html = """
<tr>
<td> Some information <td>
<td> 123 </td>
</tr>
<tr>
<td> some other information </td>
<td> 456 </td>
</tr>
<tr>
<td> and the info continues </td>
<td> 789 </td>
</tr>
"""
soup = BeautifulSoup(html)
x = soup.find('td', text = re.compile('some other information'))
print(x.find_next('td').text)
Output
' 456 '
EDIT replaced x.find_next('td').contents[0] by x.find_next('td').text, shorter

Related

Python Selenium: Finding elements by xpath when there are duplicates in the html code

So I'm working in selenium to scrape content from a liqour sales website in order to more quickly add product details to a spreadsheet. I'm using selenium to log into the website and search for the correct product. Once I get to the product page I'm able to scrape all the data I need except for some data that's contained in a certain block of the code.
I'm wanting 3 pieces of data: price per case, price per bottle, and price per oz. I noticed in the code that the data I'm looking for appears twice in a similar pattern. Interestingly, the correct data that I want is the second occurrence of the data (the first occurrence is incorrect). The relevant HTML code is:
<h2>Pricing</h2>
<div id="prices-table">
<div class="table-responsive">
<table class="table table-condensed auto-width">
<thead>
<tr>
<th></th>
<th class="best-bottle-top">
Frontline
</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottles</td>
<td class="best-bottle-mid">1</td>
</tr>
<tr>
<td>Cases</td>
<td class="best-bottle-mid">—</td>
</tr>
<tr>
<td>Price per bottle</td>
<td class="best-bottle-mid">
<div>$16.14 #I don't want this data </div>
</td>
</tr>
<tr>
<td>Price per case</td>
<td class="best-bottle-mid">
<div>
$193.71 #I don't want this data
</div>
</td>
</tr>
<tr>
<td>Cost per ounce</td>
<td class="best-bottle-mid">
<div>$1.27 #I don't want this data </div>
</td>
</tr>
<tr>
<td></td>
<td class="best-bottle-bot text-muted">
<span class="best-bottle-bot-content">
<span>
<div><small>Best</small></div>
<small>Bottle</small>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
<p>
<em class="price-disclaimer">Defer to Athens Distributing Company of Tennessee in case of any price discrepancies.</em>
</p>
</div>
</div>
<hr class="visible-print-block">
<div class="tab-pane active" id="3400355">
<dl class="dl-horizontal vpv-row">
<dt>Sizing</dt><dd>750 mL bottle × 6</dd>
<dt>SKU</dt><dd>80914</dd>
<dt>UPC</dt><dd>853192006189</dd>
<dt>Status</dt><dd>Active</dd>
<dt>Availability</dt><dd>
<span class="label label-success inventory-status-badge"><span data-container="body" data-toggle="popover" data-placement="top" data-content="Athens Distributing Company of Tennessee is integrated with SevenFifty and sends inventory levels at least once a day. You can order this item and expect that it is available." data-original-title="" title="">IN STOCK</span></span>
</dd></dl>
<div id="prices-table">
<div class="table-responsive">
<table class="table table-condensed auto-width">
<h2>Pricing</h2><thead>
<tr>
<th></th>
<th class="best-bottle-top">
Frontline
</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottles</td>
<td class="best-bottle-mid">1</td>
</tr>
<tr>
<td>Cases</td>
<td class="best-bottle-mid">—</td>
</tr>
<tr>
<td>Price per bottle</td>
<td class="best-bottle-mid">
<div>$33.03 #I want THIS data </div>
</td>
</tr>
<tr>
<td>Price per case</td>
<td class="best-bottle-mid">
<div>
$198.18 I want THIS data
</div>
</td>
</tr>
<tr>
<td>Cost per ounce</td>
<td class="best-bottle-mid">
<div>$1.30 I want THIS data </div>
</td>
</tr>
<tr>
<td></td>
<td class="best-bottle-bot text-muted">
<span class="best-bottle-bot-content">
<span>
<div><small>Best</small></div>
<small>Bottle</small>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
Using the full xpath chrome give me finds me what I want, but trying a relative path does not work. Here's what I've tried:
Full xpath for case price (works but don't want to use absolute references):
/html/body/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[3]/div[2]/div[3]/div[2]/div/div/table/tbody/tr[4]/td[2]/div
Relative xpath for case price (returns None):
//*[#id="prices-table"]/div/table/tbody/tr[4]/td[2]/div
Unfortunately I can't link the actual webpage because it requires login credentials. Thanks for any/all help.
Two ways to do it.
If everything is same, tags their attribute then use xpath indexing.
//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']
This represent two nodes, using find_element will only work with first occurrence which you do not want. So you can do this :
(//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'])[2]
to locate the second web element. Similarly you can do for Price per case and
Cost per ounce
The other way would be to use find_elements
price_per_bottle_elements = driver.find_elements(By.XPATH, "//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']")
print(price_per_bottle_elements[0].text) # this we do not want.
print(price_per_bottle_elements[1].text) # this we want.

BS4 findall not returning all divs

I was trying to get to the bottom table in the site,but findall() kept returning empty objects so i got all the divs on the same level one by one and noticed that when i try to get the last two it gives me the []
the_page=urllib.request.urlopen("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj=BeautifulSoup(the_page,'html.parser')
test=bsObj.findAll('div',{'class':'page','id':"item-page"})
print(test)
I have gone through the bs4 object that i got and the 2 divs im looking for arent in it.Whats happening?
the div im looking for is in the https://theunderminejournal.com/#eu/sylvanas/item/124105
this is the div im trying to extract
You will need to use selenium instead of the normal requests libraries.
Note that I couldn't post all of the output as the HTML parsed was huge.
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj = BeautifulSoup(driver.page_source,'html.parser')
test = bsObj.find('div', id='item-page')
print(test.prettify())
Output:
<div class="page" id="item-page" style="display: block;">
<div class="item-stats">
<table>
<tr class="available">
<th>
Available Quantity
</th>
<td>
<span>
30,545
</span>
</td>
</tr>
<tr class="spacer">
<td colspan="3">
</td>
</tr>
<tr class="current-price">
<th>
Current Price
</th>
<td>
<span class="money-gold">
27.34
</span>
</td>
</tr>
<tr class="median-price">
<th>
Median Price
</th>
<td>
<span class="money-gold">
30.11
</span>
</td>
</tr>
<tr class="mean-price">
<th>
Mean Price
</th>
<td>
<span class="money-gold">
30.52
</span>
</td>
</tr>
<tr class="standard-deviation">
<th>
Standard Deviation
</th>
<td>
<span class="money-gold">
.
.
.
</span>
</abbr>
</td>
</tr>
</table>
</div>
</div>
</div>

Some <td>'s Cannot Be Found by find_next()

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!
I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

How can I make a "FOR"(loop) in html, using chameleon and pyramid in python 3.4?

How can a make a loop using chameleon and pyramid in my html?
I search but i found nothing like that =/
Is easier use javascript in this case?
I use datatable in MACADMIN(bootstrap theme).
<div class="table-responsive">
<table cellpadding="0" cellspacing="0" border="0" id="data-table" width="100%">
<thead>
<tr>
<th>
Rendering engine
</th>
<th>
Browser
</th>
<th>
Platform(s)
</th>
<th>
Engine version
</th>
<th>
CSS grade
</th>
</tr>
</thead>
<tbody>
Maybe put FOR here? like {for x items in "TABLE"}
<tr>
<td>
{orgao_doc[x].nome}
</td>
<td>
{orgao_doc[x].cargo}
</td>
<td>
{orgao_doc[x].coleta}
</td>
<td>
{orgao_doc[x].email}
</td>
<td>
{orgao_doc[x].endereco}
</td>
</tr>
</tbody>
</table>
<div class="clearfix">
</div>
</div>
Use a tal:repeat attribute to repeat parts of a template, given a sequence:
<tbody>
<tr tal:repeat="item orgao_doc">
<td>${item.nome}</td>
<td>${item.cargo}</td>
<td>${item.coleta}</td>
<td>${item.email}</td>
<td>${item.endereco}</td>
</tr>
</tbody>
The <tr> tag is repeatedly inserted into the output, once for each element in orgao_doc. The name item is bound to each element when rendering this part of the template.

BeautifulSoup: How to extract data after specific html tag

I have following html and I am trying to figure out how exactly I can tell BeautifulSoup to extract td after certain html element. In this case I want to get data in <td> after <td>Color Digest</td>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
This is the entire HTML
<html>
<head>
<body>
<div align="center">
<table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;">
<br>
<br>
<table>
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
Sounds like you need to iterate over a list of <td> and stop once you've found your data.
Example:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<html><tr><td>X</td><td>Color Digest</td><td>THE DIGEST</td></tr></html>')
for cell in soup.html.tr.findAll('td'):
if 'Color Digest' == cell.text:
print cell.nextSibling.text

Categories