BS4 findall not returning all divs

BS4 findall not returning all divs - python

I was trying to get to the bottom table in the site,but findall() kept returning empty objects so i got all the divs on the same level one by one and noticed that when i try to get the last two it gives me the []
the_page=urllib.request.urlopen("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj=BeautifulSoup(the_page,'html.parser')
test=bsObj.findAll('div',{'class':'page','id':"item-page"})
print(test)
I have gone through the bs4 object that i got and the 2 divs im looking for arent in it.Whats happening?
the div im looking for is in the https://theunderminejournal.com/#eu/sylvanas/item/124105
this is the div im trying to extract

You will need to use selenium instead of the normal requests libraries.
Note that I couldn't post all of the output as the HTML parsed was huge.
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj = BeautifulSoup(driver.page_source,'html.parser')
test = bsObj.find('div', id='item-page')
print(test.prettify())
Output:
<div class="page" id="item-page" style="display: block;">
<div class="item-stats">
<table>
<tr class="available">
<th>
Available Quantity
</th>
<td>
<span>
30,545
</span>
</td>
</tr>
<tr class="spacer">
<td colspan="3">
</td>
</tr>
<tr class="current-price">
<th>
Current Price
</th>
<td>
<span class="money-gold">
27.34
</span>
</td>
</tr>
<tr class="median-price">
<th>
Median Price
</th>
<td>
<span class="money-gold">
30.11
</span>
</td>
</tr>
<tr class="mean-price">
<th>
Mean Price
</th>
<td>
<span class="money-gold">
30.52
</span>
</td>
</tr>
<tr class="standard-deviation">
<th>
Standard Deviation
</th>
<td>
<span class="money-gold">
.
.
.
</span>
</abbr>
</td>
</tr>
</table>
</div>
</div>
</div>

Related

Python Selenium: Finding elements by xpath when there are duplicates in the html code

So I'm working in selenium to scrape content from a liqour sales website in order to more quickly add product details to a spreadsheet. I'm using selenium to log into the website and search for the correct product. Once I get to the product page I'm able to scrape all the data I need except for some data that's contained in a certain block of the code.
I'm wanting 3 pieces of data: price per case, price per bottle, and price per oz. I noticed in the code that the data I'm looking for appears twice in a similar pattern. Interestingly, the correct data that I want is the second occurrence of the data (the first occurrence is incorrect). The relevant HTML code is:
<h2>Pricing</h2>
<div id="prices-table">
<div class="table-responsive">
<table class="table table-condensed auto-width">
<thead>
<tr>
<th></th>
<th class="best-bottle-top">
Frontline
</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottles</td>
<td class="best-bottle-mid">1</td>
</tr>
<tr>
<td>Cases</td>
<td class="best-bottle-mid">—</td>
</tr>
<tr>
<td>Price per bottle</td>
<td class="best-bottle-mid">
<div>$16.14 #I don't want this data </div>
</td>
</tr>
<tr>
<td>Price per case</td>
<td class="best-bottle-mid">
<div>
$193.71 #I don't want this data
</div>
</td>
</tr>
<tr>
<td>Cost per ounce</td>
<td class="best-bottle-mid">
<div>$1.27 #I don't want this data </div>
</td>
</tr>
<tr>
<td></td>
<td class="best-bottle-bot text-muted">
<span class="best-bottle-bot-content">
<span>
<div><small>Best</small></div>
<small>Bottle</small>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
<p>
<em class="price-disclaimer">Defer to Athens Distributing Company of Tennessee in case of any price discrepancies.</em>
</p>
</div>
</div>
<hr class="visible-print-block">
<div class="tab-pane active" id="3400355">
<dl class="dl-horizontal vpv-row">
<dt>Sizing</dt><dd>750 mL bottle × 6</dd>
<dt>SKU</dt><dd>80914</dd>
<dt>UPC</dt><dd>853192006189</dd>
<dt>Status</dt><dd>Active</dd>
<dt>Availability</dt><dd>
<span class="label label-success inventory-status-badge"><span data-container="body" data-toggle="popover" data-placement="top" data-content="Athens Distributing Company of Tennessee is integrated with SevenFifty and sends inventory levels at least once a day. You can order this item and expect that it is available." data-original-title="" title="">IN STOCK</span></span>
</dd></dl>
<div id="prices-table">
<div class="table-responsive">
<table class="table table-condensed auto-width">
<h2>Pricing</h2><thead>
<tr>
<th></th>
<th class="best-bottle-top">
Frontline
</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bottles</td>
<td class="best-bottle-mid">1</td>
</tr>
<tr>
<td>Cases</td>
<td class="best-bottle-mid">—</td>
</tr>
<tr>
<td>Price per bottle</td>
<td class="best-bottle-mid">
<div>$33.03 #I want THIS data </div>
</td>
</tr>
<tr>
<td>Price per case</td>
<td class="best-bottle-mid">
<div>
$198.18 I want THIS data
</div>
</td>
</tr>
<tr>
<td>Cost per ounce</td>
<td class="best-bottle-mid">
<div>$1.30 I want THIS data </div>
</td>
</tr>
<tr>
<td></td>
<td class="best-bottle-bot text-muted">
<span class="best-bottle-bot-content">
<span>
<div><small>Best</small></div>
<small>Bottle</small>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
Using the full xpath chrome give me finds me what I want, but trying a relative path does not work. Here's what I've tried:
Full xpath for case price (works but don't want to use absolute references):
/html/body/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[3]/div[2]/div[3]/div[2]/div/div/table/tbody/tr[4]/td[2]/div
Relative xpath for case price (returns None):
//*[#id="prices-table"]/div/table/tbody/tr[4]/td[2]/div
Unfortunately I can't link the actual webpage because it requires login credentials. Thanks for any/all help.

Two ways to do it.
If everything is same, tags their attribute then use xpath indexing.
//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']
This represent two nodes, using find_element will only work with first occurrence which you do not want. So you can do this :
(//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'])[2]
to locate the second web element. Similarly you can do for Price per case and
Cost per ounce
The other way would be to use find_elements
price_per_bottle_elements = driver.find_elements(By.XPATH, "//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']")
print(price_per_bottle_elements[0].text) # this we do not want.
print(price_per_bottle_elements[1].text) # this we want.

Some <td>'s Cannot Be Found by find_next()

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!

I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

Extract 2 pieces of information from html in python

I need help figuring out how to extract Grab and the number following data-b. There are many <tr> in the complete unmodified webpage and I need to filter using the "Need" just before </a>. I've been trying to do this with beautiful soup, though it looks like lxml might work better. I can get either all of the <tr>s or only the < a>...< /a> lines that contain Need but not just the <tr>s that contain need in that <a> line.
<tr >
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr >
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
Thanks for any help!

from bs4 import BeautifulSoup
data = '''<tr>
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr>
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
'''
soup = BeautifulSoup(data)
print(soup.findAll('a',{"href":"/local" })[0].text)
for a in soup.findAll('span',{"class":["bloat","bloat2"]}):
print(a['data-b'])

Python code to click next pages links and scrape all the page hyper links

I am very new to python, and i m struck in moving from one page to another, i am able to scrape one page details.
Below is the code i am using
def getURLinfo(url):
url = "https://apps1.coned.com/cemyaccount/MemberPages/MyAccounts.aspx?lang=eng"
driver.get(url)
html = driver.page_source
nextpage = "ctl00$Main$DataPager1$ctl01$ctl01"
soup = BeautifulSoup(html)
while soup.find(id=re.compile(nextpage)):
for table in soup.findAll('table', {'id':'ctl00_Main_lvMyAccount_itemPlaceholderContainer'} ):
for link in table.findAll('a'):
link.findAll('a')
print link['href']
driver.find_element_by_link_text(nextpage).click()
html = html + driver.page_source
soup = BeautifulSoup(driver.page_source)
soup = BeautifulSoup(html)
driver.close()
I am not sure if i am n the correct track too.
Below is the html code
View
211538138800143
43-38 39 PLAC 35
JUAN MENDOZA
Active
Delete
<tr style="background-color:#EFEFEF">
<td>
<a id="ctl00_Main_lvMyAccount_ctrl17_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl17$lnkSelect','')">View </a>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl17_lblAcctNumber">211558100500042</span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl17_LblServiceAddress">41-12 41 ST ENTM </span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl17_LblCustName">41-12 41ST MGMT CORP.</span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl17_LblAcctStatus">Active</span>
</td>
<td>
<a onclick="return confirm('Are you sure you want to delete this account number?');" id="ctl00_Main_lvMyAccount_ctrl17_lnkDelete" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl17$lnkDelete','')">Delete </a>
</td>
</tr>
<tr>
<td>
<a id="ctl00_Main_lvMyAccount_ctrl18_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl18$lnkSelect','')">View </a>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl18_lblAcctNumber">211558102300045</span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl18_LblServiceAddress">41-12 41 ST 1D </span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl18_LblCustName">41-12 MGMT CORP </span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl18_LblAcctStatus">Active</span>
</td>
<td valign="top">
<a onclick="return confirm('Are you sure you want to delete this account number?');" id="ctl00_Main_lvMyAccount_ctrl18_lnkDelete" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl18$lnkDelete','')">Delete </a>
</td>
</tr>
<tr style="background-color:#EFEFEF">
<td>
<a id="ctl00_Main_lvMyAccount_ctrl19_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl19$lnkSelect','')">View </a>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl19_lblAcctNumber">211564295000053</span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl19_LblServiceAddress">47-07 39 ST HLSM </span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl19_LblCustName">QPII-47-07 39 ST.,LLC</span>
</td>
<td>
<span id="ctl00_Main_lvMyAccount_ctrl19_LblAcctStatus">Active</span>
</td>
<td>
<a onclick="return confirm('Are you sure you want to delete this account number?');" id="ctl00_Main_lvMyAccount_ctrl19_lnkDelete" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl19$lnkDelete','')">Delete </a>
</td>
</tr>
</table>
</td>
</tr>
</td>
</tr>
<tr align="center"><td>
<span id="ctl00_Main_DataPager1"><a disabled="disabled"><< </a> <span>1</span> 2 3 4 5 ... >> </span>
</td></tr>
</table>

BeautifulSoup: How to extract data after specific html tag

I have following html and I am trying to figure out how exactly I can tell BeautifulSoup to extract td after certain html element. In this case I want to get data in <td> after <td>Color Digest</td>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
This is the entire HTML
<html>
<head>
<body>
<div align="center">
<table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;">
<br>
<br>
<table>
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>

Sounds like you need to iterate over a list of <td> and stop once you've found your data.
Example:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<html><tr><td>X</td><td>Color Digest</td><td>THE DIGEST</td></tr></html>')
for cell in soup.html.tr.findAll('td'):
if 'Color Digest' == cell.text:
print cell.nextSibling.text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BS4 findall not returning all divs - python

Related

Python Selenium: Finding elements by xpath when there are duplicates in the html code

Some <td>'s Cannot Be Found by find_next()

Extract 2 pieces of information from html in python

Python code to click next pages links and scrape all the page hyper links

BeautifulSoup: How to extract data after specific html tag

Categories

Resources