BS4 findall not returning all divs - python
I was trying to get to the bottom table in the site,but findall() kept returning empty objects so i got all the divs on the same level one by one and noticed that when i try to get the last two it gives me the []
the_page=urllib.request.urlopen("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj=BeautifulSoup(the_page,'html.parser')
test=bsObj.findAll('div',{'class':'page','id':"item-page"})
print(test)
I have gone through the bs4 object that i got and the 2 divs im looking for arent in it.Whats happening?
the div im looking for is in the https://theunderminejournal.com/#eu/sylvanas/item/124105
this is the div im trying to extract
You will need to use selenium instead of the normal requests libraries.
Note that I couldn't post all of the output as the HTML parsed was huge.
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj = BeautifulSoup(driver.page_source,'html.parser')
test = bsObj.find('div', id='item-page')
print(test.prettify())
Output:
<div class="page" id="item-page" style="display: block;">
<div class="item-stats">
<table>
<tr class="available">
<th>
Available Quantity
</th>
<td>
<span>
30,545
</span>
</td>
</tr>
<tr class="spacer">
<td colspan="3">
</td>
</tr>
<tr class="current-price">
<th>
Current Price
</th>
<td>
<span class="money-gold">
27.34
</span>
</td>
</tr>
<tr class="median-price">
<th>
Median Price
</th>
<td>
<span class="money-gold">
30.11
</span>
</td>
</tr>
<tr class="mean-price">
<th>
Mean Price
</th>
<td>
<span class="money-gold">
30.52
</span>
</td>
</tr>
<tr class="standard-deviation">
<th>
Standard Deviation
</th>
<td>
<span class="money-gold">
.
.
.
</span>
</abbr>
</td>
</tr>
</table>
</div>
</div>
</div>
Related
Python Selenium: Finding elements by xpath when there are duplicates in the html code
So I'm working in selenium to scrape content from a liqour sales website in order to more quickly add product details to a spreadsheet. I'm using selenium to log into the website and search for the correct product. Once I get to the product page I'm able to scrape all the data I need except for some data that's contained in a certain block of the code. I'm wanting 3 pieces of data: price per case, price per bottle, and price per oz. I noticed in the code that the data I'm looking for appears twice in a similar pattern. Interestingly, the correct data that I want is the second occurrence of the data (the first occurrence is incorrect). The relevant HTML code is: <h2>Pricing</h2> <div id="prices-table"> <div class="table-responsive"> <table class="table table-condensed auto-width"> <thead> <tr> <th></th> <th class="best-bottle-top"> Frontline </th> </tr> </thead> <tbody> <tr> <td>Bottles</td> <td class="best-bottle-mid">1</td> </tr> <tr> <td>Cases</td> <td class="best-bottle-mid">—</td> </tr> <tr> <td>Price per bottle</td> <td class="best-bottle-mid"> <div>$16.14 #I don't want this data </div> </td> </tr> <tr> <td>Price per case</td> <td class="best-bottle-mid"> <div> $193.71 #I don't want this data </div> </td> </tr> <tr> <td>Cost per ounce</td> <td class="best-bottle-mid"> <div>$1.27 #I don't want this data </div> </td> </tr> <tr> <td></td> <td class="best-bottle-bot text-muted"> <span class="best-bottle-bot-content"> <span> <div><small>Best</small></div> <small>Bottle</small> </span> </span> </td> </tr> </tbody> </table> </div> <p> <em class="price-disclaimer">Defer to Athens Distributing Company of Tennessee in case of any price discrepancies.</em> </p> </div> </div> <hr class="visible-print-block"> <div class="tab-pane active" id="3400355"> <dl class="dl-horizontal vpv-row"> <dt>Sizing</dt><dd>750 mL bottle × 6</dd> <dt>SKU</dt><dd>80914</dd> <dt>UPC</dt><dd>853192006189</dd> <dt>Status</dt><dd>Active</dd> <dt>Availability</dt><dd> <span class="label label-success inventory-status-badge"><span data-container="body" data-toggle="popover" data-placement="top" data-content="Athens Distributing Company of Tennessee is integrated with SevenFifty and sends inventory levels at least once a day. You can order this item and expect that it is available." data-original-title="" title="">IN STOCK</span></span> </dd></dl> <div id="prices-table"> <div class="table-responsive"> <table class="table table-condensed auto-width"> <h2>Pricing</h2><thead> <tr> <th></th> <th class="best-bottle-top"> Frontline </th> </tr> </thead> <tbody> <tr> <td>Bottles</td> <td class="best-bottle-mid">1</td> </tr> <tr> <td>Cases</td> <td class="best-bottle-mid">—</td> </tr> <tr> <td>Price per bottle</td> <td class="best-bottle-mid"> <div>$33.03 #I want THIS data </div> </td> </tr> <tr> <td>Price per case</td> <td class="best-bottle-mid"> <div> $198.18 I want THIS data </div> </td> </tr> <tr> <td>Cost per ounce</td> <td class="best-bottle-mid"> <div>$1.30 I want THIS data </div> </td> </tr> <tr> <td></td> <td class="best-bottle-bot text-muted"> <span class="best-bottle-bot-content"> <span> <div><small>Best</small></div> <small>Bottle</small> </span> </span> </td> </tr> </tbody> </table> </div> Using the full xpath chrome give me finds me what I want, but trying a relative path does not work. Here's what I've tried: Full xpath for case price (works but don't want to use absolute references): /html/body/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[3]/div[2]/div[3]/div[2]/div/div/table/tbody/tr[4]/td[2]/div Relative xpath for case price (returns None): //*[#id="prices-table"]/div/table/tbody/tr[4]/td[2]/div Unfortunately I can't link the actual webpage because it requires login credentials. Thanks for any/all help.
Two ways to do it. If everything is same, tags their attribute then use xpath indexing. //td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'] This represent two nodes, using find_element will only work with first occurrence which you do not want. So you can do this : (//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'])[2] to locate the second web element. Similarly you can do for Price per case and Cost per ounce The other way would be to use find_elements price_per_bottle_elements = driver.find_elements(By.XPATH, "//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']") print(price_per_bottle_elements[0].text) # this we do not want. print(price_per_bottle_elements[1].text) # this we want.
Some <td>'s Cannot Be Found by find_next()
So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules. The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it. Here's the html: <table style="max-width: 350px;" border="0"> <tbody><tr> <td style="max-width: 215px;">REF. NO.</td> <td style="max-width: 12px;" align="center"> </td> <td align="right">000124 </td> </tr> <tr> <td>REF. NO.</td> <td align="center"> </td> <td align="right"> </td> </tr> <tr> <td>MANU</td> <td align="center"> </td> <td align="right"></td> </tr> <tr> <td>STREAK</td> <td align="center"> </td> <td align="right">1075</td> </tr> <tr> <td>PACK</td> <td align="center"> </td> <td align="right">1</td> </tr> <tr> <td colspan="3">ON STOCK. </td> </tr> .... and so on So I used this code to get what I want: div = soup.find('div', {'id': 'infodata'}) table_data = div.find_all('td') for element in table_data: if "STREAK" in element.get_text(): price= element.find_next('td').find_next('td').text print(price+ "price") else: print('NOT FOUND!') I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK By the way, im using two find_next() there because the html has three td's in every <tr> Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!
I would rewrite it like this: trs = div.find_all('tr') for tr in trs: tds = tr.select('td') if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip(): price = tds[-1].get_text().strip()
Extract 2 pieces of information from html in python
I need help figuring out how to extract Grab and the number following data-b. There are many <tr> in the complete unmodified webpage and I need to filter using the "Need" just before </a>. I've been trying to do this with beautiful soup, though it looks like lxml might work better. I can get either all of the <tr>s or only the < a>...< /a> lines that contain Need but not just the <tr>s that contain need in that <a> line. <tr > <td>3</td> <td>Leave</td><td>Useless</td> <td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td> <td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td> <td class="text-right">7.38%</td> <td class="text-right " >Recently</td> </tr> <tr > <td>4</td> <td>Grab</td><td>Need</td> <td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td> <td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td> <td class="text-right">Some more</td> <td class="text-right " >Recently</td> </tr> Thanks for any help!
from bs4 import BeautifulSoup data = '''<tr> <td>3</td> <td>Leave</td><td>Useless</td> <td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td> <td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td> <td class="text-right">7.38%</td> <td class="text-right " >Recently</td> </tr> <tr> <td>4</td> <td>Grab</td><td>Need</td> <td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td> <td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td> <td class="text-right">Some more</td> <td class="text-right " >Recently</td> </tr> ''' soup = BeautifulSoup(data) print(soup.findAll('a',{"href":"/local" })[0].text) for a in soup.findAll('span',{"class":["bloat","bloat2"]}): print(a['data-b'])
Python code to click next pages links and scrape all the page hyper links
I am very new to python, and i m struck in moving from one page to another, i am able to scrape one page details. Below is the code i am using def getURLinfo(url): url = "https://apps1.coned.com/cemyaccount/MemberPages/MyAccounts.aspx?lang=eng" driver.get(url) html = driver.page_source nextpage = "ctl00$Main$DataPager1$ctl01$ctl01" soup = BeautifulSoup(html) while soup.find(id=re.compile(nextpage)): for table in soup.findAll('table', {'id':'ctl00_Main_lvMyAccount_itemPlaceholderContainer'} ): for link in table.findAll('a'): link.findAll('a') print link['href'] driver.find_element_by_link_text(nextpage).click() html = html + driver.page_source soup = BeautifulSoup(driver.page_source) soup = BeautifulSoup(html) driver.close() I am not sure if i am n the correct track too. Below is the html code View 211538138800143 43-38 39 PLAC 35 JUAN MENDOZA Active Delete <tr style="background-color:#EFEFEF"> <td> <a id="ctl00_Main_lvMyAccount_ctrl17_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl17$lnkSelect','')">View </a> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl17_lblAcctNumber">211558100500042</span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl17_LblServiceAddress">41-12 41 ST ENTM </span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl17_LblCustName">41-12 41ST MGMT CORP.</span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl17_LblAcctStatus">Active</span> </td> <td> <a onclick="return confirm('Are you sure you want to delete this account number?');" id="ctl00_Main_lvMyAccount_ctrl17_lnkDelete" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl17$lnkDelete','')">Delete </a> </td> </tr> <tr> <td> <a id="ctl00_Main_lvMyAccount_ctrl18_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl18$lnkSelect','')">View </a> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl18_lblAcctNumber">211558102300045</span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl18_LblServiceAddress">41-12 41 ST 1D </span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl18_LblCustName">41-12 MGMT CORP </span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl18_LblAcctStatus">Active</span> </td> <td valign="top"> <a onclick="return confirm('Are you sure you want to delete this account number?');" id="ctl00_Main_lvMyAccount_ctrl18_lnkDelete" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl18$lnkDelete','')">Delete </a> </td> </tr> <tr style="background-color:#EFEFEF"> <td> <a id="ctl00_Main_lvMyAccount_ctrl19_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl19$lnkSelect','')">View </a> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl19_lblAcctNumber">211564295000053</span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl19_LblServiceAddress">47-07 39 ST HLSM </span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl19_LblCustName">QPII-47-07 39 ST.,LLC</span> </td> <td> <span id="ctl00_Main_lvMyAccount_ctrl19_LblAcctStatus">Active</span> </td> <td> <a onclick="return confirm('Are you sure you want to delete this account number?');" id="ctl00_Main_lvMyAccount_ctrl19_lnkDelete" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl19$lnkDelete','')">Delete </a> </td> </tr> </table> </td> </tr> </td> </tr> <tr align="center"><td> <span id="ctl00_Main_DataPager1"><a disabled="disabled"><< </a> <span>1</span> 2 3 4 5 ... >> </span> </td></tr> </table>
BeautifulSoup: How to extract data after specific html tag
I have following html and I am trying to figure out how exactly I can tell BeautifulSoup to extract td after certain html element. In this case I want to get data in <td> after <td>Color Digest</td> <tr> <td> Color Digest </td> <td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td> </tr> This is the entire HTML <html> <head> <body> <div align="center"> <table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;"> <br> <br> <table> <table> <tbody> <tr bgcolor="#AAAAAA"> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <td> Color Digest </td> <td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td> </tr> </tbody> </table>
Sounds like you need to iterate over a list of <td> and stop once you've found your data. Example: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup('<html><tr><td>X</td><td>Color Digest</td><td>THE DIGEST</td></tr></html>') for cell in soup.html.tr.findAll('td'): if 'Color Digest' == cell.text: print cell.nextSibling.text