How to scrape a specific html line that follow another html line - python
I want to scrape some data from a html-page that looks something like this
<tr>
<td> Some information <td>
<td> 123 </td>
</tr>
<tr>
<td> some other information </td>
<td> 456 </td>
</tr>
<tr>
<td> and the info continues </td>
<td> 789 </td>
</tr>
What I want, is to obtain the html line that comes after a given html line. That is, if I see 'some other information' I want the output '456'. I thought of combining regex with .find_next from BeautifulSoup, but I haven't had any luck with this (I'm also not that familiar with regex). Anyone have a clue how to do it? In advance, thanks a lot
Actually with a mix of regex and find_next in BeautifulSoup you can achieve what you want:
from bs4 import BeautifulSoup
import re
html = """
<tr>
<td> Some information <td>
<td> 123 </td>
</tr>
<tr>
<td> some other information </td>
<td> 456 </td>
</tr>
<tr>
<td> and the info continues </td>
<td> 789 </td>
</tr>
"""
soup = BeautifulSoup(html)
x = soup.find('td', text = re.compile('some other information'))
print(x.find_next('td').text)
Output
' 456 '
EDIT replaced x.find_next('td').contents[0] by x.find_next('td').text, shorter
Related
Python Selenium: Finding elements by xpath when there are duplicates in the html code
So I'm working in selenium to scrape content from a liqour sales website in order to more quickly add product details to a spreadsheet. I'm using selenium to log into the website and search for the correct product. Once I get to the product page I'm able to scrape all the data I need except for some data that's contained in a certain block of the code. I'm wanting 3 pieces of data: price per case, price per bottle, and price per oz. I noticed in the code that the data I'm looking for appears twice in a similar pattern. Interestingly, the correct data that I want is the second occurrence of the data (the first occurrence is incorrect). The relevant HTML code is: <h2>Pricing</h2> <div id="prices-table"> <div class="table-responsive"> <table class="table table-condensed auto-width"> <thead> <tr> <th></th> <th class="best-bottle-top"> Frontline </th> </tr> </thead> <tbody> <tr> <td>Bottles</td> <td class="best-bottle-mid">1</td> </tr> <tr> <td>Cases</td> <td class="best-bottle-mid">—</td> </tr> <tr> <td>Price per bottle</td> <td class="best-bottle-mid"> <div>$16.14 #I don't want this data </div> </td> </tr> <tr> <td>Price per case</td> <td class="best-bottle-mid"> <div> $193.71 #I don't want this data </div> </td> </tr> <tr> <td>Cost per ounce</td> <td class="best-bottle-mid"> <div>$1.27 #I don't want this data </div> </td> </tr> <tr> <td></td> <td class="best-bottle-bot text-muted"> <span class="best-bottle-bot-content"> <span> <div><small>Best</small></div> <small>Bottle</small> </span> </span> </td> </tr> </tbody> </table> </div> <p> <em class="price-disclaimer">Defer to Athens Distributing Company of Tennessee in case of any price discrepancies.</em> </p> </div> </div> <hr class="visible-print-block"> <div class="tab-pane active" id="3400355"> <dl class="dl-horizontal vpv-row"> <dt>Sizing</dt><dd>750 mL bottle × 6</dd> <dt>SKU</dt><dd>80914</dd> <dt>UPC</dt><dd>853192006189</dd> <dt>Status</dt><dd>Active</dd> <dt>Availability</dt><dd> <span class="label label-success inventory-status-badge"><span data-container="body" data-toggle="popover" data-placement="top" data-content="Athens Distributing Company of Tennessee is integrated with SevenFifty and sends inventory levels at least once a day. You can order this item and expect that it is available." data-original-title="" title="">IN STOCK</span></span> </dd></dl> <div id="prices-table"> <div class="table-responsive"> <table class="table table-condensed auto-width"> <h2>Pricing</h2><thead> <tr> <th></th> <th class="best-bottle-top"> Frontline </th> </tr> </thead> <tbody> <tr> <td>Bottles</td> <td class="best-bottle-mid">1</td> </tr> <tr> <td>Cases</td> <td class="best-bottle-mid">—</td> </tr> <tr> <td>Price per bottle</td> <td class="best-bottle-mid"> <div>$33.03 #I want THIS data </div> </td> </tr> <tr> <td>Price per case</td> <td class="best-bottle-mid"> <div> $198.18 I want THIS data </div> </td> </tr> <tr> <td>Cost per ounce</td> <td class="best-bottle-mid"> <div>$1.30 I want THIS data </div> </td> </tr> <tr> <td></td> <td class="best-bottle-bot text-muted"> <span class="best-bottle-bot-content"> <span> <div><small>Best</small></div> <small>Bottle</small> </span> </span> </td> </tr> </tbody> </table> </div> Using the full xpath chrome give me finds me what I want, but trying a relative path does not work. Here's what I've tried: Full xpath for case price (works but don't want to use absolute references): /html/body/div[3]/div[1]/div/div[2]/div[2]/div[2]/div/div[3]/div[2]/div[3]/div[2]/div/div/table/tbody/tr[4]/td[2]/div Relative xpath for case price (returns None): //*[#id="prices-table"]/div/table/tbody/tr[4]/td[2]/div Unfortunately I can't link the actual webpage because it requires login credentials. Thanks for any/all help.
Two ways to do it. If everything is same, tags their attribute then use xpath indexing. //td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'] This represent two nodes, using find_element will only work with first occurrence which you do not want. So you can do this : (//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid'])[2] to locate the second web element. Similarly you can do for Price per case and Cost per ounce The other way would be to use find_elements price_per_bottle_elements = driver.find_elements(By.XPATH, "//td[text()='Price per bottle']/following-sibling::td[#class='best-bottle-mid']") print(price_per_bottle_elements[0].text) # this we do not want. print(price_per_bottle_elements[1].text) # this we want.
BS4 findall not returning all divs
I was trying to get to the bottom table in the site,but findall() kept returning empty objects so i got all the divs on the same level one by one and noticed that when i try to get the last two it gives me the [] the_page=urllib.request.urlopen("https://theunderminejournal.com/#eu/sylvanas/item/124105") bsObj=BeautifulSoup(the_page,'html.parser') test=bsObj.findAll('div',{'class':'page','id':"item-page"}) print(test) I have gone through the bs4 object that i got and the 2 divs im looking for arent in it.Whats happening? the div im looking for is in the https://theunderminejournal.com/#eu/sylvanas/item/124105 this is the div im trying to extract
You will need to use selenium instead of the normal requests libraries. Note that I couldn't post all of the output as the HTML parsed was huge. Code: from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Chrome() driver.get("https://theunderminejournal.com/#eu/sylvanas/item/124105") bsObj = BeautifulSoup(driver.page_source,'html.parser') test = bsObj.find('div', id='item-page') print(test.prettify()) Output: <div class="page" id="item-page" style="display: block;"> <div class="item-stats"> <table> <tr class="available"> <th> Available Quantity </th> <td> <span> 30,545 </span> </td> </tr> <tr class="spacer"> <td colspan="3"> </td> </tr> <tr class="current-price"> <th> Current Price </th> <td> <span class="money-gold"> 27.34 </span> </td> </tr> <tr class="median-price"> <th> Median Price </th> <td> <span class="money-gold"> 30.11 </span> </td> </tr> <tr class="mean-price"> <th> Mean Price </th> <td> <span class="money-gold"> 30.52 </span> </td> </tr> <tr class="standard-deviation"> <th> Standard Deviation </th> <td> <span class="money-gold"> . . . </span> </abbr> </td> </tr> </table> </div> </div> </div>
Some <td>'s Cannot Be Found by find_next()
So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules. The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it. Here's the html: <table style="max-width: 350px;" border="0"> <tbody><tr> <td style="max-width: 215px;">REF. NO.</td> <td style="max-width: 12px;" align="center"> </td> <td align="right">000124 </td> </tr> <tr> <td>REF. NO.</td> <td align="center"> </td> <td align="right"> </td> </tr> <tr> <td>MANU</td> <td align="center"> </td> <td align="right"></td> </tr> <tr> <td>STREAK</td> <td align="center"> </td> <td align="right">1075</td> </tr> <tr> <td>PACK</td> <td align="center"> </td> <td align="right">1</td> </tr> <tr> <td colspan="3">ON STOCK. </td> </tr> .... and so on So I used this code to get what I want: div = soup.find('div', {'id': 'infodata'}) table_data = div.find_all('td') for element in table_data: if "STREAK" in element.get_text(): price= element.find_next('td').find_next('td').text print(price+ "price") else: print('NOT FOUND!') I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK By the way, im using two find_next() there because the html has three td's in every <tr> Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!
I would rewrite it like this: trs = div.find_all('tr') for tr in trs: tds = tr.select('td') if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip(): price = tds[-1].get_text().strip()
How can I make a "FOR"(loop) in html, using chameleon and pyramid in python 3.4?
How can a make a loop using chameleon and pyramid in my html? I search but i found nothing like that =/ Is easier use javascript in this case? I use datatable in MACADMIN(bootstrap theme). <div class="table-responsive"> <table cellpadding="0" cellspacing="0" border="0" id="data-table" width="100%"> <thead> <tr> <th> Rendering engine </th> <th> Browser </th> <th> Platform(s) </th> <th> Engine version </th> <th> CSS grade </th> </tr> </thead> <tbody> Maybe put FOR here? like {for x items in "TABLE"} <tr> <td> {orgao_doc[x].nome} </td> <td> {orgao_doc[x].cargo} </td> <td> {orgao_doc[x].coleta} </td> <td> {orgao_doc[x].email} </td> <td> {orgao_doc[x].endereco} </td> </tr> </tbody> </table> <div class="clearfix"> </div> </div>
Use a tal:repeat attribute to repeat parts of a template, given a sequence: <tbody> <tr tal:repeat="item orgao_doc"> <td>${item.nome}</td> <td>${item.cargo}</td> <td>${item.coleta}</td> <td>${item.email}</td> <td>${item.endereco}</td> </tr> </tbody> The <tr> tag is repeatedly inserted into the output, once for each element in orgao_doc. The name item is bound to each element when rendering this part of the template.
BeautifulSoup: How to extract data after specific html tag
I have following html and I am trying to figure out how exactly I can tell BeautifulSoup to extract td after certain html element. In this case I want to get data in <td> after <td>Color Digest</td> <tr> <td> Color Digest </td> <td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td> </tr> This is the entire HTML <html> <head> <body> <div align="center"> <table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;"> <br> <br> <table> <table> <tbody> <tr bgcolor="#AAAAAA"> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <td> Color Digest </td> <td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td> </tr> </tbody> </table>
Sounds like you need to iterate over a list of <td> and stop once you've found your data. Example: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup('<html><tr><td>X</td><td>Color Digest</td><td>THE DIGEST</td></tr></html>') for cell in soup.html.tr.findAll('td'): if 'Color Digest' == cell.text: print cell.nextSibling.text