Extracting the right elements by text and span / Beautiful Soup / Python

Extracting the right elements by text and span / Beautiful Soup / Python - python

Im trying to scrape following data:
Cuisine: 4.5
Service: 4.0
Quality: 4.5
But im having issues to scrape the right data. I tried following two Codes:
for bewertungen in soup.find_all('div', {'class' : 'histogramCommon bubbleHistogram wrap'}):
if bewertungen.find(text='Cuisine'):
cuisine = bewertungen.find(text='Cuisine')
cuisine = cuisine.next_element
print("test " + str(cuisine))
if bewertungen.find_all(text='Service'):
for s_bewertung in bewertungen.find_all('span', {'class':'ui_bubble_rating'}):
s_speicher = s_bewertung['alt']
In the first if i get no result. In the second If i get the right elements but i get all 3 results but i can not define which ones belongs to which text (Cuisine, Service, Quality)
Can someone give me an advice how to get the right data?
I put at the bottom the html code.
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">\nGesamtwertung\n</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Cuisine</span>
</div>
<div class="wrap row part ">
<span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span alt="4.0 of five" class="ui_bubble_rating bubble_40"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Quality</span>
</div>
<div class="wrap row part "><span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span></div>
</div>
</li>
</ul>
</div>

Try this. According to the snippet you have pasted above, the following code should work:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".row span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Another way would be:
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".text").find_parent().find_next_sibling().select_one("span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Output:
Cuisine : 4.5
Service : 4.0
Quality : 4.5

Related

Confirm preceding and following sibling in XPATH

I've got the below statement to check that 2 conditions exist in a
an element:
if len(driver.find_elements(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")) > 0:
elem = driver.find_element(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")
I've tried a few variations, including "preceding sibling::span[text()='x'", but can't seem to get the syntax correct or if I'm going about it the right way.
HTML is below. the current find_elements(By.XPATH...) correctly finds the "Total" and "Buy" class, I would like to add $20.00 in the "price" class as a condition also.
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total">
<span>$400.00</span>
</div>
<div class="Buy">
<a class="Button">Buy</a>
</div>
</div>
</li>
</ul>

Using built in ElementTree
import xml.etree.ElementTree as ET
html = '''<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
<div class="List-Content row">
<div class="Price">"$27.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>'''
items = {'Total':'$400.00','Buy':'Buy','Price':'"$20.00"'}
root = ET.fromstring(html)
first_level_divs = root.findall('div')
for first_level_div in first_level_divs:
results = {}
for k,v in items.items():
div = first_level_div.find(f'.div[#class="{k}"]')
one_level_down = len(list(div)) > 0
results[k] = list(div)[0].text if one_level_down else div.text
if results == items:
print('found')
else:
print('not found')
results = {}
output
found
not found

Given this HTML snippet
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>
</ul>
I would use this XPath:
buy_buttons = driver.find_elements(By.XPATH, """//div[
contains(#class, 'List-Content')
and div[#class = 'Price'] = '$20.00'
and div[#class = 'Total'] = '$400.00'
]//a[. = 'Buy']""")
for buy_button in buy_buttons:
print(buy_button)
The for loop replaces your if len(buy_buttons) > 0 check. It won't run when there are no results, so the if is superfluous.

detecting the presence of text with BeautifulSoup

i trying check the presence of text on certain page(if you send precendently the text will appear in this zone otherwize it's blank).
html= urlopen(single_link)
parsed= BeautifulSoup.BeautifulSoup(html,'html.parser')
lastmessages = parsed.find('div',attrs={'id':'message-box'})
if lastmessages :
print('Already Reached')
else:
print('you can write your message')
<div class="lastMessage">
<div class="mine messages">
<div class="message last" id="msg-snd-1601248710299" dir="auto">
Hello, and how are you ?
</div>
<div style="clear : both ;"></div>
<div class="msg-status" id="msg-status-1601248710299" dir="rtl">
<div class="send-state">
Last message :
<span class="r2">before 7:35 </span>
</div>
<div class="read-state">
<span style="color : gray ;"> – </span>
Reading :
<span class="r2">Not yet</span>
</div>
</div>
<div style="clear : both ;"></div>
</div>
</div>
my problem is i can't know how to find if the text "Hello, and how are you ?" exist or not ???

Simple solution
import bs4
parsed= bs4.BeautifulSoup(html,'html.parser')
lastmessages = parsed.find('div', class_='message last')
if lastmessages :
print(f'{lastmessages.text.strip()}')
else:
print('No message')

Look for element in string and return element + following two characters

The following html appears as a string in my code. That is okay, what I need though is how to get:
"class="company-image company-34""
for each company-## there is also a price found in this tag further below in the HTML:
class="small-12 medium-4 cell text-right" data-after="kr./år">1.813
I tried following code:
for x in html:
if "company-image company" in x:
print("Oh yes")
else:
print("Nahh")
but it doesn't really work. My thinking is I look for everytime "company-image company" is mentioned and get the whole string and the following numbers after, it is always two numbers ##. And whenever it is found I look for "data-after="kr./år"" and get the numbers following. Eventually this would end in a for loop, as there are multiple companies and prices.
<app-offer-match _ngcontent-vdv-c20="" _nghost-vdv-c22="" class="ng-star-inserted">
<div _ngcontent-vdv-c22="" class="box">
<!---->
<div _ngcontent-vdv-c22="" class="line1">
<div _ngcontent-vdv-c22="" class="company-image company-34"><img _ngcontent-vdv-c22="" src="/assets/images/companies/34.svg"></div>
<div _ngcontent-vdv-c22="" class="button compare">Sammenlign </div>
</div>
<div _ngcontent-vdv-c22="" class="line2">
<div _ngcontent-vdv-c22="" class="container-button">
<div _ngcontent-vdv-c22="" class="button mini-accordion"></div>
</div>
<div _ngcontent-vdv-c22="" class="container-insurance-list">
<!---->
<div _ngcontent-vdv-c22="" class="indbo ng-star-inserted">
<div _ngcontent-vdv-c22="" class="grid-x container-product-overview">
<div _ngcontent-vdv-c22="" class="small-5 cell detail"><span _ngcontent-vdv-c22="">Indbo</span>
<!----><span _ngcontent-vdv-c22="" class="ng-star-inserted">Kongshaven 3</span>
</div>
<div _ngcontent-vdv-c22="" class="small-6 cell">
<div _ngcontent-vdv-c22="" class="grid-x price">
<div _ngcontent-vdv-c22="" class="small-12 medium-8 cell text-right" data-after="kr.">Selvrisiko 2.199</div>
<div _ngcontent-vdv-c22="" class="small-12 medium-4 cell text-right" data-after="kr./år">1.813 </div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</app-offer-match>
EDIT: Added desired output.
Desired output would be a pandas dataframe of:
Company Price
company-image company-34 1.813
EDIT 2:
It looks like an xml, that's because I formatted it like that for you guys. WHen I output it, it is of type STR, thank you.

Try this:
company = """[your string above]"""
import lxml.html as lh
import pandas as pd
doc = lh.fromstring(company)
columns = ["Company", "Price"]
rows = []
targets = doc.xpath('//div[contains(#class,"company-image company")]')
for target in targets:
row = []
row.append(target.attrib['class'])
price = target.xpath('../following-sibling::div//div[#data-after="kr./år"]')[0]
row.append(price.text)
rows.append(row)
rows
pd.DataFrame(rows,columns=columns)
Output:
Company Price
0 company-image company-34 1.813

Web scraping with Selenium and Xpath

I’m new in Xpath. I’m trying to scrape a stock website to get name and value of each element.
In my python selenium script I’ve locally extracted the main part of the web page in html_content, as follows.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
dirinstall="C:\\Program Files (x86)\\www\mm\\"
chrome_driver = dirinstall+"\\Webdriver\\chromedriver.exe"
options = Options()
driver = webdriver.Chrome(chrome_driver, options=options)
html_content = """
<html class="ng-scope">
<head data-meta-tags="">
<title> Stock NYSE </title>
<ui-layout class="ng-isolate-scope">
<div data-ng-include="" src="layoutCtrl.template" class="ng-scope">
<app-root class="ng-scope" _nghost-rqp-c0="" ng-version="8.2.14"></app-root>
<div ng-class="{'demo-mode': $root.session.user.portfolio.account.type === 'Demo' }" class="ng-scope">
<div ng-view="" ng-class="layoutCtrl.isBannerShown ? 'banner-shown' : ''" class="main-app-view ng-scope" role="main">
<et-discovery-markets-results class="ng-scope" _nghost-rqp-c42="" ng-version="8.2.14">
<div _ngcontent-rqp-c42="" class="discover main-content no-footer" ui-fun-scroll="{'class': 'minimize', 'classEl': '.user-head-wrapper, .table-discover', 'scrollContainer': '.table-discover', 'setClassAtScroll': 200 }">
<div _ngcontent-rqp-c42="" automation-id="discover-market-results-wrapp" class="table-discover markets-table">
<et-discovery-markets-results-list _ngcontent-rqp-c42="" automation-id="discover-market-results-sub-view-list" _nghost-rqp-c44="" class="ng-star-inserted">
<div _ngcontent-rqp-c44="" class="market-list list-view" data-etoro-locale-ns="discoverMarketResultsList">
<et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
<et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
<div _ngcontent-rqp-c47="" class="row-wrap">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
<div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">A</div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name positive"> 0.68 (0.90%) </div>
</div>
</div>
<et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label positive-change" automation-id="buy-sell-button-container-sell">
<div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">75.<span class="after-decimal">85</span></div>
</div>
</et-buy-sell-button>
<div _ngcontent-rqp-c24="" class="space-gap"></div>
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
<div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">76.<span class="after-decimal">03</span></div>
</div>
</et-buy-sell-button>
</et-buy-sell-buttons>
</div>
<et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
</et-trade-item-card-action>
</et-instrument-trading-mobile-row>
</et-instrument-mobile-row>
<et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
<et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
<div _ngcontent-rqp-c47="" class="row-wrap">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
<div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">AA</div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name negative"> -0.11 (-1.46%) </div>
</div>
</div>
<et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-sell">
<div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">44</span></div>
</div>
</et-buy-sell-button>
<div _ngcontent-rqp-c24="" class="space-gap"></div>
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
<div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">47</span></div>
</div>
</et-buy-sell-button>
</et-buy-sell-buttons>
</div>
<et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
</et-trade-item-card-action>
</et-instrument-trading-mobile-row>
</et-instrument-mobile-row>
</div>
</et-discovery-markets-results-list>
</div>
</div>
</et-discovery-markets-results>
</div>
</div>
</div>
</ui-layout>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
#results = driver.find_elements_by_xpath("//*[#class='ng-star-inserted']")
results = driver.find_elements_by_xpath("//*[et-instrument-mobile-row and #class='ng-star-inserted']")
print('Number of results', len(results))
I don’t know why if I search ‘et-instrument-mobile-row’ I get only 1 element instead of 2, and if I search both ‘et-instrument-mobile-row’ and 'ng-star-inserted' I get 0 elements.
Looking at the example my goal is to get the symbol and current value of buy/sell (price and after-decimal).
Something like:
[A, 75.85, 76.03]
[AA, 7.44, 7.47]
Could anyone help me? Thanks!

It looks like you may have some malformed HTML and Selenium is unsure how to parse it. I noticed this line:
<div _ngcontent-rqp-c47="" class="avatar-img-wrap"><img _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-avatar" class="avatar-img" src="https://etoro-cdn.etorostatic.com/market-avatars/a/150x150.png" alt="Agilent Technologies Inc">
This <img> tag is unclosed. You can see that the syntax highlighting also gets confused here too.
Otherwise, the XPath you are searching by looks generally well formed.
Edit: Looked at it closer. Your attribute name should be where the * is.
Here is your XPath:
"//et-instrument-mobile-row[#class='ng-star-inserted']"
Edit 2: Asker had additional question about how to search within what they found with the XPath above.
To find more elements within these elements here, looking at the documentation, each Selenium WebElement provides its own find_element method. You can then use those to further search within those elements we just found (be sure to use .// here in your XPaths, as you only want to traverse that specific element's content - other find_elements don't have this caveat).
Once you have identified the elements containing the symbols and prices, you can use simply reference the text attribute on those elements. Let's look at a simpler example:
<div class="a">
<div class="b" id="1">B</div>
<div class="c" id="2">2</div>
<div class="d" id="3">22</div>
</div>
Suppose we have already found the root div here and stored it in a variable named element. Then:
symbol = element.find_element_by_xpath(".//*[#class='b']").text
integral = element.find_element_by_xpath(".//*[#class='c']").text
fractional = element.find_element_by_xpath(".//*[#class='d']").text
Generally, if you can search by something other than XPath, though, it's easier for everyone involved. Here is a more typical way you could accomplish this with class names:
symbol = element.find_element_by_class_name("b").text
integral = element.find_element_by_class_name("c").text
fractional = element.find_element_by_class_name("d").text
Edit 3: Note from author
After the precious help of #firstbass I went in deep to get symbol and different prices for sell/buy as follows:
for element in results:
symbol = element.find_element_by_xpath(".//*[#class='symbol']").text
print(str(symbol))
sell = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[#automation-id='buy-sell-button-container-sell']")
sell_integral = sell.find_element_by_xpath(".//*[#class='price']").text
sell_fractional = sell.find_element_by_xpath(".//*[#class='after-decimal']").text
print(str(sell_integral)+':'+str(sell_fractional))
buy = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[#automation-id='buy-sell-button-container-buy']")
buy_integral = buy.find_element_by_xpath(".//*[#class='price']").text
buy_fractional = buy.find_element_by_xpath(".//*[#class='after-decimal']").text
print(str(buy_integral)+':'+str(buy_fractional))

How to get index number of specific tag and class by searching some text?

I have following html
<ul class="vote_list clearfix" id="vote_div">
<li class="vote_one">
<div class="vote_show">
<div class="vote_T1">Chelsea</div>
<div class="vote_state">
<div class="vote_ST1">Votes：30000</div>
<div class="vote_ST2">Ranking：1</div>
</div>
</div>
<div class="vote_date">
<div class="vote_T1">Chelsea</div>
</div>
</li>
<li class="vote_one">
<div class="vote_show">
<div class="vote_T1">Arsenal</div>
<div class="vote_state">
<div class="vote_ST1">Votes：20000</div>
<div class="vote_ST2">Ranking：2</div>
</div>
</div>
<div class="vote_date">
<div class="vote_T1">Arsenal</div>
</div>
</li>
<li class="vote_one">
<div class="vote_show">
<div class="vote_T1">Liverpool</div>
<div class="vote_state">
<div class="vote_ST1">Votes：10000</div>
<div class="vote_ST2">Ranking：3</div>
</div>
</div>
<div class="vote_date">
<div class="vote_T1">Liverpool</div>
</div>
</li>
<ul>
I want to extract total vote of Chelsea, so it should show Votes: 30000
My idea is Which <li class="vote_one"> own Chelsea text and it should return 0 since Chelsea located on first vote_one element
But I don't know how to convert my idea to code.
Thanks in advance.

Finally solved #Idlehands
soup = BeautifulSoup(full_content, "lxml")
i=0
for vote_one_list in soup.find_all("li", class_="vote_one"):
if vote_one_list.find("div", class_="vote_show").find("div", class_="vote_T1").text == "Chelsea":
total_vote = soup.find_all("li", class_="vote_one")[i].find("div", class_="vote_show").find("div", class_="vote_state").find("div", class_="vote_ST1").text
rank = soup.find_all("li", class_="vote_one")[i].find("div", class_="vote_show").find("div", class_="vote_state").find("div", class_="vote_ST2").text
print "Chelsea | "+ rank + " | "+total_vote
i = i+1

Printing the votes and rank
The simplest way to get the votes for any given input would be:
input_str = 'Chelsea'
for vote in soup.find_all('div', class_='vote_show'):
if vote.find('div', class_='vote_T1').get_text().strip() == input_str:
print(vote.find('div', class_='vote_ST1').get_text().strip()) # Prints votes
print(vote.find('div', class_='vote_ST2').get_text().strip()) # Prints rank
The solution looks at all <div class='vote_show'> to check if text in the <div class='vote_T1'> is same as input string, Chelsea, for example.
I added the strip() so that you can find a match even if there are spaces around the string. If a match is found, the text of the contained <div class='vote_ST1'> is printed, again stripping any surrounding whitespace.
Printing the index
You can modify the for loop to use enumerate() as follows:
for idx, vote in enumerate(soup.find_all('div', class_='vote_show')):
if vote.find('div', class_='vote_T1').get_text().strip() == input_str:
print(idx) # prints index
print(vote.find('div', class_='vote_ST1').get_text().strip()) # prints votes
print(vote.find('div', class_='vote_ST2').get_text().strip()) # prints rank
Enumerate allows us to loop over something and have an automatic counter.
If you want to stop looking any further once you've found a match, you can add a break statement after the print() statement.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting the right elements by text and span / Beautiful Soup / Python - python

Related

Confirm preceding and following sibling in XPATH

detecting the presence of text with BeautifulSoup

Look for element in string and return element + following two characters

Web scraping with Selenium and Xpath

How to get index number of specific tag and class by searching some text?

Categories

Resources