Web scraping with Selenium and Xpath

Web scraping with Selenium and Xpath - python

I’m new in Xpath. I’m trying to scrape a stock website to get name and value of each element.
In my python selenium script I’ve locally extracted the main part of the web page in html_content, as follows.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
dirinstall="C:\\Program Files (x86)\\www\mm\\"
chrome_driver = dirinstall+"\\Webdriver\\chromedriver.exe"
options = Options()
driver = webdriver.Chrome(chrome_driver, options=options)
html_content = """
<html class="ng-scope">
<head data-meta-tags="">
<title> Stock NYSE </title>
<ui-layout class="ng-isolate-scope">
<div data-ng-include="" src="layoutCtrl.template" class="ng-scope">
<app-root class="ng-scope" _nghost-rqp-c0="" ng-version="8.2.14"></app-root>
<div ng-class="{'demo-mode': $root.session.user.portfolio.account.type === 'Demo' }" class="ng-scope">
<div ng-view="" ng-class="layoutCtrl.isBannerShown ? 'banner-shown' : ''" class="main-app-view ng-scope" role="main">
<et-discovery-markets-results class="ng-scope" _nghost-rqp-c42="" ng-version="8.2.14">
<div _ngcontent-rqp-c42="" class="discover main-content no-footer" ui-fun-scroll="{'class': 'minimize', 'classEl': '.user-head-wrapper, .table-discover', 'scrollContainer': '.table-discover', 'setClassAtScroll': 200 }">
<div _ngcontent-rqp-c42="" automation-id="discover-market-results-wrapp" class="table-discover markets-table">
<et-discovery-markets-results-list _ngcontent-rqp-c42="" automation-id="discover-market-results-sub-view-list" _nghost-rqp-c44="" class="ng-star-inserted">
<div _ngcontent-rqp-c44="" class="market-list list-view" data-etoro-locale-ns="discoverMarketResultsList">
<et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
<et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
<div _ngcontent-rqp-c47="" class="row-wrap">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
<div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">A</div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name positive"> 0.68 (0.90%) </div>
</div>
</div>
<et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label positive-change" automation-id="buy-sell-button-container-sell">
<div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">75.<span class="after-decimal">85</span></div>
</div>
</et-buy-sell-button>
<div _ngcontent-rqp-c24="" class="space-gap"></div>
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
<div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">76.<span class="after-decimal">03</span></div>
</div>
</et-buy-sell-button>
</et-buy-sell-buttons>
</div>
<et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
</et-trade-item-card-action>
</et-instrument-trading-mobile-row>
</et-instrument-mobile-row>
<et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
<et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
<div _ngcontent-rqp-c47="" class="row-wrap">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
<div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">AA</div>
<div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name negative"> -0.11 (-1.46%) </div>
</div>
</div>
<et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-sell">
<div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">44</span></div>
</div>
</et-buy-sell-button>
<div _ngcontent-rqp-c24="" class="space-gap"></div>
<et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
<div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
<div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
<div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">47</span></div>
</div>
</et-buy-sell-button>
</et-buy-sell-buttons>
</div>
<et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
</et-trade-item-card-action>
</et-instrument-trading-mobile-row>
</et-instrument-mobile-row>
</div>
</et-discovery-markets-results-list>
</div>
</div>
</et-discovery-markets-results>
</div>
</div>
</div>
</ui-layout>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
#results = driver.find_elements_by_xpath("//*[#class='ng-star-inserted']")
results = driver.find_elements_by_xpath("//*[et-instrument-mobile-row and #class='ng-star-inserted']")
print('Number of results', len(results))
I don’t know why if I search ‘et-instrument-mobile-row’ I get only 1 element instead of 2, and if I search both ‘et-instrument-mobile-row’ and 'ng-star-inserted' I get 0 elements.
Looking at the example my goal is to get the symbol and current value of buy/sell (price and after-decimal).
Something like:
[A, 75.85, 76.03]
[AA, 7.44, 7.47]
Could anyone help me? Thanks!

It looks like you may have some malformed HTML and Selenium is unsure how to parse it. I noticed this line:
<div _ngcontent-rqp-c47="" class="avatar-img-wrap"><img _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-avatar" class="avatar-img" src="https://etoro-cdn.etorostatic.com/market-avatars/a/150x150.png" alt="Agilent Technologies Inc">
This <img> tag is unclosed. You can see that the syntax highlighting also gets confused here too.
Otherwise, the XPath you are searching by looks generally well formed.
Edit: Looked at it closer. Your attribute name should be where the * is.
Here is your XPath:
"//et-instrument-mobile-row[#class='ng-star-inserted']"
Edit 2: Asker had additional question about how to search within what they found with the XPath above.
To find more elements within these elements here, looking at the documentation, each Selenium WebElement provides its own find_element method. You can then use those to further search within those elements we just found (be sure to use .// here in your XPaths, as you only want to traverse that specific element's content - other find_elements don't have this caveat).
Once you have identified the elements containing the symbols and prices, you can use simply reference the text attribute on those elements. Let's look at a simpler example:
<div class="a">
<div class="b" id="1">B</div>
<div class="c" id="2">2</div>
<div class="d" id="3">22</div>
</div>
Suppose we have already found the root div here and stored it in a variable named element. Then:
symbol = element.find_element_by_xpath(".//*[#class='b']").text
integral = element.find_element_by_xpath(".//*[#class='c']").text
fractional = element.find_element_by_xpath(".//*[#class='d']").text
Generally, if you can search by something other than XPath, though, it's easier for everyone involved. Here is a more typical way you could accomplish this with class names:
symbol = element.find_element_by_class_name("b").text
integral = element.find_element_by_class_name("c").text
fractional = element.find_element_by_class_name("d").text
Edit 3: Note from author
After the precious help of #firstbass I went in deep to get symbol and different prices for sell/buy as follows:
for element in results:
symbol = element.find_element_by_xpath(".//*[#class='symbol']").text
print(str(symbol))
sell = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[#automation-id='buy-sell-button-container-sell']")
sell_integral = sell.find_element_by_xpath(".//*[#class='price']").text
sell_fractional = sell.find_element_by_xpath(".//*[#class='after-decimal']").text
print(str(sell_integral)+':'+str(sell_fractional))
buy = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[#automation-id='buy-sell-button-container-buy']")
buy_integral = buy.find_element_by_xpath(".//*[#class='price']").text
buy_fractional = buy.find_element_by_xpath(".//*[#class='after-decimal']").text
print(str(buy_integral)+':'+str(buy_fractional))

Related

Confirm preceding and following sibling in XPATH

I've got the below statement to check that 2 conditions exist in a
an element:
if len(driver.find_elements(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")) > 0:
elem = driver.find_element(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")
I've tried a few variations, including "preceding sibling::span[text()='x'", but can't seem to get the syntax correct or if I'm going about it the right way.
HTML is below. the current find_elements(By.XPATH...) correctly finds the "Total" and "Buy" class, I would like to add $20.00 in the "price" class as a condition also.
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total">
<span>$400.00</span>
</div>
<div class="Buy">
<a class="Button">Buy</a>
</div>
</div>
</li>
</ul>

Using built in ElementTree
import xml.etree.ElementTree as ET
html = '''<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
<div class="List-Content row">
<div class="Price">"$27.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>'''
items = {'Total':'$400.00','Buy':'Buy','Price':'"$20.00"'}
root = ET.fromstring(html)
first_level_divs = root.findall('div')
for first_level_div in first_level_divs:
results = {}
for k,v in items.items():
div = first_level_div.find(f'.div[#class="{k}"]')
one_level_down = len(list(div)) > 0
results[k] = list(div)[0].text if one_level_down else div.text
if results == items:
print('found')
else:
print('not found')
results = {}
output
found
not found

Given this HTML snippet
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>
</ul>
I would use this XPath:
buy_buttons = driver.find_elements(By.XPATH, """//div[
contains(#class, 'List-Content')
and div[#class = 'Price'] = '$20.00'
and div[#class = 'Total'] = '$400.00'
]//a[. = 'Buy']""")
for buy_button in buy_buttons:
print(buy_button)
The for loop replaces your if len(buy_buttons) > 0 check. It won't run when there are no results, so the if is superfluous.

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help

Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]

You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

Look for element in string and return element + following two characters

The following html appears as a string in my code. That is okay, what I need though is how to get:
"class="company-image company-34""
for each company-## there is also a price found in this tag further below in the HTML:
class="small-12 medium-4 cell text-right" data-after="kr./år">1.813
I tried following code:
for x in html:
if "company-image company" in x:
print("Oh yes")
else:
print("Nahh")
but it doesn't really work. My thinking is I look for everytime "company-image company" is mentioned and get the whole string and the following numbers after, it is always two numbers ##. And whenever it is found I look for "data-after="kr./år"" and get the numbers following. Eventually this would end in a for loop, as there are multiple companies and prices.
<app-offer-match _ngcontent-vdv-c20="" _nghost-vdv-c22="" class="ng-star-inserted">
<div _ngcontent-vdv-c22="" class="box">
<!---->
<div _ngcontent-vdv-c22="" class="line1">
<div _ngcontent-vdv-c22="" class="company-image company-34"><img _ngcontent-vdv-c22="" src="/assets/images/companies/34.svg"></div>
<div _ngcontent-vdv-c22="" class="button compare">Sammenlign </div>
</div>
<div _ngcontent-vdv-c22="" class="line2">
<div _ngcontent-vdv-c22="" class="container-button">
<div _ngcontent-vdv-c22="" class="button mini-accordion"></div>
</div>
<div _ngcontent-vdv-c22="" class="container-insurance-list">
<!---->
<div _ngcontent-vdv-c22="" class="indbo ng-star-inserted">
<div _ngcontent-vdv-c22="" class="grid-x container-product-overview">
<div _ngcontent-vdv-c22="" class="small-5 cell detail"><span _ngcontent-vdv-c22="">Indbo</span>
<!----><span _ngcontent-vdv-c22="" class="ng-star-inserted">Kongshaven 3</span>
</div>
<div _ngcontent-vdv-c22="" class="small-6 cell">
<div _ngcontent-vdv-c22="" class="grid-x price">
<div _ngcontent-vdv-c22="" class="small-12 medium-8 cell text-right" data-after="kr.">Selvrisiko 2.199</div>
<div _ngcontent-vdv-c22="" class="small-12 medium-4 cell text-right" data-after="kr./år">1.813 </div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</app-offer-match>
EDIT: Added desired output.
Desired output would be a pandas dataframe of:
Company Price
company-image company-34 1.813
EDIT 2:
It looks like an xml, that's because I formatted it like that for you guys. WHen I output it, it is of type STR, thank you.

Try this:
company = """[your string above]"""
import lxml.html as lh
import pandas as pd
doc = lh.fromstring(company)
columns = ["Company", "Price"]
rows = []
targets = doc.xpath('//div[contains(#class,"company-image company")]')
for target in targets:
row = []
row.append(target.attrib['class'])
price = target.xpath('../following-sibling::div//div[#data-after="kr./år"]')[0]
row.append(price.text)
rows.append(row)
rows
pd.DataFrame(rows,columns=columns)
Output:
Company Price
0 company-image company-34 1.813

Extracting the right elements by text and span / Beautiful Soup / Python

Im trying to scrape following data:
Cuisine: 4.5
Service: 4.0
Quality: 4.5
But im having issues to scrape the right data. I tried following two Codes:
for bewertungen in soup.find_all('div', {'class' : 'histogramCommon bubbleHistogram wrap'}):
if bewertungen.find(text='Cuisine'):
cuisine = bewertungen.find(text='Cuisine')
cuisine = cuisine.next_element
print("test " + str(cuisine))
if bewertungen.find_all(text='Service'):
for s_bewertung in bewertungen.find_all('span', {'class':'ui_bubble_rating'}):
s_speicher = s_bewertung['alt']
In the first if i get no result. In the second If i get the right elements but i get all 3 results but i can not define which ones belongs to which text (Cuisine, Service, Quality)
Can someone give me an advice how to get the right data?
I put at the bottom the html code.
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">\nGesamtwertung\n</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Cuisine</span>
</div>
<div class="wrap row part ">
<span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span alt="4.0 of five" class="ui_bubble_rating bubble_40"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Quality</span>
</div>
<div class="wrap row part "><span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span></div>
</div>
</li>
</ul>
</div>

Try this. According to the snippet you have pasted above, the following code should work:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".row span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Another way would be:
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".text").find_parent().find_next_sibling().select_one("span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Output:
Cuisine : 4.5
Service : 4.0
Quality : 4.5

lxml xpath() not returning what I expect

I am following a guide (http://docs.python-guide.org/en/latest/scenarios/scrape/) to scrape a website (https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/) and have gone through the lxml package website and can't figure out what's going wrong.
I have this code:
from lxml import html
import requests
page = requests.get('https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/')
tree = html.fromstring(page.content)
floor = tree.xpath('//div[#class="column floor"]/text()')
sf = tree.xpath('//div[#class="column rsf"]/text()')
but floor and sf return a list of '\n\t\t\t\t' values, not an integer which you'd expect looking at the html from the actual website ("20" and "5117" in the below case):
<div class="availabilityWrap">
<h3>Availabilities</h3>
<div class="availabilityRow headerRow">
<div class="column floor">
<a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
</div>
<div class="column rsf">
<p><b>5117</b></p>
</div>
<div class="column divisible">
<p><b>yes</b></p>
</div>
<div class="column date">
<p><b>05/01/2017</b></p>
</div>
<div class="column space">
<p><b>Office</b></p>
</div>
<div class="column description">
<p><b>model suite</b></p>
</div>
<div class="column rent">
<p><b>$26.55</b></p>
</div>
</div>
Shouldn't it just be returning all text in the "column floor" div class? Any help would be great.

floor = tree.xpath('normalize-space(//div[#class="column floor"])')
The div contains the \n\t to get new lines and space, Those are text too, you can concatenate all the text and remove the whitespace by using normalize-space() function
In [14]: '''<div class="column floor">
...:
...: <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
...: target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
...:
...: </div>'''
Out[14]: '<div class="column floor">\n\n <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"\n target=\'blank\'><img src="/static/images/pdf.png" class="floorPDF" />20</a>\n\n </div>'
EDIT:
for div in tree.xpath('//div[#class="column floor"]'):
print(div.xpath('normalize-space(.)')) # `.` means current node

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping with Selenium and Xpath - python

Related

Confirm preceding and following sibling in XPATH

Beautifulsoup get element with the same class

Look for element in string and return element + following two characters

Extracting the right elements by text and span / Beautiful Soup / Python

lxml xpath() not returning what I expect

Categories

Resources