Making xpath more selective? [Web scraping]

Making xpath more selective? [Web scraping] - python

I am trying to print off some housing prices and am having trouble using Xpath. Here's my code:
from selenium import webdriver
driver = webdriver.Chrome("my/path/here")
driver.get("https://www.realtor.com/realestateandhomes-search/?pgsz=10")
for house_number in range(1,11):
try:
price = driver.find_element_by_xpath("""//*[#id="
{}"]/div[2]/div[1]""".format(house_number))
print(price.text)
except:
print('couldnt find')
I am on this website, trying to print off the housing prices of the first ten houses.
My output is that for all the houses that say "NEW", that gets taken as the price instead of the actual price. But for the bottom two, which don't have that NEW sticker, the actual price is recorded.
How do I make my Xpath selector so it selects the numbers and not NEW?

You can write it like this without loading the image, which can increase your fetching speed
from selenium import webdriver
# Unloaded image
chrome_opt = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_opt.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_opt,executable_path="my/path/here")
driver.get("https://www.realtor.com/realestateandhomes-search/Bladen-County_NC/sby-6/pg-1?pgsz=10")
for house_number in range(1,11):
try:
price = driver.find_element_by_xpath('//*[#id="{}"]/div[2]/div[#class="srp-item-price"]'.format(house_number))
print(price.text)
except:
print('couldnt find')

You're on the right track, you've just made an XPath that is too brittle. I would try making it a little more verbose, without relying on indices and wildcards.
Here's your XPath (I used id="1" for example purposes):
//*[#id="1"]/div[2]/div[1]
And here's the HTML (some attributes/elements removed for brevity):
<li id="1">
<div></div>
<div class="srp-item-body">
<div>New</div><!-- this is optional! -->
<div class="srp-item-price">$100,000</div>
</div>
</li>
First, replace the * wildcard with the element that you are expecting to contain the id="1". This simply serves as a way to help "self-document" the XPath a little bit better:
//li[#id="1"]/div[2]/div[1]
Next, you want to target the second <div>, but instead of searching by index, try to use the element's attributes if applicable, such as class:
//li[#id="1"]/div[#class="srp-item-body"]/div[1]
Lastly, you want to target the <div> with the price. Since the "New" text was in it's own <div>, your XPath was targeting the first <div> ("New"), not the <div> with the price. Your XPath did however work, if the "New" text <div> did not exist.
We can use a similar method as the previous step, targeting by attribute. This forces the XPath to always target the <div> with the price:
//li[#id="1"]/div[#class="srp-item-body"]/div[#class="srp-item-price"]
Hope this helps!
And so... having said all of that, if you are just interested in the prices and nothing else, this would probably also work :)
for price in driver.find_elements_by_class_name('srp-item-price'):
print(price.text)

Can you try this code:
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.realtor.com/realestateandhomes-search/Bladen-County_NC/sby-6/pg-1?pgsz=10")
prices=driver.find_elements_by_xpath('//*[#class="data-price-display"]')
for price in prices:
print(price.text)
It will print
$39,900
$86,500
$39,500
$40,000
$179,000
$31,000
$104,900
$94,900
$54,900
$19,900
Do let me know if any other details are also required

Related

How do I click and open a collection of objects in an element in python selenium without closing and opening the browser for each element

Say I have an eccormece site I would want to scrape and I am interested in the top ten trending products and when dig into the html element its like this:
<div>
<div>
<span>
<a href='www.mysite/products/1'>
Product 1
</a>
</spa>
</div>
<div>
<span>
<a href='www.mysite/products/2'>
Product 2
</a>
</spa>
</div>
<div>
<span>
<a href='www.mysite/products/3'>
Product 3
</a>
</spa>
</div>
<div>
<span>
<a href='www.mysite/products/4'>
Product 4
</a>
</spa>
</div>
</div>
My first solution was to extract the href attributes and then store them in a list then I would open browser instances for each and every attribute, but then it comes at a cost as I have to close and open the browser and every time I open it I have to authenticate. I then tried solution 2. In my solution two the outer div is the parent and as per selenium way of doing things it would mean that products I stored as follows:
product_1 = driver.find_element_by_xpath("//div/div[1]")
product_2 = driver.find_element_by_xpath("//div/div[2]")
product_3 = driver.find_element_by_xpath("//div/div[3]")
product_4 = driver.find_element_by_xpath("//div/div[4]")
So my objective would is to search for a product and after getting the list target the box's a tag and then click it, go to extract more details on the product and then go back without closing the browser till my list is finished and below is my solution:
for i in range(10):
try:
num = i + 1
path = f"//div/div[{num}]/span/a"
poduct_click = driver.find_element_by_xpath(path)
driver.execute_script("arguments[0].click();", poduct_click)
scrape_product_detail() #function that scrapes the whole product detail
driver.execute_script("window.history.go(-1)") # goes backwards to continue looping
except NoSuchElementException:
print('Element not found')
The problem is it works for the first product and it scrapes all the detail and then it goes back. Despite going back to the product page the program fails to find the second element and those coming afterwards and I am failing to understand what may be the problem. May you kindly assist. Thanks

thanks #Debenjan you did help me a lot there. Your solution is working like a charm. For those who would want to know how I went about here is the following code:
article_elements = self.find_elements_by_class_name("s-card-image")
collection = []
for news_box in article_elements:
# Pulling the hotel name
slug = news_box.find_element_by_tag_name(
'a'
).get_attribute('href')
collection.append(
slug
)
for i in range(len(collection)):
self.execute_script("window.open()")
self.switch_to.window(self.window_handles[i+1])
url = collection[i]
self.get(url)
print(self.title, url, self.current_url)
#A D thanks so much your solution is working too and I just will have to test and see whats the best strategy and go with it. Thanks a lot guys

Web Scraping Thesaurus using Selenium

I'm fairly new to the web scraping world but I really need to do some web scraping on the Thesaurus website for a project I'm working on. I have successfully created a program using beautifulsoup4 that asks the user for a word, then returns the most likely synonyms based on Thesaurus. However, I would like to not only have those synonyms but also the synonyms of every sense of the word (which is depicted on Thesaurus by a list of buttons above the synonyms). I noticed that when clicking a button, the name of the classes also change, so I did a little digging and decided to go with Selenium instead of beautifulsoup.
I have now a code that writes a word on the search bar and clicks it, however, I'm unable to get the synonyms or the said buttons, simply because the find_element finds nothing, and being new to this, I'm afraid I'm using the wrong syntax.
This is my code at the moment (it looks for synonyms of "good"):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
PATH = "C:\Program Files (x86)\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://thesaurus.com")
search = driver.find_element_by_id("searchbar_input")
search.send_keys('good')
search.send_keys(Keys.RETURN)
try:
headword = WebDriverWait(driver,10).until(
EC.presence_of_element_located((By.ID, "headword"))
)
print(headword.text)
#buttons = headword.find_element_by_class_name("css-bjn8wh e1br8a1p0")
#print(buttons.text)
meanings = WebDriverWait(driver,10).until(
EC.presence_of_element_located((By.ID, "meanings"))
)
print(meanings.text)
#words = meanings.find_elements_by_class_name("css-1kg1yv8 eh475bn0")
#print(words.text)
except:
print('failed')
driver.quit()
For the first part, I want to access the buttons. The headword is simply the element that contains all the buttons I want to press. This is the headword element according to the inspect tool:
<div id="headword" class="css-bjn8wh e1br8a1p0">
<div class="css-vw3jp5 e1ibdjtj4">
*unecessary stuff*
<div class="css-bjn8wh e1br8a1p0">
<div class="postab-container css-cthfds ew5makj3">
<ul class="css-gap396 ew5makj2">
<li data-test-pos-tab="true" class="active-postab css-kgfkmr ew5makj4">
<a class="css-sc11zf ew5makj1">
<em class="css-1v93s5a ew5makj0">adj.</em>
<strong>pleasant, fine</strong>
</a>
</li>
<li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4">
*similar stuff*
<li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4">
...
where each one these <li data-test-pos-tab="true" class=" css-1ha4k0a ew5makj4"> is a button I want to click. So far I have tried a bunch of things like the one showed in the code, and also things like:
buttons = headword.find_elements_by_class_name("css-1ha4k0a ew5makj4")
buttons = headword.find_elements_by_css_selector("css-1ha4k0a ew5makj4")
buttons = headword.find_elements_by_class_name("postab-container css-cthfds ew5makj3")
buttons = headword.find_elements_by_css_selector("postab-container css-cthfds ew5makj3")
but in any case Selenium can find these elements.
For the second part I want the synonyms. Here is the meaning element:
<div id="meanings" class="css-16lv1yi e1qo4u831">
<div class="css-1f3egm3 efhksxz0">
*unecessary stuff*
<div data-testid="word-grid-container" class="css-ixatld e1cc71bi0">
<ul class="css-1ngwve3 e1ccqdb60">
<li>
<a font-weight="inherit" href="/browse/acceptable" data-linkid="nn1ov4" class="css-1kg1yv8 eh475bn0">
</a>
</li>
<li>
<a font-weight="inherit" href="/browse/bad" data-linkid="nn1ov4" class="css-1kg1yv8 eh475bn0">
...
where each of these elements is a synonym I want to get. Similarly to the previous case I tried several things such as:
synGrid = meanings.find_element_by_class_name("css-ixatld e1cc71bi0")
synGrid = meanings.find_element_by_css_selector("css-ixatld e1cc71bi0")
words = meanings.find_elements_by_class_name("css-1kg1yv8 eh475bn0")
words = meanings.find_elements_by_css_selector("css-1kg1yv8 eh475bn0")
And again Selenium cannot find these elements...
I would really appreciate some help in order to achieve this, even if it is just a push in the right direction instead of giving a full solution.
Hope I wrote all the needed information, if not, please let me know.

If you use css selector then you have to use dot for class
css_selector(".css-ixatld.e1cc71bi0")
and hash for id
css_selector("#headword")
like you would use in files .css
In css selector you can use also other methods avaliable in CSS.
See css selectors on w3schools.com
Selenium converts class_name to css selector but class_name() expects single name and Selenium has problems when there are two or more names. When it converts class_name to css_selector then it adds dot only before first name but it needs dot also before second and other names. So you have to manually add second dot
class_name("css-ixatld.e1cc71bi0")

See if this works:
meanings = driver.find_elements_by_xpath(".//div[#id='meanings']/div[#data-testid='word-grid-container']/ul/li")
for e in meanings:
e.find_element_by_tag_name("a").click()
//Add a implicit wait if you need
driver.back()

How to scrape elements in Selenium/Python by calling different css selectors at the same time?

I am trying to select the title of posts that are loaded in a webpage by integrating multiple css selectors. See below my process:
Load relevant libraries
import time
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Then load the content I wish to analyse
options = Options()
options.set_preference("dom.push.enabled", False)
browser = webdriver.Firefox(options=options)
browser.get("https://medium.com/search")
browser.find_element_by_xpath("//input[#type='search']").send_keys("international development",Keys.ENTER)
time.sleep(5)
scrolls = 2
while True:
scrolls -= 1
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(5)
if scrolls < 0:
break
Then to get the content for each selector separately, call for css_selector
titles=browser.find_elements_by_css_selector("h3[class^='graf']")
TitlesList = []
for names in titles:
names.text
TitlesList.append(names.text)
times=browser.find_elements_by_css_selector("time[datetime^='2016']")
Times = []
for names in times:
names.text
Times.append(names.text)
It all works so far...Now trying to bring them together with the aim to identify only choices from 2016
choices = browser.find_elements_by_css_selector("time[datetime^='2016'] and h3[class^='graf']")
browser.quit()
On this last snippet, I always get an empty list.
So I wonder 1) How can I select multiple elements by considering different css_selector as conditions for selection at the same time 2) if the syntax to find under multiple conditions would be the same to link elements by using different approaches like css_selector or x_paths and 3) if there is a way to get the text for elements identified by calling for multiple css selectors along a similar line of what below:
[pair.text for pair in browser.find_elements_by_css_selector("h3[class^='graf']") if pair.text]
Thanks

Firstly, I think what you're trying to do is to get any title that has time posted in 2016 right?
You're using CSS selector "time[datetime^='2016'] and h3[class^='graf']", but this will not work because its syntax is not valid (and is not valid). Plus, these are 2 different elements, CSS selector can only find 1 element. In your case, to add a condition from another element, use a common element like a parent element or something.
I've checked the site, here's the HTML that you need to take a look at (if you're trying to the title that published in 2016). This is the minimal HTML part that can help you identify what you need to get.
<div class="postArticle postArticle--short js-postArticle js-trackPostPresentation" data-post-id="d17220aecaa8"
data-source="search_post---------2">
<div class="u-clearfix u-marginBottom15 u-paddingTop5">
<div class="postMetaInline u-floatLeft u-sm-maxWidthFullWidth">
<div class="u-flexCenter">
<div class="postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis">
<div
class="ui-caption u-fontSize12 u-baseColor--textNormal u-textColorNormal js-postMetaInlineSupplemental">
<a class="link link--darken"
href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-source="preview-listing">
<time datetime="2016-09-05T13:55:05.811Z">Sep 5, 2016</time>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="postArticle-content">
<a href="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action="open-post" data-action-source="search_post---------2"
data-action-value="https://provocations.darkmatterlabs.org/reimagining-international-development-for-the-21st-century-d17220aecaa8?source=search_post---------2"
data-action-index="2" data-post-id="d17220aecaa8">
<section class="section section--body section--first section--last">
<div class="section-divider">
<hr class="section-divider">
</div>
<div class="section-content">
<div class="section-inner sectionLayout--insetColumn">
<h3 name="5910" id="5910" class="graf graf--h3 graf--leading graf--title">Reimagining
International Development for the 21st Century.</h3>
</div>
</div>
</section>
</a>
</div>
</div>
Both time and h3 are in a big div with class of postArticle. The article contains time published & the title, so it makes sense to get the whole article div that published in 2016 right?
Using XPATH is much more powerful & easier to write:
This will get all articles div that contains class name of postArticle--short: article_xpath = '//div[contains(#class, "postArticle--short")]'
This will get all time tag that contains class name of 2016: //time[contains(#datetime, "2016")]
Let's combine both of them. I want to get article div that contains a time tag with classname of 2016:
article_2016_xpath = '//div[contains(#class, "postArticle--short")][.//time[contains(#datetime, "2016")]]'
article_element_list = driver.find_elements_by_xpath(article_2016_xpath)
# now let's get the title
for article in article_element_list:
title = article.find_element_by_tag_name("h3").text
I haven't tested the code yet, only the xpath. You might need to adapt the code to work on your side.
By the way, using find_element... is not a good idea, try using explicit wait: https://selenium-python.readthedocs.io/waits.html
This will help you to avoid making stupid time.sleep waits and improve your app performance, and you can handle errors pretty well.
Only use find_element... when you already located the element, and you need to find a child element inside. For example, in this case if I want to find articles, I will find by explicit wait, then after the element is located, I will use find_element... to find child element h3.

Getting search bar ID for selenium

Trying to get the search bar ID from this website: http://www.pexels.com
browser = webdriver.Chrome(executable_path="C:\\Users\\James\\Documents\\PythonScripts\\chromedriver.exe")
url = "https://www.pexels.com"
browser.get(url)
browser.maximize_window()
search_bar = browser.find_element_by_id("//input[#id='search__input']")
search_bar.send_keys("sky")
search_button.click()
However this isn't correct and I'm not sure how to get the search to work. First time using selenium so all help is appreciated!

There's no id attribute in the tag you are searching for. You may use css selectors instead. Here's a sample snippet:
search_bar = driver.find_element_by_css_selector("input[placeholder='Search for free photos…']");
search_bar.send_keys("sky")
search_bar.send_keys(Keys.RETURN)
Above snippet will insert 'sky' in the search bar and hit enter button.

Locating Elements gives an explanation on how to locate elements with Selenium.
The element you want to select, looks like this in the DOM:
<input required="required" autofocus="" class="search__input" type="search" placeholder="Search for free photos…" name="s">
Since there is no id specified you can't use find_element_by_id.
For me this worked:
search_bar = driver.find_element_by_xpath('/html/body/header[2]/div/section/form/input')
search_bar.send_keys("sky")
search_bar.send_keys(Keys.RETURN)
for the selection you can also use the class name:
search_bar = driver.find_element_by_class_name("search__input")
or the name tag:
search_bar = driver.find_element_by_name('s')
However, locating elements by names is probably not a good idea if there are more elements with the same name (link)
BTW if you are unsure about the xpath, the google Chrome inspection tool lets you copy the xpath from the document:

Impossible to locate an element in selenium

I'm trying to find one of these elements (vote up or vote down):
<div class="votingWrapper ">
<span class="voteBtn voteUp " data-postid="c4fcff79-7f73-493c-b6d8-f05ff1962897"></span>
<span class="postRank">
<span class="rankWrapper wrapper_up" data-points="1320">+1.3k</span>
<span class="rankWrapper wrapper_down" data-points="512">-512</span>
</span>
<span class="voteBtn voteDown " data-postid="c4fcff79-7f73-493c-b6d8-f05ff1962897"></span>
</div>
I tried to locate it by class, id, xpath (a lot of types) but I got NoSuchElementException for all attempts.
element = driver.find_element_by_id("c4fcff79-7f73-493c-b6d8-f05ff1962897")
element = driver.find_element_by_class_name("voteBtn voteDown ")
element = driver.find_element_by_xpath('//*[#id="c4fcff79-7f73-493c-b6d8-f05ff1962897"]/div[1]/span[3]')
I searched for 3 hours how to do that but I failed. It's impossible.
Can anyone give me some tip?

You have pass single class at a time
element_up = driver.find_element_by_class_name("voteUp")
element_down = driver.find_element_by_class_name("voteDown")

Using space in html tags is usually deprecated but your issue can be resolve easily.
In CSS, you can access a div with a class name with a name separate with a space using a dot.
Example :
<span class="voteBtn voteUp " data-postid="c4fcff79-7f73-493c-b6d8-f05ff1962897"></span>
can be access in CSS file by :
.voteBtn.voteUp {}
You can use the same technic with your selenium and as a one liner :
element = driver.find_element_by_css_selector("span.voteBtn.voteDown")

try the following XPaths (in case they are not in a frame):
to find voteUp:
//span[contains(#class,"voteBtn voteUp")]
to find voteDown:
//span[contains(#class,"voteBtn voteDown")]
If they are inside a frame (check whether the elements you are trying to find, are child elements of an iframe tag), then first find the frame and then switch to it. then use the XPath to find elements.

CSS for vote up: div.votingWrapper > span
or
span[class*='voteUp']
CSS for vote down: div.votingWrapper > span:nth-child(3)
or
span[class*='voteDown']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Making xpath more selective? [Web scraping] - python

Related

How do I click and open a collection of objects in an element in python selenium without closing and opening the browser for each element

Web Scraping Thesaurus using Selenium

How to scrape elements in Selenium/Python by calling different css selectors at the same time?

Getting search bar ID for selenium

Impossible to locate an element in selenium

Categories

Resources