Selenium Driver - Webscraping - python

Using the Selenium Module to try and webscrape but when I print out the element, it seems that it returns a location the data is stored on the Selenium Server? I'm not exactly sure how this works. Anyway, here's my code. I'm very confused. Can someone tell me what I'm doing wrong?
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://caribeexpress.com.do/') #get method
elem2 = browser.find_elements_by_css_selector('div.plan:nth-child(3) > div:nth-child(2) > span:nth-child(2)')
print(elem2)
elems3 = browser.find_elements_by_class_name('value')
print(elems3)
elem4 = browser.find_element_by_xpath('//*[#id="content-wrapper"]/div[2]/div[3]/div/span[2]')
print(elem4)
For some reason, what displays in my Python IDE doesn't display here, I included it in my gist.
https://gist.github.com/jtom343

In case you want to extract the text between span tags.
Replace this to :
print(elem2)
TO:
print(elem2.text.strip())
and this : print(elem4)
To:
print(elem4.text.strip())

Related

Delete dynamic elements from HTML with Selenium and Python

I've used BeautifulSoup to find a specific div class in the page's HTML. I want to check if this div has a span class inside it. If the div has the span class, I want to maintain it on the page's code, but if it doesn't, I want to delete it, maybe using Selenium.
For that I have two lists selecting the elements (div and span). I tried to check if one list is inside the other, and that kind of worked. But how can one delete that found element from the page's source code?
Edit
I've edited the code after a few conversations in the commentaries section. With help, I was able to implement code to remove elements executing javascript.
The code is running with no errors, but nothing is being deleted from the page.
# Import required module
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
# Option to launch browser in incognito
options = Options()
options.add_argument("--incognito")
#options.add_argument("--headless")
# Using chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
# Web page url request
driver.get('https://www.facebook.com/ads/library/?active_status=all&ad_type=all&country=BR&q=frete%20gr%C3%A1tis%20aproveite&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all')
driver.maximize_window()
time.sleep(10)
driver.execute_script("""
for(let div of document.querySelectorAll('div._99s5')){
let match = div.innerText.match(/(\d+) ads? use this creative and text/)
let numAds = match ? parseInt(match[1]) : 0
if(numAds < 10){
div.querySelector(".tp-logo")?.remove()
}
}
""")
Since you're deleting them in javascript anyway:
driver.execute_script("""
for(let div of document.querySelectorAll('div._99s5')){
let match = div.innerText.match(/(\d+) ads? use this creative and text/)
let numAds = match ? parseInt(match[1]) : 0
if(numAds < 10){
div.querySelector(".tp-logo")?.remove()
}
}
""")
Note: Question and comments reads a bit confusing so it would be great to improve it a bit. Assuming you like to decompose() some elements, the reason why or what to do after this action is not clear. So this answer will only point out an apporache.
To decompose() the elements that do not contains ads use this creative and text just negate your selection and iterate the ResultSet:
for e in soup.select('div._99s5:has(:not(:-soup-contains("ads use this creative and text")))'):
e.decompose()
Now these elements will no longer be included in your soup and you could process it for your needs.

Selenium CSS selector (Works in Scrapy but not Selenium) Python

I have tried probably every kind of selector and am unable to output this selector as text.
Id, css selector, xpath, all return no result, but when using the same reference in Scrapy shell the desired output is returned.
Any Idea why the Selenium selector does not work?
I am trying to return the text in masterBody_trSalesDate
発売予定日 : 7月(2021/4/21予約開始)
https://www.example.co.jp/10777687
try:
hatsubai = driver.find_element_by_id('#masterBody_trSalesDate').text
I have honestly tried every possible combination elements and selectors I can think of with no luck, but as mentioned Scrapy shell DOES return the correct data so I am not sure what is going wrong.
Is there any way to test Selenium selectors like scrapy shell without running the script?
Thank you if you have any advice.
image shows working in scrapy shell
When you use by_id or by_xpath then you don't need char #
hatsubai = driver.find_element_by_id('masterBody_trSalesDate').text
That's all.
Minimal working code which works for me
from selenium import webdriver
url = 'https://www.1999.co.jp/10777687'
#driver = webdriver.Firefox()
driver = webdriver.Chrome()
driver.get(url)
hatsubai = driver.find_element_by_id('masterBody_trSalesDate').text
print(hatsubai)
hatsubai = driver.find_element_by_xpath('//*[#id="masterBody_trSalesDate"]').text
print(hatsubai)
hatsubai = driver.find_element_by_css_selector('#masterBody_trSalesDate').text
print(hatsubai)
BTW:
The same is with by_class_name - it needs only name without dot .
You need to use css selector for this one:
hatsubai = driver.find_element_by_css_selector('#masterBody_trSalesDate').text
print(hatsubai)
Output:
発売予定日 : 7月(2021/4/21予約開始)

Why does Python selenium throw a StaleElementReferenceException when I try to access the text of an element? [duplicate]

This question already has answers here:
Python Selenium: Block-Title is not properly verified. (Magento Cloud)
(2 answers)
Closed 2 years ago.
I am trying to access the text of an element using selenium with Python. I can access the elements themselves just fine, but when I try to get the text it doesn't work.
This is my code:
from selenium import webdriver
driver = webdriver.Chrome() # I removed the path for my post, but there is one that works in my actual code
URL = "https://www.costco.com/laptops.html"
driver.get(URL)
prices = driver.find_elements_by_class_name("price")
print([price.text for price in prices])
If I run this code I get: selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
However, if I were to print out the elements themselves, I have no problem.
I read some previous posts about the stale element exception, but I don't understand why it applies to me in this case. Why would the DOM change when I try to access the text? Why is this happening?
Turn out you just need to wait:
from selenium import webdriver
import time
driver = webdriver.Chrome() # I removed the path for my post, but there is one that works in my actual code
URL = "https://www.costco.com/laptops.html"
driver.get(URL)
time.sleep(3)
prices = driver.find_elements_by_class_name("price")
print([price.text for price in prices])
Output:
['$1,999.99', '$2,299.99', '', '', '$769.99', '', '$799.99', '$1,449.99', '$1,199.99', '$1,199.99', '$1,999.99', '$1,599.99', '$1,299.99', '$2,299.99', '$1,549.99', '$1,499.99', '$599.99', '$1,699.99', '$1,079.99', '$2,999.99', '$1,649.99', '$1,499.99', '$2,399.99', '$1,499.97', '$1,199.99', '$1,649.99', '$849.99', '']
The correct way to do this is to use WebDriverWait. See
Old answer:
I am not entirely sure why that is happening. But I would suggest you try BeautifulSoup:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome() # I removed the path for my post, but there is one that works in my actual code
URL = "https://www.costco.com/laptops.html"
driver.get(URL)
soup = BeautifulSoup(driver.page_source)
divs = soup.find_all("div",{"class":"price"})
[div.text.replace("\t",'').replace("\n",'') for div in divs]
Output:
['$1,099.99',
'$399.99',
'$1,199.99',
'$599.99',
'$1,049.99',
'$799.99',
'$699.99',
'$949.99',
'$699.99',
'$1,999.99',
'$449.99',
'$2,699.99',
'$1,149.99',
'$1,599.99',
'$1,049.99',
'$1,249.99',
'$299.99',
'$1,799.99',
'$749.99',
'$849.99',
'$2,299.99',
'$999.99',
'$649.99',
'$799.99']

"How to fix 'malformed URL' in Selenium web scraping

My problem is that I am attempting to scrape the titles of Netflix movies and shows from a website that lists them on 146 different pages, so I made a loop to try and capture data from all the pages, however, when using the loop it makes my URL malformed and I don't know how to fix it.
I have made sure the webdriver part of the code works, meaning if I type in the URL to the driver.get it gives me the information I need, however when using the loop it pops up multiple firefox windows and doesnt put any URL into any of the windows. I also added a time delay to try and see if it was changing the URL before it got used but it still didn't work.
from selenium import webdriver
import time
for i in range(1,3):
URL = "https://flixable.com/?min-rating=0&min-year=1920&max-year=2019&order=date&page={}"
newURL = URL.format(i)
print(newURL)
time.sleep(10)
driver = webdriver.Firefox()
driver.get('newURL')
titles = driver.find_elements_by_css_selector('#filterContainer > div > div > p > strong > a')
for post in titles:
print(post.text)

Get html of inspect element source with selenium

I'm working in selenium with Chrome.
The webpage I'm accessing updates dynamically.
I need the html that shows the results, I can access it when I do 'inspect element'.
I don't get how I need to access that html from my code. I always get the original html.
I tried this: Get HTML Source of WebElement in Selenium WebDriver using Python
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
It seems that it's working after some delay. If I were you I should try to experiment with the delay time.
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
time.sleep(10)
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
Addition: a nicer way is to let the script proceed when an element is available (because of time it takes with JS (for example) before a specific element has been added to the DOM). The element to look for in your example is table with id iceDatTbl (for what I could find after a quick look).

Categories