Issue in web scraping using Selenium and driver.get()

Issue in web scraping using Selenium and driver.get() - python

I am trying to scrape this url but the url I enter in the driver.get() gets changed when the program runs and chrome page is opened.
What can be causing it to change?
I want to open this link and get specific things but the url changes and it displays error because this class doesnot exist on the changed url.
Here's my code:
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.get("https://www.booking.com/hotel/pk/one-bahawalpur.html?aid=378266&label=bdot-Os1%2AaFx2GVFdW3rxGd0MYQS541115605091%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-334108349%3Alp1011080%3Ali%3Adec%3Adm%3Appccp%3DUmFuZG9tSVYkc2RlIyh9YYriJK-Ikd_dLBPOo0BdMww&sid=0a2d2e37ba101e6b9547da95c4a30c48&all_sr_blocks=41645002_248999805_0_1_0;checkin=2022-11-10;checkout=2022-11-11;dest_id=-2755460;dest_type=city;dist=0;group_adults=2;group_children=0;hapos=1;highlighted_blocks=41645002_248999805_0_1_0;hpos=1;matching_block_id=41645002_248999805_0_1_0;no_rooms=1;req_adults=2;req_children=0;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=41645002_248999805_0_1_0__1120000;srepoch=1668070223;srpvid=e9da3e27383900b4;type=total;ucfs=1&#hotelTmpl")
print(driver.find_element(by=By.CLASS_NAME, value="d2fee87262"))
The url after chrome page is opened is as follows:
https://www.booking.com/searchresults.html?aid=378266&label=bdot-Os1%2AaFx2GVFdW3rxGd0MYQS541115605091%3Apl%3Ata%3Ap1%3Ap22%2C563%2C000%3Aac%3Aap%3Aneg%3Afi%3Atikwd-334108349%3Alp1011080%3Ali%3Adec%3Adm%3Appccp%3DUmFuZG9tSVYkc2RlIyh9YYriJK-Ikd_dLBPOo0BdMww&sid=63b32ef8c1d53ae0613d71baf62c3e56&checkin=2022-11-10&checkout=2022-11-11&city=-2755460&group_adults=2&group_children=0&highlighted_hotels=416450&hlrd=with_av&keep_landing=1&no_rooms=1&redirected=1&source=hotel&srpvid=e9da3e27383900b4&room1=A,A,;#hotelTmpl

The URL is changed by the site. You can not change its behavior. It is redirecting users with new sessions to the search instead of the page of a hotel.
As a workaround I can suggest to click the hotel name in search:
driver.find_element(By.XPATH, '//div[text()="Hotel One Bahawalpur"]').click
put this line between the line that opens the page and the line that finds the element.
Also I recommend you to add a line driver.implicitly_wait(10) after the line driver = webdriver.Chrome(service=s). This will improve your scraper's stability as it makes selenium wait while an element appears on the page.

Related

Selenium webdriver does not open the correct url, rather it opens a blank page

I am using selenium webdriver to try scrape information from realestate.com.au, here is my code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
path = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'
url = 'https://www.realestate.com.au/buy'
url2 = 'https://www.realestate.com.au/property-house-nsw-castle+hill-134181706'
webdriver = Chrome(path)
webdriver.get(url)
soup = BeautifulSoup(webdriver.page_source, 'html.parser')
print(soup)
it works fine with URL but when I try to do the same to open url2, it opens up a blank page, and I checked the console get the following:
"Failed to load resource: the server responded with a status of 429 ()
about:blank:1 Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME
149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint:1 Failed to load resource: the server responded with a status of 404 ()"
while opening up URL, I tried to search for anything, which also leads to a blank page like url2.

It looks like the www.realestate.com.au website is using an Akamai security tool.
A quick DNS lookup shows that www.realestate.com.au resolves to dualstack.realestate.com.au.edgekey.net.
They are most likely using the Bot Manager product (https://www.akamai.com/us/en/products/security/bot-manager.jsp). I have encountered this on another website recently.
Typically rotating user agents and IP addresses (ideally using residential
proxies) should do the trick. You want to load up the site with a "fresh" browser profile each time. You should also check out https://github.com/67-6f-64/akamai-sensor-data-bypass

I think you should try adding driver.implicitly_wait(10) before your get line, as this will add an implicit wait, in case the page loads too slowly for the driver to pull the site. Also you should consider trying out the Firefox webdriver, since this bug appears to be only affecting chromium browsers.

Python: finding content in dynamically generated HTML

I am trying to get stock options prices from this website based on the series code (for example FMM1), but the content is dynamically generated after the page loads and my python selenium script is not able to extract the correct source code, and therefore does not find it. When I inspect element, I can find it but not when I click on "view source code".
This is my code:
# Here, we open the website for options prices in Chrome
driver = webdriver.Chrome()
driver.get("http://www.bmfbovespa.com.br/pt_br/servicos/market-data/consultas/mercado-de-derivativos/precos-referenciais/precos-referenciais-bm-f-premios-de-opcoes/")
# Since the page is populated by JavaScript code *after* loading the page, we
# tell the browser to wait 10 seconds before getting the source html code
time.sleep(10)
html_file = driver.page_source # gets the html source of the page
print(html_file)
I have also tried the following, but it did not work:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.ID,
"divContainerIframeBmf")))

Use this after the page loads
driver.switch_to.frame(driver.find_element_by_xpath("//iframe"))
and continue performing your operations on the page.

Selenium: difference between chrome and PhantomJS?-python

I want to do web scraping for Bing's search results. Basically, I am using selenium, the idea is to using selenium to click 'Next' automatedly and scrap the URLs of search results of each page. I made it run with chrome browser on my Ubuntu:
from selenium import web driver
import os
class bingURL(object):
def __init__(self):
self.driver=webdriver.Chrome(os.path.expanduser('./chromedriver'))
def get_urls(self,url):
driver=self.driver
driver.get(url)
elems = driver.find_elements_by_xpath("//a[#href]")
href=[]
for elem in elems:
link=elem.get_attribute("href")
try:
if 'bing.com' not in link and 'http' in link and 'microsoft.com' not in link and 'smashboards.com' not in link:
href.append(link)
except:
pass
return list(set(href))
def search_urls(self,keyword,pagenum):
driver=self.driver
searchurl=self.lookup(keyword) ### url of first page of google search
driver.get(searchurl)
results=self.get_urls(searchurl)
for i in range(pagenum):
driver.find_elements_by_class_name("sb_pagN")[0].click() # click 'Next' of bing search result
time.sleep(5) # wait to load page
current_url=driver.current_url
#print(current_url)
#print(self.get_urls(current_url))
results[0:0]=self.get_urls(current_url)
driver.quit()
return results
def lookup(self,query):
return "https://www.bing.com/search?q="+query
if __name__ == "__main__":
g=bingURL()
result=g.search_urls('Stackoverflow is good',10)
it works perfectly, when I run the code, it launches a Chrome browser, and I can saw it go to the next page automatically, and get URLs for 10 pages of searching results.
However, my goal is to run these codes on AWS successfully. The original codes failed with error 'Chrome failed to start'. After google, it seems I need to use a headless browser like PhantomJS on AWS. Thus I installed PhantomJS, and change the def __init__(self): to:
def __init__(self):
self.driver=webdriver.PhantomJS()
However, it cannot click 'next' anymore, and cannot scrap URLs using the old code. The error message is:
File ".../SEARCH_BING_MODULE.py", line 70, in search_urls
driver.find_elements_by_class_name("sb_pagN")[0].click()
IndexError: list index out of range
It looks like change the browser completely change the rules. How should I modify the more original code to make it work again? or how to scrap Bing search results' URLs using selenium+PhantomJS?
Thanks for your help!

Yes, You can perform all operations as per of your all 3 point using headless browser. Don't use HTMLUnit as it have many configuration issue.
PhamtomJS was another approach for headless browser but PhantomJs is having bug these days because of poorly maintenance of it.
You can use chromedriver itself for headless jobs.
You just need to pass one option in chromedriver as below:-
chromeOptions.addArguments("--headless");
Full code will appear like this :-
System.setProperty("webdriver.chrome.driver","D:\\Workspace\\JmeterWebdriverProject\\src\\lib\\chromedriver.exe");
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--start-maximized");
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.google.co.in/");
Hope it will help you :)

Webscraping links not the same as manual browsing

I have scraped a site for 840 urls...
When I rebuld the urls for more insformation, my python scraper does not porvide the same data as if I manually click on the links.
For example, when I visit this website, https://salesweb.civilview.com/Sales/SalesSearch
If I click on the first 'Details' in the list, it take to a page with more information.
The information that is given is a relative link showing '/Sales/SaleDetails?PropertyId=254119896'
I've scraped the 'details' relative link and then rebuilt the link to match the absolute address.
this address becomes
https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119896
However when I do this and try to scrape, I get a total different set of data and it takes me to a general landing page.
https://salesweb.civilview.com/
I thought at first, I needed to use a headless browser to fix the problem, but now I am not sure.
Here is my code:
import time
from selenium import webdriver
baseurl='https://salesweb.civilview.com'
link='/Sales/SaleDetails?PropertyId=254119946'
url1=baseurl+link
driver = webdriver.PhantomJS()
driver.get(url1)
html = driver.page_source
time.sleep(10)
driver.quit()

I found a workaround, if you first interact with the website, you can access the others urls. Unfortunately I have no idea why it works:
driver = webdriver.PhantomJS()
driver.get("https://salesweb.civilview.com/")
driver.find_element_by_link_text('Atlantic County, NJ').click()
driver.get("https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119946")
html = driver.page_source
print(html)

Staying logged in with Selenium in Python

I am trying to log into a website and then once logged in navigate to a different page on the website remaining logged in, using Selenium. However, when I try to navigate to the different page, I found I have become logged off.
I believe this is because I do not understand how the webdriver.Firefox().get() function works exactly.
My code:
from selenium import webdriver
from Code.Other import XMLParser
#Initialise driver and go to webpage
driver = webdriver.Firefox()
URL = 'http://www.website.com'
driver.get(URL)
#Login
UserName = XMLParser.XMLParse('./Config.xml','UserName')
Password = XMLParser.XMLParse('./Config.xml','Password')
element = driver.find_elements_by_id('UserName')
element[0].send_keys(UserName)
element = driver.find_elements_by_id('Password')
element[0].send_keys(Password)
element = driver.find_elements_by_id('Submit')
element[0].click()
#Go to new page
URL = 'http://www.website.com/page1'
driver.get(URL)
Unfortunately I am navigated to the new page but I am no longer logged in. How do I fix this?

Looks like website doesn't have enough time to react on your submit in authorization form. You click submit but you don't wait for response and open another url.
Wait until some event after login (like getting cookies or some changes in DOM or just time.sleep) and only then go to another page.
P.S.: if it won't help, try to check your cookies after login and after you open new url, maybe it's problem with authorization backend or webdriver

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue in web scraping using Selenium and driver.get() - python

Related

Selenium webdriver does not open the correct url, rather it opens a blank page

Python: finding content in dynamically generated HTML

Selenium: difference between chrome and PhantomJS?-python

Webscraping links not the same as manual browsing

Staying logged in with Selenium in Python

Categories

Resources