Selenium get() redirects to another url - python

I'm trying to navigate to the following page and extract the html https://www.automobile.it/annunci?b=data&d=DESC, but everytime i call the get() method it looks like the website redirects me to the another page, always the same one which is https://www.automobile.it/torrenova?radius=100&b=data&d=DESC.
here's the simple code i'm running:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=ex_path)
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")
html=driver.page_source
if i do the same thing using the request module i don't get redirected
import requests
html=requests.get("https://www.automobile.it/annunci?b=data&d=DESC")
i don't understand why it's behaving like this, any ideas?

Use driver.delete_all_cookies()
from selenium import webdriver
driver = webdriver.Chrome(executable_path=ex_path)
driver.delete_all_cookies()
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")
html=driver.page_source
PS: be also warned: Page_source will not get you the completed DOM as rendered.

Well you can clear browser cache by using the below code :
I am assuming that you are using chrome.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=ex_path)
driver.get('chrome://settings/clearBrowserData')
driver.find_element_by_xpath('//settings-ui').send_keys(Keys.ENTER)
driver.get("https://www.automobile.it/annunci?b=data&d=DESC")

Related

Use Selenium to download multiple URLs as PDFs

I am trying to have Selenium download the URLs of a webpage as PDFs on Safari. So far, I have been able to open the URL, but I can't get Safari to download it. All the solutions I found so far were either for another browser, or they didn't work. Ideally I would like it to download all links of one page and then move on the next page.
At first I thought that clicking on each hyperlink and then downloading it was the way to go. But that would require switching windows each time, so then I tried to find a way to download it without having to click on it, but nothing worked.
I am quite new at programming so I am sure that I am missing something.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pdfkit
browser = webdriver.Safari()
browser.get(a_base_url)
username = browser.find_element_by_name("tb_LoginName")
password = browser.find_element_by_name("tb_Password")
submit = browser.find_element_by_id("btn_Login")
username.send_keys(username)
password.send_keys(password)
submit.click()
element=WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="maincolumn"]/div/table/tbody/tr[2]/td[9]/a[2]'))).click()
browser.switch_to_window(browser.window_handles[0])
url=browser.current_url
I would go for the following approach:
Get href attribute of the link you want to download via WebElement.get_attribute() function
Use urllib or requests library to retrieve the URL from step 1 without using the browser
Most probably you will also need to get the browser Cookies via WebDriver.get_cookies function and add them to Cookie header for your download request

Eaten by the net? Crawler survival under this website using python

I am making web-crawler to get information from http://www.caam.org.cn/hyzc, but it showed me HTTP Error 302, and I cannot fix it.
https://imgur.com/a/W0cykim
The picture gives you a rough idea about the special layout of this website in that when you are browsing it, it will pop out a window, telling you that the website is accelerating, for the reason that there are so many people online, and then direct you to that website. As a result, when I use web-crawler, all I get is the information on this window, but nothing on this website. I think this is a good way for the website keeper to get rid of our web crawlers. So I want to ask for your help to get useful information from this website
At first, I used requests of python for my web crawler, and I only got information on that window, the results are shown here: https://imgur.com/a/GLcpdZn
And then I forbad website redirect, I got HTTP Error 303, shown:
https://imgur.com/a/6YtaVOt
This is the latest code I used:
python
import requests
def getpage(url):
try:
r= requests.get(url, headers={'User-Agent':'Mozilla/5.0'}, timeout=10)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "try again"
url = "http://www.caam.org.cn/hyzc"
print(getpage(url))
The expected outcome of this question is to get useful information from the website http://www.caam.org.cn/hyzc. We may need to deal with the window popped out.
Looks like this website have some kind of protection against crawlers using requests, the page is not entirely loaded when you send a get request.
You can try to emulate a browser using selenium:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.caam.org.cn/hyzc')
print(driver.page_source)
driver.close()
driver.page_source will contain the page source.
You can learn how to setup selenium webdriver here.
I added something to delay the closure of my web crawl and this worked. So I want to share my lines in case you meet similar problem in the future:
python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get('http://www.caam.org.cn')
body = driver.find_element_by_tag_name("body")
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
wait.until(EC.staleness_of(body))
print(driver.page_source)
driver.close()

Reopen same browser window using selenium python and Firefox

Hey i'm trying to make an automatic program to send Whatsapp messages.
I'm currently using python, Firefox and selenium to achieve that.
The problem is that every time i'm calling driver.get(url) it opens a new instance of the firefox browser, blank with no memories of the last run. It makes me scan the bar code every time I run it.
from selenium import webdriver
from selenium.webdriver.firefox.webdriver import FirefoxProfile
cp_profile = webdriver.FirefoxProfile("/Users/Hodai/AppData/Roaming/Mozilla/Firefox/Profiles/v27qat5d.whatsapp_profile")
driver = webdriver.Firefox(executable_path="/Users/Hodai/Desktop/geckodriver",firefox_profile=cp_profile)
driver.get('http://web.whatsapp.com')
#Scan the code before proceeding further
input('Enter anything after scanning QR code')
I've tried to use profile but it seems like it has no affect.
cp_profile = webdriver.FirefoxProfile("/Users/Hodai/AppData/Roaming/Mozilla/Firefox/Profiles/v27qat5d.whatsapp_profile")
driver = webdriver.Firefox(executable_path="/Users/Hodai/Desktop/geckodriver",firefox_profile=cp_profile)
At the end I used chromedriver to achive my goal.
I tried cookies with pickle but it was a bit tricky because it remembered the cookies just for the same domain.
So I used user data for chrome.
now it works like a charm. thank you all.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("user-data-dir=C:/Users/Designer1/AppData/Local/Google/Chrome/User Data/Profile 1")
driver = webdriver.Chrome(chrome_options=options,executable_path="C:\webdrivers\chromedriver.exe")
The easiest way I think is to save your cookies after scanned the qrcode and push them to Selenium manually.
# Load page to be able to set cookies
driver.get('http://web.whatsapp.com')
# Set saved cookies
cookies = {'name1': 'value1', 'name2', 'value2'}
for name in cookies:
driver.add_cookie({
'name': name,
'value': cookies[name],
})
# Load page using cookies
driver.get('http://web.whatsapp.com')
To get your cookies you can use the console (F12), Network tab, right click on the request, Copy => Copy Request Headers.
It should not be like that. It only opens the new window when initialized with new variable or the program starts again. Here is the code for chrome. It doesn't matter how many times you call driver.get(url) it would open the url in the same browser window
from selenium import webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(executable_path=r"C:\new\chromedriver.exe")
driver.get('https://www.olx.com.pk/lahore/apple/q-iphone-6s/?search%5Bfilter_float_price%3Afrom%5D=40000&search%5Bfilter_float_price%3Ato%5D=55000')
time.sleep(10)
driver.get('https://www.olx.com.pk/lahore/apple/q-iphone-6s/?search%5Bfilter_float_price%3Afrom%5D=40000&search%5Bfilter_float_price%3Ato%5D=55000')
time.sleep(10)
driver.get('https://www.olx.com.pk/lahore/apple/q-iphone-6s/?search%5Bfilter_float_price%3Afrom%5D=40000&search%5Bfilter_float_price%3Ato%5D=55000')
time.sleep(10)
Let me know if the issue is resolved or you are trying to do something else.

Selenium Python - Getting the current URL of web browser?

I have this so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('C:\Users\Fan\Desktop\chromedriver.exe')
url = driver.current_url
print url
It keeps saying that line 4 "driver" is an invalid syntax. How would I fix this?
Also is there a way I can get all the current tabs open, and not just a single one?
EDIT: the above code works now; But I have another problem!
The code now opens a new tab, and for some reason the URL bar has "data;" in it, and it outputs data; as the print.
But I want it to take the existing URL from existing web browser already opened, how do I solve this?
In Python you do not specify the type of variable as is required in Java which is the reason for the error. The same error will also happen because your last line starts with String.
Calling webdriver.Chrome() returns a driver object so the line webdriver driver = new webdriver() is actually not needed.
The new keyword is not used in Python to create a new object.
Try this:
from selenium import webdriver
driver = webdriver.Chrome()
url = driver.getCurrentUrl()
In order to extract the url of the current page from the web driver you have to call the current_url attribute:
from selenium import webdriver
import time
driver = webdriver.Chrome()
#Opens a known doi url
driver.get("https://doi.org/10.1002/rsa.1006")
#Gives the browser a few seconds to process the redirect
time.sleep(3)
#Retrieves the url after the redirect
#In this case https://onlinelibrary.wiley.com/doi/abs/10.1002/rsa.1006
url = driver.current_url

How to get all links on a web page using python and selenium IDE

I want to get all link from a web page using selenium ide and python.
For example if I search test or anything on google website and I want all link related to that.
Here is my code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
baseurl="https://www.google.co.in/?gws_rd=ssl"
driver = webdriver.Firefox()
driver.get(baseurl)
driver.find_element_by_id("lst-ib").click()
driver.find_element_by_id("lst-ib").clear()
driver.find_element_by_id("lst-ib").send_keys("test")
link_name=driver.find_element_by_xpath(".//*[#id='rso']/div[2]/li[2]/div/h3/a")
print link_name
driver.close()
Output
<selenium.webdriver.remote.webelement.WebElement object at 0x7f0ba50c2090>
Using xpath $x(".//*[#id='rso']/div[2]/li[2]/div/h3/a") in Firebug's console.
Output
[a jtypes2.asp]
How can I get links content from a object.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
baseurl="https://www.google.co.in/?gws_rd=ssl"
driver = webdriver.Firefox()
driver.get(baseurl)
driver.find_element_by_id("lst-ib").click()
driver.find_element_by_id("lst-ib").clear()
driver.find_element_by_id("lst-ib").send_keys("test")
driver.find_element_by_id("lst-ib").send_keys(Keys.RETURN)
driver.implicitly_wait(2)
link_name=driver.find_elements_by_xpath(".//*[#id='rso']/div/li/div/h3/a")
for link in link_name:
print link.get_attribute('href')
Try the above code. Your code doesn't send a RETURN key after giving the search keyword. Also I've made changes to implicitly wait for 2 seconds to load the search results and I've changed xpath to get all links.

Categories