I've searched the web on how to access a website using request, essentially the site ask the user to complete a captcha form before they can access the site.
As of now I understand the process should be
visit the site using selenium
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
browser.get('link-to-site')
complete the captcha form
save the cookies from that selenium session (since some how these cookies will contain data showing that you've completed captcha
input('cookies ready ?')
pickle.dump( browser.get_cookies() , open("cookies.pkl","wb"))
open a request session
get the site
import requests
session = requests.session()
r = session.get('link-to-site')
then load the cookies in
with open('cookies.pkl', 'r') as f:
cookies = requests.utils.cookiejar_from_dict(json.load(f))
session.cookies.update(cookies)
But I'm still unable to access the site, so I'm assuming the google captcha hasn't been solved when I'm using requests.
So there must be a correct way to go about this, I must be missing something?
You need to load the site after setting the cookies. Otherwise, the response is what the response would be without any cookies. Although having said that you will normally need to submit the form with selenium then list the cookies as a captcha doesn't normally set a cookie in itself.
Requirement
Get the Final Url after redirect chain like it happens on the normal browser using selenium .
The urls are article urls got from twitter.
Behavior on normal Desktop browser after viewing the redirect and headers:
The Url is a twitter URL which gets a 301 error -moved permanently
Then it follows the location tag to a shortened url which then again gets a 302 error .
It again follows the redirect chain and lands on the final page.
Behavior using Selenium
It redirects finally to the main website homepage/index rather than the actual article page .Final url is not the same as actual one.
Initial basic code
chrome_options = Options()
chrome_options.add_argument('--user-agent= Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:70.0) Gecko/20100101 Firefox/70.0');
#above user agent is an example but multiple different user agents were tried
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(driver_loaction,chrome_options=chrome_options)
browser.set_page_load_timeout(45)
browser.get(url)
final_url = browser.current_url
Various attempts to get the final url instead of the main website home/index page
With normal wait
browser.get(url)
time.sleep(25)
With WebDriverWait-
WebDriverWait(browser,20)
with expected_conditions
WebDriverWait(browser, 20).until(EC.url_contains("portion of the final url"))
ends up timing out everytime , even with different conditions like url_to_be etc.
Behavior on Trying with non-selenium options
1.Wget -
Below is response from a wget call edited for obscuring actual details -
Resolving t.co (t.co)...,
... Connecting to t.co (t.co)|:443... connected. HTTP
request sent, awaiting response... 301 Moved Permanently Location:
[following]
Resolving domain (domain)...
... Connecting to ... connected.
HTTP request sent, awaiting response... 302 Found Location:
[following]
--Date-- Resolving website (website)... ip ,
connected. HTTP request sent, awaiting response... 200 OK
As seen finally we got the homepage rather than the website page .
Request library -
response = requests.get(Url, allow_redirects=True, headers=self.headers, timeout=30)
(header contains user agents but tried with actual same request headers from browser that gets the proper final url response )- gets the homepage --
checking redirects by response.history we see that from t.co(twitter url ) - we redirect to the short url then redirect to website homepage and end .
urlib library -
same final response.
Test e.g url - t.co/Mg3IYF8ZLm?amp=1 (add the https:// - i removed for posting)
After days of different approaches , i am stuck -- i somehow think that selenium is key to resolve this because it works on normal desktop browsers then should work with selenium - right?
Edit: It seems it is happening with other versions of drivers and selenium too ,it would be great if we could at least find out the actual reasoning its happening with certain links like the example given .
I have used your code only.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--user-agent= Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:70.0) Gecko/20100101 Firefox/70.0');
#above user agent is an example but multiple different user agents were tried
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.set_page_load_timeout(120)
browser.get("url")
final_url = browser.current_url
print(final_url)
Snapshot:
The issue was after the intermediate url, the url cannot redirect to the final page due to the missing headers, the intermediate url (http://www.cumhuriyet.com.tr/) is blocked with 403 status code.
Once we add the headers for the request the url is navigating to the final url.
I have given the Java implementation below. Sorry, I am not familiar with this code in Python. Please convert this code to python and it may solve the issue.
#Test
public void RUN() {
BrowserMobProxy proxy = new BrowserMobProxyServer();
proxy.start(0);
Proxy seleniumProxy = ClientUtil.createSeleniumProxy(proxy);
proxy.addRequestFilter((request, contents, messageInfo)->{
request.headers().add("Accept-Language", "en-US,en;q=0.5");
request.headers().add("Upgrade-Insecure-Requests", "1");
return null;
});
String proxyOption = "--proxy-server=" + seleniumProxy.getHttpProxy();
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
options.addArguments(proxyOption);
System.setProperty("webdriver.chrome.driver","<driverLocation>");
WebDriver driver = new ChromeDriver(options);
driver.get("<url>");
System.out.println(driver.getCurrentUrl());
}
I need to get a cookie from a specific request. Problem is it gets generated outside my eyes and i need to use Selenium to simulate browser open so i can generate it myself. The second problem is that i can't access the request cookie. The cookie i need is in the request, not the response.
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
binary = FirefoxBinary('/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=binary)
driver.get('http://www.princess.com/find/searchResults.do')
driver.find_elements_by_xpath('//*[#id="LSV010"]/div[3]/div[1]/div[1]/button')[0].click()
This code block opens the page and on the second result, clicks the "View all dates and pricing" link. The cookie is sent there but by the browser, not as a response. I need to get my hands on that cookie. Other libraries are ok if they can do the job.
If you go manually to the page, this is the thing i need:
I have selected the request and the Cookie i need and as it shows it is in the request not response. Is this possible to achieve?
I found how this is done. Using the selenium library i did manage to get this working:
def fix_header():
browser = webdriver.Firefox(executable_path='geckodriver.exe', firefox_profile=profile)
browser.get('https://www.princess.com/find/searchResults.do')
browser.find_element_by_xpath(
"//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']")
WebDriverWait(browser, 60).until(EC.visibility_of_any_elements_located(
(By.XPATH, "//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']")))
try:
browser.find_element_by_xpath(
"//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']").click()
except Exception:
browser.find_element_by_class_name('mfp-close').click()
browser.find_element_by_xpath(
"//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']").click()
cookies = browser.get_cookies()
browser.close()
chrome_cookie = ''
for c in cookies:
chrome_cookie += c['name']
chrome_cookie += '='
chrome_cookie += c['value']
chrome_cookie += "; "
return chrome_cookie[:-2]
Selenium actually goes to the page and "clicks" the url i need with a browser and gets the needed cookies.
What I am trying to achieve
I am trying to log in to a website where cookies must be enabled using Selenium headless, I am using PhantomJS for driver.
Problem
I first recorded the procedure using Selenium IDE where it works fine using Firefox (not headless). Then I exported the code to Python and now I can't log in because it's throwing an error saying "Can only set Cookies for the current domain". I don't know why I am getting this problem, am I not on the correct domain?
Code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import unittest, time, re
self.driver = webdriver.PhantomJS()
self.driver.implicitly_wait(30)
self.base_url = "https://login.example.com"
driver = self.driver
driver.get(self.base_url)
all_cookies = self.driver.get_cookies()
# It prints out all cookies and values just fine
for cookie in all_cookies
print cookie['name'] + " --> " + cookies['value']
# Set cookies to driver
for s_cookie in all_cookies:
c = { s_cookie['name'] : s_cookie['value']}
# This is where it's throwing an error saying "Can only set Cookies for current domain
driver.add_cookie(c)
...
What I've tried
I've tried saving the cookies in a dict, going to another domain, going back to original domain and added the cookies and then trying to log in but it still doesn't work (as suggested in this thread)
Any help is appreciated.
Investigate the each cookies pairs. I ran into the similar issues and some of the cookies belonged to Google. You need to make sure cookies are being added only to the current Domain and also belong to the same Domain. In that case your exception is expected. On a side note, if I recall it correctly you cannot use localhost to add the cookies if you are doing so. Change to IP address. Also, investigate the cookies you are getting specially domain and expiry information. See, if they are returning null
Edit
I did this simple test on Gmail to show what you have done wrong. At first look I did not notice that you are trying to grab partial cookie, a pair, and add that to the domain. Since, the cookie does not have any Domain, path, expiry etc. information it was trying to add the cookie to current domain(127.0.0.1) and throwing some misleading info that did not quite make sense. Notice: in order to be a valid cookie it must have to have the correct Domain and expiry information which you have been missing.
import unittest
from selenium.webdriver.common.by import By
from selenium import webdriver
__author__ = 'Saifur'
class CookieManagerTest(unittest.TestCase):
def setUp(self):
self.driver = webdriver.PhantomJS("E:\\working\\selenium.python\\selenium\\resources\\phantomjs.exe")
self.driver.get("https://accounts.google.com/ServiceLogin?service=mail&continue=https://mail.google.com/mail/")
self.driver.find_element(By.ID, "Email").send_keys("userid")
self.driver.find_element(By.ID, "next").click()
self.driver.find_element(By.ID, "Passwd").send_keys("supersimplepassword")
self.driver.find_element(By.CSS_SELECTOR, "[type='submit'][value='Sign in']").click()
self.driver.maximize_window()
def test(self):
driver = self.driver
listcookies = driver.get_cookies()
for s_cookie in listcookies:
# this is what you are doing
c = {s_cookie['name']: s_cookie['value']}
print("*****The partial cookie info you are doing*****\n")
print(c)
# Should be done
print("The Full Cookie including domain and expiry info\n")
print(s_cookie)
# driver.add_cookie(s_cookie)
def tearDown(self):
self.driver.quit()
Console output:
D:\Python34\python.exe "D:\Program Files (x86)\JetBrains\PyCharm Educational Edition 1.0.1\helpers\pycharm\utrunner.py" E:\working\selenium.python\selenium\python\FirstTest.py::CookieManagerTest true
Testing started at 9:59 AM ...
*******The partial cookie info you are doing*******
{'PREF': 'ID=*******:FF=0:LD=en:TM=*******:LM=*******:GM=1:S=*******'}
The Full Cookie including domain and expiry info
{'httponly': False, 'name': '*******', 'value': 'ID=*******:FF=0:LD=en:TM=*******:LM=1432393656:GM=1:S=iNakWMI5h_2cqIYi', 'path': '/', 'expires': 'Mon, 22 May 2017 15:07:36 GMT', 'secure': False, 'expiry': *******, 'domain': '.google.com'}
Notice: I just replaced some info with ******* on purpose
I was going to just add a comment onto the bottom of what #Saifur said above, but I figured I had enough new content to warrant a full comment.
The revelation for me, having this exact same error, was that using Selenium works exactly the same as if you're actually opening up your browser and physically clicking and typing things. With this in mind, if you enter the user/pass into Selenium and press click(), your Selenium driver will, upon successful authentican, automatically have the cookie in it. Thus negating any need to smash in my saved (probably going to expire soon) cookie. I felt a little silly realizing this. Made everything so much simpler.
Using #Saifur's code above as an template, I've made some adjustments and removed what I feel is a bit excessive of an extra whole class for the execution in this example.
url = 'http://domainname.com'
url2 = 'http://domainname.com/page'
USER = 'superAwesomeRobot'
PASS = 'superSecretRobot'
# initiates your browser
driver = webdriver.PhantomJS()
# browses to your desired URL
driver.get(url)
# searches for the user or email field on the page, and inputs USER
driver.find_element_by_id("email").send_keys(USER)
# searches for the password field on the page, and inputs PASS
driver.find_element_by_id("pass").send_keys(PASS)
# finds the login button and click you're in
driver.find_element_by_id("loginbutton").click()
from here you can browse to the page you want to address
driver.get(url2)
note: if you have a modern site that auto loads when you scroll down, it might be handy to use this:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
I would also like to note, #simeg, that Selenium automatically is supposed to wait until the page has returned that it's loaded (and yes, I've had the AJAX issue that is being referred to, so sometimes it is necessary to wait a few seconds - what page takes more then 30 seconds to load?!). The way that you're running your wait command is just waiting for PhantomJS to load, not the actual page itself so it seems of no use to me considering the built in function:
The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script. It’s worth noting that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded.
source: http://selenium-python.readthedocs.io/getting-started.html#example-explained
Hope this helps somebody!
Some webpages use too many keys in the cookies not supported by webdriver, then you get an "errorMessage":"Can only set Cookies for the current domain", even though you are 100% sure that you are setting cookies for the current domain. An example of such webpage is "https://stackoverflow.com/". In this case, you need to make sure that only the required keys are added to the cookies, as mentioned in some previous posts.
driver.add_cookie({k: cookie[k] for k in ('name', 'value', 'domain', 'path', 'expiry') if k in cookie})
In constrast, some webpages use too few keys in the cookies, that are required by webdriver, then you get an "errorMessage":"Can only set Cookies for the current domain", after you fix the first problem. An example of such webpage is "https://github.com/". You need to add key 'expiry' to the cookies for this webpage.
for k in ('name', 'value', 'domain', 'path', 'expiry'):
if k not in list(cookie.keys()):
if k == 'expiry':
cookie[k] = 1475825481
Putting them all together, the complete code is as below:
# uncommented one of the following three URLs to test
#targetURL = "http://pythonscraping.com"
targetURL = "https://stackoverflow.com/"
#targetURL = "https://github.com/"
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(targetURL)
driver.implicitly_wait(1)
#print(driver.get_cookies())
savedCookies = driver.get_cookies()
driver2 = webdriver.PhantomJS()
driver2.get(targetURL)
driver2.implicitly_wait(1)
driver2.delete_all_cookies()
for cookie in savedCookies:
# fix the 2nd problem
for k in ('name', 'value', 'domain', 'path', 'expiry'):
if k not in list(cookie.keys()):
if k == 'expiry':
cookie[k] = 1475825481
# fix the 1st problem
driver2.add_cookie({k: cookie[k] for k in ('name', 'value', 'domain', 'path', 'expiry') if k in cookie})
print(cookie)
driver2.get(targetURL)
driver2.implicitly_wait(1)
I am trying to log into a website and then once logged in navigate to a different page on the website remaining logged in, using Selenium. However, when I try to navigate to the different page, I found I have become logged off.
I believe this is because I do not understand how the webdriver.Firefox().get() function works exactly.
My code:
from selenium import webdriver
from Code.Other import XMLParser
#Initialise driver and go to webpage
driver = webdriver.Firefox()
URL = 'http://www.website.com'
driver.get(URL)
#Login
UserName = XMLParser.XMLParse('./Config.xml','UserName')
Password = XMLParser.XMLParse('./Config.xml','Password')
element = driver.find_elements_by_id('UserName')
element[0].send_keys(UserName)
element = driver.find_elements_by_id('Password')
element[0].send_keys(Password)
element = driver.find_elements_by_id('Submit')
element[0].click()
#Go to new page
URL = 'http://www.website.com/page1'
driver.get(URL)
Unfortunately I am navigated to the new page but I am no longer logged in. How do I fix this?
Looks like website doesn't have enough time to react on your submit in authorization form. You click submit but you don't wait for response and open another url.
Wait until some event after login (like getting cookies or some changes in DOM or just time.sleep) and only then go to another page.
P.S.: if it won't help, try to check your cookies after login and after you open new url, maybe it's problem with authorization backend or webdriver