Selenium gets response code of 429 but firefox private mode does not - python

Used Selenium in python3 to open a page. It does not open under selenium but it does open under firefox private page.
What is the difference and how to fix it?
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://google.com') # creating a google cookie
driver.get_cookies() # check google gets cookies
sleep(3.0)
url='https://www.realestate.com.au/buy/in-sydney+cbd%2c+nsw/list-1'
driver.get(url)
Creating a google cookie is not necessary. It is not there under firefox private page either but it works without it. However, under Selenium the behavior is different.
I also see the website returns [HTTP/2 429 Too Many Requests 173ms] status and the page is blank white. It does not happen in firefox private mode.
UPDATE:
I turned on the persistent log. Firefox on private mode will receive a 429 response too but it seems the javascript will resume from another url. It only happens for the first time.
On selenium however, the request does not survive the 429 response. It does report something to cdndex website. I have blocked that website so you o not see the request go through there. This is still a different behavior between firefox and selenium.
Selenium with persistent log:
Firefox with persistent log:

This is just my huch after working with selenium and webdriver for a while; I suspect that it is due to the default user agent of selenium being set to something lame by default and that the server side recognizes this and provides you with a silly HTTP code and a blank page as a result.
Try setting the user agent to something reasonable and/or disable selenium's interfering with defaults.
Another tips is to look at the request using wireshark or similar to see exactly what is sent over the wire.

429 Too Many Requests
The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests within a short period of time. The 429 status code is intended for use with rate-limiting schemes.
Root Cause
When your server detects that a user agent is trying to access a specific page too often in a short period of time, it triggers a rate-limiting feature. The most common example of this is when a user (or an attacker) repeatedly tries to log into a web application.
The server can also identify a bot with cookies, rather than by their login credentials. Requests may also be counted on a per-request basis, across your server, or across several servers. So there are a variety of situations that can result in you seeing an error like one of these:
429 Too Many Requests
429 Error
HTTP 429
Error 429 (Too Many Requests)
This usecase
This usecase seems to be a classical case of Selenium driven GeckoDriver initiated firefox Browsing Context getting detected as a bot due to the fact:
Selenium identifies itself
References
You can find a couple of relevant detailed discussions in:
How to Conceal WebDriver in Geckodriver from BotD in Java?
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?

Related

2 factor authentication handling in selenium webdriver with python

I am logging to a website with valid credentials, but if my network changes or even device gets changed (within same network); it redirects to Authentication page where I have to put the access code, which I received via email. I want to skip this authentication page and navigate to other pages to continue my process.
Expected result - Home page of the site
Actual result - Secure Access Code page
When you initialise your driver you can configure the browser to load your chrome profile, that is if your using chrome. This may allow you to bypass the authentication page if you have had a previous login with this profile. Not sure if this will work but it worth a shot.

Some websites block selenium webdriver, how does this work?

So I'm trying to web crawl clothing websites to build a list of great deals/products to look out for, but I notice that some of the websites that I try to load, don't. How are websites able to block selenium webdriver http requests? Do they look at the header or something. Can you give me a step by step of how selenium webdriver sends requests and how the server receives them/ are able to block them?
Selenium uses a real web browser (typically Firefox or Chrome) to make its requests, so the website probably has no idea that you're using Selenium behind the scenes.
If the website is blocking you, it's probably because of your usage patterns (i.e. you're clogging up their web server by making 1000 requests every minute. That's rude. Don't do that!)
One exception would be if you're using Selenium in "headless" mode with the HtmlUnitDriver. The website can detect that.
It's very likely that the website is blocking you due to your AWS IP.
Not only that tells the website that somebody is likely programmatically scraping them, but most websites have a limited number of queries they will accept from any 1 IP address.
You most likely need a proxy service to pipe your requests through.

Race condition in selenium python bindings?

I have run into a problem when running selenium with python bindings.
I have a rest web service that i would like to call from selenium.firefox web driver with a pre-created session cookie. (the cookie is previously created by a python.requests request, i am just passing it to selenium)
In order to be able to add cookie to a specific domain, i am running a dummy request to that domain first, and then setting the cookie for the second, real request: (if i don't do that, add_cookie throws error)
driver = webdriver.Firefox()
driver.get("http://url.com/preheat")
#sleep(10)
driver.add_cookie(cookie_dict=persistedCookie)
driver.get("http://url.com/realrequest")
The problem is, when i run the code above, the web framework cannot see any cookies set. If i uncomment the sleep and waiting 10 seconds after the first request and then setting the cookie, everything works as desired.
(I was trying to apply WebDriverWait for an element in the result document of the first request, but experienced the same)
Is it an expected behaviour? If yes, could anyone recommend me a "deterministic" way of doing this?
Thanks,
Marcell

Using selenium with grid servers parallel visiting multiple URLs and sharing the same cookies

If you don't want waste your time to look at the details, please go to the Here is the problem part. If you are very impatient, please directly go to the last part of this question.
First of all, I'm not using the selenium to do automatic testing, I'm using it to scraping (collecting data) from specific websites. The reason to using selenium is a long story, I'm not going to talk about it in here.
This is the environment:
Client side:
Python 2.7
selenium 2.43.0
Server side:
CentOS
selenium 2.43.1 (both the hub and grids)
Firefox 32
1 selenium hub and multiple selenium grids (running on different servers)
Currently we have workable scrapers (or data collectors, if you prefer) based on this, but they are scraping pages in serial, and now we decide to using multithreading to speed it up.
But the selenium FAQ said:
WebDriver is not thread-safe.
So we are going to have multiple WebDriver (firefox) instance in a scraper, and visiting different URLs.
Here is the problem:
We need the scrapers sharing cookies (and cache if possible) between WebDrivers that scraping a same website via a same proxy. But we not don't want the scrapers sharing cookies with another scrapers that scraping different websites or via different proxies.
I've did some research on this.
I know the client can specify a profile path for the selenium firefox webdriver(which are running on the grid server). But the selenium not able to create and delete the profile automatically, we have to do it on our own, this means we might need to create the profile dynamically, and delete it if it's no longer needed, because we don't know how many scrapers/websites/proxies will be use - this sounds not a good idea.
The second choice is sync the cookies in the code, but the selenium prevent to access cookies that doesn't belonging to the current domain, this might become tricky when the web site has two or more domains. Also, I can patch 2 JS files in the webdriver.xpi file to remove this limit (See here), but this requires me to patch the grid server - sounds a bad idea too.
So, is there any possibility to make the selenium remote webdrivers (Firefox instances) sharing cookies, and doesn't require to modify the selenium or have a "babysitter" program to care of the selenium server?
Thanks.

Selenium prevent redirect

Does selenium automatic follow redirects? Because it seems that the webdriver isn't loading the page I requested.
And when it does automatic follow the redirects is there any possibility to prevent this?
ceddy
No, Selenium drives the browser like a regular user would, which means redirects are followed when requested by the web application via either a 30X HTTP status or when triggered by javascript.
I suggest you consider a legitimate bug in the application if you consider it problematic when it happens to users.

Categories