So I'm trying to web crawl clothing websites to build a list of great deals/products to look out for, but I notice that some of the websites that I try to load, don't. How are websites able to block selenium webdriver http requests? Do they look at the header or something. Can you give me a step by step of how selenium webdriver sends requests and how the server receives them/ are able to block them?
Selenium uses a real web browser (typically Firefox or Chrome) to make its requests, so the website probably has no idea that you're using Selenium behind the scenes.
If the website is blocking you, it's probably because of your usage patterns (i.e. you're clogging up their web server by making 1000 requests every minute. That's rude. Don't do that!)
One exception would be if you're using Selenium in "headless" mode with the HtmlUnitDriver. The website can detect that.
It's very likely that the website is blocking you due to your AWS IP.
Not only that tells the website that somebody is likely programmatically scraping them, but most websites have a limited number of queries they will accept from any 1 IP address.
You most likely need a proxy service to pipe your requests through.
Related
I need to scrape a few details from a website and the problem is that the particular website is banned in India and I cannot open the site without a VPN but the VPN makes scrapping a lot slower and the program crashes a lot because the response time of the site increases. Is there any other way I can access the website?
Try this method, It's a private DNS that lets you access blocked websites. It is Faster and Better than a VPN.
Works only on Chrome:
Go to Chrome Settings.
Click on Security.
On the secure DNS, select Cloudflare (1.1.1.1).
For more details: https://asapguide.com/open-blocked-websites-without-vpn/
You can use scrape API https://www.scraperapi.com/ it provides you with a dynamic IP. It supports all the languages, you only need to attach the URL of scraper API at the beginning and your URL as param.
Used Selenium in python3 to open a page. It does not open under selenium but it does open under firefox private page.
What is the difference and how to fix it?
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://google.com') # creating a google cookie
driver.get_cookies() # check google gets cookies
sleep(3.0)
url='https://www.realestate.com.au/buy/in-sydney+cbd%2c+nsw/list-1'
driver.get(url)
Creating a google cookie is not necessary. It is not there under firefox private page either but it works without it. However, under Selenium the behavior is different.
I also see the website returns [HTTP/2 429 Too Many Requests 173ms] status and the page is blank white. It does not happen in firefox private mode.
UPDATE:
I turned on the persistent log. Firefox on private mode will receive a 429 response too but it seems the javascript will resume from another url. It only happens for the first time.
On selenium however, the request does not survive the 429 response. It does report something to cdndex website. I have blocked that website so you o not see the request go through there. This is still a different behavior between firefox and selenium.
Selenium with persistent log:
Firefox with persistent log:
This is just my huch after working with selenium and webdriver for a while; I suspect that it is due to the default user agent of selenium being set to something lame by default and that the server side recognizes this and provides you with a silly HTTP code and a blank page as a result.
Try setting the user agent to something reasonable and/or disable selenium's interfering with defaults.
Another tips is to look at the request using wireshark or similar to see exactly what is sent over the wire.
429 Too Many Requests
The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests within a short period of time. The 429 status code is intended for use with rate-limiting schemes.
Root Cause
When your server detects that a user agent is trying to access a specific page too often in a short period of time, it triggers a rate-limiting feature. The most common example of this is when a user (or an attacker) repeatedly tries to log into a web application.
The server can also identify a bot with cookies, rather than by their login credentials. Requests may also be counted on a per-request basis, across your server, or across several servers. So there are a variety of situations that can result in you seeing an error like one of these:
429 Too Many Requests
429 Error
HTTP 429
Error 429 (Too Many Requests)
This usecase
This usecase seems to be a classical case of Selenium driven GeckoDriver initiated firefox Browsing Context getting detected as a bot due to the fact:
Selenium identifies itself
References
You can find a couple of relevant detailed discussions in:
How to Conceal WebDriver in Geckodriver from BotD in Java?
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
I made a program that works with selenium, and it automates for posting comment to the some blogs' contents. I'm not familiar with the requests module of python. (working on it for just a week) The thing that I'm wondering is, my program with selenium is a bit slow for page loading, and it loads everything from ads to the images/videos. If I'd made my program with requests module, would it save data and a bit faster according to the selenium module?
I searched this issue at some forum-sites, generally they say request modules a bit faster, but not all. Also I couldn't find any info about saving data by comparing this modules?
Plz don't give me directly the thumbs down. I need this answer with details.
Selenium is used for web automation via clicking in web elements and sending keys to input boxes.
To speed up selenium, use headless mode, so that the visual components like ads are not loaded and the work is fast , go to selenium's documentation to learn more about headless mode.
While requests is used for HTTP methods
Like GET, POST etc. Learn more about requests from here
If the blogging site has a public api, then you can use requests module.
If you are new to API , I recommend watching this YouTube video
https://youtu.be/GZvSYJDk-us
For example to create issues on GitHub you can use GitHub API.
But to comment on a blogging site which has no public api, you need to use selenium.
Requests directly send and receive data from the server which hosts a particular service, so it is fast.
But selenium interacts with the web browser.
When you are using requests , you can do an action directly, without having to perform a bunch of clicks or send keys.
Selenium allows you to control a browser and execute actions on a webpage.
requests library is for making HTTP requests.
So, if you know how to write your program for posting comments with just using HTTP API then I’d go with requests, Selenium would be an overhead in this case
If you are proficient with HTTP requests and verb (know how to make a POST request to a server with requests library), then choose requests. If you want to test your script, use selenium or BeautifulSoup.
Generally, whenever a page is loaded, it sends several requests, which can be recorded in the Network tab in chrome developer tools.
My prime motive is to log all the network requests whenever page is loaded using python script. A sample screen-shot is attached, to illustrate what all request I am trying to collect.
Image for Network Hit Logs
I am trying to achieve the same using urllib library in python, however, I am not exactly sure of the usage.
Looking forward for your responses. Thanks in advance.
You can't do this with the urllib family of libraries. To capture AJAX Requests you need to use something that has Javascript support... like a browser.
So your best option in this case is to use Selenium to write a script that uses the Selenium Web Driver to drive whatever browser you're using and then capture/log the AJAX requests being pushed out.
If you don't want waste your time to look at the details, please go to the Here is the problem part. If you are very impatient, please directly go to the last part of this question.
First of all, I'm not using the selenium to do automatic testing, I'm using it to scraping (collecting data) from specific websites. The reason to using selenium is a long story, I'm not going to talk about it in here.
This is the environment:
Client side:
Python 2.7
selenium 2.43.0
Server side:
CentOS
selenium 2.43.1 (both the hub and grids)
Firefox 32
1 selenium hub and multiple selenium grids (running on different servers)
Currently we have workable scrapers (or data collectors, if you prefer) based on this, but they are scraping pages in serial, and now we decide to using multithreading to speed it up.
But the selenium FAQ said:
WebDriver is not thread-safe.
So we are going to have multiple WebDriver (firefox) instance in a scraper, and visiting different URLs.
Here is the problem:
We need the scrapers sharing cookies (and cache if possible) between WebDrivers that scraping a same website via a same proxy. But we not don't want the scrapers sharing cookies with another scrapers that scraping different websites or via different proxies.
I've did some research on this.
I know the client can specify a profile path for the selenium firefox webdriver(which are running on the grid server). But the selenium not able to create and delete the profile automatically, we have to do it on our own, this means we might need to create the profile dynamically, and delete it if it's no longer needed, because we don't know how many scrapers/websites/proxies will be use - this sounds not a good idea.
The second choice is sync the cookies in the code, but the selenium prevent to access cookies that doesn't belonging to the current domain, this might become tricky when the web site has two or more domains. Also, I can patch 2 JS files in the webdriver.xpi file to remove this limit (See here), but this requires me to patch the grid server - sounds a bad idea too.
So, is there any possibility to make the selenium remote webdrivers (Firefox instances) sharing cookies, and doesn't require to modify the selenium or have a "babysitter" program to care of the selenium server?
Thanks.