I am parsing one site, but when I go in lately it gives the following: your IP is blocked or you send requests too often. I work with Selenium and WebDriver Chrome. Can I somehow change my IP to get around this limitation?
Selenium have no abilities to change your IP address.
This can be done with other tools / libraries.
Related
So i'm using python and selenium to repeat requests to a specific website. I am using a rotating proxy that is supposed to give me a new ip after each requests. The issue is that when I do a request for example to whatsmyip.org in the chrome window I don't always get a fresh ip.
If my requests are done every 2seconds I keep the same ip, but if it's every 10-15s then my ip changes.
If you have any idea how I could fix this it would be nice, for example a chrome option ? or a capability maybe ? I really don't know.
I have multiple network interfaces (tun0, tun1 ...) and want to open several firefox browser instances in python, such that each one goes through a specific interface.
I can obtain the ip address of each interface with netifaces but have not found any way to 'attach' them to browser = webdriver.Firefox(...). There is plenty of documentation on using webdriver.DesiredCapabilities and proxies but that isnt't what I'd like to achieve.
Ideally I'd really like to make it work at the python rather than OS level, since the interfaces/ip addresses will change and this is driven by the python code.
Using FreeBSD 11.1 and Python 3.6.
I am not sure if it works but you can download selenium standalone server and run it with other network interface like in this answer and by assigning different ports (you can do it in command line while starting the server java -jar selenium-server-standalone-version.jar -port 4545) you can connect them individually. I don't know if network interface method works on browsers because driver launches a new process but I think it's worth to try, maybe it can help you to think the different ways.
I'd like to check tor before I start crawling using python scrapy. I am using polipo/tor/scrapy on linux.
with this settup scrapy correctly using tor on its crawls. The way I check if the scrapy using tor correctly is to crawl this page in myspider.
class mySpider(scrapy.Spider):
def start_requests(self):
yield Request('https://check.torproject.org/', self.parse)
def parse(self, response):
logging.info("Check tor page:" + str(response.css('.content h1::text')))
However I think there might be a better/clean way of doing it. I know I can check tor service status or check ip address but I want to actually check whether tor connection is correctly established.
A somewhat definitive way to do this is to connect to Tor's control port and issue GETINFO status/circuit-established.
If Tor has an active circuit built, it will return:
250-status/circuit-established=1
250 OK
If Tor hasn't been used for a while, this could be 0. You can also call GETINFO dormant which would yield 250-dormant=1. Most likely when you then try to use Tor, it will build a circuit and dormant will become 0 and circuit-established will be 1 barring any major network issues.
In either case, dormant=0 or circuit-established=1 should be enough to tell you can use Tor.
It's a simple protocol so you can just open a socket to the control port, authenticate, and issue commands, or use Controller from Stem.
See the control spec for more info.
I am writing a web scraper using Selenium for Python. The scraper is visiting the same sites many times per hour, therefore I was hoping to find a way to alter my IP every few searches. What is the best strategy for this (I am using firefox)? Is there any prewritten code/a csv of IP addresses I can switch through? I am completely new to masking IP, proxies, etc. so please go easy on me!
Try using a proxy.
There are free options (not so reliable) or paid services.
from selenium import webdriver
def change_proxy(proxy,port):
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.http", proxy)
profile.set_preference("network.proxy.http_port", port)
profile.set_preference("network.proxy.ssl", proxy)
profile.set_preference("network.proxy.ssl_port", port)
driver = webdriver.Firefox(profile)
return driver
Your ISP will assign you your IP address. If you sign up for something like hidemyass.com, they will probably provide you with an app that changes your proxy, although I don't know how they do it.
But, if they have an app that cycles you through various proxies, then all your internet traffic will go through that proxy - including your scraper. There's no need for the scraper to know about these proxies or how hide my ass works - it'll connect through the proxies just like your browser or FTP client or ....
I'm attempting to use a proxy, via python, in order to log into a site from a different, specific IP address. It seems that certain websites, however, can detect the original (host) IP address. I've investigated the issue a bit and here's what I found.
There are four proxy methods I've tried:
Firefox with a proxy setting.
Python with mechanize.set_proxies.
Firefox in a virtual machine using an internal network, along with another virtual machine acting as a router (having two adapters: a NAT, and that internal network), set up such that the internal network traffic is routed through a proxy.
TorBrowser (which uses Firefox as the actual browser).
For the first three I used the same proxy. The Tor option was just for additional testing, not via my own proxy. The following things are behaviors I've noticed that are expected:
With all of these, if I go to http://www.whatismyip.com/, it gives the correct IP address (the IP address of the proxy, not the host computer).
whatismyip.com says "No Proxy Detected" for all of these.
Indeed, it seems like the websites I visit do think my IP is that of the proxy. However, there have been a few weird cases which makes me think that some sites can somehow detect my original IP address.
In one situation, visiting a non-US site via Firefox with a non-US proxy, the site literally was able to print my originating IP address (from the US) and deny me access. Shouldn't this be impossible? Visiting the site via the virtual machine with that same non-US proxy, or the TorBrowser with a non-US exit node, though, the site was unable to do so.
In a similar situation, I was visiting another non-US site from a non-US proxy. If I logged into the site from Firefox within the virtual machine, or from the TorBrowser with a non-US exit node, the site would work properly. However, if I attempted to log in via Firefox with a proxy (the same proxy the virtual machine uses), or with mechanize, it would fail to log in with an unrelated error message.
In a third situation, using the mechanize.set_proxies option, I overloaded a site with too many requests so it decided to block access (it would purposefully time out whenever I logged in). I thought it might have blocked the proxy's IP address. However, when I ran the code from a different host machine, but with the same proxy, it worked again, for a short while, until they blocked it again. (No worries, I won't be harassing the site any further - I re-ran the program as I thought it might have been a glitch on my end, not a block from their end.) Visiting that site with the Firefox+proxy solution from one of the blocked hosts also resulted in the purposeful timeout.
It seems to me that all of these sites, in the Firefox + proxy and mechanize cases, were able to find out something about the host machine's IP address, whereas in the TorBrowser and virtual machine cases, they weren't.
How are the sites able to gather this information? What is different about the TorBrowser and virtual machine cases that prevents the sites from gathering this information? And, how would I implement my python script so that the sites I'm visiting via the proxy can't detect the host/host's IP address?
It's possible that the proxy is reporting your real IP address in the X-Forwarded-For HTTP header, although if so, I'm surprised that the WhatIsMyIP site didn't tell you about it.
If you first visited the non-US site directly, and then later again using the proxy, it's also possible that the site might have set cookies in your browser on your first visit that let the site identify you even after your IP address changes. This could account for the differences you've observed between browser instances.
(I've noticed that academic journal sites like to do that. If I try to access a paywalled article from home and get blocked because I wasn't using my university's proxy server, I'll typically have to clear cookies after enabling the proxy to be allowed access.)