Selenium Python Changing IP - python

I am writing a web scraper using Selenium for Python. The scraper is visiting the same sites many times per hour, therefore I was hoping to find a way to alter my IP every few searches. What is the best strategy for this (I am using firefox)? Is there any prewritten code/a csv of IP addresses I can switch through? I am completely new to masking IP, proxies, etc. so please go easy on me!

Try using a proxy.
There are free options (not so reliable) or paid services.
from selenium import webdriver
def change_proxy(proxy,port):
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.http", proxy)
profile.set_preference("network.proxy.http_port", port)
profile.set_preference("network.proxy.ssl", proxy)
profile.set_preference("network.proxy.ssl_port", port)
driver = webdriver.Firefox(profile)
return driver

Your ISP will assign you your IP address. If you sign up for something like hidemyass.com, they will probably provide you with an app that changes your proxy, although I don't know how they do it.
But, if they have an app that cycles you through various proxies, then all your internet traffic will go through that proxy - including your scraper. There's no need for the scraper to know about these proxies or how hide my ass works - it'll connect through the proxies just like your browser or FTP client or ....

Related

Selenium Python, reset tcp connection after each request

So i'm using python and selenium to repeat requests to a specific website. I am using a rotating proxy that is supposed to give me a new ip after each requests. The issue is that when I do a request for example to whatsmyip.org in the chrome window I don't always get a fresh ip.
If my requests are done every 2seconds I keep the same ip, but if it's every 10-15s then my ip changes.
If you have any idea how I could fix this it would be nice, for example a chrome option ? or a capability maybe ? I really don't know.

How can I change my IP using selenium WebDriver Chrome

I am parsing one site, but when I go in lately it gives the following: your IP is blocked or you send requests too often. I work with Selenium and WebDriver Chrome. Can I somehow change my IP to get around this limitation?
Selenium have no abilities to change your IP address.
This can be done with other tools / libraries.

How to do proxy over proxy (2 layers) by using scrapy?

How to do proxy over proxy (2 layers) by using scrapy? I assume here are the http/https proxies.
For example, my local machine -->proxy1 -->proxy2 --> the site I want to crawl.
How to do that in scrapy?
Why I want to do this?
The goal is to hide my ip address. You can think that proxy1 is very reliable, but it was blocked by the site I want to crawl. And the proxy2 is not reliable but have access to the site I want to crawl.
I could do my local machine -->proxy2 --> the site I want to crawl. But because proxy2 is not reliable, so I could expose my ip address to the site I want to crawl. So I want to add another layer before proxy2 to protect it.
What for? To hide your ip address you can use High-Anonymous proxy.
The High-Anonymous proxies mask your IP, replacing it with their own. However the servers you go to may still be able to detect your real IP. This is unlikely, but servers that do add code to detect underlying IP addresses can possibly detect your IP.

Using IP authenticated proxies in a distributed crawler

I'm working on a distributed web crawler in Python running on a cluster of CentOS 6.3 servers, the crawler uses many proxies from different proxy providers. Everything works like a charm for username/password authenticated proxy providers. But now we have bought some proxies that uses IP based authentication, this means that when I want do crawl into a webpage using one of this proxies I need to make the request from a subset of our servers.
The question is, is there a way in Python (using a library/software) to make a request to a domain passing trough 2 proxies? (one proxy is one of the subset needed to be used for the IP authentication and the second is the actual proxy from the provider) Or is there another way to do this without setting up this subset of our servers as proxies?
The code I'm using now to make the request trough a proxy uses the requests library:
import requests
from requests.auth import HTTPProxyAuth
proxy_obj = {
'http':proxy['ip']
}
auth = HTTPProxyAuth(proxy['username'], proxy['password')
data = requests.get(url, proxies = proxy_obj, auth = auth)
Thanks in advance!
is there a way in Python (using a library/software) to make a request
to a domain passing trough 2 proxies?
If you need to go through two proxies, it looks like you'll have to use HTTP tunneling, so any host which isn't on the authorized list would have to connect an HTTP proxy server on one of the hosts which is, and use the HTTP CONNECT method to create a tunnel to the remote proxy, but it may not be possible to achieve that with the requests library.
Or is there another way to do this without setting up this subset of
our servers as proxies?
Assuming that the remote proxies which use IP address-based authentication are all expecting the same IP address, then you could instead configure a NAT router, between your cluster and the remote proxies, to translate all outbound HTTP requests to come from that single IP address.
But, before you look into implementing either of these unnecessarily complicated options, and given that you're paying for this service, can't you just ask the provider to allow requests for the entire range of IP addresses which you're currently using?

how to use proxies without the remote site being able to detect the host/host IP?

I'm attempting to use a proxy, via python, in order to log into a site from a different, specific IP address. It seems that certain websites, however, can detect the original (host) IP address. I've investigated the issue a bit and here's what I found.
There are four proxy methods I've tried:
Firefox with a proxy setting.
Python with mechanize.set_proxies.
Firefox in a virtual machine using an internal network, along with another virtual machine acting as a router (having two adapters: a NAT, and that internal network), set up such that the internal network traffic is routed through a proxy.
TorBrowser (which uses Firefox as the actual browser).
For the first three I used the same proxy. The Tor option was just for additional testing, not via my own proxy. The following things are behaviors I've noticed that are expected:
With all of these, if I go to http://www.whatismyip.com/, it gives the correct IP address (the IP address of the proxy, not the host computer).
whatismyip.com says "No Proxy Detected" for all of these.
Indeed, it seems like the websites I visit do think my IP is that of the proxy. However, there have been a few weird cases which makes me think that some sites can somehow detect my original IP address.
In one situation, visiting a non-US site via Firefox with a non-US proxy, the site literally was able to print my originating IP address (from the US) and deny me access. Shouldn't this be impossible? Visiting the site via the virtual machine with that same non-US proxy, or the TorBrowser with a non-US exit node, though, the site was unable to do so.
In a similar situation, I was visiting another non-US site from a non-US proxy. If I logged into the site from Firefox within the virtual machine, or from the TorBrowser with a non-US exit node, the site would work properly. However, if I attempted to log in via Firefox with a proxy (the same proxy the virtual machine uses), or with mechanize, it would fail to log in with an unrelated error message.
In a third situation, using the mechanize.set_proxies option, I overloaded a site with too many requests so it decided to block access (it would purposefully time out whenever I logged in). I thought it might have blocked the proxy's IP address. However, when I ran the code from a different host machine, but with the same proxy, it worked again, for a short while, until they blocked it again. (No worries, I won't be harassing the site any further - I re-ran the program as I thought it might have been a glitch on my end, not a block from their end.) Visiting that site with the Firefox+proxy solution from one of the blocked hosts also resulted in the purposeful timeout.
It seems to me that all of these sites, in the Firefox + proxy and mechanize cases, were able to find out something about the host machine's IP address, whereas in the TorBrowser and virtual machine cases, they weren't.
How are the sites able to gather this information? What is different about the TorBrowser and virtual machine cases that prevents the sites from gathering this information? And, how would I implement my python script so that the sites I'm visiting via the proxy can't detect the host/host's IP address?
It's possible that the proxy is reporting your real IP address in the X-Forwarded-For HTTP header, although if so, I'm surprised that the WhatIsMyIP site didn't tell you about it.
If you first visited the non-US site directly, and then later again using the proxy, it's also possible that the site might have set cookies in your browser on your first visit that let the site identify you even after your IP address changes. This could account for the differences you've observed between browser instances.
(I've noticed that academic journal sites like to do that. If I try to access a paywalled article from home and get blocked because I wasn't using my university's proxy server, I'll typically have to clear cookies after enabling the proxy to be allowed access.)

Categories