Selenium Python, reset tcp connection after each request - python

So i'm using python and selenium to repeat requests to a specific website. I am using a rotating proxy that is supposed to give me a new ip after each requests. The issue is that when I do a request for example to whatsmyip.org in the chrome window I don't always get a fresh ip.
If my requests are done every 2seconds I keep the same ip, but if it's every 10-15s then my ip changes.
If you have any idea how I could fix this it would be nice, for example a chrome option ? or a capability maybe ? I really don't know.

Related

How can I change my IP using selenium WebDriver Chrome

I am parsing one site, but when I go in lately it gives the following: your IP is blocked or you send requests too often. I work with Selenium and WebDriver Chrome. Can I somehow change my IP to get around this limitation?
Selenium have no abilities to change your IP address.
This can be done with other tools / libraries.

Python Requests - Get Server IP

I'm making a small tool that tests CDN performance and would like to check where the response comes from. I thought of getting the host's IP and then using one of the geolocation API's on github to check the country.
I've tried doing so with
import socket
...
raw._fp.fp._sock.getpeername()
...however that only works when i use stream=True for the request and that in turn breaks the tool's functionality.
Is there any other option to get the server ip with requests or in a completely different way?
The socket.gethostbyname() function from Python's socket library should solve your problem. You can check it out in the Python docs here.
Here is an example of how to use it:
import socket
url="cdnjs.cloudflare.com"
print("IP:",socket.gethostbyname(url))
All you need to do is pass the url to socket.gethostbyname() and it will do the rest. Just make sure to remove the http:// before the URL because that will trip it up.
I could not get Akilan's solution to give the IP address of a different host that I was using. socket.gethostbyname() and getpeername() were not working for me. They are not even available. His solution did open the door.
However, navigating the socket object, I did find this:
socket.getaddrinfo('host name', 443)[0][4][0]
I wrapped this in a try/except block.
Maybe there is a prettier way.

Handling https requests using a SOCK_STREAM proxy

I'm working on a project that allows a user to redirect his browsing through a proxy. The system works like this - a user runs this proxy on a remote PC and then also runs the proxy on his laptop. The user then changes his browser settings on the laptop to use localhost:8080 to make use of that local proxy, which in turn forwards all browser traffic to the proxy running on the remote PC.
This is where I ran into HTTPS. I was able to get normal HTTP requests working fine and dandy, but as soon as I clicked on google.com, Firefox skipped my proxy and connected to https://google.com directly.
My idea was to watch for browser requests the say CONNECT host:443 and then use the python ssl module to wrap that socket. This would give me a secure connection between the outer proxy and the target server. However, when I run wireshark to see how a browser request looks like before ssl kicks in, it's already there, meaning it looks like the browser connects to port 443 directly, which explains why it omitted my local proxy.
I would like to be able to handle to HTTPS as that would make for a complete browsing experience.
I'd really appreciate any tips that could push in the right direction.
Well, after doing a fair amount of reading on proxies, I found out that my understanding of the problem was insufficient.
For anyone else that might end up in the same spot as me, know that there's a pretty big difference between HTTP, HTTPS, and SOCKS proxies.
HTTP proxies usually take a quick look into the HTTP headers to determine where to forward the whole packet. These are quite easy to code on your own with some basic knowledge of sockets.
HTTPS proxies, on the other hand, have to work differently. They should either be able to do the whole SSL magic for the client or they could try to pass the traffic without changes, however if the latter solution is chosen, the users IP will be known. This is a wee bit more demanding when it comes to coding.
SOCKS proxies are a whole different, albeit really cool, beast. They work on the 5th layer of the OSI model and honestly, I have no clue as to where I would even begin creating one. They achieve both security and anonymity. However, I do know that a person may be able to use SSH to start a SOCKS proxy on their machine, just read this http://www.revsys.com/writings/quicktips/ssh-tunnel.html . That link also gave an idea that it should be possible to use SSH from a Python script to make it much more convenient.
Hope this helps anyone with the same question as I had. Good luck!

how to use proxies without the remote site being able to detect the host/host IP?

I'm attempting to use a proxy, via python, in order to log into a site from a different, specific IP address. It seems that certain websites, however, can detect the original (host) IP address. I've investigated the issue a bit and here's what I found.
There are four proxy methods I've tried:
Firefox with a proxy setting.
Python with mechanize.set_proxies.
Firefox in a virtual machine using an internal network, along with another virtual machine acting as a router (having two adapters: a NAT, and that internal network), set up such that the internal network traffic is routed through a proxy.
TorBrowser (which uses Firefox as the actual browser).
For the first three I used the same proxy. The Tor option was just for additional testing, not via my own proxy. The following things are behaviors I've noticed that are expected:
With all of these, if I go to http://www.whatismyip.com/, it gives the correct IP address (the IP address of the proxy, not the host computer).
whatismyip.com says "No Proxy Detected" for all of these.
Indeed, it seems like the websites I visit do think my IP is that of the proxy. However, there have been a few weird cases which makes me think that some sites can somehow detect my original IP address.
In one situation, visiting a non-US site via Firefox with a non-US proxy, the site literally was able to print my originating IP address (from the US) and deny me access. Shouldn't this be impossible? Visiting the site via the virtual machine with that same non-US proxy, or the TorBrowser with a non-US exit node, though, the site was unable to do so.
In a similar situation, I was visiting another non-US site from a non-US proxy. If I logged into the site from Firefox within the virtual machine, or from the TorBrowser with a non-US exit node, the site would work properly. However, if I attempted to log in via Firefox with a proxy (the same proxy the virtual machine uses), or with mechanize, it would fail to log in with an unrelated error message.
In a third situation, using the mechanize.set_proxies option, I overloaded a site with too many requests so it decided to block access (it would purposefully time out whenever I logged in). I thought it might have blocked the proxy's IP address. However, when I ran the code from a different host machine, but with the same proxy, it worked again, for a short while, until they blocked it again. (No worries, I won't be harassing the site any further - I re-ran the program as I thought it might have been a glitch on my end, not a block from their end.) Visiting that site with the Firefox+proxy solution from one of the blocked hosts also resulted in the purposeful timeout.
It seems to me that all of these sites, in the Firefox + proxy and mechanize cases, were able to find out something about the host machine's IP address, whereas in the TorBrowser and virtual machine cases, they weren't.
How are the sites able to gather this information? What is different about the TorBrowser and virtual machine cases that prevents the sites from gathering this information? And, how would I implement my python script so that the sites I'm visiting via the proxy can't detect the host/host's IP address?
It's possible that the proxy is reporting your real IP address in the X-Forwarded-For HTTP header, although if so, I'm surprised that the WhatIsMyIP site didn't tell you about it.
If you first visited the non-US site directly, and then later again using the proxy, it's also possible that the site might have set cookies in your browser on your first visit that let the site identify you even after your IP address changes. This could account for the differences you've observed between browser instances.
(I've noticed that academic journal sites like to do that. If I try to access a paywalled article from home and get blocked because I wasn't using my university's proxy server, I'll typically have to clear cookies after enabling the proxy to be allowed access.)

urllib.urlopen to open page on same port just hangs

I am trying to use urllib.urlopen to open a web page running on the same host and port as the page I am loading it from and it is just hanging.
For example I have a page at: "http://mydevserver.com:8001/readpage.html" and I have the following code in it:
data = urllib.urlopen("http://mydevserver.com:8001/testpage.html")
When I try and load the page it just hangs. However if I move the testpage.html script to a different port on the same host it works fine. e.g.
data = urllib.urlopen("http://mydevserver.com:8002/testpage.html")
Does anyone know why this might be and how I can solve the problem?
A firewall perhaps? Try opening the page from the command line with wget/curl (assuming you're on Linux) or on the browser, with both ports on settings. Furthermore, you could try a packet sniffer to find out what's going on and where the connection gets stuck. Also, if testpage.html is dynamically generated, see if it is hit, check webserver logs if the request shows up there.
Maybe something is already running on port 8001. Does the page open properly with a browser?
You seem to be implying that you are accessing a web page that is scripted in Python. That implies that the Python script is handling the incoming connections, which could mean that since it's already handling the urllib call, it is not available to handle the connection that results from it as well.
Show the code (or tell us what software) you're using to serve these Python scripts.

Categories