How to do proxy over proxy (2 layers) by using scrapy? - python

How to do proxy over proxy (2 layers) by using scrapy? I assume here are the http/https proxies.
For example, my local machine -->proxy1 -->proxy2 --> the site I want to crawl.
How to do that in scrapy?
Why I want to do this?
The goal is to hide my ip address. You can think that proxy1 is very reliable, but it was blocked by the site I want to crawl. And the proxy2 is not reliable but have access to the site I want to crawl.
I could do my local machine -->proxy2 --> the site I want to crawl. But because proxy2 is not reliable, so I could expose my ip address to the site I want to crawl. So I want to add another layer before proxy2 to protect it.

What for? To hide your ip address you can use High-Anonymous proxy.
The High-Anonymous proxies mask your IP, replacing it with their own. However the servers you go to may still be able to detect your real IP. This is unlikely, but servers that do add code to detect underlying IP addresses can possibly detect your IP.

Related

How do I intercept web browser traffic in python?

I want to intercept all destinations, so I can reroute them, kind of like a virtual lan. How would I intercept and find the hostname of a destination packet?
I've searched the web but I haven't found anything. I would like it to be like a device driver, it starts and waits for web browsers to request a specific IP or domain name, and reroute it to a different IP or domain name.
You do that using a (local) "proxy" process. There are several solutions to set up such a "web proxy". You can even write one using a few lines of python capturing HTTP-traffic.
However, since most HTTPS web traffic is nowadays protected by SSL/TLS, you probably can't inspect the plain text details of the internet traffic without resorting to specific techniques.

Changing IP of python requests

How do I change the IP of HTTP requests in python?
My friend built an API for a website, and sometimes it blocks certain IP's and so we need to change the IP of the request... here is an example:
login_req = self.sess.post('https://www.XXX/YYY', params={...}
Now, each request that it sends, is through the computer's IP, and we need it basically to pass through an imaginary VPN.
Thanks for the help. If something isn't clear I will explain.
Short answer: you can't.
Long answer: it seems like you're misunderstanding how IP addresses work. Your IP address is the network address that corresponds to your computer - when you send a request to a server, you attach your IP as a "return address" of sorts, so that the server can send a response back to you.
However, just like a physical address, you don't get to choose what your IP address is – you live on a street, and that's your address, you don't get to change what the street is called or what your house number is. In general, when you send a request from your computer, the message passes through a chain of devices. For example:
Your computer --> Your router --> Your ISP --> The Server
In a lot of cases, each of these assigns a different IP address to whatever's below it. So, when your request passes through your router, your router records your IP address and then forwards the request through your ISP using its own IP address. Hence how several users on the same network can have the same IP address.
There are physical IP addresses, that correspond directly to devices, but there are a limited amount of these. Mostly, each Internet Service Provider has a few blocks of IP addresses that it can attach to things; an ISP can keep a specific IP address pointed to a specific computer all of the time, but they don't have to, and for many of their regular users, they don't.
Your computer has basically no power to determine what its own IP address is, basically. There's nothing python can do about that.
Your Question:
we need [the request] basically to pass through an imaginary VPN.
It'd be easier to actually requisition a real proxy or VPN from somewhere and push your request through it. You'd have to talk with your internet service provider to get them to set something like that up for you specifically, and unless you're representing a reasonably big company they're unlikely to want to put in that effort. Most python libraries that deal with HTTP can easily handle proxy servers, so once you figure it out it shouldn't be a problem.
You can use an IP address from https://www.sslproxies.org/
For example,
import requests
response=requests.get("yourURL", proxies={'https': 'https://219.121.1.93:80', 'http': http://219.121.1.93:80 "})
The IP addresses on that site are pretty crappy and sometimes don't work, so it would be best to find a way to constantly scrape IP addresses from the site so you have a couple to try. Check out this article: https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
warning: These should not be used for sensitive information as they are not secure. Don't use those IP addresses unless you are ok with anyone in the world knowing what your're doing.

How to listen to/forward all ports on an interface, in Python or otherwise

I am writing an application, currently in Python + Twisted, which serves as a first port of call for DNS requests – if the requests meet certain patterns, e.g. Namecoin .bit addresses or OpenNIC TLDs, they are passed to different DNS resolvers, otherwise the default one is used.
Some addresses however I need redirected through special routes, e.g. Tor .onion addresses which don't resolve to traditional IPv4 addresses, or certain websites that require tunneling through a VPN for geolocation reasons. So when DNS requests for such sites come in, I want the application to create a new loopback interface/alias and return the IP of this interface. I then need to be able to tunnel all TCP and UDP traffic coming through this interface through the proxy/VPN or whatever to the endpoint it was set up for.
The question is, how can I do this? Listening on specific ports (e.g. 80) will be fine for most purposes, but as a perfectionist I would like to know how to accept connections/messages sent to ALL ports, without having to set up tens of thousands of listeners and potentially crashing the system.
Note: while everything is currently in Python, I don't mind adding components in C++ or another language, or playing around with network configurations to get this working.

Changing IP for a scraping script

I am trying to a website, but my IP has been banned after a while. I have tried using Tor proxy, but it's unstable and slow. Therefor I think the best solution might be a standard proxy that would be obfuscating it's IP say once per 12 hours. Or do you have any other suggestion?
IP spoofing is useless since the server response would be delivered to some undesired address. You'll either have to ask the site owner or setup something like a botnet which won't be easy nor cheap.

how to use proxies without the remote site being able to detect the host/host IP?

I'm attempting to use a proxy, via python, in order to log into a site from a different, specific IP address. It seems that certain websites, however, can detect the original (host) IP address. I've investigated the issue a bit and here's what I found.
There are four proxy methods I've tried:
Firefox with a proxy setting.
Python with mechanize.set_proxies.
Firefox in a virtual machine using an internal network, along with another virtual machine acting as a router (having two adapters: a NAT, and that internal network), set up such that the internal network traffic is routed through a proxy.
TorBrowser (which uses Firefox as the actual browser).
For the first three I used the same proxy. The Tor option was just for additional testing, not via my own proxy. The following things are behaviors I've noticed that are expected:
With all of these, if I go to http://www.whatismyip.com/, it gives the correct IP address (the IP address of the proxy, not the host computer).
whatismyip.com says "No Proxy Detected" for all of these.
Indeed, it seems like the websites I visit do think my IP is that of the proxy. However, there have been a few weird cases which makes me think that some sites can somehow detect my original IP address.
In one situation, visiting a non-US site via Firefox with a non-US proxy, the site literally was able to print my originating IP address (from the US) and deny me access. Shouldn't this be impossible? Visiting the site via the virtual machine with that same non-US proxy, or the TorBrowser with a non-US exit node, though, the site was unable to do so.
In a similar situation, I was visiting another non-US site from a non-US proxy. If I logged into the site from Firefox within the virtual machine, or from the TorBrowser with a non-US exit node, the site would work properly. However, if I attempted to log in via Firefox with a proxy (the same proxy the virtual machine uses), or with mechanize, it would fail to log in with an unrelated error message.
In a third situation, using the mechanize.set_proxies option, I overloaded a site with too many requests so it decided to block access (it would purposefully time out whenever I logged in). I thought it might have blocked the proxy's IP address. However, when I ran the code from a different host machine, but with the same proxy, it worked again, for a short while, until they blocked it again. (No worries, I won't be harassing the site any further - I re-ran the program as I thought it might have been a glitch on my end, not a block from their end.) Visiting that site with the Firefox+proxy solution from one of the blocked hosts also resulted in the purposeful timeout.
It seems to me that all of these sites, in the Firefox + proxy and mechanize cases, were able to find out something about the host machine's IP address, whereas in the TorBrowser and virtual machine cases, they weren't.
How are the sites able to gather this information? What is different about the TorBrowser and virtual machine cases that prevents the sites from gathering this information? And, how would I implement my python script so that the sites I'm visiting via the proxy can't detect the host/host's IP address?
It's possible that the proxy is reporting your real IP address in the X-Forwarded-For HTTP header, although if so, I'm surprised that the WhatIsMyIP site didn't tell you about it.
If you first visited the non-US site directly, and then later again using the proxy, it's also possible that the site might have set cookies in your browser on your first visit that let the site identify you even after your IP address changes. This could account for the differences you've observed between browser instances.
(I've noticed that academic journal sites like to do that. If I try to access a paywalled article from home and get blocked because I wasn't using my university's proxy server, I'll typically have to clear cookies after enabling the proxy to be allowed access.)

Categories