Using IP authenticated proxies in a distributed crawler - python

I'm working on a distributed web crawler in Python running on a cluster of CentOS 6.3 servers, the crawler uses many proxies from different proxy providers. Everything works like a charm for username/password authenticated proxy providers. But now we have bought some proxies that uses IP based authentication, this means that when I want do crawl into a webpage using one of this proxies I need to make the request from a subset of our servers.
The question is, is there a way in Python (using a library/software) to make a request to a domain passing trough 2 proxies? (one proxy is one of the subset needed to be used for the IP authentication and the second is the actual proxy from the provider) Or is there another way to do this without setting up this subset of our servers as proxies?
The code I'm using now to make the request trough a proxy uses the requests library:
import requests
from requests.auth import HTTPProxyAuth
proxy_obj = {
'http':proxy['ip']
}
auth = HTTPProxyAuth(proxy['username'], proxy['password')
data = requests.get(url, proxies = proxy_obj, auth = auth)
Thanks in advance!

is there a way in Python (using a library/software) to make a request
to a domain passing trough 2 proxies?
If you need to go through two proxies, it looks like you'll have to use HTTP tunneling, so any host which isn't on the authorized list would have to connect an HTTP proxy server on one of the hosts which is, and use the HTTP CONNECT method to create a tunnel to the remote proxy, but it may not be possible to achieve that with the requests library.
Or is there another way to do this without setting up this subset of
our servers as proxies?
Assuming that the remote proxies which use IP address-based authentication are all expecting the same IP address, then you could instead configure a NAT router, between your cluster and the remote proxies, to translate all outbound HTTP requests to come from that single IP address.
But, before you look into implementing either of these unnecessarily complicated options, and given that you're paying for this service, can't you just ask the provider to allow requests for the entire range of IP addresses which you're currently using?

Related

Can I intercept HTTP requests that are coming for another application and port using python

I am currently thinking on a project that automatically executes defensive actions such as adding the IP of a DoS attacker to iptables list to drop their requests permanently.
My question is can I intercept the HTTP requests that are coming for another application, using python? For example, can I count how many times an Apache server running on port 80, recieved a HTTP POST request and extract its sender etc.
I tried looking into requests documentation but couldn't find anything relevant.

Is it possible to tunnel 2 proxy servers through websocket

A proxy server forwards HTTP traffic from client to host as shown below:
Actually, the proxy server has two jobs: (A) Receive data from client. (B) Send data to server. and vice versa.
Now what if we separate these two tasks into 2 different proxy servers and connect those 2 servers using another protocol such as websocket?
Why do I want to do this? My initial intention is to bypass internet censorship in some regions where most of the internet is blocked and only some protocols and servers (including cloud flare) are reachable. Doing this we can add a reverse proxy to our client so our proxy server B will remain anonymous.
Websocket is used here because only standard HTTP and websocket are allowed in cloud flare and not HTTP(S) proxy. And in case of blocking websocket (which sounds unlikely), we might use another intermediate like ssh, ftp, http. What are your thoughts about this? Is it possible? Is there such a proxy server out there? Or is there a better way?

Python requests get website using custom dns

I need to access a specific server and it just responses to connections with a specific DNS server. So before connecting to that website I need to set my system DNS servers to custom IPs. That's ok, but now I'm working on a python script with a requests module and I want to access that server. How can I set custom DNS IPs to requests session to do GET function with those DNS servers?
I should say that I just need a JSON file from that server, so It's just exhausting to change DNS servers every time.

Python JSON fetching via Tor Socks5 proxy

I am trying to fetch JSON data, via a Python3 script, on Tails. I would like to know that this code is secure and doesn't leak IP or any other stuff. I know that Tails is configured to block any problematic connection, so I would like to know if my code is safe.
import json
import requests
url = 'https://api.bitcoincharts.com/v1/markets.json'
proxy = {'https': "socks5://127.0.0.1:9050"}
with open('datafile','w') as outfile:
json.dump( (requests.get(url , proxies=proxy ).json()) ,outfile)
As you can see I am using the requests that has been suggested for proxies. I use socks5 just like the docs suggest, configured for the localhost 9050 port that tor listens to.
I guess if the website would be http then I would have to change the proxy to 'http' as well.
One thing I am not sure about is whether to use the port 9150 or 9050 , the code seems to work on both proxies, but I don't know which one is safer.
Other than these, is my code safe to use on Tails?

Different IP for each bot?

I'm doing a Python bot that will request an url under different IP addresses in one computer. Is there a way to change my IP address for free and apply it to the bot? I have looked around and it seems like people say that I should use proxies for this. But I'm not familiar with proxies and how to implement them in Python. It'd be great if someone can guide me.
Thanks
You can change your IP in python, but your gateway will not be able to route a different IP than one in your sub-net.
Therefore, you have to use a proxy or a diffente router.
If you have/know an active router that will forward your packages using NAT, you can it as the gateway for the IP of the URL you are going to request.
For changing routes you can use this package: https://pypi.python.org/pypi/pyroute2
For using proxies directly in your bot, assuming you are using urllib3, you can check this documentation: http://docs.python-requests.org/en/latest/user/advanced/.
Another thing you might do is to rent some VPS servers for different worldwide IPs, check this search for examples.

Categories