I am working on web-crawler [using python].
Situation is, for example, I am behind server-1 and I use proxy setting to connect to the Outside world. So in Python, using proxy-handler I can fetch the urls.
Now thing is, I am building a crawler so I cannot use only one IP [otherwise I will be blocked]. To solve this, I have bunch of Proxies, I want to shuffle through.
My question is: This is two level proxy, one to connect to main server-1, I use proxy and then after to shuffle through proxies, I want to use proxy. How can I achieve this?
Update Sounds like you're looking to connect to proxy A and from there initiate HTTP connections via proxies B, C, D which are outside of A. You might look into the proxychains project which says it can "tunnel any protocol via a user-defined chain of TOR, SOCKS 4/5, and HTTP proxies".
Version 3.1 is available as a package in Ubuntu Lucid. If it doesn't work directly for you, the proxychains source code may provide some insight into how this capability could be implemented for your app.
Orig answer:
Check out the urllib2.ProxyHandler. Here is an example of how you can use several different proxies to open urls:
import random
import urllib2
# put the urls for all of your proxies in a list
proxies = ['http://localhost:8080/']
# construct your list of url openers which each use a different proxy
openers = []
for proxy in proxies:
opener = urllib2.build_opener(urllib2.ProxyHandler({'http': proxy}))
openers.append(opener)
# select a url opener randomly, round-robin, or with some other scheme
opener = random.choice(openers)
req = urllib2.Request(url)
res = opener.open(req)
I recommend you take a look at CherryProxy. It lets you send a proxy request to an intermediate server (where CherryProxy is running) and then forward your HTTP request to a proxy on a second level machine (e.g. squid proxy on another server) for processing. Viola! A two-level proxy chain.
http://www.decalage.info/python/cherryproxy
Related
I am trying to implement a script which tests a tracking URL is reachable to the app on the app store or not.
For this, I need to get the IP address of country in question and using that IP I need to request the tracking URL. So it will be like request went through that country and we will come to know if URL is reachable to the app store for that country or not.
Question: How do I request the URL as if its requested from IP I provide?
Example:
def check_url_is_valid(url, ip_address):
# Trying to request url using ip_address
# return True or False
PS: Not expecting to complete this code but any direction or guidance is appreciated.
There is no way to get a "country's IP address" since there is no such thing. There are ranges of IP addresses corresponding to different ISP's in different locations. You should pass each request through a proxy that you know is located where you want it to be.
For that you will need to create your own proxy list for each country and pass your requests through proxy every time for each country. You should explore some possible free or payed proxies for that.
Anyway, once you do, sending a request through a proxy can be done like this:
proxyDict = {
"http" : "http://1.1.1.1:123",
"https" : "https://1.1.1.1:456",
"ftp" : "ftp://1.1.1.1:789"
}
r = requests.get(url, proxies=proxyDict)
Of course you should replace the fake addresses above with real proxies that are good for what you want.
By the way, i'm sure there are off the shelf solutions for that, so maybe you should seek them out first instead of "reinventing the wheel". For example: https://www.uptrends.com/tools/uptime
You can use web proxies that allow hotlinking or APIs, or you can use proxychains if you are on linux, or if you want to go for manual effort go for VPNs.
You need to use 3rd party service which manually checks the URL by country/region, e.g. asm.ca.com I guess there's no way you can do it for specific IP. So you should determine the country by IP first.
How do I include my automatic proxy config file in HTTP libraries like
urllib or requests.
pacfile = 'http://myintranet.com/proxies/ourproxies.pac'
proxy = urllib3.ProxyManager(????????????????)
I've created a pure-Python library called PyPAC which should do what you're looking for. It provides a subclass of requests.Session that includes honours PACs and includes PAC auto-discovery.
Current there is no support for a proxy PAC file directly in urllib3 or requests. While support could in principle be added for proxy PAC files, because they are Javascript files that require interpretation it is likely to be extremely difficult to provide broad-based support.
In principle you could use requests/urllib3 to request the Proxy PAC file, then pass it to something like Node.JS for interpreting, then parse the results back in Python to pass to urllib3/requests, but nothing like that exists out of the box.
Use PYPAC.
from pypac import PACSession, get_pac
pac = get_pac(url='http://your/pac/url/file.pac')
session = PACSession(pac, proxy_auth=HTTPProxyAuth('your_user', 'password'))
print(session.get('http://www.google.com'))
you will get a 200
I developed till now with different webapp-servers (Tornado, Django, ...) and am encountering the same problem again and again:
I want a simple web proxy (reverse proxy) that allows me, that I can combine different source entities from other web-servers (that could be static files, dynamic content from an app server or other content) to one set of served files. That means, the browser should see them as if they come from one source.
I know, that I can do that with nginx, but I am searching for an even simpler tool for development. I want something, that can be started on command line and does not need to run as root. Changing the configuration (routing of the requests) should be as simple as possible.
In development, I just want to be able to mashup different sources. For example: On my production server runs something, that I don't want to copy, but I want to connect with static files on a different server and also a new application on my development system.
Speed of the proxy is not the issue, just flexibility and speed of development!
Preferred would be a Python or other scripting solution. I found also a big list of Python proxys, but after scanning the list, I found that all are lacking. Most of them just connect to one destination server and no way to have multiple servers (the proxy has to decide which to take by analysis of the local url).
I am just wondering, that nobody else has this need ...
You do not need to start nginx as root as long as you do not let it serve on port 80. If you want it to run on port 80 as a normal user, use setcap. In combination with a script that converts between an nginx configuration file and a route specification for your reverse proxy, this should give you the most reliable solution.
If you want something simpler/smaller, it should be pretty straight-forward to write a script using Python's BaseHTTPServer and urllib. Here's an example that only implements GET, you'd have to extend it at least to POST and add some exception handling:
#!/usr/bin/env python
# encoding: utf-8
import BaseHTTPServer
import SocketServer
import urllib
import re
FORWARD_LIST = {
'/google/(.*)': r'http://www.google.com/%s',
'/so/(.*)': r'http://www.stackoverflow.com/%s',
}
class HTTPServer(BaseHTTPServer.HTTPServer, SocketServer.ThreadingMixIn):
pass
class ProxyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
for pattern, url in FORWARD_LIST.items():
match = re.search(pattern, self.path)
if match:
url = url % match.groups()
break
else:
self.send_error(404)
return
dataobj = urllib.urlopen(url)
data = dataobj.read()
self.send_response(200)
self.send_header("Content-Length", len(data))
for key, value in dataobj.info().items():
self.send_header(key, value)
self.end_headers()
self.wfile.write(data)
HTTPServer(("", 1234), ProxyHandler).serve_forever()
Your use case should be covered by:
https://mitmproxy.org/doc/features/reverseproxy.html
There is now a proxy that covers my needs (and more) -- very lightweight and very good:
Devd
I have a website developed in flask running on an apache2 server that responds on port 80 to two URLs
Url-1 http://www.example.com
Url-2 http://oer.example.com
I want to detect which of the two urls the user is coming in from and adjust what the server does and store the variable in a config variable
app.config['SITE'] = 'OER'
or
app.config['SITE'] = 'WWW'
Looking around on the internet I can find lots of examples using urllib2 the issue is that you need to pass it the url you want to slice and I cant find a way to pull that out as it may change between the two with each request.
I could fork the code and put up two different versions but that's as ugly as a box of frogs.
Thoughts welcome.
Use the Flask request object (from flask import request) and one of the following in your request handler:
hostname = request.environ.get('HTTP_HOST', '')
url = urlparse(request.url)
hostname = url.netloc
This will get e.g. oer.example.com or www.example.com. If there is a port number that will be included too. Keep in mind that this ultimately comes from the client request so "bad" requests might have it set wrong, although hopefully apache wouldn't route those to your app.
I'm trying to use the salesforce-python-toolkit to make web services calls to the Salesforce API, however I'm having trouble getting the client to go through a proxy. Since the toolkit is based on top of suds, I tried going down to use just suds itself to see if I could get it to respect the proxy setting there, but it didn't work either.
This is tested on suds 0.3.9 on both OS X 10.7 (python 2.7) and ubuntu 12.04.
an example request I've made that did not end up going through the proxy (just burp or charles proxy running locally):
import suds
ws = suds.client.Client('file://sandbox.xml',proxy={'http':'http://localhost:8888'})
ws.service.login('user','pass')
I've tried various things with the proxy - dropping http://, using an IP, using a FQDN. I've stepped through the code in pdb and see it setting the proxy option. I've also tried instantiating the client without the proxy and then setting it with:
ws.set_options(proxy={'http':'http://localhost:8888'})
Is proxy not used by suds any longer? I don't see it listed directly here http://jortel.fedorapeople.org/suds/doc/suds.options.Options-class.html, but I do see it under transport. Do I need to set it differently through a transport? When I stepped through in pdb it did look like it was using a transport, but I'm not sure how.
Thank you!
I went into #suds on freenode and Xelnor/rbarrois provided a great answer! Apparently the custom mapping in suds overrides urllib2's behavior for using the system configuration environment variables. This solution now relies on having the http_proxy/https_proxy/no_proxy environment variables set accordingly.
I hope this helps anyone else running into issues with proxies and suds (or other libraries that use suds). https://gist.github.com/3721801
from suds.transport.http import HttpTransport as SudsHttpTransport
class WellBehavedHttpTransport(SudsHttpTransport):
"""HttpTransport which properly obeys the ``*_proxy`` environment variables."""
def u2handlers(self):
"""Return a list of specific handlers to add.
The urllib2 logic regarding ``build_opener(*handlers)`` is:
- It has a list of default handlers to use
- If a subclass or an instance of one of those default handlers is given
in ``*handlers``, it overrides the default one.
Suds uses a custom {'protocol': 'proxy'} mapping in self.proxy, and adds
a ProxyHandler(self.proxy) to that list of handlers.
This overrides the default behaviour of urllib2, which would otherwise
use the system configuration (environment variables on Linux, System
Configuration on Mac OS, ...) to determine which proxies to use for
the current protocol, and when not to use a proxy (no_proxy).
Thus, passing an empty list will use the default ProxyHandler which
behaves correctly.
"""
return []
client = suds.client.Client(my_wsdl, transport=WellBehavedHttpTransport())
I think you can do by using a urllib2 opener like below.
import suds
t = suds.transport.http.HttpTransport()
proxy = urllib2.ProxyHandler({'http': 'http://localhost:8888'})
opener = urllib2.build_opener(proxy)
t.urlopener = opener
ws = suds.client.Client('file://sandbox.xml', transport=t)
I was actually able to get it working by doing two things:
making sure there were keys in the proxy dict for http as well as https.
setting the proxy using set_options AFTER creation of the client.
So, my relevant code looks like this:
self.suds_client = suds.client.Client(wsdl)
self.suds_client.set_options(proxy={'http': 'http://localhost:8888', 'https': 'http://localhost:8888'})
I had multiple issues using Suds, even though my proxy was configured properly I could not connect to the endpoint wsdl. After spending significant time attempting to formulate a workaround, I decided to give soap2py a shot - https://code.google.com/p/pysimplesoap/wiki/SoapClient
Worked straight off the bat.
For anyone who's attempting cji's solution over HTTPS, you actually need to keep one of the handlers for the basic authentication. I also am using python3.7 so urllib2 has been replaced with urllib.request.
from suds.transport.https import HttpAuthenticated as SudsHttpsTransport
from urllib.request import HTTPBasicAuthHandler
class WellBehavedHttpsTransport(SudsHttpsTransport):
""" HttpsTransport which properly obeys the ``*_proxy`` environment variables."""
def u2handlers(self):
""" Return a list of specific handlers to add.
The urllib2 logic regarding ``build_opener(*handlers)`` is:
- It has a list of default handlers to use
- If a subclass or an instance of one of those default handlers is given
in ``*handlers``, it overrides the default one.
Suds uses a custom {'protocol': 'proxy'} mapping in self.proxy, and adds
a ProxyHandler(self.proxy) to that list of handlers.
This overrides the default behaviour of urllib2, which would otherwise
use the system configuration (environment variables on Linux, System
Configuration on Mac OS, ...) to determine which proxies to use for
the current protocol, and when not to use a proxy (no_proxy).
Thus, passing an empty list (asides from the BasicAuthHandler)
will use the default ProxyHandler which behaves correctly.
"""
return [HTTPBasicAuthHandler(self.pm)]