I have Fiddler2 listening on 0.0.0.0:8888.
try:
data = ''
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8888'}) //also tried {'http': 'http://127.0.0.1:8888/'}
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
req = urllib2.Request('http://www.google.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
except Exception, detail:
print "Err ", detail
I don't see the GET or any request to google in Fiddler (but I can see other requests)
is there a way to debug it? is seems like python bypasses Fiddler or ignores the proxy.
I also configured WinHTTP to work with Fiddler -
C:\Windows\system32>netsh winhttp set proxy 127.0.0.1:8888
Current WinHTTP proxy settings:
Proxy Server(s) : 127.0.0.1:8888
Bypass List : (none)
does is matter if the request it to a SSL address? (Fiddler supports https)
Thanks!
Maybe you can work with the opener directly instead of installing it. turn on your fiddler proxy listener on 8008 (i'm using WebScarab, but they're probably the same) then try this code exactly (also has cookies which you don't need, but lets try as-is and narrow it down later):
cj = cookielib.MozillaCookieJar(cookie_filename)
if os.access(cookie_filename, os.F_OK):
cj.load()
proxy_handler = urllib2.ProxyHandler({'https': 'localhost:8008'})
opener = urllib2.build_opener(
proxy_handler,
urllib2.HTTPCookieProcessor(cj)
)
opener.addheaders = [
('User-agent', ('Mozilla/4.0 (compatible; MSIE 6.0; '
'Windows NT 5.2; .NET CLR 1.1.4322)'))
]
auth = urllib.urlencode({'email':email,'pass':passw})
data = opener.open('https://login.facebook.com/login.php',data=auth)
so - things i'm doing differently: direct usage of the opener, change the port to 8008, add cookies and use WebScarab. let me know which one of these did the trick for you...
proxy_bypass_registry in urllib.py does not handle the ProxyOverride registry value properly: it treats an empty override as *, i.e. bypass the proxy for all hosts. This behavior does not match other programs (e.g. Chrome).
There are a number of possible workarounds:
Set urllib.proxy_bypass = lambda h: 0 to disable bypass checking.
Specify the proxy settings in the http_proxy environment variable (proxy_bypass_registry is not called in this case).
In Fiddler2, go to the page Tools->Fiddler Options ...->Connections, remove the trailing semicolon from the value in the "IE should bypass Fiddler for ..." field and restart Fiddler2.
In Fiddler2, go to the page Tools -> Fiddler Options ... -> Connections, remove the trailing semicolon from the value in the IE should bypass Fiddler for ... field and restart Fiddler2.
This solution is definitely works for me when I using urllib2 proxy, however I still don't understand why removing the trailing semicolon can solve it.
btw, u need use http://www.google.com/ instead of http://www.google.com so that fiddler could figure you are requesting 'get /'
otherwise fiddler cannot figure out the uri. (u may get a 504 receive failure).
Related
I am trying to scrape a website using requests in python.
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
# set the headers like we are a browser,
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers )
This is working fine when I use my personal wifi. However, when I connect to my company's VPN, I get the following error.
ConnectionError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/23013220/max-retries-exceeded-with-url (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))
Now, I need this to work over my company's VPN because I need to access a website which works only in that. How to resolve this?
In my case, the problem was related to IPv6.
Our VPN used split tunneling, and it seems the VPN configuration does not support IPv6.
So for example this would hang forever:
requests.get('https://pokeapi.co/api/v2/pokemon')
But if you add a timeout, the request succeeds:
requests.get('https://pokeapi.co/api/v2/pokemon', timeout=1)
But not all machines were having this problem. So I compared the output of this among two different machines:
import socket
for line in socket.getaddrinfo('pokeapi.co', 443):
print(line)
The working one only returned IPv4 addresses. The non-working machine returned both IPv4 and IPv6 addresses.
So with the timeout specified, my theory is that python fails quickly with IPv6 and then moves to IPv4, where the request succeeds.
Ultimately we resolved this by disabling IPv6 on the machine:
networksetup -setv6off "Wi-Fi"
But I assume that this could instead be resolved through VPN configuration.
How about trying like this:
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
ua = UserAgent()
headers = headers = {"User-Agent": ua.random}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers)
It seems to be caused by UserAgent() settings difference.
Try to set trust_env = None
trust_env = None #
Trust environment settings for proxy configuration, default authentication and similar.
Or you can disable proxies for a particular domain. The question
import os
os.environ['NO_PROXY'] = 'stackoverflow.com'
In my organization, I have to run my program under VPN for different geo locations. so we have multiple proxy configurations.
I found it simpler to use a package called PyPAC to get my proxy details automatically
from pypac import PACSession
from requests.auth import HTTPProxyAuth
session = PACSession()
# when the username and password is required
# session = PACSession(proxy_auth=HTTPProxyAuth(name, password))
r = session.get('http://example.org')
How does this work:
The package locates the PAC file which is configured by the organization. This file consist of proxy configuration detail (more info).
I am using the python shell to test requests together with proxy servers.
After reading documentation (http://docs.python-requests.org/en/master/user/advanced/) and a few stackoverflow threads I am doing the following:
import requests
s = requests.session()
proxies = {'http': 'http://90.178.216.202:3128'}
s.proxies.update(proxies)
req = s.get('http://jsonip.com')
After this, if I print req.text, I get this:
u'{"ip":"my current IP (not the proxy server IP I have inserted before)","about":"/about", ......}'
Can you please explain why I'm getting my computer's IP address and not the proxy server's IP address?
Did I go wrong somewhere or am I expecting the wrong thing to happen here?
I am new to requests + proxy servers so I would like to make sure I am understanding this.
UPDATE
I also have this in my code:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
s.headers.update(headers)
Thanks
Vittorio
The site ( http://jsonip.com ) broadcasts an 'Upgrade-Insecure-Requests' header. This means that your request gets redirected to https://jsonip.com, so requests doesn't use a proxy because you don't have an https proxy in your proxies dict.
So, all you have to do is add an https proxy in proxies , eg:
proxies = {'http':'http://90.178.216.202:3128', 'https':'https://90.178.216.202:3128'}
Instead of doing this pass user-agent
requests.post(url='abc.com',header={'user-agent':'Mozila 5.0'})
u need to change ur get request to have the proxies used.
something like this:req = s.get('http://jsonip.com', proxies=proxies)
url = 'https://www.instagram.com/accounts/login/ajax/'
values = {'username' : 'User',
'password' : 'Pass'}
#'User-agent', ''
data = urllib.urlencode(values)
req = urllib2.Request(url, data,headers={'User-Agent' : "Mozilla/5.0"})
con = urllib2.urlopen( req )
the_page = response.read()
Does anyone have any ideas with this? I keep getting the error "403 forbidden".
Its possible instagram has something that won't let me connect via python (I don't want to connect via their API). What on earth is going on here, does anyone have any ideas?
Thanks!
EDIT: Adding more info.
The error I was getting was this
This page could not be loaded. If you have cookies disabled in your browser, or you are browsing in Private Mode, please try enabling cookies or turning off Private Mode, and then retrying your action.
I edited my code but am still getting that error.
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
print len(jar) #prints 0
opener.addheaders = [('User-agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36')]
result = opener.open('https://www.instagram.com')
print result.getcode(), len(jar) #prints 200 and 2
url = 'https://www.instagram.com/accounts/login/ajax/'
values = {'username' : 'username',
'password' : 'password'}
data = urllib.urlencode(values)
response = opener.open(url, data)
print response.getcode()
Two important things, for starters:
make sure you stay on the legal side. According to the Instagram's Terms of Use:
We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means, including but not limited to, user profiles and photos (except as may be the result of standard search engine protocols or technologies used by a search engine with Instagram's express consent).
You must not create accounts with the Service through unauthorized means, including but not limited to, by using an automated device, script, bot, spider, crawler or scraper.
there is an Instagram API that would help staying on the legal side and make the life easier. There is a Python client: python-instagram
Aside from that, the Instagram itself is javascript-heavy and you may find it difficult to work with using just urllib2 or requests. If, for some reason, you cannot use the API, you would look into browser automation via selenium. Note that you can automate a headless browser like PhantomJS also. Here is a sample code to log in:
from selenium import webdriver
USERNAME = "username"
PASSWORD = "password"
driver = webdriver.PhantomJS()
driver.get("https://www.instagram.com")
driver.find_element_by_name("username").send_keys(USERNAME)
driver.find_element_by_name("password").send_keys(PASSWORD)
driver.find_element_by_xpath("//button[. = 'Log in']").click()
I'm trying to determine high anonymity proxies. Also called private/elite proxies. From a forum I've read this:
High anonymity Servers don't send HTTP_X_FORWARDED_FOR, HTTP_VIA and
HTTP_PROXY_CONNECTION variables. Host doesn't even know you are using
proxy server and of course it doesn't know your IP address.
A highly anonymous proxy will display the following information:
REMOTE_ADDR = Proxy's IP address
HTTP_VIA = blank
HTTP_X_FORWARDED_FOR = blank
So, how I can check for this headers in Python, to discard them as a HA Proxy ? I have tried to retrieve the headers for 20-30 proxies using the requests package, also with urllib, with the build-in http.client, with urllib2. But I didn't see these headers, never. So I should be doing something wrong...
This is the code I've used to test with requests:
proxies = {'http': 'http://176.100.108.214:3128'}
header = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.360',}
s = requests.session()
s.proxies = proxies
r = s.get('http://www.python.org', headers=header)
print(r.status_code)
print(r.request.headers)
print(r.headers)
It sounds like the forum post you're referring to is talking about the headers seen by the server on your proxied request, not the headers seen by the client on the proxied response.
Since you're testing with www.python.org as the server, the only way to see the headers it receives would be to have access to their logs. Which you don't.
But there's a simple solution: run your own HTTP server, make requests against that, and then you can see what it receives. (If you're behind a firewall or NAT that the proxy you're testing won't be able to connect to, you may have to get a free hosted server somewhere; if not, you can just run it on your machine.)
If you have no idea how to set up and configure a web server, Python comes with one of its own. Just run this script with Python 3.2+ (on your own machine, or an Amazon EC2 free instance, or whatever):
from http.server import HTTPServer, SimpleHTTPRequestHandler
class HeaderDumper(SimpleHTTPRequestHandler):
def do_GET(self):
try:
return super().do_GET()
finally:
print(self.headers)
server = HTTPServer(("", 8123), HeaderDumper)
server.serve_forever()
Then run that script with python3 in the shell.
Then just run your client script, with http://my.host.ip instead of http://www.python.org, and look at what the script dumps to the server's shell.
I am quite new to python and I am rather struck for a couple of days now trying to send a cookie with urllib2. So, basically, on the page I want to get, I see from firebug that there is a "sent cookie" which looks like:
list_type=height
.. which basically arranges the list on the page in a certain order.
I would like to send this above cookie info via urllib2, so that the rendered page taked this above setting into effect - and here is the code I am trying to write to make it work:
class Networksx(object):
def __init__(self):
self.cj = cookielib.CookieJar()
self.opener = urllib2.build_opener\
#socks handler
self.opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'),
('Keep-Alive', '115'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Referer', 'http://www.google.com'),
("Cookie", {"list_type":"height"}),
]
urllib2.install_opener(self.opener)
self.params = { 'Set-Cookie': "list_type":"height"}
self.encoded_params = urllib.urlencode( self.params )
def fullinfo(self,url):
return self.opener.open(url,self.encoded_params).read()
..as you can see, I have tried a couple of things:
setting the parameter via a header
setting a cookie
however, these do not seem to render the page in the certain list_order (height) as I would like. I was wondering if someone could point me in the right direction as to how to send the cookie information with urllib2
Thanks.
An easy way to generate a cookie.txt is this chrome extension: https://chrome.google.com/webstore/detail/cookietxt-export/lopabhfecdfhgogdbojmaicoicjekelh
import urllib2, cookielib
url = 'https://example.com/path/default.aspx'
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
cj = cookielib.LWPCookieJar()
# cj.load signature: filename=None, ignore_discard=False, ignore_expires=False
cj.load('/path/to/my/cookies.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, txheaders)
handle = urllib2.urlopen(req)
[update]
Sorry, I was pasting from an old code snippet long forgotten. From the LWPCookieJar docstring:
The LWPCookieJar saves a sequence of "Set-Cookie3" lines. "Set-Cookie3" is the format used by the libwww-perl libary, not known to be compatible with any browser, but which is easy to read and doesn't lose information about RFC 2965 cookies.
So it is not compatible with the cookie.txt generated by modern browsers. If you try to load it with you will get: LoadError: 'cookies.txt' does not look like a Set-Cookie3 (LWP) format file.
You can do as the OP and convert the file:
there is something wrong with the format of the output from chrome extension. I just googled the lwp problem and found: code.activestate.com/recipes/302930-cookielib-example the code spits out the cookie in lwp format and then I follow your steps as it is. - James W
You can also use this Firefox addon, and then "Tools->Export cookies". Make sure the first line in the cookies.txt file is "# Netscape HTTP Cookie File" and use:
cj = cookielib.MozillaCookieJar('/path/to/my/cookies.txt')
cj.load()
You would better look into the 'request' module for Python making HTTP much easier approachable than through the low-level urllib modules.
See
http://docs.python-requests.org/en/latest/user/quickstart/#cookies