Python urllib2 and SSH proxy -- throws a 404 not found

Python urllib2 and SSH proxy -- throws a 404 not found - python

I'm trying use SSH tunnels inside of Python's urllib2.
Creating the tunnel:
ssh -N user#machine.place.edu -L 1337:localhost:80
The above line should use port 80 on the remote machine and port 1337 on the local machine.
I used -N, so the bash prompt (intentionally) hangs so long as the this tunnel is running.
Using the tunnel in urllib2:
import urllib2
url = "http://ifconfig.me/ip"
headers={'User-agent' : 'Mozilla/5.0'}
proxy_support = urllib2.ProxyHandler({'http': 'http://127.0.0.1:1337'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, headers)
html = urllib2.urlopen(req).read()
print html
When I run the above code, html = urllib2.urlopen(req).read() throws the error urllib2.HTTPError: HTTP Error 404: Not Found.
What might be going wrong, and how can we fix it?
Troubleshooting:
If I turn off the SSH tunnel, the error changes to urllib2.URLError: <urlopen error [Errno 61] Connection refused>. So, Python is clearly "seeing" the SSH tunnel.
If I comment out the proxy stuff by replacing opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1)) with opener = urllib2.build_opener(), then the ifconfig.me page downloads properly. (Of course, the project that I'm working on requires me to access documents from a few different networks, so I still need proxies to work.)
Some StackOverflow posts suggest using Requests instead of urllib2. I wouldn't mind using Requests instead -- I just used urllib2 here because I wasn't sure how to do custom headers (e.g. user-agent, referer) in Requests.

Unfortunately, since you're the only one with access to machine.place.edu, it's going to be impossible for anyone else to reproduce the problem.
First of all, try something like...
$ telnet localhost 1337
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET http://ifconfig.me/ip HTTP/1.0
...and hit enter a couple of times after the 'GET' line, and see what you get back.
If you get a 404, there's probably something wrong with the proxy.
If you get a 200, then you should be able to recreate that fairly easily with httplib.

Related

pycURL; Received HTTP code 400 from proxy after CONNECT

I'm using pycURL to make a few requests to a https site through a http proxy.
Here's my code:
import pycurl
buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url) # 'url' is the base url of the form https://www.target.com
c.setopt(c.PROXY, proxy) # 'proxy' has the form 1.2.3.4:8080
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
I've tried this code with different proxies. I get either Proxy CONNECT aborted or Received HTTP code 400 from proxy after CONNECT.
Is there something I'm missing? Should I be using https proxies instead? I've looked around and can't seem to find any help or documentation on pycURL's usage.
Any help appreciated. Thanks!

I have a problem similar to yours, and my error log is:
fatal: unable to access 'https://github.com/nhn/raphael.git/': Received HTTP code 400 from proxy after CONNECT
so i use these commond to resolve my problem,
first view your git profile
git config --global --edit
then to delete
config [remote "origin"]
proxy = https://github.com/facette/facette.git

503 Reponse when trying to use python request on local website

I'm trying to scrape my own site from my local server. But when I use python requests on it, it gives me a response 503. Other ordinary sites on the web work. Any reason/solution for this?
import requests
url = 'http://127.0.0.1:8080/full_report/a1uE0000002vu2jIAA/'
r = requests.get(url)
print r
prints out
<Response [503]>
After further investigation, I've found a similar problem to mine.
Python requests 503 erros when trying to access localhost:8000
However, I don't think he's solved it yet. I can access the local website via the web browser but can't access using the requests.get function. I'm also using Django to host the server.
python manage.py runserver 8080
When I use:
curl -vvv http://127.0.0.1:8080
* Rebuilt URL to: http://127.0.0.1:8080/
* Trying 10.37.135.39...
* Connected to proxy.kdc.[company-name].com (10.37.135.39) port 8099 (#0)
* Proxy auth using Basic with user '[company-id]'
> GET http://127.0.0.1:8080/ HTTP/1.1
> Host: 127.0.0.1:8080
> Proxy-Authorization: Basic Y2FhNTc2OnJ2YTkxQ29kZQ==
> User-Agent: curl/7.49.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Server: BlueCoat-Security-Appliance
< Location:http://10.118.216.201
< Connection: Close
<
<HTML>
<HEAD><TITLE>Redirection</TITLE></HEAD>
<BODY><H1>Redirect</H1></BODY>
* Closing connection 0

I cannot request a local url using python requests because the company's network software won't allow it. This is a dead end and other avenues must be pursued.
EDIT: Working Solution
>>> import requests
>>> session = requests.Session()
>>> session.trust_env = False
>>> r = session.get("http://127.0.0.1:8080")
>>> r
<Response [200]>

Maybe you should disable your proxies in your requests.
import requests
proxies = {
"http": None,
"https": None,
}
requests.get("http://127.0.0.1:8080/myfunction", proxies=proxies)
ref:
https://stackoverflow.com/a/35470245/8011839
https://2.python-requests.org//en/master/user/advanced/#proxies

HTTP Error 503 means:
The Web server (running the Web site) is currently unable to handle the HTTP request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. Some servers in this state may also simply refuse the socket connection, in which case a different error may be generated because the socket creation timed out.
You may do following things:
Check you are able to open URL in the browser
If URL is opening, then check the domain in your code, it might be incorrect.
If in browser also it is not opening, your site may be overloaded or server resources are full to perform request

The most common cause of a 503 error is that a proxy host of some form is unable to communicate with the back end. For example, if you have Varnish trying to handle a request but Apache is down.
In your case, you have Django running on port 8080. (That's what the 8080 means). When you try to get content from 127.0.0.1, though, you're going to the default HTTP port (80). This means that your default server (Apache maybe? NginX?) is trying to find a host to serve 127.0.0.1 and can't find one.
You have two choices. Either you can update your server's configuration, or you can include the port in the URL.
url = 'http://127.0.0.1:8080/full_report/a1uE0000002vu2jIAA/'

urllib2.URLError: <urlopen error [Errno 11004] getaddrinfo failed>

If I run:
urllib2.urlopen('http://google.com')
even if I use another url, I get the same error.
I'm pretty sure there is no firewall running on my computer or router, and the internet (from a browser) works fine.

The problem, in my case, was that some install at some point defined an environment variable http_proxy on my machine when I had no proxy.
Removing the http_proxy environment variable fixed the problem.

The site's DNS record is such that Python fails the DNS lookup in a peculiar way: it finds the entry, but zero associated IP addresses. (Verify with nslookup.) Hence, 11004, WSANO_DATA.
Prefix the site with 'www.' and try the request again. (Use nslookup to verify that its result is different, too.)
This fails essentially the same way with the Python Requests module:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='...', port=80): Max retries exceeded with url: / (Caused by : [Errno 11004] getaddrinfo failed)

This may not help you if it's a network-level issue but you can get some debugging info by setting debuglevel on httplib. Try this:
import urllib, urllib2, httplib
url = 'http://www.mozillazine.org/atom.xml'
httplib.HTTPConnection.debuglevel = 1
print "urllib"
data = urllib.urlopen(url);
print "urllib2"
request = urllib2.Request(url)
opener = urllib2.build_opener()
feeddata = opener.open(request).read()
Which is copied directly from here, hope that's kosher: http://bytes.com/topic/python/answers/517894-getting-debug-urllib2

You probably need to use a proxy. Check your normal browser settings to find out which. Take a look at opening websites using urllib2 from behind corporate firewall - 11004 getaddrinfo failed for a similar problem with solution.,

To troubleshoot the issue:
let us know on what OS is the script running and what version of Python
In command prompt on that very same machine, do ping google.com and observe if that works (or you get say "could not find host")
If (2) worked, open browser on that machine (try in IE if on Windows) and try opening "google.com" there. If there is a problem, look closely at proxy settings in Internet Options / Connections / LAN Settings
Let us know how it goes either way.

add s to the http i.e urllib2.urlopen('https://google.com')
worked for me

Does urllib2 in Python 2.6.1 support proxy via https

Does urllib2 in Python 2.6.1 support proxy via https?
I've found the following at http://www.voidspace.org.uk/python/articles/urllib2.shtml:
NOTE
Currently urllib2 does not support
fetching of https locations through a
proxy. This can be a problem.
I'm trying automate login in to web site and downloading document, I have valid username/password.
proxy_info = {
'host':"axxx", # commented out the real data
'port':"1234" # commented out the real data
}
proxy_handler = urllib2.ProxyHandler(
{"http" : "http://%(host)s:%(port)s" % proxy_info})
opener = urllib2.build_opener(proxy_handler,
urllib2.HTTPHandler(debuglevel=1),urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
fullurl = 'https://correct.url.to.login.page.com/user=a&pswd=b' # example
req1 = urllib2.Request(url=fullurl, headers=headers)
response = urllib2.urlopen(req1)
I've had it working for similar pages but not using HTTPS and I suspect it does not get through proxy - it just gets stuck in the same way as when I did not specify proxy. I need to go out through proxy.
I need to authenticate but not using basic authentication, will urllib2 figure out authentication when going via https site (I supply username/password to site via url)?
EDIT:
Nope, I tested with
proxies = {
"http" : "http://%(host)s:%(port)s" % proxy_info,
"https" : "https://%(host)s:%(port)s" % proxy_info
}
proxy_handler = urllib2.ProxyHandler(proxies)
And I get error:
urllib2.URLError: urlopen error
[Errno 8] _ssl.c:480: EOF occurred in
violation of protocol

Fixed in Python 2.6.3 and several other branches:
_bugs.python.org/issue1424152 (replace _ with http...)
http://www.python.org/download/releases/2.6.3/NEWS.txt
Issue #1424152: Fix for httplib, urllib2 to support SSL while working through
proxy. Original patch by Christopher Li, changes made by Senthil Kumaran.

I'm not sure Michael Foord's article, that you quote, is updated to Python 2.6.1 -- why not give it a try? Instead of telling ProxyHandler that the proxy is only good for http, as you're doing now, register it for https, too (of course you should format it into a variable just once before you call ProxyHandler and just repeatedly use that variable in the dict): that may or may not work, but, you're not even trying, and that's sure not to work!-)

Incase anyone else have this issue in the future I'd like to point out that it does support https proxying now, make sure the proxy supports it too or you risk running into a bug that puts the python library into an infinite loop (this happened to me).
See the unittest in the python source that is testing https proxying support for further information:
http://svn.python.org/view/python/branches/release26-maint/Lib/test/test_urllib2.py?r1=74203&r2=74202&pathrev=74203

Python urllib2 timeout when using Tor as proxy?

I am using Python's urllib2 with Tor as a proxy to access a website. When I
open the site's main page it works fine but when I try to view the login page
(not actually log-in but just view it) I get the following error...
URLError: <urlopen error (10060, 'Operation timed out')>
To counteract this I did the following:
import socket
socket.setdefaulttimeout(None).
I still get the same timeout error.
Does this mean the website is timing out on the server side? (I don't know much
about http processes so sorry if this is a dumb question)
Is there any way I can correct it so that Python is able to view the page?
Thanks,
Rob

According to the Python Socket Documentation the default is no timeout so specifying a value of "None" is redundant.
There are a number of possible reasons that your connection is dropping. One could be that your user-agent is "Python-urllib" which may very well be blocked. To change your user agent:
request = urllib2.Request('site.com/login')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.04 (jaunty) Firefox/3.5')
You may also want to try overriding the proxy settings before you try and open the url using something along the lines of:
proxy = urllib2.ProxyHandler({"http":"http://127.0.0.1:8118"})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

I don't know enough about Tor to be sure, but the timeout may not happen on the server side, but on one of the Tor nodes somewhere between you and the server. In that case there is nothing you can do other than to retry the connection.

urllib2.urlopen(url[, data][, timeout])
The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS, FTP and FTPS connections.
http://docs.python.org/library/urllib2.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python urllib2 and SSH proxy -- throws a 404 not found - python

Related

pycURL; Received HTTP code 400 from proxy after CONNECT

503 Reponse when trying to use python request on local website

urllib2.URLError: <urlopen error [Errno 11004] getaddrinfo failed>

Does urllib2 in Python 2.6.1 support proxy via https

Python urllib2 timeout when using Tor as proxy?

Categories

Resources