python urllib2: connection reset by peer - python

I have a perl program that retrieves data from the database of my university library and it works well. Now I want to rewrite it in python but encounter the problem
<urlopen error [errno 104] connection reset by peer>
The perl code is:
my $ua = LWP::UserAgent->new;
$ua->cookie_jar( HTTP::Cookies->new() );
$ua->timeout(30);
$ua->env_proxy;
my $response = $ua->get($url);
The python code I wrote is:
cj = CookieJar();
request = urllib2.Request(url); # url: target web page
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
opener = urllib2.install_opener(opener);
data = urllib2.urlopen(request);
I use VPN(virtual private network) to log in my university's library at home, and I tried both the perl code and python code. The perl code works as I expected, but the python code always encountered "urlopen error".
I googled for the problem and it seems that the urllib2 fails to load the environmental proxy. But according to the document of urllib2, the urlopen() function works transparently with proxies which do not require authentication. Now I feels quite confusing. Can anybody help me with this problem?

I tried faking the User-Agent headers as Uku Loskit and Mikko Ohtamaa suggested, and solved my problem. The code is as follows:
proxy = "YOUR_PROXY_GOES_HERE"
proxies = {"http":"http://%s" % proxy}
headers={'User-agent' : 'Mozilla/5.0'}
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, headers)
html = urllib2.urlopen(req).read()
print html
Hope it is useful for someone else!

Firstly, as Steve said, you need response.read(), but that's not your problem
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Can you give details of the error? You can get it like this:
try:
urllib2.urlopen(req)
except URLError, e:
print e.code
print e.read()
Source: http://www.voidspace.org.uk/python/articles/urllib2.shtml
(I put this in a comment but it ate my newlines)

You might find that the requests module is a much easier-to-use replacement for urllib2.

Did you try specifying the proxy manually?
proxy = urllib2.ProxyHandler({'http': 'your_proxy_ip'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
urllib2.urlopen('http://www.uni-database.com')
if it still fails, try faking your User-Agent headers so as to make it seem that the request is coming from a real browser.

Related

How to log into a restricted site

I am doing some summer research with my school. I have to download ~2000 images off of a restricted site with graphs. I could absolutely do this manually, but I know it would be much faster to do with some sort of script. I've settled on Python, because I am assuming it will be much easier than another language. I have the URL for the site and the generic link for the database where the images are stored. I plan to feed the program a list of orbit numbers and it will download the appropriate images. The main issue is that when you visit the site, it pops up a login window through the browser, not HTML. I cannot view any of the site code to see how to submit the login.
I have already tried to use urllib and cookielib. I realize that urllib2 does not work in Python 3. I have also looked into using requests and mechanize with no luck.
import cookielib
import urllib2
import string
def cook():
url="SITE"
cj = cookielib.LWPCookieJar()
authinfo = urllib2.HTTPBasicAuthHandler()
realm="realmName"
username="USERNAME"
password="PASS"
host="HOST"
authinfo.add_password(realm, host, username, password)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), authinfo)
urllib2.install_opener(opener)
# Create request object
txheaders = { 'User-agent' : "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)" }
try:
req = urllib2.Request(url, None, txheaders)
cj.add_cookie_header(req)
f = urllib2.urlopen(req)
except IOError as e:
print("Failed to open", url)
if hasattr(e, 'code'):
print("Error code:", e.code)
else:
print (f)
print (f.read())
print (f.info())
f.close()
print('Cookies:')
for index, cookie in enumerate(cj):
print (index, " : ", cookie)
cj.save("cookies.lwp")
The code, obviously, just throws a bunch of errors. I really just need to be able to get into the site and download my images.
Totally was able to fix it by bypassing the verify. I know its not a great method, but it does what I need it to. Thanks guys!
You should use the selenium web driver to make login automate and download images. Read this article it will help you to scrape data from login required website

Is it possible to read Wikipedia using Python requests library?

To read a content from a given URL I do the following:
import requests
proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777'}
url = 'http://example.com/foo/bar'
r = requests.get(url, proxies = proxies)
print r.text.encode('utf-8')
And it works fine! I get the content.
However, if I use another URL:
url = 'https://en.wikipedia.org/wiki/Mestisko'
It does not work. I get an error message that starts with:
requests.exceptions.ConnectionError: ('Connection aborted.', error(10060
Is Wikipedia blocking automatic requests?
ADDED
I tried to set a user agent in the following way:
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(url, proxies = proxies, headers = headers)
Unfortunately it does not help. I still get the same error.
ADDED 2
Now I am confused. If I try to get content from http://example.com/foo/bar with setting proxy, I get it. If I do not set proxy, I get content generated by proxy. This behavior I can understand. Now, if I try to get content from Wikipedia, I get the same error message independently on whether I set or do not set proxy. So, I do not understand where this error message comes from Wikipedia or proxy (both options cannot be true).
The problem was resolved by replacing:
proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777'}
with the following line:
proxies = {'http':'http://user:pswd#foo-webproxy.foo.com:7777', 'https':'http://user:pswd#foo-webproxy.foo.com:7777'}

Proxy authentication error - python

Hi I have written a few simple lines of code. But I seem to be getting a Authentication error. Can anyone please suggest , what credentials are being looked for python here ?
Code:
import urllib2
response = urllib2.urlopen('http://google.com')
html = response.read()
Error
urllib2.HTTPError: HTTP Error 407: Proxy Authentication Required
PS: I do not have acces to IE -->Advanced settings or regedit
As advised I've modified the code :
import urllib2
proxy_support = urllib2.ProxyHandler({'http':r'http://usename:psw#IP:port'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy_support, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://google.com')
html = response.read()
Also I have created two environment variables :
HTTP_PROXY = http://username:password#proxyserver.domain.com
HTTPS_PROXY = https://username:password#proxyserver.domain.com
But still getting the error .
urllib2.HTTPError: HTTP Error 407: Proxy Authentication Required
There are multiple ways to work-around your problem. You may want to try defining an environment variables with the names http_proxy and https_proxy with each set to you proxy URL. Refer to this link for more details.
Alternatively, you may want to explicitly define a ProxyHandler to work with urllib2 while handling requests through the proxy. The link is already present within the comment to your query; however I am including it here for the sake of completeness.
Hope this helps
If your OS is windows and behind ISA proxy, urllib2 does not use any information about proxy; instead "Firewall Client for ISA server" automatically authenticates the user. That means we don't need to set http_proxy and https_proxy system environment variables. Keep it empty in ProxyHandler as following:
proxy = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
u = urllib2.urlopen('your-url-goes-here')
data = u.read()
The error code and message seem that the username and password failed to pass the authentications of proxy servers.
The following code:
proxy_handler = urllib2.ProxyHandler({'http': 'usename:psw#IP:port'})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://google.com')
html = response.read()
should also works if the authentication is passed.

issue with cookies and sending POST/GET to get te web content in Python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to use Python to login to a webpage and retrieve cookies for later usage?
I want to download whole webpage source from a service that handles cookies in some unusual way. I wrote a script that actually works and seems to be fine however at some point it returned such error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
My script works in loop and changes link to subpage wchich content im interested to download.
I get a cookie, send a package of data and then i am able to get to porper link then download html.
script look like this:
import urllib2
data = 'some_string'
url = "http://example/index.php"
url2 = "http://example/source"
req1 = urllib2.Request(url)
response = urllib2.urlopen(req1)
cookie = response.info().getheader('Set-Cookie')
## Use the cookie is subsequent requests
req2 = urllib2.Request(url, data)
req2.add_header('cookie', cookie)
response = urllib2.urlopen(req2)
## reuse again
req3 = urllib2.Request(url2)
req3.add_header('cookie', cookie)
response = urllib2.urlopen(req3)
html = response.read()
Ive been reading sth ab cookiejar/cookielib coz using this lib i am supposed to ged rid of this error mentioned above however i have no clue how to reporoduce my code to be used by: http.cookiejar, urllib.request
i tried sth like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener( urllib.request.HTTPCookieProcessor(cj) )
r = opener.open(url) # now cookies are stored in cj
r1 = urllib.request(url, data) #TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
r2 = opener.open(url2)
print( r2.read() )
But its not working as my first script.
ps. Sorry for my english but im am not native.
#Piotr Dobrogost thanks for the link, it solved the issue.
TypeError solved by using data=b"string" instead of data="string"
Ive got still some issues due to porting to python3 but issue is to be closed.

Retrieving pages from what.cd

I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.
As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.
I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.
After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:
import urllib
import urllib2
import cookielib
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()
data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)
html = f.read()
f.close()
Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.

Categories