urllib2.open error in python - python

I can't get URL
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
for url in parser.url_list:
get_rss_th = threading.Thread(target=parser.get_rss,name="get_rss_th", args=(url,))
get_rss_th.start()
print htmldata
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
when specifying htmldata.read() (Python error when using urllib.open)
then getting blank screen
python 2.7
whole code:https://github.com/tech-sketch/zabbix_aws_template/blob/master/scripts/AWS_Service_Health_Dashboard.py
The problem is, that from URL link (RSS feed), i can't get output (data) variable data = zbx_client.recv(4096) is empty- no status

There's no real problem with your code (except for a bunch of indentation errors and syntax errors that apparently aren't in your real code), only with your attempts to debug it.
First, you did this:
print htmldata
That's perfectly fine, but since htmldata is a urllib2 response object, printing it just prints that response object. Which apparently looks like this:
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
That doesn't look like particularly useful information, but that's the kind of output you get when you print something that's only really useful for debugging purposes. It tells you what type of object it is, some unique identifier for it, and the key members (in this case, the socket fileobject wrapped up by the response).
Then you apparently tried this:
print htmldata.read()
But already called read on the same object earlier:
parser.feed(htmldata.read())
When you read() the same file-like object twice, the first time gets everything in the file, and the second time gets everything after everything in the file—that is, nothing.
What you want to do is read() the contents once, into a string, and then you can reuse that string as many times as you want:
contents = htmldata.read()
parser.feed(contents)
print contents
It's also worth noting that, as the urllib2 documentation said right at the top:
See also The Requests package is recommended for a higher-level HTTP client interface.
Using urllib2 can be a big pain, in a lot of ways, and this is just one of the more minor ones. Occasionally you can't use requests because you have to dig deep under the covers of HTTP, or handle some protocol it doesn't understand, or you just can't install third-party libraries, so urllib2 (or urllib.request, as it's renamed in Python 3.x) is still there. But when you don't have to use it, it's better not to. Even Python itself, in the ensurepip bootstrapper, uses requests instead of urllib2.
With requests, the normal way to access the contents of a response is with the content (for binary) or text (for Unicode text) properties. You don't have to worry about when to read(); it does it automatically for you, and lets you access the text over and over. So, you can just do this:
import requests
base_url = "http://status.aws.amazon.com/"
response = requests.get(base_url, timeout=30)
parser.feed(response.content) # assuming it wants bytes, not unicode
print response.text

If I use this code:
import urllib2
import socket
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
print(htmldata.read())
I get the page's HTML code.

Related

Using URL of file as file path in Python in Lambda

I am trying to acquire a file from a url on the web and then open that file for use in an application I’m making in python on AWS Lambda. There doesn’t seem to be a way for me to acquire the file in the form I need it, which I believe to be an os.Pathlike object.
Here is what I am trying now, which doesn’t work since requests.get returns a response not path. I’m posting from a phone right now so I cannot use code tags. Apologies.
filename = requests.get(“url.com/file.txt”)
f = open(filename, ‘rb’)
I have also tried a urlparse and a urllib urlretrieve on the url but that does not return a pathlike object either. Note that I don’t believe I can just use wget or something on the shell level as I am using AWS lambda.
import requests
url = 'http://url.com/file.txt'
r = requests.get(url, allow_redirects=True)
f = open(r, ‘rb’)
When you do such operation, it's always a good to see the entire response of the request you are doing. I usually use the dict attribute, works quite often
print(response.__dict__)
On the ones I have done, there were a _content field in the response object with the file bytes. Then you can simply use the io module to read this file :
file = io.BytesIO(response._content)
This can then be used as a file just like when you do open() function

requests - Python command line behavior differs from behavior when script is run

I'm trying to write a script that will input data I supply into a web form at a url I supply.
To start with, I'm testing it out by simply getting the html of the page and outputting it as a text file. (I'm using Windows, hence .txt.)
import sys
import requests
sys.stdout = open('html.txt', 'a')
content = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
content.text
When I do this (i.e., the last two lines) on the python command line (>>>), I get what I expect. When I do it in this script and run it from the normal command line, the resulting html.txt is blank. If I add print(content) then html.txt contains only: <Response [200]>.
Can anyone elucidate what's going on here? Also, as you can probably tell, I'm a beginner, and I can't for the life of me find a beginner-level tutorial that explains how to use requests (or urllib[2] or selenium or whatever) to send data to webpages and retrieve the results. Thanks!
You want:
import sys
import requests
result = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
if result.status_code == requests.codes.ok:
with open('html.txt', 'a') as sys.stdout:
print result.content
Requests returns an instance of type request.Response. When you tried to print that, the __repr__ method was called, which looks like this:
def __repr__(self):
return '<Response [%s]>' % (self.status_code)
That is where the <Response [200]> came from.
The requests.Reponse has a content attribute which is an instance of str (or bytes for Python 3) that contains your HTML.
The text attribute is type unicode which may or may not be what you want. You mention in the comments that you saw a UnicodeDecodeError when you tried to write it to a file. I was able to replace the print result.content above with print result.text and I did not get that error.
If you need help solving your unicode problems, I recommend reading this unicode presentation. It explains why and when to decode and encode unicode.
The interactive interpreter echoes the result of every expression that doesn't produce None. This doesn't happen in regular scripts.
Use print to explicitly echo values:
print response.content
I used the undecoded version here as you are redirecting stdout to a file with no further encoding information.
You'd be better of writing the output directly to a file however:
with open('html.txt', 'ab') as outputfile:
outputfile.write(response.content)
This writes the response body, undecoded, directly to the file.

Python urllib download only some of a webpage?

I have a program where I need to open many webpages and download information in them. The information, however, is in the middle of the page, and it takes a long time to get to it. Is there a way to have urllib only retrieve x lines? Or, if nothing else, don't load the information afterwards?
I'm using Python 2.7.1 on Mac OS 10.8.2.
The returned object is a file-like object, and you can use .readline() to only read a partial response:
resp = urllib.urlopen(url)
for i in range(10):
line = resp.readline()
would read only 10 lines, for example. Note that this won't guarantee a faster response.

Python 2 vs. Python 3 - urllib formats

I'm getting really tired of trying to figure out why this code works in Python 2 and not in Python 3. I'm just trying to grab a page of json and then parse it. Here's the code in Python 2:
import urllib, json
response = urllib.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content)
I thought the equivalent code in Python 3 would be this:
import urllib.request, json
response = urllib.request.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content)
But it blows up in my face, because the data returned by read() is a "bytes" type. However, I cannot for the life of me get it to convert to something that json will be able to parse. I know from the headers that reddit is trying to send utf-8 back to me, but I can't seem to get the bytes to decode into utf-8:
import urllib.request, json
response = urllib.request.urlopen("http://reddit.com/.json")
content = response.read()
data = json.loads(content.decode("utf8"))
What am I doing wrong?
Edit: the problem is that I cannot get the data into a usable state; even though json loads the data, part of it is undisplayable, and I want to be able to print the data to the screen.
Second edit: The problem has more to do with print than parsing, it seems. Alex's answer provides a way for the script to work in Python 3, by setting the IO to utf8. But a question still remains: why is it that the code worked in Python 2, but not Python 3?
The code you post is presumably due to wrong cut-and-paste operations because it's clearly wrong in both versions (f.read() fails because there's no f barename defined).
In Py3, ur = response.decode('utf8') works perfectly well for me, as does the following json.loads(ur). Maybe the wrong copys-and-pastes affected your 2-to-3 conversion attempts.
Depends of your python version you have to choose the correct library.
for python 3.5
import urllib.request
data = urllib.request.urlopen(url).read().decode('utf8')
for python 2.7
import urllib
url = serviceurl + urllib.urlencode({'sensor':'false', 'address': address})
uh = urllib.urlopen(url)
Please see that answer in another Unicode related question.
Now: the Python 3 str (which was the Python 2 unicode) type is an idealised object, in the sense that it deals with “characters”, not “bytes”. These characters, in order to be used for/from disk/network data, need to be encoded-into/decoded-from bytes by a “conversion table”, a.k.a encoding a.k.a codepage. Because of operating system variety, Python historically avoided to guess what that encoding should be; this has been changing over the years, but still the principle of “In the face of ambiguity, refuse the temptation to guess.” applies.
Thankfully, a web server makes your work easier. Your response above should give you all extra information needed:
>>> response.headers['content-type']
'application/json; charset=UTF-8'
So, every time you issue a request to a web server, check the Content-Type header for a charset value, and decode the request's data into Unicode (Python 3: bytes.decode(charset) → str) by using that charset.
Here is an approach that is compatible across both versions - it works by first converting bytes data to string, and then loading the string.
import json
try:
from urllib.request import Request, urlopen #python3+
except ImportError:
from urllib2 import Request, urlopen #python2
url = 'https://jsonfeed.org/feed.json'
request = Request(url)
response_json_string = urlopen(request).read().decode('utf8')
response_json_object = json.loads(response_json_string)

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.
If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}
First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.
If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

Categories