I am using the following code to open a url and retrieve it's response :
def get_issue_report(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
response_headers = response.info()
print response.read()
The response I get is as follows :
<?xml version='1.0' encoding='UTF-8'?><entry xmlns='http://www.w3.org/2005/Atom' xmlns:gd='http://schemas.google.com/g/2005' xmlns:issues='http://schemas.google.com/projecthosting/issues/2009' gd:etag='W/"DUUFQH47eCl7ImA9WxBbFEg."'><id>http://code.google.com/feeds/issues/p/chromium/issues/full/2</id><published>2008-08-30T16:00:21.000Z</published><updated>2010-03-13T05:13:31.000Z</updated><title>Testing if chromium id works</title><content type='html'><b>What steps will reproduce the problem?</b>
<b>1.</b>
<b>2.</b>
<b>3.</b>
<b>What is the expected output? What do you see instead?</b>
<b>Please use labels and text to provide additional information.</b>
</content><link rel='replies' type='application/atom+xml' href='http://code.google.com/feeds/issues/p/chromium/issues/2/comments/full'/><link rel='alternate' type='text/html' href='http://code.google.com/p/chromium/issues/detail?id=2'/><link rel='self' type='application/atom+xml' href='https://code.google.com/feeds/issues/p/chromium/issues/full/2'/><author><name>rah...#google.com</name><uri>/u/#VBJVRVdXDhZCVgJ%2FF3tbUV5SAw%3D%3D/</uri></author><issues:closedDate>2008-08-30T20:48:43.000Z</issues:closedDate><issues:id>2</issues:id><issues:label>Type-Bug</issues:label><issues:label>Priority-Medium</issues:label><issues:owner><issues:uri>/u/kuchhal#chromium.org/</issues:uri><issues:username>kuchhal#chromium.org</issues:username></issues:owner><issues:stars>4</issues:stars><issues:state>closed</issues:state><issues:status>Invalid</issues:status></entry>
I would like to get rid of the characters like <, > etc. I tried using
response.read().decode('utf-8')
but this doesn't help much.
Just in case, the response.info() prints the following :
Content-Type: application/atom+xml; charset=UTF-8; type=entry
Expires: Fri, 01 Jul 2011 11:15:17 GMT
Date: Fri, 01 Jul 2011 11:15:17 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"DUUFQH47eCl7ImA9WxBbFEg."
Last-Modified: Sat, 13 Mar 2010 05:13:31 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
Here's the URL : https://code.google.com/feeds/issues/p/chromium/issues/full/2
Sentinel has explained how you can decode entity references like < but there's a bit more to the problem than that.
The example you give suggests that you are reading an Atom feed. If you want to do this reliably in Python, then I recommend using Mark Pilgrim's Universal Feed Parser.
Here's how one would read the feed in your example:
>>> import feedparser
>>> d = feedparser.parse('http://code.google.com/feeds/issues/p/chromium/issues/full/2')
>>> len(d.entries)
1
>>> print d.entries[0].title
Testing if chromium id works
>>> print d.entries[0].description
<b>What steps will reproduce the problem?</b>
<b>1.</b>
<b>2.</b>
<b>3.</b>
<b>What is the expected output? What do you see instead?</b>
<b>Please use labels and text to provide additional information.</b>
Using feedparser is likely to be much more reliable and convenient than trying to do your own XML parsing, entity decoding, date parsing, HTML sanitization, and so on.
from HTMLParser import HTMLParser
import urllib2
query="http://code.google.com/feeds/issues/p/chromium/issues/full/2"
def get_issue_report(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
response_headers = response.info()
return response.read()
s = get_issue_report(query)
p = HTMLParser()
print p.unescape(s)
p.close()
Use
xml.sax.saxutils.unescape()
http://docs.python.org/library/xml.sax.utils.html#module-xml.sax.saxutils
Related
I am trying to send a PNG as an attachment, but it doesn't show up in Slack. I am setting the right parameters in the POST-method, but Slack refuses to use the image I am providing.
I am using flask to serve the static files:
#app.route('/data/<path:path>')
def send_png(path):
response = make_response(send_file("data/" + path))
return response
When I call the URL in my browser, the files gets displayed without any issues. When I pass the URL to slack as an attachment, the file doesn't show up.
When I pass the URL of an imgur-image, the attachment does get displayed.
For that reason, I assume the issue lies somewhere in the content-type/file-headers of the files flask serves.
My file headers are:
HTTP/1.0 200 OK
Content-Length: 391777
Content-Type: image/png
Last-Modified: Fri, 02 Mar 2018 22:46:41 GMT
Cache-Control: public, max-age=43200
Expires: Sat, 03 Mar 2018 12:48:53 GMT
ETag: "1520030801.2465587-391777-4064615867"
Server: Werkzeug/0.11.11 Python/3.5.2
Date: Sat, 03 Mar 2018 00:48:53 GMT
Connection: keep-alive
I can also verify, that Slack does request my attachment (just doesn't display it, as said before):
[('User-Agent', 'Slackbot 1.0 (+https://api.slack.com/robots)'), ('X-Forwarded-For', '54.89.92.4'), ('Content-Type', ''), ('Accept-Encoding', 'gzip,deflate'), ('Accept', '*/*'), ('Host', 'XXXXXXXXXX'), ('Referer', 'https://slack.com'), ('Content-Length', ''), ('X-Forwarded-Proto', 'https')]
What about this:, the image should go in files= if you want as an attachment
import io
import requests
from PIL import Image
img = Image.open('picture.png')
im = io.BytesIO(img.file.read())
r = requests.post("127.0.0.1", data=im, timeout=5, files=im)
I'm using python 2.7 and I want to parse string HTTP response fields which I already extracted from a text file. What would be the easiest way? I can parse requests by using the BaseHTTPServer but couldn't manage to find something for the responses.
The responses I have are pretty standard and in the following format
HTTP/1.1 200 OK
Date: Thu, Jul 3 15:27:54 2014
Content-Type: text/xml; charset="utf-8"
Connection: close
Content-Length: 626
Thanks in advance,
You might find this useful, keep in mind that HTTPResponse wasn't designed to be "instantiated directly by user."
Also note that the content-length header in your response string may not be valid any more (it depends on how you've aquired these responses) this just means that the call to HTTPResponse.read() needs to have value larger than the content in order to get it all.
In python 2 it can be run this way.
from httplib import HTTPResponse
from StringIO import StringIO
http_response_str = """HTTP/1.1 200 OK
Date: Thu, Jul 3 15:27:54 2014
Content-Type: text/xml; charset="utf-8"
Connection: close
Content-Length: 626"""
class FakeSocket():
def __init__(self, response_str):
self._file = StringIO(response_str)
def makefile(self, *args, **kwargs):
return self._file
source = FakeSocket(http_response_str)
response = HTTPResponse(source)
response.begin()
print "status:", response.status
print "single header:", response.getheader('Content-Type')
print "content:", response.read(len(http_response_str)) # the len here will give a 'big enough' value to read the whole content
In python 3, the HTTPResponse is imported from http.client, and the response to be parsed needs to be byte encoded. Depending on where the data is gotten from this may be done already or need to be called explicitly
from http.client import HTTPResponse
from io import BytesIO
http_response_str = """HTTP/1.1 200 OK
Date: Thu, Jul 3 15:27:54 2014
Content-Type: text/xml; charset="utf-8"
Connection: close
Content-Length: 626
teststring"""
http_response_bytes = http_response_str.encode()
class FakeSocket():
def __init__(self, response_bytes):
self._file = BytesIO(response_bytes)
def makefile(self, *args, **kwargs):
return self._file
source = FakeSocket(http_response_bytes)
response = HTTPResponse(source)
response.begin()
print( "status:", response.status)
# status: 200
print( "single header:", response.getheader('Content-Type'))
# single header: text/xml; charset="utf-8"
print( "content:", response.read(len(http_response_str)))
# content: b'teststring'
You might want to consider using python-requests.
Link: http://docs.python-requests.org/en/latest/
Here is an example from http://dancallahan.info/journal/python-requests/
Considering your responses are compliant with HTTP RFC
Does this look like something you want to do?
>>> import requests
>>> url = 'http://example.test/'
>>> response = requests.get(url)
>>> response.status_code
200
>>> response.headers['content-type']
'text/html; charset=utf-8'
>>> response.content
u'Hello, world!'
I want to download on disk the gif image:
http://www.portaportese.it/telefono/es_2014043024395.gif
with all the codes I found out around for downloading pictures, I end up with a error in the final saved picture such as:
GIF image was truncated or incomplete.
in a few words the picture is not being saved correctly.
Is there anybody able to provide a correct solution which will download this picture on disk?
Any code returns an empty image.. I tried this:
import urllib2
picture_page = "http://www.portaportese.it/telefono/es_2014043024395.gif"
opener1 = urllib2.build_opener()
page1 = opener1.open(picture_page)
my_picture = page1.read()
filename = "my_image.gif"
fout = open(filename, "wb")
fout.write(my_picture)
fout.close()
The problem does not lie with your Python code, the image that you are trying to download does not exist. If I use curl to place a request at that URL, you can see that no image is stored there.
~ ❯❯❯ curl -I http://www.portaportese.it/telefono/es_2014043024395.gif
HTTP/1.1 200 OK
Date: Mon, 07 Jul 2014 19:35:05 GMT
Server: Apache/2.2.3 (Red Hat)
Connection: close
Content-Type: text/plain; charset=UTF-8
Compare that with this request to a known image source:
~ ❯❯❯ curl -I http://baconmockup.com/300/200
HTTP/1.1 200 OK
Date: Mon, 07 Jul 2014 19:35:42 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.2
Access-Control-Allow-Origin: *
Content-Length: 20564
Content-Disposition: inline; filename=brisket-300-200.jpg
Pragma: public
Cache-Control: public
Expires: Mon, 21 Jul 2014 19:35:43 GMT
Last-Modified: Mon, 20 Aug 2012 19:20:21 GMT
Vary: User-Agent
Content-Type: image/jpeg
If you change the URL in your code to a good image source, then it will work perfectly well.
import urllib2
picture_page = "http://baconmockup.com/300/200"
opener1 = urllib2.build_opener()
page1 = opener1.open(picture_page)
my_picture = page1.read()
filename = "my_image.gif"
fout = open(filename, "wb")
fout.write(my_picture)
fout.close()
I just ran this, and was given a picture of some tasty brisket.
I just want a better idea of what's going on here, I can of course "work around" the problem by using urllib2.
import urllib
import urllib2
url = "http://www.crutchfield.com/S-pqvJFyfA8KG/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html"
# urllib2 works fine (foo.headers / foo.read() also behave)
foo = urllib2.urlopen(url)
# urllib throws errors though, what specifically is causing this?
bar = urllib.urlopen(url)
http://pae.st/AxDW/ shows this code in action with the exception/stacktrace. foo.headers and foo.read() work fine
stu#sente.cc ~ $: curl -I "http://www.crutchfield.com/S-pqvJFyfA8KG/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html"
HTTP/1.1 302 Object Moved
Cache-Control: private
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
Location: /S-FSTWJcduy5w/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html
Server: Microsoft-IIS/7.5
Set-Cookie: SESSIONID=FSTWJcduy5w; domain=.crutchfield.com; expires=Fri, 22-Feb-2013 22:06:43 GMT; path=/
Set-Cookie: SYSTEMID=0; domain=.crutchfield.com; expires=Fri, 22-Feb-2013 22:06:43 GMT; path=/
Set-Cookie: SESSIONDATE=02/23/2012 17:07:00; domain=.crutchfield.com; expires=Fri, 22-Feb-2013 22:06:43 GMT; path=/
X-AspNet-Version: 4.0.30319
HostName: cws105
Date: Thu, 23 Feb 2012 22:06:43 GMT
Thanks.
This server is both non-deterministic and sensitive to HTTP version. urllib2 is HTTP/1.1, urllib is HTTP/1.0. You can reproduce this by running curl --http1.0 -I "http://www.crutchfield.com/S-pqvJFyfA8KG/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html"
a few times in a row. You should see the output curl: (52) Empty reply from server occasionally; that's the error urllib is reporting. (If you re-issue the request a bunch of times with urllib, it should succeed sometimes.)
I solved the Problem. I simply using now the urrlib instead of urllib2 and anything works fine thank you all :)
I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.
If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'