python parse http response (string) - python

I'm using python 2.7 and I want to parse string HTTP response fields which I already extracted from a text file. What would be the easiest way? I can parse requests by using the BaseHTTPServer but couldn't manage to find something for the responses.
The responses I have are pretty standard and in the following format
HTTP/1.1 200 OK
Date: Thu, Jul 3 15:27:54 2014
Content-Type: text/xml; charset="utf-8"
Connection: close
Content-Length: 626
Thanks in advance,

You might find this useful, keep in mind that HTTPResponse wasn't designed to be "instantiated directly by user."
Also note that the content-length header in your response string may not be valid any more (it depends on how you've aquired these responses) this just means that the call to HTTPResponse.read() needs to have value larger than the content in order to get it all.
In python 2 it can be run this way.
from httplib import HTTPResponse
from StringIO import StringIO
http_response_str = """HTTP/1.1 200 OK
Date: Thu, Jul 3 15:27:54 2014
Content-Type: text/xml; charset="utf-8"
Connection: close
Content-Length: 626"""
class FakeSocket():
def __init__(self, response_str):
self._file = StringIO(response_str)
def makefile(self, *args, **kwargs):
return self._file
source = FakeSocket(http_response_str)
response = HTTPResponse(source)
response.begin()
print "status:", response.status
print "single header:", response.getheader('Content-Type')
print "content:", response.read(len(http_response_str)) # the len here will give a 'big enough' value to read the whole content
In python 3, the HTTPResponse is imported from http.client, and the response to be parsed needs to be byte encoded. Depending on where the data is gotten from this may be done already or need to be called explicitly
from http.client import HTTPResponse
from io import BytesIO
http_response_str = """HTTP/1.1 200 OK
Date: Thu, Jul 3 15:27:54 2014
Content-Type: text/xml; charset="utf-8"
Connection: close
Content-Length: 626
teststring"""
http_response_bytes = http_response_str.encode()
class FakeSocket():
def __init__(self, response_bytes):
self._file = BytesIO(response_bytes)
def makefile(self, *args, **kwargs):
return self._file
source = FakeSocket(http_response_bytes)
response = HTTPResponse(source)
response.begin()
print( "status:", response.status)
# status: 200
print( "single header:", response.getheader('Content-Type'))
# single header: text/xml; charset="utf-8"
print( "content:", response.read(len(http_response_str)))
# content: b'teststring'

You might want to consider using python-requests.
Link: http://docs.python-requests.org/en/latest/
Here is an example from http://dancallahan.info/journal/python-requests/
Considering your responses are compliant with HTTP RFC
Does this look like something you want to do?
>>> import requests
>>> url = 'http://example.test/'
>>> response = requests.get(url)
>>> response.status_code
200
>>> response.headers['content-type']
'text/html; charset=utf-8'
>>> response.content
u'Hello, world!'

Related

httplib - http not accepting content length

Problem
When I switched Macbooks, all of the sudden I am getting an HTTP 411: Length Required (I wasn't getting this using a different Mac) trying to use a POST request with httplib. I cannot seem to find a work around for this.
Code Portion 1: from a supporting class; retrieves data and other things,
class Data(object):
def __init__(self, value):
self.company_id = None
self.host = settings.CONSUMER_URL
self.body = None
self.headers = {"clienttype": "Cloud-web", "Content-Type": "application/json", "ErrorLogging": value}
def login(self):
'''Login and store auth token'''
path = "/Security/Login"
body = self.get_login_info()
status_code, resp = self.submit_request("POST", path, json.dumps(body))
self.info = json.loads(resp)
company_id = self.get_company_id(self.info)
self.set_token(self.info["token"])
return company_id
def submit_request(self, method, path, body=None, header=None):
'''Submit requests for API tests'''
conn = httplib.HTTPSConnection(self.host)
conn.set_debuglevel(1)
conn.request(method, path, body, self.headers)
resp = conn.getresponse()
return resp.status, resp.read()
Code Portion 2: my unittests,
# logging in
cls.api = data.Data(False) # initializing the data class from Code Portion 1
cls.company_id = cls.api.login()
...
# POST Client/Register
def test_client_null_body(self):
'''Null body object - 501'''
status, resp = self.api.submit_request('POST', '/Client/register')
if status != 500:
log.log_warning('POST /Client/register: %s, %s' % (str(status), str(resp)))
self.assertEqual(status, 500)
Code Portion 3: example of the data I send from a settings file,
API_ACCOUNT = {
"userName": "account#account.com",
"password": "password",
"companyId": 107
}
From Logging
WARNING:root: POST /Client/register: 411, <!DOCTYPE HTML PUBLIC "-//W3C//DTD
HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Length Required</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Length Required</h2>
<hr><p>HTTP Error 411. The request must be chunked or have a content length.</p>
</BODY></HTML>
Additional Info: I was using a 2008 Macbook Pro without issue. Switched to a 2013 Macbook Pro and this keeps occurring.
I took a look at this post:
Python httplib and POST and it seems that at the time httplib did not automatically generate the content length.
Now https://docs.python.org/2/library/httplib.html:
If one is not provided in headers, a Content-Length header is added automatically for all methods if the length of the body can be determined, either from the length of the str representation, or from the reported size of the file on disk.
when using conn.set_debuglevel(1) we see that httplib is sending a header
reply: 'HTTP/1.1 411 Length Required\r\n'
header: Content-Type: text/html; charset=us-ascii
header: Server: Microsoft-HTTPAPI/2.0
header: Date: Thu, 26 May 2016 17:08:46 GMT
header: Connection: close
header: Content-Length: 344
Edit
Unittest Failure:
======================================================================
FAIL: test_client_null_body (__main__.NegApi)
Null body object - 501
----------------------------------------------------------------------
Traceback (most recent call last):
File "API_neg.py", line 52, in test_client_null_body
self.assertEqual(status, 500)
AssertionError: 411 != 500
.send: 'POST /Client/register HTTP/1.1\r\nHost: my.host\r\nAccept-Encoding: identity\r\nAuthorizationToken: uhkGGpJ4aQxm8BKOCH5dt3bMcwsHGCHs1p+OJvtf9mHKa/8pTEnKyYeJr+boBr8oUuvWvZLr1Fd+Og2xJP3xVw==\r\nErrorLogging: False\r\nContent-Type: application/json\r\nclienttype: Cloud-web\r\n\r\n'
reply: 'HTTP/1.1 411 Length Required\r\n'
header: Content-Type: text/html; charset=us-ascii
header: Server: Microsoft-HTTPAPI/2.0
header: Date: Thu, 26 May 2016 17:08:27 GMT
header: Connection: close
header: Content-Length: 344
Any ideas as to why this was working on a previous Mac and is currently not working here? It's the same code, same operating systems. Let me know if I can provide any more information.
Edit 2
The issue seemed to be with OSX 10.10.4, after upgrading to 10.10.5 all is well. I still would like to get some insight on why I was having this issue.
The only change from 10.10.4 to 10.10.5, that seems close, would have been the python update from 2.7.6 to 2.7.10 which includes this bug fix: http://bugs.python.org/issue22417

Python requests - print entire http request (raw)?

While using the requests module, is there any way to print the raw HTTP request?
I don't want just the headers, I want the request line, headers, and content printout. Is it possible to see what ultimately is constructed from HTTP request?
Since v1.2.3 Requests added the PreparedRequest object. As per the documentation "it contains the exact bytes that will be sent to the server".
One can use this to pretty print a request, like so:
import requests
req = requests.Request('POST','http://stackoverflow.com',headers={'X-Custom':'Test'},data='a=1&b=2')
prepared = req.prepare()
def pretty_print_POST(req):
"""
At this point it is completely built and ready
to be fired; it is "prepared".
However pay attention at the formatting used in
this function because it is programmed to be pretty
printed and may differ from the actual request.
"""
print('{}\n{}\r\n{}\r\n\r\n{}'.format(
'-----------START-----------',
req.method + ' ' + req.url,
'\r\n'.join('{}: {}'.format(k, v) for k, v in req.headers.items()),
req.body,
))
pretty_print_POST(prepared)
which produces:
-----------START-----------
POST http://stackoverflow.com/
Content-Length: 7
X-Custom: Test
a=1&b=2
Then you can send the actual request with this:
s = requests.Session()
s.send(prepared)
These links are to the latest documentation available, so they might change in content:
Advanced - Prepared requests and API - Lower level classes
import requests
response = requests.post('http://httpbin.org/post', data={'key1': 'value1'})
print(response.request.url)
print(response.request.body)
print(response.request.headers)
Response objects have a .request property which is the PreparedRequest object that was sent.
An even better idea is to use the requests_toolbelt library, which can dump out both requests and responses as strings for you to print to the console. It handles all the tricky cases with files and encodings which the above solution does not handle well.
It's as easy as this:
import requests
from requests_toolbelt.utils import dump
resp = requests.get('https://httpbin.org/redirect/5')
data = dump.dump_all(resp)
print(data.decode('utf-8'))
Source: https://toolbelt.readthedocs.org/en/latest/dumputils.html
You can simply install it by typing:
pip install requests_toolbelt
Note: this answer is outdated. Newer versions of requests support getting the request content directly, as AntonioHerraizS's answer documents.
It's not possible to get the true raw content of the request out of requests, since it only deals with higher level objects, such as headers and method type. requests uses urllib3 to send requests, but urllib3 also doesn't deal with raw data - it uses httplib. Here's a representative stack trace of a request:
-> r= requests.get("http://google.com")
/usr/local/lib/python2.7/dist-packages/requests/api.py(55)get()
-> return request('get', url, **kwargs)
/usr/local/lib/python2.7/dist-packages/requests/api.py(44)request()
-> return session.request(method=method, url=url, **kwargs)
/usr/local/lib/python2.7/dist-packages/requests/sessions.py(382)request()
-> resp = self.send(prep, **send_kwargs)
/usr/local/lib/python2.7/dist-packages/requests/sessions.py(485)send()
-> r = adapter.send(request, **kwargs)
/usr/local/lib/python2.7/dist-packages/requests/adapters.py(324)send()
-> timeout=timeout
/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py(478)urlopen()
-> body=body, headers=headers)
/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py(285)_make_request()
-> conn.request(method, url, **httplib_request_kw)
/usr/lib/python2.7/httplib.py(958)request()
-> self._send_request(method, url, body, headers)
Inside the httplib machinery, we can see HTTPConnection._send_request indirectly uses HTTPConnection._send_output, which finally creates the raw request and body (if it exists), and uses HTTPConnection.send to send them separately. send finally reaches the socket.
Since there's no hooks for doing what you want, as a last resort you can monkey patch httplib to get the content. It's a fragile solution, and you may need to adapt it if httplib is changed. If you intend to distribute software using this solution, you may want to consider packaging httplib instead of using the system's, which is easy, since it's a pure python module.
Alas, without further ado, the solution:
import requests
import httplib
def patch_send():
old_send= httplib.HTTPConnection.send
def new_send( self, data ):
print data
return old_send(self, data) #return is not necessary, but never hurts, in case the library is changed
httplib.HTTPConnection.send= new_send
patch_send()
requests.get("http://www.python.org")
which yields the output:
GET / HTTP/1.1
Host: www.python.org
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/2.1.0 CPython/2.7.3 Linux/3.2.0-23-generic-pae
requests supports so called event hooks (as of 2.23 there's actually only response hook). The hook can be used on a request to print full request-response pair's data, including effective URL, headers and bodies, like:
import textwrap
import requests
def print_roundtrip(response, *args, **kwargs):
format_headers = lambda d: '\n'.join(f'{k}: {v}' for k, v in d.items())
print(textwrap.dedent('''
---------------- request ----------------
{req.method} {req.url}
{reqhdrs}
{req.body}
---------------- response ----------------
{res.status_code} {res.reason} {res.url}
{reshdrs}
{res.text}
''').format(
req=response.request,
res=response,
reqhdrs=format_headers(response.request.headers),
reshdrs=format_headers(response.headers),
))
requests.get('https://httpbin.org/', hooks={'response': print_roundtrip})
Running it prints:
---------------- request ----------------
GET https://httpbin.org/
User-Agent: python-requests/2.23.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
None
---------------- response ----------------
200 OK https://httpbin.org/
Date: Thu, 14 May 2020 17:16:13 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 9593
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
<!DOCTYPE html>
<html lang="en">
...
</html>
You may want to change res.text to res.content if the response is binary.
Here is a code, which makes the same, but with response headers:
import socket
def patch_requests():
old_readline = socket._fileobject.readline
if not hasattr(old_readline, 'patched'):
def new_readline(self, size=-1):
res = old_readline(self, size)
print res,
return res
new_readline.patched = True
socket._fileobject.readline = new_readline
patch_requests()
I spent a lot of time searching for this, so I'm leaving it here, if someone needs.
A fork of #AntonioHerraizS answer (HTTP version missing as stated in comments)
Use this code to get a string representing the raw HTTP packet without sending it:
import requests
def get_raw_request(request):
request = request.prepare() if isinstance(request, requests.Request) else request
headers = '\r\n'.join(f'{k}: {v}' for k, v in request.headers.items())
body = '' if request.body is None else request.body.decode() if isinstance(request.body, bytes) else request.body
return f'{request.method} {request.path_url} HTTP/1.1\r\n{headers}\r\n\r\n{body}'
headers = {'User-Agent': 'Test'}
request = requests.Request('POST', 'https://stackoverflow.com', headers=headers, json={"hello": "world"})
raw_request = get_raw_request(request)
print(raw_request)
Result:
POST / HTTP/1.1
User-Agent: Test
Content-Length: 18
Content-Type: application/json
{"hello": "world"}
💡 Can also print the request in the response object
r = requests.get('https://stackoverflow.com')
raw_request = get_raw_request(r.request)
print(raw_request)
I use the following function to format requests. It's like #AntonioHerraizS except it will pretty-print JSON objects in the body as well, and it labels all parts of the request.
format_json = functools.partial(json.dumps, indent=2, sort_keys=True)
indent = functools.partial(textwrap.indent, prefix=' ')
def format_prepared_request(req):
"""Pretty-format 'requests.PreparedRequest'
Example:
res = requests.post(...)
print(format_prepared_request(res.request))
req = requests.Request(...)
req = req.prepare()
print(format_prepared_request(res.request))
"""
headers = '\n'.join(f'{k}: {v}' for k, v in req.headers.items())
content_type = req.headers.get('Content-Type', '')
if 'application/json' in content_type:
try:
body = format_json(json.loads(req.body))
except json.JSONDecodeError:
body = req.body
else:
body = req.body
s = textwrap.dedent("""
REQUEST
=======
endpoint: {method} {url}
headers:
{headers}
body:
{body}
=======
""").strip()
s = s.format(
method=req.method,
url=req.url,
headers=indent(headers),
body=indent(body),
)
return s
And I have a similar function to format the response:
def format_response(resp):
"""Pretty-format 'requests.Response'"""
headers = '\n'.join(f'{k}: {v}' for k, v in resp.headers.items())
content_type = resp.headers.get('Content-Type', '')
if 'application/json' in content_type:
try:
body = format_json(resp.json())
except json.JSONDecodeError:
body = resp.text
else:
body = resp.text
s = textwrap.dedent("""
RESPONSE
========
status_code: {status_code}
headers:
{headers}
body:
{body}
========
""").strip()
s = s.format(
status_code=resp.status_code,
headers=indent(headers),
body=indent(body),
)
return s
test_print.py content:
import logging
import pytest
import requests
from requests_toolbelt.utils import dump
def print_raw_http(response):
data = dump.dump_all(response, request_prefix=b'', response_prefix=b'')
return '\n' * 2 + data.decode('utf-8')
#pytest.fixture
def logger():
log = logging.getLogger()
log.addHandler(logging.StreamHandler())
log.setLevel(logging.DEBUG)
return log
def test_print_response(logger):
session = requests.Session()
response = session.get('http://127.0.0.1:5000/')
assert response.status_code == 300, logger.warning(print_raw_http(response))
hello.py content:
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
return 'Hello, World!'
Run:
$ python -m flask hello.py
$ python -m pytest test_print.py
Stdout:
------------------------------ Captured log call ------------------------------
DEBUG urllib3.connectionpool:connectionpool.py:225 Starting new HTTP connection (1): 127.0.0.1:5000
DEBUG urllib3.connectionpool:connectionpool.py:437 http://127.0.0.1:5000 "GET / HTTP/1.1" 200 13
WARNING root:test_print_raw_response.py:25
GET / HTTP/1.1
Host: 127.0.0.1:5000
User-Agent: python-requests/2.23.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 13
Server: Werkzeug/1.0.1 Python/3.6.8
Date: Thu, 24 Sep 2020 21:00:54 GMT
Hello, World!

'urllib2.urlopen' adding Host header

I'm using Observium to pull Nginx stats on localhost however it returns '405 Not Allowed':
# curl -I localhost/nginx_status
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Wed, 19 Jun 2013 22:12:37 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 166
Connection: keep-alive
Keep-Alive: timeout=5
# curl -I -H "Host: example.com" localhost/nginx_status
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 19 Jun 2013 22:12:43 GMT
Content-Type: text/plain
Connection: keep-alive
Keep-Alive: timeout=5
Could you please advise how to add Host header with 'urllib2.urlopen' in Python (Python 2.6.6
):
Current script:
#!/usr/bin/env python
import urllib2
import re
data = urllib2.urlopen('http://localhost/nginx_status').read()
params = {}
for line in data.split("\n"):
smallstat = re.match(r"\s?Reading:\s(.*)\sWriting:\s(.*)\sWaiting:\s(.*)$", line)
req = re.match(r"\s+(\d+)\s+(\d+)\s+(\d+)", line)
if smallstat:
params["Reading"] = smallstat.group(1)
params["Writing"] = smallstat.group(2)
params["Waiting"] = smallstat.group(3)
elif req:
params["Requests"] = req.group(3)
else:
pass
dataorder = [
"Active",
"Reading",
"Writing",
"Waiting",
"Requests"
]
print "<<<nginx>>>\n";
for param in dataorder:
if param == "Active":
Active = int(params["Reading"]) + int(params["Writing"]) + int(params["Waiting"])
print Active
else:
print params[param]
You might want to check out the urllib2 missing manual for more information, but basically you create a dictionary of your header labels and values and pass it to the urllib2.Request method. A (slightly) modified version of the code from the linked manual:
from urllib import urlencode
from urllib2 import Request urlopen
# Define values that we'll pass to our urllib and urllib2 methods
url = 'http://www.something.com/blah'
user_host = 'example.com'
values = {'name' : 'Engineero', # dict of keys and values for our POST data
'location' : 'Interwebs',
'language' : 'Python' }
headers = { 'Host' : user_host } # dict of keys and values for our header
# Set up our request, execute, and read
data = urlencode(values) # encode for sending URL request
req = Request(url, data, headers) # make POST request to url with data and headers
response = urlopen(req) # get the response from the server
the_page = response.read() # read the response from the server
# Do other stuff with the response

python and twisted proxy, how to gunzip on the fly?

how may i gunzip and process response part when using twistedmatrix ProxyClient?
i need to examine text or javascript and ajax query/answer. Is it that I should use the handleResponseEnd?
I think it was inside the handleResponsePart, but it looks like I have misunderstood a point or something, here is my skeleton code:
from twisted.python import log
from twisted.web import http, proxy
class ProxyClient(proxy.ProxyClient):
"""Mange returned header, content here.
Use `self.father` methods to modify request directly.
"""
def handleHeader(self, key, value):
# change response header here
log.msg("Header: %s: %s" % (key, value))
proxy.ProxyClient.handleHeader(self, key, value)
def handleResponsePart(self, buffer):
# this part below do not work,
# looks like # this moment i do not have 'Content-Encoding' or 'Content-Type'
# what am i misunderstading?
cEncoding = self.father.getAllHeaders().get('Content-Encoding', '')
cType = self.father.getAllHeaders().get('Content-Type', '')
print >> sys.stderr, 'Content-Encoding', cEncoding
print >> sys.stderr, 'Content-Type', cType
if ('text' in cType.lower() or 'javascript' in cType.lower()) and 'gzip' in cEncoding.lower():
buf = StringIO(buffer)
s = gzip.GzipFile(mode="rb", fileobj=buf)
content = s.read(len(buffer))
# here process content as it should be gunziped
proxy.ProxyClient.handleResponsePart(self, buffer)
class ProxyClientFactory(proxy.ProxyClientFactory):
protocol = ProxyClient
class ProxyRequest(proxy.ProxyRequest):
protocols = dict(http=ProxyClientFactory)
class Proxy(proxy.Proxy):
requestFactory = ProxyRequest
class ProxyFactory(http.HTTPFactory):
protocol = Proxy
from my logging i have:
2013-06-11 14:07:33+0200 [ProxyClient,client] Header: Date: Tue, 11 Jun 2013 12:07:25 GMT
2013-06-11 14:07:33+0200 [ProxyClient,client] Header: Server: Apache
...
2013-06-11 14:07:33+0200 [ProxyClient,client] Header: Content-Type: text/html;charset=ISO-8859-1
...
2013-06-11 14:07:33+0200 [ProxyClient,client] Header: Content-Encoding: gzip
...
2013-06-11 14:07:33+0200 [ProxyClient,client] Header: Connection: close
thus i should have the two conditions ok! what am i missing please?
also even if i am not interested by this second way, that is to remove the accept for the request, like this, is it possible to do:
(btw it looks like it does not work or that the tested webservers do not care of the fact that we do not want to receive gzip-ed content)
class ProxyRequest(proxy.ProxyRequest):
protocols = dict(http=ProxyClientFactory)
def process(self):
# removing the accept so that we do not tell "i'm ok with gzip encoded content" and should receive only not gzip-ed
self.requestHeaders.removeHeader('accept')
self.requestHeaders.removeHeader('accept-encoding')
You have to collect chunks of data into StringIO buffer in handleResponsePart, and then decode with GzipFile in handleResponseEnd.

Decoding response while opening a URL

I am using the following code to open a url and retrieve it's response :
def get_issue_report(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
response_headers = response.info()
print response.read()
The response I get is as follows :
<?xml version='1.0' encoding='UTF-8'?><entry xmlns='http://www.w3.org/2005/Atom' xmlns:gd='http://schemas.google.com/g/2005' xmlns:issues='http://schemas.google.com/projecthosting/issues/2009' gd:etag='W/"DUUFQH47eCl7ImA9WxBbFEg."'><id>http://code.google.com/feeds/issues/p/chromium/issues/full/2</id><published>2008-08-30T16:00:21.000Z</published><updated>2010-03-13T05:13:31.000Z</updated><title>Testing if chromium id works</title><content type='html'><b>What steps will reproduce the problem?</b>
<b>1.</b>
<b>2.</b>
<b>3.</b>
<b>What is the expected output? What do you see instead?</b>
<b>Please use labels and text to provide additional information.</b>
</content><link rel='replies' type='application/atom+xml' href='http://code.google.com/feeds/issues/p/chromium/issues/2/comments/full'/><link rel='alternate' type='text/html' href='http://code.google.com/p/chromium/issues/detail?id=2'/><link rel='self' type='application/atom+xml' href='https://code.google.com/feeds/issues/p/chromium/issues/full/2'/><author><name>rah...#google.com</name><uri>/u/#VBJVRVdXDhZCVgJ%2FF3tbUV5SAw%3D%3D/</uri></author><issues:closedDate>2008-08-30T20:48:43.000Z</issues:closedDate><issues:id>2</issues:id><issues:label>Type-Bug</issues:label><issues:label>Priority-Medium</issues:label><issues:owner><issues:uri>/u/kuchhal#chromium.org/</issues:uri><issues:username>kuchhal#chromium.org</issues:username></issues:owner><issues:stars>4</issues:stars><issues:state>closed</issues:state><issues:status>Invalid</issues:status></entry>
I would like to get rid of the characters like &lt, &gt etc. I tried using
response.read().decode('utf-8')
but this doesn't help much.
Just in case, the response.info() prints the following :
Content-Type: application/atom+xml; charset=UTF-8; type=entry
Expires: Fri, 01 Jul 2011 11:15:17 GMT
Date: Fri, 01 Jul 2011 11:15:17 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"DUUFQH47eCl7ImA9WxBbFEg."
Last-Modified: Sat, 13 Mar 2010 05:13:31 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
Here's the URL : https://code.google.com/feeds/issues/p/chromium/issues/full/2
Sentinel has explained how you can decode entity references like < but there's a bit more to the problem than that.
The example you give suggests that you are reading an Atom feed. If you want to do this reliably in Python, then I recommend using Mark Pilgrim's Universal Feed Parser.
Here's how one would read the feed in your example:
>>> import feedparser
>>> d = feedparser.parse('http://code.google.com/feeds/issues/p/chromium/issues/full/2')
>>> len(d.entries)
1
>>> print d.entries[0].title
Testing if chromium id works
>>> print d.entries[0].description
<b>What steps will reproduce the problem?</b>
<b>1.</b>
<b>2.</b>
<b>3.</b>
<b>What is the expected output? What do you see instead?</b>
<b>Please use labels and text to provide additional information.</b>
Using feedparser is likely to be much more reliable and convenient than trying to do your own XML parsing, entity decoding, date parsing, HTML sanitization, and so on.
from HTMLParser import HTMLParser
import urllib2
query="http://code.google.com/feeds/issues/p/chromium/issues/full/2"
def get_issue_report(query):
request = urllib2.Request(query)
response = urllib2.urlopen(request)
response_headers = response.info()
return response.read()
s = get_issue_report(query)
p = HTMLParser()
print p.unescape(s)
p.close()
Use
xml.sax.saxutils.unescape()
http://docs.python.org/library/xml.sax.utils.html#module-xml.sax.saxutils

Categories