Dump JSON from string in unknown character encoding - python

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.
I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:
for possible_encoding in ["utf-8", "ISO-8859-1"]:
try:
# post_dict contains, among other things, website html retrieved
# with urllib2
json = simplejson.dumps(post_dict, encoding=possible_encoding)
break
except UnicodeDecodeError:
pass
if json is None:
raise UnicodeDecodeError
This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.
The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.

You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see
A good way to get the charset/encoding of an HTTP response in Python
.
To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

Related

Given a URL, how to encode a file's contents as base64 with Python / Django?

I am building a Django-based website, and am having trouble figuring out a decent way to email some larger PDFs and such to my users.
The files in question never touch our servers; they're handled on a CDN. So, my starting point is with the unique URLs for the files, not with the files themselves. It would be nice to find a solution that doesn't involve saving the files locally.
In order for me to be able to send the email in the way I want (with the PDF/DOCX/whatever attached to it), I need to be able to encode the attachment as a base-64 string.
I would prefer not to save the file to our server; I would also prefer not to read a response object in chunks and write it plainly to a file on our server, then encode that file.
That said, given a direct url to a file is there a way to stream the response and encode it in base64 as it comes in?
I have been reading about Django's StreamingHttpResponse and FileWrapper and feel like I am close, but I'm not able to put it together just yet.
Edit: the snippet below is working for now, but I'm worried about memory usage - how well would something like this scale?
import base64
req = requests.get('url')
encoded = base64.b64encode(req.content)
Thanks to beetea I am comfortable implementing the simple:
import base64
req = requests.get('url')
encoded = base64.b64encode(req.content)
As the solution to this issue.

How can I get XML data in Python/Django passed in POST request?

There is XML data passed in POST request as a simple string and not inside some name=value pair,
with HTTP header (optionally) set to 'Content-Type: text/xml'.
How can I get this data in Python (by its own ways or by tools of Django)?
I'm not quite sure what you're asking. Do you just want to know how to access the POST data? You can get that via request.body, which will contain your XML as a string. You can then use your favourite Python XML parser on it.

Torrent Tracker info hash GET Request- Python

I'm trying to connect to a torrent tracker to receive a list of peers to play bit torrent with, however I am having trouble forming the proper GET request.
As far as I understand, I must obtain the 20 byte SHA1 hash of the bencoded 'info' section from the .torrent file. I use the following code:
h = hashlib.new('sha1')
h.update(bencode.bencode(meta_dict['info']))
info_hash = h.digest()
This is where I am stuck. I can not figure out how to create the proper url-encoded info_hash to stick into a URL string as a parameter.
I believe it involves some combination of urllib.urlencode and urllib.quote, however my attempts have not worked so far.
well a bit late but might help someone.
Using module requests encodes the url by it's self. First you need to create a dictionary with the parameters (info_hash, peer_id etc). Then you only have to do a get request
response = requests.get(tracker_url, params=params)
I think that urllib.quote_plus() is all you need.

Browser charsets order of precedence

Client browsers are sending the header HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3. I only serve webpages as utf8 with the correct header but browsers are posting data from forms encoded with the ISO-8859-1 charset. My question is, will a browser always prefer charsets in the order of its ACCEPT_CHARSET header so I can reliably write a middleware that will decode any posted data with the first entry, in this case ISO-8859-1, and encode it as utf8.
UPDATE:
I updated the form tag with accept-charset="utf-8" and I'm still seeing non-unicode characters appearing. Is it possible that a user copy/pasting their password from somewhere else (lastpass, excel file) could be injecting non-unicode characters?
The request header Accept-Charset (which may get mapped to HTTP_ACCEPT_CHARSET server-side) expresses the client’s preferences, to be used when the server is capable to serving the resource in different encodings. The server may ignore it, and often will.
If your page is UTF-8 encoded and declared as such, then any form on your page will send its data as UTF-8 encoded, unless you specify an accept-charset attribute. So if a browser posts data as ISO-8859-1 encoded, then this is a browser bug. However, this would need to be analyzed before drawing conclusions.
There’s an ald technique of including some special character, written using a character reference for safety, as the value of a hidden field. The server-side handler can then pick up the value of this field and detect an encoding mismatch, or even to heuristically deduce the actual encoding from the encoded form of the special character.
I am not sure if all browsers always prefer charset in the same specific order, but you can set the accept-charset in the form, which forces the browser to send utf-8 encoded data.
Like this:
<form accept-charset="utf-8"></form>

Ñ not displayed in Google App Engine website

I'm using Google App Engine to build a website and I'm having problems with special characters. I think I've reduced the problem to this two code samples:
request = urlfetch.fetch(
url=self.WWW_INFO,
payload=urllib.urlencode(inputs),
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
print request.content
The previous code displays the content just fine, showing the special characters. But, the correct way to use the framework to display something is using:
request = urlfetch.fetch(
url=self.WWW_INFO,
payload=urllib.urlencode(inputs),
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
self.response.out.write(request.content)
Which doesn't display the special characters, and instead just prints �. What should I do so it displays correctly?
I know I'm missing something, but I can't seem to grasp what it is. The website sets the <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">, and I've tried with charset=UTF-8 with no success.
I'll appreciate any advice that can point me in the right direction.
You need to get the charset from the content-type header in the fetch's result, use it to decode the bytes into Unicode, then, on the response, set the header with your favorite encoding (I do suggest utf-8 -- no good reason to do otherwise) and emit the encoding of the Unicode text via that codec. The pass through unicode is not strictly needed (when you're doing nothing at all with the contents, just bouncing it right back to the response, you might use identical content-type and charset to what you received) but it's recommended on general grounds (use encoded byte strings only on input/output, always keep all text "within" your app as unicode).
IOW, your problem seems to be mostly that you're not setting headers correctly on the response.

Categories