Browser charsets order of precedence - python

Client browsers are sending the header HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3. I only serve webpages as utf8 with the correct header but browsers are posting data from forms encoded with the ISO-8859-1 charset. My question is, will a browser always prefer charsets in the order of its ACCEPT_CHARSET header so I can reliably write a middleware that will decode any posted data with the first entry, in this case ISO-8859-1, and encode it as utf8.
UPDATE:
I updated the form tag with accept-charset="utf-8" and I'm still seeing non-unicode characters appearing. Is it possible that a user copy/pasting their password from somewhere else (lastpass, excel file) could be injecting non-unicode characters?

The request header Accept-Charset (which may get mapped to HTTP_ACCEPT_CHARSET server-side) expresses the client’s preferences, to be used when the server is capable to serving the resource in different encodings. The server may ignore it, and often will.
If your page is UTF-8 encoded and declared as such, then any form on your page will send its data as UTF-8 encoded, unless you specify an accept-charset attribute. So if a browser posts data as ISO-8859-1 encoded, then this is a browser bug. However, this would need to be analyzed before drawing conclusions.
There’s an ald technique of including some special character, written using a character reference for safety, as the value of a hidden field. The server-side handler can then pick up the value of this field and detect an encoding mismatch, or even to heuristically deduce the actual encoding from the encoded form of the special character.

I am not sure if all browsers always prefer charset in the same specific order, but you can set the accept-charset in the form, which forces the browser to send utf-8 encoded data.
Like this:
<form accept-charset="utf-8"></form>

Related

Obtaining metadata from magnetlink infohash

I am learning about bittorrent protocols and have a question I'm not too sure about.
According to BEP009,
magnet URI format
The magnet URI format is:
v1: magnet:?xt=urn:btih:info-hash&dn=name&tr=tracker-url
v2: magnet:?xt=urn:btmh:tagged-info-hash&dn=name&tr=tracker-url
info-hash Is the info-hash hex encoded, for a total of 40 characters. For compatability with existing links in the wild, clients should also support the 32 character base32 encoded info-hash.
tagged-info-hash Is the multihash formatted, hex encoded full infohash for torrents in the new metadata format. 'btmh' and 'btih' exact topics may exist in the same magnet if they describe the same hybrid torrent.
example magnet link: magnet:?xt=urn:btih:407AEA6F3D7DC846879449B24CA3F57DB280DE5C&dn=ubuntu-educationpack_14+04_all&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexplodie.org%3A6969
Correct me if i'm wrong, but urn:btih:407AEA6F3D7DC846879449B24CA3F57DB280DE5C is the info-hash from the magnet link, and i will need to decode it to be able to obtain a bencoded metadata such as listed in BEP015. Things such as: downloaded, left, uploaded, event, etc.
My question is, how do I decode this in python?
The info-hash in Magnet Link is the same as the info-hash required for a UDP Tracker (20-bytes SHA-1 hash of bencoded "info" dictionary of a torrent).
Additionally, a UDP Tracker doesn't use bencoded data at all, just bytes!
Bencoded format is used by HTTP/HTTPs trackers though.
You can search some open source code like libtorrent. It's written by C++, so you need to read the bdecode and bencode part. That part is not complex, and then you can write python codes by yourself.
Correct me if i'm wrong, but
urn:btih:407AEA6F3D7DC846879449B24CA3F57DB280DE5C is the info-hash
from the magnet link, and i will need to decode it to be able to
obtain a bencoded metadata such as listed in BEP015. Things such as:
downloaded, left, uploaded, event, etc.
Infohash is a unique SHA1 hash that identifies a torrent. Therefore it cannot be further decoded to obtain any further information, it's just an identifier. Furthermore, if you think about it, the link would constantly need to change if it contained this information.
You must use this infohash in the announce request to a tracker. The purpose of the announce request is to let the tracker know that you are downloading the particular hash, how far along you are and to provide you with peers the tracker knows about.
In your example there are two UDP trackers:
tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fexplodie.org%3A6969
After URL decoding these, they become:
tr=udp://tracker.opentrackr.org:1337/announce&tr=udp://explodie.org:6969
So, these are the trackers you must send your announce request to by implementing https://libtorrent.org/udp_tracker_protocol.html
Note that does not give you any information about the torrent file, for that you need to implement BEP-9.

Python | Http - can't get the correct mime type

I am building a web crawler using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
mime_type = response.getheader("content-type")
I have stumbled upon few links to document files such as docx and epub and the mime type I'm getting from the server is text/plain.It is important to me to get the correct mime type.
Example to a problematic url:
http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx
Right now the logic of getting file's mime type is getting it from the server and if not available trying to get the file's extension.
How come Firefox is not getting confused by these kind of urls and let the user download the file right away? How does it know that this file is not plain text? How can i get the correct mime type?
I haven't read the Firefox source code, but I would guess that Firefox either tries to guess the filetype based on the URL, or refuses to render it inline if it's a specific Content-Type and larger than some maximum size, or perhaps it even inspects some of the file contents to figure out what it is based on a magic number at the start.
You can use the Python mimetypes module in the standard library to guess what the filetype is based on the URL:
import mimetypes
url = "http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx"
type, encoding = mimetypes.guess_type(url)
In this case, type is "application/vnd.openxmlformats-officedocument.wordprocessingml.document" which is probably what you want.
Unfortunately, text/plain is the right MIME type for your response, as stated here.
For text documents without specific subtype, text/plain should be used.
I tested your URL in Chrome and the behaviour you described for Firefox happened as well: Chrome downloaded the file instead of opening it, even with the Content type header being text/plain.
This means that those browsers use more than just this header to determine whether they should download or open the said file, which might include their own limitation to parse that file.
That said, you're not able to rely on the Content type header if you want to determine the real MIME type of whatever will come in the request's response. Maybe an alternative is to temporarily store the response's file and determine its MIME type afterwards.

Given a URL, how to encode a file's contents as base64 with Python / Django?

I am building a Django-based website, and am having trouble figuring out a decent way to email some larger PDFs and such to my users.
The files in question never touch our servers; they're handled on a CDN. So, my starting point is with the unique URLs for the files, not with the files themselves. It would be nice to find a solution that doesn't involve saving the files locally.
In order for me to be able to send the email in the way I want (with the PDF/DOCX/whatever attached to it), I need to be able to encode the attachment as a base-64 string.
I would prefer not to save the file to our server; I would also prefer not to read a response object in chunks and write it plainly to a file on our server, then encode that file.
That said, given a direct url to a file is there a way to stream the response and encode it in base64 as it comes in?
I have been reading about Django's StreamingHttpResponse and FileWrapper and feel like I am close, but I'm not able to put it together just yet.
Edit: the snippet below is working for now, but I'm worried about memory usage - how well would something like this scale?
import base64
req = requests.get('url')
encoded = base64.b64encode(req.content)
Thanks to beetea I am comfortable implementing the simple:
import base64
req = requests.get('url')
encoded = base64.b64encode(req.content)
As the solution to this issue.

Dump JSON from string in unknown character encoding

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.
I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:
for possible_encoding in ["utf-8", "ISO-8859-1"]:
try:
# post_dict contains, among other things, website html retrieved
# with urllib2
json = simplejson.dumps(post_dict, encoding=possible_encoding)
break
except UnicodeDecodeError:
pass
if json is None:
raise UnicodeDecodeError
This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.
The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.
You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see
A good way to get the charset/encoding of an HTTP response in Python
.
To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

Ñ not displayed in Google App Engine website

I'm using Google App Engine to build a website and I'm having problems with special characters. I think I've reduced the problem to this two code samples:
request = urlfetch.fetch(
url=self.WWW_INFO,
payload=urllib.urlencode(inputs),
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
print request.content
The previous code displays the content just fine, showing the special characters. But, the correct way to use the framework to display something is using:
request = urlfetch.fetch(
url=self.WWW_INFO,
payload=urllib.urlencode(inputs),
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
self.response.out.write(request.content)
Which doesn't display the special characters, and instead just prints �. What should I do so it displays correctly?
I know I'm missing something, but I can't seem to grasp what it is. The website sets the <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">, and I've tried with charset=UTF-8 with no success.
I'll appreciate any advice that can point me in the right direction.
You need to get the charset from the content-type header in the fetch's result, use it to decode the bytes into Unicode, then, on the response, set the header with your favorite encoding (I do suggest utf-8 -- no good reason to do otherwise) and emit the encoding of the Unicode text via that codec. The pass through unicode is not strictly needed (when you're doing nothing at all with the contents, just bouncing it right back to the response, you might use identical content-type and charset to what you received) but it's recommended on general grounds (use encoded byte strings only on input/output, always keep all text "within" your app as unicode).
IOW, your problem seems to be mostly that you're not setting headers correctly on the response.

Categories