Cannot resolve encoding of filename in HTTP response headers

Cannot resolve encoding of filename in HTTP response headers - python

I am trying to make an HTTP request in Python using urrlib.request:
import urllib.request
url = 'https://www.example.com/pdf/123'
request = urllib.request.urlopen(url)
headers = request.getheaders()
When trying to print headers, the output includes filename that is in Cyrillic language but in wrong encoding:
('Content-Disposition', 'attachment; filename="Ð\x9fÑ\x80Ð¾ Ð½Ð°Ñ\x83Ðº-Ð´Ð¾Ñ\x81Ð» Ñ\x81ÐµÐ¼Ñ\x96Ð½Ð°Ñ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;')
It probably has something to do with the binary encoding that is set by default since the HTTp response is PDF file but I can be wrong. Also, tried to download that file via browser and the filename is displayed and saved correctly in a Cyrillic language without mojibake: Про наук-досл семінар.pdf.
So, I guess, the "Ð\x9fÑ\x80Ð¾ Ð½Ð°Ñ\x83Ðº-Ð´Ð¾Ñ\x81Ð» Ñ\x81ÐµÐ¼Ñ\x96Ð½Ð°Ñ\x80" corresponds to "Про наук-досл семінар".
How can I make Python display the filename correctly in the HTTP response headers?

Figured it out. Encoding the returned string from headers as latin-1 and then decoding it as utf-8 worked for me.
Input:
headers[6][1]
Output:
'attachment; filename="Ð\x9fÑ\x80Ð¾ Ð½Ð°Ñ\x83Ðº-Ð´Ð¾Ñ\x81Ð» Ñ\x81ÐµÐ¼Ñ\x96Ð½Ð°Ñ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'
Input:
headers[6][1].encode('latin1')
Output:
b'attachment; filename="\xd0\x9f\xd1\x80\xd0\xbe \xd0\xbd\xd0\xb0\xd1\x83\xd0\xba-\xd0\xb4\xd0\xbe\xd1\x81\xd0\xbb \xd1\x81\xd0\xb5\xd0\xbc\xd1\x96\xd0\xbd\xd0\xb0\xd1\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'
Input:
headers[6][1].encode('latin1').decode('utf-8')
Output:
'attachment; filename="Про наук-досл семінар.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'

Related

Encoding error: in MIME file data via AWS SES

I am trying to retrieve attachments data like file format and name of file from MIME via aws SES. Unfortunately some time file name encoding is changed, like file name is "3_amrishmishra_Entry Level Resume - 02.pdf" and in MIME it appears as '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?=', any way to get exact file name?
if email_message.is_multipart():
message = ''
if "apply" in receiver_email.split('#')[0].split('_')[0] and isinstance(int(receiver_email.split('#')[0].split('_')[1]), int):
for part in email_message.walk():
content_type = str(part.get_content_type()).lower()
content_dispo = str(part.get('Content-Disposition')).lower()
print(content_type, content_dispo)
if 'text/plain' in content_type and "attachment" not in content_dispo:
message = part.get_payload()
if content_type in ['application/pdf', 'text/plain', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg', 'image/jpg', 'image/png', 'image/gif'] and "attachment" in content_dispo:
filename = part.get_filename()
# open('/tmp/local' + filename, 'wb').write(part.get_payload(decode=True))
# s3r.meta.client.upload_file('/tmp/local' + filename, bucket_to_upload, filename)
data = {
'base64_resume': part.get_payload(),
'filename': filename,
}
data_list.append(data)
try:
api_data = {
'email_data': email_data,
'resumes_data': data_list
}
print(len(data_list))
response = requests.post(url, data=json.dumps(api_data),
headers={'content-type': 'application/json'})
print(response.status_code, response.content)
except Exception as e:
print("error %s" % e)

This syntax '=?UTF-8?Q?...?=' is a MIME encoded word. It is used in MIME email when a header value includes non-ASCII characters (gory details in RFC 2047). Your attachment filename includes an "en dash" character, which is why it was sent with this encoding.
The best way to handle it depends on which Python version you're using...
Python 3
Python 3's updated email.parser package can correctly decode RFC 2047 headers for you:
# Python 3
from email import message_from_bytes, policy
raw_message_bytes = b"<< the MIME message you downloaded from SES >>"
message = message_from_bytes(raw_message_bytes, policy=policy.default)
for attachment in message.iter_attachments():
# (EmailMessage.iter_attachments is new in Python 3)
print(attachment.get_filename())
# amrishmishra_Entry Level Resume – 02.pdf
You must specifically request policy.default. If you don't, the parser will use a compat32 policy that replicates Python 2.7's buggy behavior—including not decoding RFC 2047. (Also, early Python 3 releases were still shaking out bugs in the new email package, so make sure you're on Python 3.5 or later.)
Python 2
If you're on Python 2, the best option is upgrading to Python 3.5 or later, if at all possible. Python 2's email parser has many bugs and limitations that were fixed with a massive rewrite in Python 3. (And the rewrite added handy new features like iter_attachments() shown above.)
If you can't switch to Python 3, you can decode the RFC 2047 filename yourself using email.header.decode_header:
# Python 2 (also works in Python 3, but you shouldn't need it there)
from email.header import decode_header
filename = '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?='
decode_header(filename)
# [('amrishmishra_Entry Level Resume \xe2\x80\x93 02.pdf', 'utf-8')]
(decoded_string, charset) = decode_header(filename)[0]
decoded_string.decode(charset)
# u'amrishmishra_Entry Level Resume – 02.pdf'
But again, if you're trying to parse real-world email in Python 2.7, be aware that this is probably just the first of several problems you'll encounter.
The django-anymail package I maintain includes a compatibility version of email.parser.BytesParser that tries to work around several (but not all) other bugs in Python 2.7 email parsing. You may be able to borrow that (internal) code for your purposes. (Or since you tagged your question Django, you might want to look into Anymail's normalized inbound email handling, which includes Amazon SES support.)

How to get HTML from URL that returns "junk" data?

I want to get the html source code of a given url. I had tried using this
import urllib2
url = 'http://mp3.zing.vn' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
But the returned data is not in HTML format for some pages. I tried with another link like http://phuctrancs.info and it works (as this page is plain html based). I have also tried using BeautifulSoup library but it didn't work also. Any suggestion?

You're getting the HTML you expect, but it's compressed. I tried this URL by hand and got back a binary mess with this in the headers:
Content-Encoding: gzip
I saved the response body to a file and was able to gunzip it on the command line. You should also be able to decompress it in your program with the functions in the standard library's zlib module.
Update for anyone having trouble with zlib.decompress...
The compressed data you will get (or at least that I got in Python 2.6) apparently has a "gzip header and trailer" like you'd expect in *.gz files, while zlib.decompress expects a "zlib wrapper"... probably. I kept getting an unhelpful zlib.error exception:
Traceback (most recent call last):
File "./fixme.py", line 32, in <module>
text = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
The solution is entirely undocumented in the Python standard library, but can be found in Greg Hewgill's answer to a question about gzip streams: You have to feed zlib.decompress a wbits argument, created by adding a magic number to an undocumented module-level constant <grumble, mutter...>:
text = zlib.decompress(data, 16 + zlib.MAX_WBITS)
If you feel this isn't obfuscated enough, note that a 32 here would be every bit as magical as the 16.
The only hint of this is buried in the original zlib's manual, under the deflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 16 to windowBits to write a simple gzip header and trailer around the compressed data instead of a zlib wrapper.
...and the inflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format [...]
Note that the zlib.decompress docs explicitly tell you that you can't do this:
The default value is therefore the highest value, 15.
But this is... the opposite of true.
<fume, curse, rant...>

have you look into the response code, urllib2 may need you to handle the response such as 301 redirect and so on.
you should print the response code like:
data = usock.read()
if usock.getcode() != 200:
print "something unexpected"
updated:
if the response contains None-localized or none-readable text, then you might need to specify the request character set in the request header.
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.opener(urllib2.HTTPCookieProcessor(cookie))
opener.addheaders = [("Content-type: text/html; charset=UTF-8")]
urllib2.install_opener(opener)
PS: untested.

use beautiful soup from python
import request
from bs4 import BeautifulSoup
url = 'http://www.google.com'
r=request.get(url)
b=BeautifulSoup(r.text)
b will contain all the html tags and also provides you iteractor to traverse elements/tags. To know more link is https://pypi.python.org/pypi/beautifulsoup4/4.3.2

Python Requests URL with Unicode Parameters

I'm currently trying to hit the google tts url, http://translate.google.com/translate_tts with japanese characters and phrases in python using the requests library.
Here is an example:
http://translate.google.com/translate_tts?tl=ja&q=ひとつ
However, when I try to use the python requests library to download the mp3 that the endpoint returns, the resulting mp3 is blank. I have verified that I can hit this URL in requests using non-unicode characters (via romanji) and have gotten correct responses back.
Here is a part of the code I am using to make the request
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text})
r.write(result.content)
r.seek(0)
return r
else:
return url
Also, if I print textor url within this snippet, the kana/kanji is rendered correctly in my console.
Edit:
If I attempt to encode the unicode and quote it as such, I still get the same response.
# -*- coding: utf-8 -*-
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
text = urllib.quote(text.encode('utf-8'))
url = 'http://translate.google.com/translate_tts?tl=%(glang)s&q=%(text)s' % locals()
print url
if download:
result = requests.get(url)
r.write(result.content)
r.seek(0)
return r
else:
return url
Which returns this:
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Which seems like it should work, but doesn't.
Edit 2:
If I attempt to use urlllb/urllib2, I get a 403 error.
Edit 3:
So, it seems that this problem/behavior is simply limited to this endpoint. If I try the following URL, a different endpoint.
http://www.kanjidamage.com/kanji/13-un-%E4%B8%8D
From within requests and my browser, I get the same response (they match). If I even try ascii characters to the server, like this url.
http://translate.google.com/translate_tts?tl=ja&q=sayonara
I get the same response as well (they match again). But if I attempt to send unicode characters to this URL, I get a correct audio file on my browser, but not from requests, which sends an audio file, but with no sound.
http://translate.google.com/translate_tts?tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
So, it seems like this behavior is limited to the Google TTL URL?

The user agent can be part of the problem, however, it is not in this case. The translate_tts service rejects (with HTTP 403) some user agents, e.g. any that begin with Python, curl, wget, and possibly others. That is why you are seeing a HTTP 403 response when using urllib2.urlopen() - it sets the user agent to Python-urllib/2.7 (the version might vary).
You found that setting the user agent to Mozilla/5.0 fixed the problem, but that might work because the API might assume a particular encoding based on the user agent.
What you actually should do is to explicitly specify the URL character encoding with the ie field. Your URL request should look like this:
http://translate.google.com/translate_tts?ie=UTF-8&tl=ja&q=%E3%81%B2%E3%81%A8%E3%81%A4
Note the ie=UTF-8 which explicitly sets the URL character encoding. The spec does state that UTF-8 is the default, but doesn't seem entirely true, so you should always set ie in your requests.
The API supports kanji, hiragana, and katakana (possibly others?). These URLs all produce "nihongo", although the audio produced for hiragana input has a slightly different inflection to the others.
import requests
one = u'\u3072\u3068\u3064'
kanji = u'\u65e5\u672c\u8a9e'
hiragana = u'\u306b\u307b\u3093\u3054'
katakana = u'\u30cb\u30db\u30f3\u30b4'
url = 'http://translate.google.com/translate_tts'
for text in one, kanji, hiragana, katakana:
r = requests.get(url, params={'ie': 'UTF-8', 'tl': 'ja', 'q': text})
print u"{} -> {}".format(text, r.url)
open(u'/tmp/{}.mp3'.format(text), 'wb').write(r.content)

I made this little method before to help me with UTF-8 encoding. I was having issues printing cyrllic and CJK languages to csvs and this did the trick.
def assist(unicode_string):
utf8 = unicode_string.encode('utf-8')
read = utf8.decode('string_escape')
return read ## UTF-8 encoded string
Also, make sure you have these two lines at the beginning of your .py.
#!/usr/bin/python
# -*- coding: utf-8 -*-
The first line is just a good python habit, it specifies which compiler to use on the .py (really only useful if you have more than one version of python loaded on your machine). The second line specifies the encoding of the python file. A slightly longer answer for this is given here.

Setting the User-Agent to Mozilla/5.0 fixes this issue.
from StringIO import StringIO
import urllib
import requests
__author__ = 'jacob'
langs = {'japanese': 'ja',
'english': 'en'}
def get_sound_file_for_text(text, download=False, lang='japanese'):
r = StringIO()
glang = langs[lang]
text = text.replace('*', '')
text = text.replace('/', '')
text = text.replace('x', '')
url = 'http://translate.google.com/translate_tts'
if download:
result = requests.get(url, params={'tl': glang, 'q': text}, headers={'User-Agent': 'Mozilla/5.0'})
r.write(result.content)
r.seek(0)
return r
else:
return url

why are python double-quotes converted to hyphen in filename?

I'm generating some pdfs using ReportLab in Django. I followed and experimented with the answer given to this question, and realised that the double-quotes therein don't make sense:
response['Content-Disposition'] = 'inline; filename=constant_"%s_%s".pdf'\
% ('foo','bar')
gives filename constant_-foo_bar-.pdf
response['Content-Disposition'] = 'inline; filename=constant_%s_%s.pdf' \
% ('foo','bar')
gives filename constant_foo_bar.pdf
Why is this? Is it just to do with slug-esque sanitisation for filesystems?

It seems from the research in this question that it's actually the browser doing the encoding/escaping. I used cURL to confirm that Django itself does not escape these headers. First, I set up a minimal test view:
# views.py
def index(request):
response = render(request, 'template.html')
response['Content-Disposition'] = 'inline; filename=constant"a_b".html'
return response
then ran:
carl#chaffinch:~$ HEAD http://localhost:8003
200 OK
Date: Thu, 16 Aug 2012 19:28:54 GMT
Server: WSGIServer/0.1 Python/2.7.3
Vary: Cookie
Content-Type: text/html; charset=utf-8
Client-Date: Thu, 16 Aug 2012 19:28:54 GMT
Client-Peer: 127.0.0.1:8003
Client-Response-Num: 1
Content-Disposition: inline; filename=constant"a_b".html
Check out the header: filename=constant"a_b".html. The quotes are still there!

Python does not convert double quotes to hyphens in filenames:
>>> with open('constant_"%s_%s".pdf' % ('foo', 'bar'), 'w'): pass
$ ls
...
constant_"foo_bar".pdf
...
Probably it's django that will not allow you to use too strange names.
Anyway I'd recommend to use only the following characters in filenames, to avoid portability issues:
Letters [a-z][A-Z]
digits [0-9]
hyphen(-), underscore(_), plus(+)
Note: I've excluded the whitespace in the list, because there are a lot of scripts that don't use proper quoting, and break with such filenames.
If you restrict yourself to this set of characters you probably wont ever have any problems with pathnames. Obviously other people or other programs may still not follow this "guideline" so you shouldn't assume this convention is shared by paths you obtain from users or other external sources.

Your usage is slightly incorrect. You would want the quotes around the entire filename in order to account for spaces, etc.
change:
response['Content-Disposition'] = 'inline; filename=constant_"%s_%s".pdf'\
% ('foo','bar')
to:
response['Content-Disposition'] = 'inline; filename="constant_%s_%s.pdf"'\
% ('foo','bar')

Django: Unicode Filenames with ASCII headers?

I have a list of strangely encoded files: 02 - Charlie, Woody and You／Study #22.mp3 which I suppose isn't so bad but there are a few particular characters which Django OR nginx seem to be snagging on.
>>> test = u'02 - Charlie, Woody and You／Study #22.mp3'
>>> test
u'02 - Charlie, Woody and You\uff0fStudy #22.mp3'
I am using nginx as a reverse proxy to connect to django's built in webserver (still in development stages) and postgresql for my database. My database and tables are all en_US.UTF-8 and I am using pgadmin3 to view my tables outside of django. My issue goes a little beyond my title, firstly how should I be saving possibly whacky filenames in my database? My current method is
'path': smart_unicode(path.lstrip(MUSIC_PATH)),
'filename': smart_unicode(file)
and when I pprint out the values they do show u'whateverthecrap'
I am not sure if that is how I should be doing it but assuming it is now I have issues trying to spit out the download.
My download view looks something like this:
def song_download(request, song_id):
song = get_object_or_404(Song, pk=song_id)
url = u'/static_music/%s/%s' % (song.path, song.filename)
print url
response = HttpResponse()
response['X-Accel-Redirect'] = url
response['Content-Type'] = 'audio/mpeg'
response['Content-Disposition'] = "attachment; filename=test.mp3"
return response
and most files will download but when I get to 02 - Charlie, Woody and You／Study #22.mp3 I receive this from django: 'ascii' codec can't encode character u'\uff0f' in position 118: ordinal not in range(128), HTTP response headers must be in US-ASCII format.
How can I use an ASCII acceptable string if my filename is out of bounds? 02 - Charlie, Woody and You\uff0fStudy #22.mp3 doesn't seem to work...
EDIT 1
I am using Ubuntu for my OS.

Although ／ is an unusual and undesirable character, your script will break for any non-ASCII character.
response['X-Accel-Redirect'] = url
url is Unicode (and it isn't a URL, it's a filepath). Response headers are bytes. You'll need to encode it.
response['X-Accel-Redirect'] = url.encode('utf-8')
that's assuming you're running on a server with UTF-8 as the filesystem encoding.
(Now, how to encode the filename in the Content-Disposition header... that's an altogether trickier question!)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.