I already decoded a lot of email attachments filenames in my code.
But this particular filename breaks my code.
Here is a minimal example:
from email.header import decode_header
encoded_filename='=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
decoded_header=decode_header(encoded_filename) # --> [('SalesInvoiceQ1|\x04\xb5I\x95\xc1\xbd\xc9\xd0\xb9\xc1\x91\x98', 'utf-8')]
filename=str(decoded_header[0][0]).decode(decoded_header[0][1])
Exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 16: invalid start byte
Don't ask my how, but Thunderbird is able to decode this filename to: SalesInvoice-Report.pdf
How can I decode this with python like email clients apparently are able to?
There are two Encoded-Word sections in that header. You'd have to detect where one ends and one begins:
>>> print decode_header(encoded_filename[:28])[0]
('SalesInvoice', 'utf-8')
>>> print decode_header(encoded_filename[28:])[0]
('-Report.pdf', 'utf-8')
Apparently that's what Thunderbird does in this case; split the string into =?encoding?data?= chunks. Normally these should be separated by \r\n (CARRIAGE RETURN + LINE FEED) characters, but in your case they are mashed up together. If you re-introduce the \r\n separator the value decodes correctly:
>>> decode_header(encoded_filename[:28] + '\r\n' + encoded_filename[28:])[0]
('SalesInvoice-Report.pdf', 'utf-8')
You could use a regular expression to extract the parts and re-introduce the separator:
import re
from email.header import decode_header
quopri_entry = re.compile(r'=\?[\w-]+\?[QB]\?[^?]+?\?=')
def decode_multiple(encoded, _pattern=quopri_entry):
fixed = '\r\n'.join(_pattern.findall(encoded))
output = [b.decode(c) for b, c in decode_header(fixed)]
return ''.join(output)
Demo:
>>> encoded_filename = '=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
>>> decode_multiple(encoded_filename)
u'SalesInvoice-Report.pdf'
Of course, it could be that you have a bug in how you read the header in the first place. Make sure you don't accidentally destroy an existing \r\n separator when extracting the encoded_filename value.
Related
Ia have the following data container which is constantly being updated:
data = []
for val, track_id in zip(values,list(track_ids)):
#below
if val < threshold:
#structure data as dictionary
pre_data = {"artist": sp.track(track_id)['artists'][0]['name'], "track":sp.track(track_id)['name'], "feature": filter_name, "value": val}
data.append(pre_data)
#write to file
with open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w') as f:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
but I am getting a lot of errors like this:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)
Is there a way I can get rid of this encoding problem once and for all?
I was told that this would do it:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but many people do not recommend it.
I use python 2.7.10
any clues?
When you write to a file that was opened in text mode, Python encodes the string for you. The default encoding is ascii, which generates the error you see; there are a lot of characters that can't be encoded to ASCII.
The solution is to open the file in a different encoding. In Python 2 you must use the codecs module, in Python 3 you can add the encoding= parameter directly to open. utf-8 is a popular choice since it can handle all of the Unicode characters, and for JSON specifically it's the standard; see https://en.wikipedia.org/wiki/JSON#Data_portability_issues.
import codecs
with codecs.open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w', encoding='utf-8') as f:
Your object has unicode strings and python 2.x's support for unicode can be a bit spotty. First, lets make a short example that demonstrates the problem:
>>> obj = {"artist":u"Björk"}
>>> import json
>>> with open('deleteme', 'w') as f:
... json.dump(obj, f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
From the json.dump help text:
If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.
Ah! There is the solution. Either use the default ensure_ascii=True and get ascii escaped unicode characters or use the codecs module to open the file with the encoding you want. This works:
>>> import codecs
>>> with codecs.open('deleteme', 'w', encoding='utf-8') as f:
... json.dump(obj, f, ensure_ascii=False)
...
>>>
Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.
just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))
You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.
I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()
Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}
You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.
The following statement to fill a list from a file :
action = []
with open (os.getcwd() + "/files/" + "actions.txt") as temp:
action = list (temp)
gives me the following error:
(result, consumed) = self._buffer_decode (data, self.errors, end)
UnicodeDecodeError: 'utf-8' codec can not decode byte 0xf1 in position 67: invalid continuation byte
if I add errors = 'ignore':
action = []
with open (os.getcwd () + "/ files /" + "actions.txt", errors = 'ignore') as temp:
action = list (temp)
Is read the file but not the ñ and vowels accented á-é-í-ó-ú being that python 3 works, as I have understood, default to 'utf-8'
I'm looking for a solution for two or more days, and I'm getting more confused.
In advance thank you very much for any suggestions.
You should use codecs to open the file with the correct encoding.
import codecs
with codecs.open(os.getcwd () + "/ files /" + "actions.txt", "r", encoding="utf8") as temp:
action = list(temp)
See the codecs docs
As #Bogdan pointed out, you're likely not dealing with utf-8 data. You can leverage a module like chardet to try to determine the encoding. If you're on a unix-y environment, you can also try running the file command on it to guess the encoding.
Using your error message character:
>>> import chardet
>>> sample_string = '\xf1'
>>> chardet.detect(sample_string)
{'confidence': 0.5, 'encoding': 'windows-1252'}
I have a list of html pages which may contain certain encoded characters. Some examples are as below -
<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>
I would like to decode (escape, I'm unsure of the current terminology) these strings to -
<a href="mailto:lad at maestro dot com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>
Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.
Edit -
The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)
error in some cases.
You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.
import HTMLParser, urllib2
markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>'''
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"):
print(line)
Result:
<a href="mailto:lad at maestro dot com">
<em>ada#graphics.maestro.com</em>
<em>mel#graphics.maestro.com</em>
Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:
import codecs
with codecs.open(filename, encoding="cp1252") as fin:
decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
fou.write(result)
Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:
with open(filename) as fin:
decoded = fin.read().decode('ascii','ignore')
...
So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))