Some readble content, but impossible to JSON dump to file - python

This text file (30 bytes only, the content is '(Ne pas r\xe9pondre a ce message)') can be opened and inserted in a dict successfully :
import json
d = {}
with open('temp.txt', 'r') as f:
d['blah'] = f.read()
with open('test.txt', 'w') as f:
data = json.dumps(d)
f.write(data)
But it is impossible to dump the dict into a JSON file (see traceback below). Why?
I tried lots of solutions provided by various SO questions. The closest solution I could get was this answer. When using this, I can dump to file, but then the JSON file looks like this:
# test.txt
{"blah": "(Ne pas r\u00e9pondre a ce message)"}
instead of
# test.txt
{"blah": "(Ne pas répondre a ce message)"}
Traceback:
File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte
[Finished in 0.1s with exit code 1]

Your file is not UTF-8 encoded. It uses a Latin codec, like ISO-8859-1 or Windows Codepage 1252. Reading the file gives you the encoded text.
JSON however requires Unicode text. Python 2 has a separate Unicode type, and byte strings (type str) need to be decoded using a suitable codec. The json.dumps() function uses UTF-8 by default; UTF-8 is a widely used codec for encoding Unicode text data that can handle all codepoints in the standard, and is also the default codec for JSON strings to use (JSON requires documents to be encoding in one of 3 UTF codecs).
You need to either decode the string manually or tell json.dumps() what codec to use for the byte string:
data = json.dumps(d, encoding='latin1') # applies to all bytestrings in d
or
d['blah'] = d['blah'].decode('latin1')
data = json.dumps(d)
or using io.open() to decode as you read:
import io
with io.open('test.txt', 'w', encoding='latin1') as f:
d['blah'] = f.read()
By default, the json library produces ASCII-safe JSON output by using the \uhhhh escape syntax the JSON standard allows for. This is entirely normal, the output is valid JSON and readable by any compliant JSON decoder.
If you must produce UTF-8 encoded output without the \uhhhh escape sequences, see Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.
Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.
try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

UnicodeDecodeError while processing Accented words

I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
only in the embedded environment.
The YAML sample:
data: ã
The snippet which reads the YAML:
with open(YAML_FILE, 'r') as stream:
try:
data = yaml.load(stream)
Tried a bunch of solutions without success.
Versions: Python 3.6, PyYAML 3.12
The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.
The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.
A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.
If you can change your codec to be a UTF-8 decode, it should work.
In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.
You can specify the codec that should be used when dumping data using PyYAML, but there is no way you specify your coded in PyYAML when you load. However PyYAML will handle unicode as input and you can explicitly specify which codec to use when opening the file for reading, that codec is then used to return the text (you open the file as text file with 'r', which is the default for open()).
import yaml
YAML_FILE = 'input.yaml'
with open(YAML_FILE, encoding='utf-8') as stream:
data = yaml.safe_load(stream)
Please note that you should almost never have to use yaml.load(), which is documented to be unsafe, use yaml.safe_load() instead.
To dump data in the same format you loaded it use:
import sys
yaml.safe_dump(data, sys.stdout, allow_unicode=True, encoding='utf-8',
default_flow_style=False)
The default_flow_style is needed in order not to get the flow-style curly braces, and the allow_unicode is necessary or else you get data: "\xE3" (i.e. escape sequences for unicode characters)

Write bytes literal with undefined character to CSV file (Python 3)

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.
However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.
Here's a minimal example:
import urllib.request
import urllib.parse
import csv
if __name__ == "__main__":
with open("bytewrite.csv", "w", newline="") as csvfile:
a = b'\x8b' # byte literal by urllib.request
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])
And this is the output:
Traceback (most recent call last):
File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in <module>
w.writerow([b])
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to <undefined>
Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?
It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK
Quoted from csv docs:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.
Corrected minimal example:
import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
a = b'\x8b'
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])
Your interpretation of the lookup tables in encodings is not correct. The code you've listed:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK
Tells you two things:
How to map the unicode character '\x8b' to bytes in iso8859-1, it's just a control character.
How to map the unicode character '\u2039' to bytes in cp1252, it's a piece of punctuation: ‹
This does not tell you how to map the unicode character '\x8b' to bytes in cp1252, which is what you're trying to do.
The root of the problem is that "\x8b" is not a valid iso8859-1 character. Look at the table here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
8b is undefined, so it just decodes as a control character. After it's decoded and we're in unicode land, what is 0x8b? This is a little tricky to find out, but it's defined in the unicode database here:
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
Now, does CP1252 have this control character, "PARTIAL LINE FORWARD"?
http://en.wikipedia.org/wiki/Windows-1252#Code_page_layout
No, it does not. So you get an error when trying to encode it in CP1252.
Unfortunately there's no good solution for this. Some ideas:
Guess what encoding the page actually is. It's probably CP1252, not ISO-8859-1, but who knows. It could even contain a mix of encodings, or incorrectly encoded data (mojibake). You can use chardet to guess the encoding, or force this URL to use CP1252 in your program (overriding what the meta tag says), or you could try a series of codecs and take the first one that decodes & encodes successfully.
Fix up the input text or the decoded unicode string using some kind of mapping of problematic characters like this. This will work most of the time, but will fail silently or do something weird if you're trying to "fix up" data where it doesn't make sense.
Do not try to convert from ISO-8859-1 to CP1252, as they aren't compatible with each other. If you use UTF-8 that might work better.
Use an encoding error handler. See this table for a list of handlers. Using xmlcharrefreplace and backslashreplace will preserve the information (but then require you to do extra steps when decoding), while replace and ignore will silently skip over the bad character.
These types of issues caused by older encodings are really hard to solve, and there is no perfect solution. This is the reason why unicode was invented.

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?
Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.
This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

Python bytearray ignoring encoding?

I've got a chunk of code that reads binary data off a string buffer (StringIO object), and tries to convert it to a bytearray object, but it's throwing errors when the value is greater than 127, which the ascii encoding can't handle, even when I'm trying to override it:
file = open(filename, 'r+b')
file.seek(offset)
chunk = file.read(length)
chunk = zlib.decompress(chunk)
chunk = StringIO(chunk)
d = bytearray(chunk.read(10), encoding="iso8859-1", errors="replace")
Running that code gives me:
d = bytearray(chunk.read(10), encoding="iso8859-1", errors="replace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 3: ordinal not in range(128)
Obviously 240 (decimal of 0xf0) can't fit in the ascii encoding range, but that's why I'm explicitly setting the encoding. But it seems to be ignoring it.
When converting a string to another encoding, its original encoding is taken to be ASCII if it is a str or Unicode if it is a unicode object. When creating the bytearray, the encoding parameter is required only if the string is unicode. Just don't specify an encoding and you will get the results you want.
I am not quite sure what the problem is.
StringIO is for string IO, not for binary IO.
If you want to get a bytearray representing the whole content of the file, use:
with open ('filename', 'r') as file: bytes = bytearray (file.read () )
if you want to get a string with only ascii characters contained in that file, use:
with open ('filename', 'r') as file: asciis = file.read ().decode ('ascii', 'ignore')
(If you run it on windows, you will probably need the binary flag for opening the file.

Categories