Python - inferring input file encoding while reading

Python - inferring input file encoding while reading - python

I have to process an input text file, which can be in ANSI and convert it to UTF8, whilst doing doing some processing of the lines read. In python, that'll amount to
with open(input_file_location, 'r', newline='\r\n', encoding='cp1252') as old, open(output_file_location, 'w', encoding='utf_8') as new:
for line in old:
modified = ... do processing here ....
new.write(modified)
However, this will work as expected only if the input file is ANSI (windows). If however, the input file was UTF8 originally, the above code works silently, reading it assuming ANSI and thus things in output are not as expected.
So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.
Thanks

So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.
UTF8 is more constraining than CP1252, and both are ascii compatible. So you can start by reading it as UTF8, if that works you're fine (it's either plain ASCII or valid UTF-8), if that does not fall back to CP1252.
Alternatively you could try running chardet on it, but that's not necessarily more reliable: every byte is "valid" in ISO-8859 encodings (of which CP1252 is a derivative), so every file "decodes properly", they just return garbage.

There isn't a guaranteed way to determine the encoding a file if it isn't known in advance. However if you are sure that the possibilities are restricted to UTF-8 and cp1252, then the following approach may work:
Open the file in binary mode and read the first three bytes. If these bytes are b'\xef\xbb\xbf' then the encoding is extremely likely to be 'utf-8-sig', a Microsoft variant of UTF-8 (unless you have cp1252 files that legitimately begin with "'ï»¿'"). See the final paragraph of this section of the codecs docs.
Assume UTF-8. Both UTF-8 and cp1252 will decode bytes in the ASCII range (0-127) identically. Single bytes with the high bit set are not valid UTF-8, so if the file is encoded as cp1252 and contains such bytes a UnicodeDecodeError will be raised.
Catch the above UnicodeDecodeError and try again with cp1252.

Related

UTF-8 Encoding in python gets transformed to ASCII?

I'm attempting to do something very simple, which is read a file in ascii or utf-8-sig and save it as utf-8. However, when I run the function below, and then do file filename.json in linux, it always shows the file as being ASCII. I have tried using codecs, and no luck either. The only way I can get it to work, is if I replace utf-8 with utf-8-sig, BUT that gives me the issue that the file has BOM endings. I've searched around for solutions, and I found some removing the beginning characters, however, after this is performed, the file becomes ascii again. I have tried everything her: Convert UTF-8 with BOM to UTF-8 with no BOM in Python
def file_converter(file_path):
s = open(file_path, mode='r', encoding='ascii').read()
open(file_path, mode='w', encoding='utf-8').write(s)

Files that only contain characters below U+0080 encode to exactly the same bytes as either ASCII or UTF-8 (this was one of the compatibility goals of UTF-8). file detects the file as ASCII, and it is, but it's also UTF-8, and will decode correctly as UTF-8 (just like any ASCII file will). So nothing at all is wrong.

json.dump() uses ASCII codec encoding (instead of requested UTF-8) when redirecting stdout to a file

This tiny python program:
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
import sys
x = { "name":u"This doesn't work β" }
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
print
Generates this output when run at a terminal:
$ ./tester.py
{"name": "This doesn't work β"}
Which is exactly as I would expect. However, if I redirect stdout to a file, it fails:
$ ./tester.py > output.json
Traceback (most recent call last):
File "./tester.py", line 9, in <module>
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b2' in position 19: ordinal not in range(128)
However, a direct print (without json.dump) can can be redirect to file:
print u"This does work β".encode('utf-8')
It's as if the json package ignores the encoding option if stdout is not a terminal.
How can I get the json package to do what I want?

JSON is a text serialization format (that incidentally has a recommended binary encoding), not a binary serialization format. The json module itself only cares about encoding to the extent that it would like to know what Python 2's terrible str type is supposed to represent (is it ASCII bytes? UTF-8 bytes? latin-1 bytes?).
Since Python 2 text handling is, as stated, terrible, the json module is happy to return either str (when ensure_ascii is true, or the stars align in other cases and it's convinced you've told it str is compatible with your expected encoding, and none of the inputs are actually unicode) or unicode (when ensure_ascii is false, most of the time).
Like the rest of Python 2, sys.stdout is a bit wishy-washy. Even if it is set to an encoding='ascii' by your locale settings, it ignores it when you write a str to it (sys.stdout.write('\xe9') should fail, but instead, it treats the str as pre-encoded raw binary data and doesn't bother to verify it matches the expected encoding. But when unicode comes in, it doesn't have that option; unicode is text (not UTF-8 text, not ASCII text, etc.), from the ideal text world of unicorns and rainbows, and that world isn't expressed in tawdry bytes.
So sys.stdout must encode the result, and it does so with the locale determined encoding (sys.stdout.encoding will tell you what it is). When that's ASCII, and it receives something that can't encode to ASCII, it explodes (as it should).
The point is, the json module is always returning text (either unicode, or str that it's convinced is effectively text in the wishy-washy Python 2 world), and sometimes you get lucky and that text happens to be in a format that bypasses checks in sys.stdout.
But you shouldn't be relying on that. If your output must be in a specific encoding, use that encoding. The simplest way to do this (simplest in the sense that it pushes most work to the interpreter to do for you) is to not use sys.stdout (explicitly, or implicitly via print) and write your data to files you open with io.open (a backport of Python 3's open, that properly handles encodings), explicitly specifying encoding='utf-8'. If you must use sys.stdout, and you insist on ignoring the locale encoding, you can rewrap it, e.g.:
with io.open(sys.stdout.fileno(), encoding='utf-8', closefd=False) as encodedout:
json.dump(x, encodedout, ensure_ascii=False, encoding="utf-8")
which temporarily wraps the stdout file descriptor in a modern file-like object (using closefd to avoid closing the underlying descriptor when it's closed).
TL;DR: Switch to Python 3. Python 2 is awful when it comes to non-ASCII text, and its modules are often even worse (json should absolutely be returning a consistent type, or at least just one type for each setting of ensure_ascii, not dynamically selecting based on the inputs and encoding; it's not even the worst either, the csv module is absolutely awful). Also, it's reached end-of-life, and will not be patched for anything from here on out, so continuing to use it leaves you vulnerable to any security problems found between the beginning of this year and the end of time. Among other things, Python 3 uses str exclusively for text (which has the full Unicode support of Py2's unicode type) and modern Python 3 (3.7+) will coerce ASCII locales to UTF-8 (because basically all systems can actually handle the latter), which should fix all your problems. Non-ASCII text will behave the same as ASCII text, and weirdo locales like yours that insist they're ASCII (and therefore won't handle non-ASCII output) will be "fixed" to work as you desire, without manually encoding and decoding, rewrapping file handles, etc.

Consolidating all the comments and answers into one final answer:
Note: this answer is for Python 2.7. Python 3 is likely to be different.
The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.
You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.
There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.
The encoding argument was a red herring. That option tells the json package how the input strings are encoded.
Here's what finally worked for me:
ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)
tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.

There is an environment variable Python uses that can override encoding to the terminal or for redirection, so this should work without wrapping stdout inside the script.
$ export PYTHONIOENCODING=utf8
$ ./tester.py > output.json

How do I detect if a file is encoded using UTF-8?

Is there a way to recognize if text file is UTF-8 in Python?
I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.

You mentioned in a comment you only need to detect UTF-8. If you know the alternative consists of only single byte encodings, then there is a solution that often works.
If you know it's either UTF-8 or single byte encoding like latin-1, then try opening it first in UTF-8 and then in the other encoding. If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two.
try:
# or codecs.open on Python <= 2.5
# or io.open on Python > 2.5 and <= 2.7
filedata = open(filename, encoding='UTF-8').read()
except:
filedata = open(filename, encoding='other-single-byte-encoding').read()
Your best bet is to use the chardet package from PyPI, either directly or through UnicodeDamnit from BeautifulSoup:
chardet 1.0.1
Universal encoding detector
Detects:
ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-2, windows-1250 (Hungarian)
ISO-8859-5, windows-1251 (Bulgarian)
windows-1252 (English)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)
Requires Python 2.1 or later
However, some files will be valid in multiple encodings, so chardet is not a panacea.

Reliably? No.
In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc.
But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one.

Writing UTF-8 friendly parsers in python

I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.
In those rare cases, I would expect UTF-8 encoded files.
Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.
Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.

Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.

If you are using Python 2, you can simply change all your str objects to unicode objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .
If you are using Python 3, strings are encoded in UTF-8 by default.

If you are using Python 2.6 or later, you can use the io library and its io.open method to open the files you want. It has an encoding argument which should be set to 'utf-8' in your case. When you read or write the returned file objects, string are automatically en-/decoded.
Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.

So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.
% cat /tmp/u
π is 3.14.
% file /tmp/u
/tmp/u: UTF-8 Unicode text
% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()
% python f.py
['\xcf\x80', 'is', '3.14.']
This changes when you declare or accept standard input using UTF-8.
% cat g.py
s = 'π is 3.14.'
print s.split()
% python g.py
File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the SyntaxError exception above).
% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()
% python h.py
['\xcf\x80', 'is', '3.14.']

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.
I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):
str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)
for location in locations:
print location['name']
For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5
It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?
PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?
thx

codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).
Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.
To print it, you should first encode it, thus:
for location in locations:
print location['name'].encode('utf8')
EDIT:
For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.
By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:
locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
print location['name'].encode('utf8')

You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.
Try this instead:
print location['name'].encode('utf-8')
If your terminal is set to expect output in utf-8 format, this will print correctly.

It's the same as in PHP. UTF8 strings are good to print.

The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.