Writing UTF-8 friendly parsers in python - python

I wrote a simple file parser and writer, but then I came across an article talking about the importance of unicode and then it occurred to me that I'm assuming the input file is ascii encoded, which may not be the case all the time, though it would be rare in my situation.
In those rare cases, I would expect UTF-8 encoded files.
Is there a way to work with UTF-8 files by simply changing how I read and write? All I do with the strings is store them and then write them out, so I just need to make sure I can read them, store them, and write them properly.
Furthermore, would I have to treat ascii and UTF-8 files separately and write different functions for each? I have not worked with anything other than ascii files yet and only read about handling unicode.

Python natively supports Unicode. If you directly read and write from the first file to the second, then no data is lost as it copies the bytes verbatim. However, if you decode the string and then re-encode it, you'll need to make sure you use the right encoding.

If you are using Python 2, you can simply change all your str objects to unicode objects. Unicode objects have all the same methods as strings but are encoded in a unicode format instead of ASCII. See http://docs.python.org/library/functions.html#unicode .
If you are using Python 3, strings are encoded in UTF-8 by default.

If you are using Python 2.6 or later, you can use the io library and its io.open method to open the files you want. It has an encoding argument which should be set to 'utf-8' in your case. When you read or write the returned file objects, string are automatically en-/decoded.
Anyway, you don't need to do something special for ASCII, because UTF-8 is a superset of ASCII.

So long as you are only reading and writing to files and not expecting any other type of encoded input, then you should not have to do anything special.
% cat /tmp/u
π is 3.14.
% file /tmp/u
/tmp/u: UTF-8 Unicode text
% cat f.py
f = open('/tmp/u', 'r')
d = f.read()
print d.split()
f.close()
% python f.py
['\xcf\x80', 'is', '3.14.']
This changes when you declare or accept standard input using UTF-8.
% cat g.py
s = 'π is 3.14.'
print s.split()
% python g.py
File "g.py", line 1
SyntaxError: Non-ASCII character '\xcf' in file g.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
To handle this properly, declare the encoding for the Python program at the beginning per PEP 263 (referenced by the SyntaxError exception above).
% cat h.py
# -*- coding: utf-8 -*-
s = 'π is 3.14.'
print s.split()
% python h.py
['\xcf\x80', 'is', '3.14.']

Related

How to write strings in Unicode to a text file in Python? [duplicate]

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

json.dump() uses ASCII codec encoding (instead of requested UTF-8) when redirecting stdout to a file

This tiny python program:
#!/usr/bin/env python
# -*- coding: utf8 -*-
import json
import sys
x = { "name":u"This doesn't work β" }
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
print
Generates this output when run at a terminal:
$ ./tester.py
{"name": "This doesn't work β"}
Which is exactly as I would expect. However, if I redirect stdout to a file, it fails:
$ ./tester.py > output.json
Traceback (most recent call last):
File "./tester.py", line 9, in <module>
json.dump(x, sys.stdout, ensure_ascii=False, encoding="utf8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b2' in position 19: ordinal not in range(128)
However, a direct print (without json.dump) can can be redirect to file:
print u"This does work β".encode('utf-8')
It's as if the json package ignores the encoding option if stdout is not a terminal.
How can I get the json package to do what I want?
JSON is a text serialization format (that incidentally has a recommended binary encoding), not a binary serialization format. The json module itself only cares about encoding to the extent that it would like to know what Python 2's terrible str type is supposed to represent (is it ASCII bytes? UTF-8 bytes? latin-1 bytes?).
Since Python 2 text handling is, as stated, terrible, the json module is happy to return either str (when ensure_ascii is true, or the stars align in other cases and it's convinced you've told it str is compatible with your expected encoding, and none of the inputs are actually unicode) or unicode (when ensure_ascii is false, most of the time).
Like the rest of Python 2, sys.stdout is a bit wishy-washy. Even if it is set to an encoding='ascii' by your locale settings, it ignores it when you write a str to it (sys.stdout.write('\xe9') should fail, but instead, it treats the str as pre-encoded raw binary data and doesn't bother to verify it matches the expected encoding. But when unicode comes in, it doesn't have that option; unicode is text (not UTF-8 text, not ASCII text, etc.), from the ideal text world of unicorns and rainbows, and that world isn't expressed in tawdry bytes.
So sys.stdout must encode the result, and it does so with the locale determined encoding (sys.stdout.encoding will tell you what it is). When that's ASCII, and it receives something that can't encode to ASCII, it explodes (as it should).
The point is, the json module is always returning text (either unicode, or str that it's convinced is effectively text in the wishy-washy Python 2 world), and sometimes you get lucky and that text happens to be in a format that bypasses checks in sys.stdout.
But you shouldn't be relying on that. If your output must be in a specific encoding, use that encoding. The simplest way to do this (simplest in the sense that it pushes most work to the interpreter to do for you) is to not use sys.stdout (explicitly, or implicitly via print) and write your data to files you open with io.open (a backport of Python 3's open, that properly handles encodings), explicitly specifying encoding='utf-8'. If you must use sys.stdout, and you insist on ignoring the locale encoding, you can rewrap it, e.g.:
with io.open(sys.stdout.fileno(), encoding='utf-8', closefd=False) as encodedout:
json.dump(x, encodedout, ensure_ascii=False, encoding="utf-8")
which temporarily wraps the stdout file descriptor in a modern file-like object (using closefd to avoid closing the underlying descriptor when it's closed).
TL;DR: Switch to Python 3. Python 2 is awful when it comes to non-ASCII text, and its modules are often even worse (json should absolutely be returning a consistent type, or at least just one type for each setting of ensure_ascii, not dynamically selecting based on the inputs and encoding; it's not even the worst either, the csv module is absolutely awful). Also, it's reached end-of-life, and will not be patched for anything from here on out, so continuing to use it leaves you vulnerable to any security problems found between the beginning of this year and the end of time. Among other things, Python 3 uses str exclusively for text (which has the full Unicode support of Py2's unicode type) and modern Python 3 (3.7+) will coerce ASCII locales to UTF-8 (because basically all systems can actually handle the latter), which should fix all your problems. Non-ASCII text will behave the same as ASCII text, and weirdo locales like yours that insist they're ASCII (and therefore won't handle non-ASCII output) will be "fixed" to work as you desire, without manually encoding and decoding, rewrapping file handles, etc.
Consolidating all the comments and answers into one final answer:
Note: this answer is for Python 2.7. Python 3 is likely to be different.
The json spec says that json files are utf-8 encoded. However, the Python json package does not like to take chances and so writes straight ascii and escapes unicode characters in the output.
You can set the ensure_ascii flag to False, in which case the json package will generate unicode output instead of str. In that case, encoding the unicode output is your problem.
There is no way to make the json package generate utf-8 or any other encoding on output. It's either ascii or unicode; take your pick.
The encoding argument was a red herring. That option tells the json package how the input strings are encoded.
Here's what finally worked for me:
ofile = codecs.getwriter('utf-8')(sys.stdout)
json.dump(x, ofile, ensure_ascii=False)
tl;dr: the real mystery was why didn't it barf when just letting stdout go to the terminal. It turned out that stdout.write() was detecting when output was to a terminal and encoding per the $LANG environment variable. When output goes to a file, the unicode is encoded to ascii, and an error results when a non-encodable character is encountered.
There is an environment variable Python uses that can override encoding to the terminal or for redirection, so this should work without wrapping stdout inside the script.
$ export PYTHONIOENCODING=utf8
$ ./tester.py > output.json

Python 2.7: Setting I/O Encoding, ’?

Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.
I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):
str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)
for location in locations:
print location['name']
For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5
It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?
PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?
thx
codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).
Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.
To print it, you should first encode it, thus:
for location in locations:
print location['name'].encode('utf8')
EDIT:
For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.
By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:
locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
print location['name'].encode('utf8')
You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.
Try this instead:
print location['name'].encode('utf-8')
If your terminal is set to expect output in utf-8 format, this will print correctly.
It's the same as in PHP. UTF8 strings are good to print.
The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

Unicode (UTF-8) reading and writing to files in Python

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

Categories