I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).
I want to read files with special file names in Python (2.7). But whatever I try, it always fails to open them.
The filenames are
F\xA8\xB9hrerschein
and
Gro\xDFhandel
I know, the encoding was done with one of several codepages. I could try to find out which one and try to convert it and all the mumbo jumbo, but I don't want that.
Can't I somehow tell python to open that file without having to go through all that encoding stuff? I mean opening the file by its raw name in bytes?
After all, I fixed it with
reload(sys)
sys.setdefaultencoding('utf-8')
and setting the environment variable
LANG="C.UTF-8"
Thanks for the hints.
One way is to use os.listdir(). See the following example.
Add some data to a file with non-ascii character 0xdf in the name:
$ echo abcd > `printf "A\xdfA"`
Check that the file contains a non-ascii character:
$ ls A*
A?A
Start Python, read the directory and open the first file (which is the one with the non-ascii character):
$ Python
>>> import os
>>> d = os.listdir('.')
>>> d
['A\xdfA']
>>> f = open(d[0])
>>> f.readline()
'abcd\n'
>>>
If you have source code like
with open('Großhandel') as input:
#stuff
You should look at Source Code Encodings and write
#!python2
# -*- coding: utf-8 -*-
with open('Großhandel') as input:
…
It is worth mention that the authors of PEP-263 are Marc-André Lemburg and Martin von Löwis, which I suppose makes pushing defined toward source encoding back in 2002 slightly more understandable.
Under Linux, filenames can be encoded in any character encoding. When opening a file, you must use the exact name encoded to match.
I.e. If the filename is Großhandel.txt encoded using UTF-8, it must be encoded as Gro\xc3\x9fhandel.txt.
If you pass a Unicode string to open(), the user's locale is used to encode the filename, which may match the filename.
Under OS X, UTF-8 encoding is enforced. Under Windows, the character encoding is abstracted by the i/o drivers. A Unicode object passed to open() should always be used for these Operating Systems, where it'll be converted appropriately.
If you're reading filenames from the filesystem, it would be useful to get decoded Unicode filenames to pass straight to open() - Well, you can pass Unicode strings to os.listdir().
E.g.
Locale: LANG=en_GB.UTF-8
A directory with the following files, with their filenames encoded to UTF-8:
test.txt
€.txt
When running Python 2.7 using a string:
>>> os.listdir(".")
['\xe2\x82\xac.txt', 'test.txt']
Using a Unicode path:
>>> os.listdir(u".")
[u'\u20ac.txt', u'test.txt']
I'd like to find a function that accesses a directory using the os.listdir() function (or any other method) and returns all the file names in that directory but converts non-ASCII characters into their unicode format. For example, if I had the file Hello WorlЪ.py, I'd like for the function to return Hello Worl\u042a.py or something equivalent. Any help is appreciated.
If you pass os.listdir a unicode path, then os.listdir returns unicode:
os.listdir(u'.')
From the docs:
Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a
Unicode object, the result will be a list of Unicode objects.
Undecodable filenames will still be returned as string objects.
str.encode("unicode_escape") will encode a string in the way you described.
>>> print(u"Hello WorlЪ.py".encode("unicode_escape"))
Hello Worl\u042a.py
I have a file with the following path : D:/bar/クレイジー・ヒッツ!/foo.abc
I am parsing the path from a XML file and storing it in a variable called path in the form of file://localhost/D:/bar/クレイジー・ヒッツ!/foo.abc
Then, the following operations are being done:
path=path.strip()
path=path[17:] #to remove the file://localhost/ part
path=urllib.url2pathname(path)
path=urllib.unquote(path)
The error is:
IOError: [Errno 2] No such file or directory: 'D:\\bar\\\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81\\foo.abc'
I am using Python 2.7 on Windows 7
The path in your error is:
'\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
I think this is the UTF8 encoded version of your filename.
I've created a folder of the same name on Windows7 and placed a file called 'abc.txt' in it:
>>> a = '\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
>>> os.listdir('.')
['?????\xb7???!']
>>> os.listdir(u'.') # Pass unicode to have unicode returned to you
[u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01']
>>>
>>> a.decode('utf8') # UTF8 decoding your string matches the listdir output
u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01'
>>> os.listdir(a.decode('utf8'))
[u'abc.txt']
So it seems that Duncan's suggestion of path.decode('utf8') does the trick.
Update
I can't test this for you, but I suggest that you try checking whether the path contains non-ascii before doing the .decode('utf8'). This is a bit hacky...
ASCII_TRANS = '_'*32 + ''.join([chr(x) for x in range(32,126)]) + '_'*130
path=path.strip()
path=path[17:] #to remove the file://localhost/ part
path=urllib.unquote(path)
if path.translate(ASCII_TRANS) != path: # Contains non-ascii
path = path.decode('utf8')
path=urllib.url2pathname(path)
Provide the filename as a unicode string to the open call.
How do you produce the filename?
if provided as a constant by you
Add a line near the beginning of your script:
# -*- coding: utf8 -*-
Then, in a UTF-8 capable editor, set path to the unicode filename:
path = u"D:/bar/クレイジー・ヒッツ!/foo.abc"
read from a list of directory contents
Retrieve the contents of the directory using a unicode dirspec:
dir_files= os.listdir(u'.')
read from a text file
Open the filename-containing-file using codecs.open to read unicode data from it. You need to specify the encoding of the file (because you know what is the “default windows charset” for non-Unicode applications on your computer).
in any case
Do a:
path= path.decode("utf8")
before opening the file; substitute the correct encoding if not "utf8".
Here's some interesting stuff from the documentation:
sys.getfilesystemencoding()
Return the name of the encoding used
to convert Unicode filenames into
system file names, or None if the
system default encoding is used. The
result value depends on the operating
system: On Mac OS X, the encoding is
'utf-8'. On Unix, the encoding is the
user’s preference according to the
result of nl_langinfo(CODESET), or
None if the nl_langinfo(CODESET)
failed. On Windows NT+, file names are
Unicode natively, so no conversion is
performed. getfilesystemencoding()
still returns 'mbcs', as this is the
encoding that applications should use
when they explicitly want to convert
Unicode strings to byte strings that
are equivalent when used as file
names. On Windows 9x, the encoding is
'mbcs'.
New in version 2.3.
If I understand this correctly, you should pass the file name as unicode:
f = open(unicode(path, encoding))
I am on python 2.6 for Windows.
I use os.walk to read a file tree. Files may have non-7-bit characters (German "ae" for example) in their filenames. These are encoded in Pythons internal string representation.
I am processing these filenames with Python library functions and that fails due to wrong encoding.
How can I convert these filenames to proper (unicode?) python strings?
I have a file "d:\utest\ü.txt". Passing the path as unicode does not work:
>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]
If you pass a Unicode string to os.walk(), you'll get Unicode results:
>>> list(os.walk(r'C:\example')) # Passing an ASCII string
[('C:\\example', [], ['file.txt'])]
>>>
>>> list(os.walk(ur'C:\example')) # Passing a Unicode string
[(u'C:\\example', [], [u'file.txt'])]
I was looking for a solution for Python 3.0+. Will put it up here incase someone else needs it.
rootdir = r'D:\COUNTRY\ROADS\'
fs_enc = sys.getfilesystemencoding()
for (root, dirname, filename) in os.walk(rootdir.encode(fs_enc)):
# do your stuff here, but remember that now
# root, dirname, filename are represented as bytearrays
a more direct way might be to try the following -- find your file system's encoding, and then convert that to unicode. for example,
unicode_name = unicode(filename, "utf-8", errors="ignore")
to go the other way,
unicode_name.encode("utf-8")
os.walk(unicode(root_dir, 'utf-8'))
os.walk isn't specified to always use os.listdir, but neither is it listed how Unicode is handled. However, os.listdir does say:
Changed in version 2.3: On Windows
NT/2k/XP and Unix, if path is a
Unicode object, the result will be a
list of Unicode objects. Undecodable
filenames will still be returned as
string objects.
Does simply using a Unicode argument work for you?
for dirpath, dirnames, filenames in os.walk(u"."):
print dirpath
for fn in filenames:
print " ", fn
No, they are not encoded in Pythons internal string representation, there is no such thing. They are encoded in the encoding of the operating system/file system. Passing in unicode works for os.walk though.
I don't know how os.walk behaves when filenames can't be decoded, but I assume that you'll get a string back, like with os.listdir(). In that case you'll again have problems later. Also, not all of Python 2.x standard library will accept unicode parameters properly, so you may need to encode them as strings anyway. So, the problem may in fact be somewhere else, but you'll notice if that is the case. ;-)
If you need more control of the decoding you can't always pass in a string, and then just decode it with
filename = filename.decode()
as usual.