Python 3 unicode encode error - python

I'm using glob.glob to get a list of files from a directory input. When trying to open said files, Python fights me back with this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 18: character maps to < undefined >
By defining a string variable first, I can do this:
filePath = r"C:\Users\Jørgen\Tables\\"
Is there some way to get the 'r' encoding for a variable?
EDIT:
import glob
di = r"C:\Users\Jørgen\Tables\\"
def main():
fileList = getAllFileURLsInDirectory(di)
print(fileList)
def getAllFileURLsInDirectory(directory):
return glob.glob(directory + '*.xls*')
There is a lot more code, but this problem stops the process.

Independently on whether you use the raw string literal or a normal string literal, Python interpreter must know the source code encoding. It seems you use some 8-bit encoding, not the UTF-8. Therefore you have to add the line like
# -*- coding: cp1252 -*-
at the beginning of the file (or using another encoding used for the source files). It need not to be the first line, but it usually is the first or second (the first should contain #!python3 for the script used on Windows).
Anyway, it is usually better not to use non ASCII characters in the file/directory names.
You can also use normal slashes in the path (the same way as in Unix-based systems). Also, have a look at os.path.join when you need to compose the paths.
Updated
The problem is probably not where you search it for. My guess is that the error manifests only when you want to display the resulting list via print. This is usually because the console by default uses non-unicode encoding that is not capable to display the character. Try the chcp command without arguments in your cmd window.
You can modify the print command in your main() function to convert the string representation to the ASCII one that can always be displayed:
print(ascii(fileList))

Please also see:
Convert python filenames to unicode
and
Listing chinese filenames in directory with python
You can tell Python to explicitly handle strings as unicode -- but you have to maintain that from the first string onward.
In this case passing a u'somepath' to os.walk.

Related

Python sqlite3 connect with special characters in path

I have an application that is compiled with PyInstaller that uses a sqlite database. Everything works fine until a user with special characters in their name runs the software. Even simple code like this:
import sqlite3
path = "C:\\Users\\Jøen\\test.db"
db = sqlite3.connect(path)
Results in a traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
sqlite3.OperationalError: unable to open database file
I have tried all kinds of combinations including using chardet to detect the encoding and then converting to UTF-8 but that didn't work either. All of my usual Python encoding/decoding tricks are failing me at this point.
Has anyone successfully opened a SQLite DB in Python that has special characters in a path?
So if any of you have international or special characters in your user path, some test code to potentially help me:
import os
import sqlite3
path = os.path.expanduser("~")
sqlite3.connect(path + "\\test.db")
I see two issues:
\t is a tab character, \U is the start of a 8-hex-digit Unicode character escape.
You'd need to encode to the platform filesystem encoding, sys.getfilesystemencoding(), which on Windows is usually UTF-16 (little endian) or MBCS (Multi-byte character set, really meaning *any of our supported multi-byte encodings, including UTF-16), but not UTF-8. Or just pass in a Unicode string and let Python worry about this for you.
On Python 2, the following should work:
path = ur"C:\Users\Jøen\test.db"
This uses a raw unicode string literal, meaning that it'll a) not interpret \t as a tab but as two separate characters and b) produce a Unicode string for Python then to encode to the correct filesystem encoding.
Alternatively, on Windows forward slashes are also acceptable as separators, or you could double the backslashes to properly escape them:
path = u"C:/Users/Jøen/test.db"
path = u"C:\\Users\\Jøen\\test.db"
On Python 3, just drop the u and still not encode:
path = r"C:\Users\Jøen\test.db"
Building a path from the home directory, use Unicode strings everywhere and use os.path.join() to build your path. Unfortunately, os.path.expanduser() is not Unicode-aware on Python 2 (see bug 28171), so using it requires decoding using sys.getfilesystemencoding() but this can actually fail (see Problems with umlauts in python appdata environvent variable as to why). You could of course try anyway:
path = os.path.expanduser("~").decode(sys.getfilesystemencoding())
sqlite3.connect(os.path.join(path, u"test.db"))
But instead relying on retrieving the Unicode value of the environment variables would ensure you got an uncorrupted value instead; building on Problems with umlauts in python appdata environvent variable, that could look like:
import ctypes
import os
def getEnvironmentVariable(name):
name= unicode(name) # make sure string argument is unicode
n= ctypes.windll.kernel32.GetEnvironmentVariableW(name, None, 0)
if n==0:
return None
buf= ctypes.create_unicode_buffer(u'\0'*n)
ctypes.windll.kernel32.GetEnvironmentVariableW(name, buf, n)
return buf.value
if 'HOME' in os.environ:
userhome = getEnvironmentVariable('HOME')
elif 'USERPROFILE' in os.environ:
userhome = getEnvironmentVariable('USERPROFILE')
sqlite3.connect(os.path.join(userhome, u"test.db"))
The way that I found will actually work without having to deal with encoding (which I never did find a solution to) is to use the answer from here:
How to get Windows short file name in python?
The short name appears to always have the encoded characters removed based on my testing. I realize this is a kludge but I could not find another way.

File name encoding in Python 2.7

I want to read files with special file names in Python (2.7). But whatever I try, it always fails to open them.
The filenames are
F\xA8\xB9hrerschein
and
Gro\xDFhandel
I know, the encoding was done with one of several codepages. I could try to find out which one and try to convert it and all the mumbo jumbo, but I don't want that.
Can't I somehow tell python to open that file without having to go through all that encoding stuff? I mean opening the file by its raw name in bytes?
After all, I fixed it with
reload(sys)
sys.setdefaultencoding('utf-8')
and setting the environment variable
LANG="C.UTF-8"
Thanks for the hints.
One way is to use os.listdir(). See the following example.
Add some data to a file with non-ascii character 0xdf in the name:
$ echo abcd > `printf "A\xdfA"`
Check that the file contains a non-ascii character:
$ ls A*
A?A
Start Python, read the directory and open the first file (which is the one with the non-ascii character):
$ Python
>>> import os
>>> d = os.listdir('.')
>>> d
['A\xdfA']
>>> f = open(d[0])
>>> f.readline()
'abcd\n'
>>>
If you have source code like
with open('Großhandel') as input:
#stuff
You should look at Source Code Encodings and write
#!python2
# -*- coding: utf-8 -*-
with open('Großhandel') as input:
…
It is worth mention that the authors of PEP-263 are Marc-André Lemburg and Martin von Löwis, which I suppose makes pushing defined toward source encoding back in 2002 slightly more understandable.
Under Linux, filenames can be encoded in any character encoding. When opening a file, you must use the exact name encoded to match.
I.e. If the filename is Großhandel.txt encoded using UTF-8, it must be encoded as Gro\xc3\x9fhandel.txt.
If you pass a Unicode string to open(), the user's locale is used to encode the filename, which may match the filename.
Under OS X, UTF-8 encoding is enforced. Under Windows, the character encoding is abstracted by the i/o drivers. A Unicode object passed to open() should always be used for these Operating Systems, where it'll be converted appropriately.
If you're reading filenames from the filesystem, it would be useful to get decoded Unicode filenames to pass straight to open() - Well, you can pass Unicode strings to os.listdir().
E.g.
Locale: LANG=en_GB.UTF-8
A directory with the following files, with their filenames encoded to UTF-8:
test.txt
€.txt
When running Python 2.7 using a string:
>>> os.listdir(".")
['\xe2\x82\xac.txt', 'test.txt']
Using a Unicode path:
>>> os.listdir(u".")
[u'\u20ac.txt', u'test.txt']

'ascii' codec can't encode error when reading using Python to read JSON

Yet another person unable to find the correct magic incantation to get Python to print UTF-8 characters.
I have a JSON file. The JSON file contains string values. One of those string values contains the character "à". I have a Python program that reads in the JSON file and prints some of the strings in it. Sometimes when the program tries to print the string containing "à" I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 12: ordinal not in range(128)
This is hard to reproduce. Sometimes a slightly different program is able to print the string "à". A smaller JSON file containing only this string does not exhibit the problem. If I start sprinkling encode('utf-8') and decode('utf-8') around the code it changes what blows up in unpredictable ways. I haven't been able to create a minimal code fragment and input that exhibits this problem.
I load the JSON file like so.
with codecs.open(filename, 'r', 'utf-8') as f:
j = json.load(f)
I'll pull out the offending string like so.
s = j['key']
Later I do a print that has s as part of it and see the error.
I'm pretty sure the original file is in UTF-8 because in the interactive command line
codecs.open(filename, 'r', 'utf-8').read()
returns a string but
codecs.open(filename, 'r', 'ascii').read()
gives an error about the ascii codec not being able to decode such-and-such a byte. The file size in bytes is identical to the number of characters returned by wc -c, and I don't see anything else that looks like a non-ASCII character, so I suspect the problem lies entirely with this one high-ASCII "à".
I am not making any explicit calls to str() in my code.
I've been through the Python Unicode HOWTO multiple times. I understand that I'm supposed to "sandwich" unicode handling. I think I'm doing this, but obviously there's something I still misunderstand.
Mostly I'm confused because it seems like if I specify 'utf-8' in the codecs.open call, everything should be happening in UTF-8. I don't understand how the ASCII codec still sneaks in.
What am I doing wrong? How do I go about debugging this?
Edit: Used io module in place of codecs. Same result.
Edit: I don't have a minimal example, but at least I have a minimal repro scenario.
I am printing an object derived from the strings in the JSON that is causing the problem. So the following gives an error.
print(myobj)
(Note that I am using from __future__ import print_function though that does not appear to make a difference.)
Putting an encode('utf-8') on the end of my object's __str__ function return value does not fix the bug. However changing the print line to this does.
print("%s" % myobj)
This looks wrong to me. I'd expect these two print calls to be equivalent.
I can make this work by doing the sys.setdefaultencoding hack:
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
But this is apparently a bad idea that can make Python malfunction in other ways.
What is the correct way to do this? I tried
env PYTHONIOENCODING=UTF-8 ./myscript.py
but that didn't work. (Unsurprisingly, since the issue is the default encoding, not the io encoding.)
When you write directly to a file or redirect stdout to a file or pipe the default encoding is ASCII and you have to encode Unicode strings before writing them. With opened file handles you can set an encoding to have this happen automatically but with print you must use an encode() method.
print s.encode('utf-8')
It is recommended to use the newer io module in place of codecs because it has an improved implementation and is forward compatible with Py3.x open().

Python's glob module and unix' find command don't recognize non-ascii

I am on Mac OS X 10.8.2
When I try to find files with filenames that contain non-ASCII-characters I get no results although I know for sure that they are existing. Take for example the console input
> find */Bärlauch*
I get no results. But if I try without the umlaut I get
> find */B*rlauch*
images/Bärlauch1.JPG
So the file is definitely existing. If I rename the file replacing 'ä' by 'ae' the file is being found.
Similarily the Python module glob is not able to find the file:
>>> glob.glob('*/B*rlauch*')
['images/Bärlauch1.JPG']
>>> glob.glob('*/Bärlauch*')
[]
I figured out it must have something to do with the encoding but my terminal is set to be utf-8 and I am using Python 3.3.0 which uses unicode strings.
Mac OS X uses denormalized characters always for filenames on HFS+. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.
import unicodedata
glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*'))
Python programs are fundamentally text files. Conventionally, people write them using only characters from the ASCII character set, and thus do not have to think about the encoding they write them in: all character sets agree on how ASCII characters should be decoded.
You have written a Python program using a non-ASCII character. Your program thus comes with an implicit encoding (which you haven't mentioned): to save such a file, you have to decide how you are going to represent a-umlaut on disk. I would guess that perhaps your editor has chosen something non-Unicode for you.
Anyway, there are two ways around such a problem: either you can restrict yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read the text file with a specific encoding.
To do the former, you should replace the a-umlaut with its Unicode escape sequence (which I think is \x0228 but can't test at the moment). To do the latter, you should add a coding declaration at the top of the file:
# -*- coding: <your encoding> -*-

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.
I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):
str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)
for location in locations:
print location['name']
For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5
It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?
PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?
thx
codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).
Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.
To print it, you should first encode it, thus:
for location in locations:
print location['name'].encode('utf8')
EDIT:
For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.
By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:
locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
print location['name'].encode('utf8')
You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.
Try this instead:
print location['name'].encode('utf-8')
If your terminal is set to expect output in utf-8 format, this will print correctly.
It's the same as in PHP. UTF8 strings are good to print.
The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

Categories