Python sqlite3 connect with special characters in path - python

I have an application that is compiled with PyInstaller that uses a sqlite database. Everything works fine until a user with special characters in their name runs the software. Even simple code like this:
import sqlite3
path = "C:\\Users\\Jøen\\test.db"
db = sqlite3.connect(path)
Results in a traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
sqlite3.OperationalError: unable to open database file
I have tried all kinds of combinations including using chardet to detect the encoding and then converting to UTF-8 but that didn't work either. All of my usual Python encoding/decoding tricks are failing me at this point.
Has anyone successfully opened a SQLite DB in Python that has special characters in a path?
So if any of you have international or special characters in your user path, some test code to potentially help me:
import os
import sqlite3
path = os.path.expanduser("~")
sqlite3.connect(path + "\\test.db")

I see two issues:
\t is a tab character, \U is the start of a 8-hex-digit Unicode character escape.
You'd need to encode to the platform filesystem encoding, sys.getfilesystemencoding(), which on Windows is usually UTF-16 (little endian) or MBCS (Multi-byte character set, really meaning *any of our supported multi-byte encodings, including UTF-16), but not UTF-8. Or just pass in a Unicode string and let Python worry about this for you.
On Python 2, the following should work:
path = ur"C:\Users\Jøen\test.db"
This uses a raw unicode string literal, meaning that it'll a) not interpret \t as a tab but as two separate characters and b) produce a Unicode string for Python then to encode to the correct filesystem encoding.
Alternatively, on Windows forward slashes are also acceptable as separators, or you could double the backslashes to properly escape them:
path = u"C:/Users/Jøen/test.db"
path = u"C:\\Users\\Jøen\\test.db"
On Python 3, just drop the u and still not encode:
path = r"C:\Users\Jøen\test.db"
Building a path from the home directory, use Unicode strings everywhere and use os.path.join() to build your path. Unfortunately, os.path.expanduser() is not Unicode-aware on Python 2 (see bug 28171), so using it requires decoding using sys.getfilesystemencoding() but this can actually fail (see Problems with umlauts in python appdata environvent variable as to why). You could of course try anyway:
path = os.path.expanduser("~").decode(sys.getfilesystemencoding())
sqlite3.connect(os.path.join(path, u"test.db"))
But instead relying on retrieving the Unicode value of the environment variables would ensure you got an uncorrupted value instead; building on Problems with umlauts in python appdata environvent variable, that could look like:
import ctypes
import os
def getEnvironmentVariable(name):
name= unicode(name) # make sure string argument is unicode
n= ctypes.windll.kernel32.GetEnvironmentVariableW(name, None, 0)
if n==0:
return None
buf= ctypes.create_unicode_buffer(u'\0'*n)
ctypes.windll.kernel32.GetEnvironmentVariableW(name, buf, n)
return buf.value
if 'HOME' in os.environ:
userhome = getEnvironmentVariable('HOME')
elif 'USERPROFILE' in os.environ:
userhome = getEnvironmentVariable('USERPROFILE')
sqlite3.connect(os.path.join(userhome, u"test.db"))

The way that I found will actually work without having to deal with encoding (which I never did find a solution to) is to use the answer from here:
How to get Windows short file name in python?
The short name appears to always have the encoded characters removed based on my testing. I realize this is a kludge but I could not find another way.

Related

Python 3 unicode encode error

I'm using glob.glob to get a list of files from a directory input. When trying to open said files, Python fights me back with this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 18: character maps to < undefined >
By defining a string variable first, I can do this:
filePath = r"C:\Users\Jørgen\Tables\\"
Is there some way to get the 'r' encoding for a variable?
EDIT:
import glob
di = r"C:\Users\Jørgen\Tables\\"
def main():
fileList = getAllFileURLsInDirectory(di)
print(fileList)
def getAllFileURLsInDirectory(directory):
return glob.glob(directory + '*.xls*')
There is a lot more code, but this problem stops the process.
Independently on whether you use the raw string literal or a normal string literal, Python interpreter must know the source code encoding. It seems you use some 8-bit encoding, not the UTF-8. Therefore you have to add the line like
# -*- coding: cp1252 -*-
at the beginning of the file (or using another encoding used for the source files). It need not to be the first line, but it usually is the first or second (the first should contain #!python3 for the script used on Windows).
Anyway, it is usually better not to use non ASCII characters in the file/directory names.
You can also use normal slashes in the path (the same way as in Unix-based systems). Also, have a look at os.path.join when you need to compose the paths.
Updated
The problem is probably not where you search it for. My guess is that the error manifests only when you want to display the resulting list via print. This is usually because the console by default uses non-unicode encoding that is not capable to display the character. Try the chcp command without arguments in your cmd window.
You can modify the print command in your main() function to convert the string representation to the ASCII one that can always be displayed:
print(ascii(fileList))
Please also see:
Convert python filenames to unicode
and
Listing chinese filenames in directory with python
You can tell Python to explicitly handle strings as unicode -- but you have to maintain that from the first string onward.
In this case passing a u'somepath' to os.walk.

python2.7 utf-8 input through command line in Windows7

I'm a newbie, and I'm sure a similar question has been asked in the past, but I am having trouble finding/understanding an answer. Thank you in advance for being patient with me!
So I'm trying to write a script to read lines in a utf-8 encoded input file, compare portions of it to an optional command line argument passed in by the user, and if there's a match, to do some stuff to that line before printing it to an output file. I'm using codecs to open the files.
I'm using the argparse module to parse command line arguments right now. The lines in the file can be in all sorts of languages, hence the command line argument needs to also be utf-8.
For example:
A line from the file might look like this:
разъедают {. r ax z . j je . d ax1 . ju t .}
The script should be called from the command line with something like this:
>python myscript.py mytextfile.txt -grapheme ъ
Here's the part of my code that is supposed to do the processing. In this case, orth is some Cyrillic text and grapheme is a Cyrillic character.
def process_orth(orth, grapheme):
grapheme = grapheme.decode(sys.stdin.encoding).encode('utf-8')
if (grapheme in orth):
print 'success, your grapheme was: ' + grapheme.encode('utf-8')
return True
else:
print 'failure, your grapheme was: ' + grapheme.encode('utf-8')
return False
Unfortunately, even though the grapheme is definitely there, the function returns false and prints a question mark instead of the grapheme:
failure, your grapheme was: ?
I've tried adding the following at the start of process_orth() as per the recommendation of some other post I read, but it didn't seem to work:
grapheme.decode(sys.stdin.encoding).encode('utf-8')
So my question is...
How do I pass utf-8 strings through the command line into a python script? Also, are there any extra quirks with this on Windows7 (and does having cygwin installed change anything)?
If you are opening the input file using codecs.open() then you have unicode data, not encoded data. You would want to just decode grapheme, not encode it again to UTF-8:
grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
print u'success, your grapheme was: ' + grapheme
return True
Note that we print unicode as well; normally print will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.
Unfortunately, sys.argv on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.
I see no reason for argparse to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv() and encode it to UTF-8 before passing it to argparse.

Python's glob module and unix' find command don't recognize non-ascii

I am on Mac OS X 10.8.2
When I try to find files with filenames that contain non-ASCII-characters I get no results although I know for sure that they are existing. Take for example the console input
> find */Bärlauch*
I get no results. But if I try without the umlaut I get
> find */B*rlauch*
images/Bärlauch1.JPG
So the file is definitely existing. If I rename the file replacing 'ä' by 'ae' the file is being found.
Similarily the Python module glob is not able to find the file:
>>> glob.glob('*/B*rlauch*')
['images/Bärlauch1.JPG']
>>> glob.glob('*/Bärlauch*')
[]
I figured out it must have something to do with the encoding but my terminal is set to be utf-8 and I am using Python 3.3.0 which uses unicode strings.
Mac OS X uses denormalized characters always for filenames on HFS+. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.
import unicodedata
glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*'))
Python programs are fundamentally text files. Conventionally, people write them using only characters from the ASCII character set, and thus do not have to think about the encoding they write them in: all character sets agree on how ASCII characters should be decoded.
You have written a Python program using a non-ASCII character. Your program thus comes with an implicit encoding (which you haven't mentioned): to save such a file, you have to decide how you are going to represent a-umlaut on disk. I would guess that perhaps your editor has chosen something non-Unicode for you.
Anyway, there are two ways around such a problem: either you can restrict yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read the text file with a specific encoding.
To do the former, you should replace the a-umlaut with its Unicode escape sequence (which I think is \x0228 but can't test at the moment). To do the latter, you should add a coding declaration at the top of the file:
# -*- coding: <your encoding> -*-

Confirm that Python 2.6 ftplib does not support Unicode file names? Alternatives?

Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names? Or must Unicode file names be specially encoded in order to be used with the ftplib module?
The following email exchange seems to support my conclusion that the ftplib module only supports ASCII file names.
Should ftplib use UTF-8 instead of latin-1 encoding?
http://mail.python.org/pipermail/python-dev/2009-January/085408.html
Any recommendations on a 3rd party Python FTP module that supports Unicode file names? I've googled this question without success [1], [2].
The official Python documentation does not mention Unicode file names [3].
Thank you,
Malcolm
[1] ftputil wraps ftplib and inherits ftplib's apparent ASCII only support?
[2] Paramiko's SFTP library does support Unicode file names, however I'm looking specifically for ftp (vs. sftp) support relative to our current project.
[3] http://docs.python.org/library/ftplib.html
WORKAROUND:
The encodings.idna.ToASCII and .ToUnicode methods can be used to convert Unicode path names to an ASCII format. If you wrap all your remote path names and the output of the dir/nlst methods with these functions, then you can create a way to preserve Unicode path names using the standard ftplib (and also preserve Unicode file names on file systems that don't support Unicode paths). The downside to this technique is that other processes on the server will also have to use encodings.idna when referencing the files that you upload to the server. BTW: I understand that this is an abuse of the encodings.idna library.
Thank you Peter and Bob for your comments which I found very helpful.
ftplib has no knowledge of Unicode whatsoever. It is intended to be passed byte-strings for filenames, and it'll return byte strings when asked for a directory list. Those are the exact strings of bytes passed-to/returned-from the server.
If you pass a Unicode string to ftplib in Python 2.x, it'll end up getting coerced to bytes when it's sent to the underlying socket object. This coercion uses Python's ‘default’ encoding, ie. US-ASCII for safety, with exceptions generated for non-ASCII characters.
The python-dev message to which you linked is talking about ftplib in Python 3.x, where strings are Unicode by default. This leaves modules like ftplib in a tricky situation because although they now use Unicode strings at their front-end, the actual protocol behind it is byte-based. There therefore has to be an extra level of encoding/decoding involved, and without explicit intervention to specify what encoding is in use, there's a fair change it'll choose wrong.
ftplib in 3.x chose to default to ISO-8859-1 in order to preserve each byte as a character inside the Unicode string. Unfortunately this will give unexpected results in the common case where the target server uses a UTF-8 collation for filenames (whether or not the FTP daemon itself knows that filenames are UTF-8, which it commonly won't). There are a number of cases like this where the Python standard libraries have been brutally hacked to Unicode strings with negative consequences; Python 3's batteries-included are still leaking corrosive fluid IMO.
Personally I would be more worried about what is on the other side of the ftp connection than the support of the library. FTP is a brittle protocol at the best of times without trying to be creative with filenames.
from RFC 959:
Pathname is defined to be the character string which must be
input to a file system by a user in order to identify a file.
Pathname normally contains device and/or directory names, and
file name specification. FTP does not yet specify a standard
pathname convention. Each user must follow the file naming
conventions of the file systems involved in the transfer.
To me that means that the filenames should conform to the lowest common denominator. Since nowadays the number of DOS servers, Vax and IBM mainframes is negligeable and chances are you'll end up on a Windows or Unix box so the common denominator is quite high, but making assumptions on which codepage the remote site wants to accept appears to me pretty risky.
To get around this, I used the following code
ftp.storbinary("STOR " + target_name.encode( "utf-8" ), open(file_name, 'rb'))
This assumes that the FTP server supports RFC 2640 http://www.ietf.org/rfc/rfc2640.txt which allows for utf-8 file names. In my case I used SwiFTP server for Android and it transfers the files with the proper names successfully.
Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names?
It doesn't.
Should ftplib use UTF-8 instead of latin-1 encoding?
It's debatable. UTF-8 is the preferred encoding as dictated by RFC-2640 but latin-1 is usually more friendly for misbehaving implementations (either server or client).
If server includes "UTF8" as part of the FEAT response then you should definitively use UTF8.
>>> utf8_server = 'UTF8' in ftp.sendcmd('FEAT')
To support unicode in python 2.x you can adopt the following monkey patched version of ftpdlib:
class UnicodeFTP(ftplib.FTP):
"""A ftplib.FTP subclass supporting unicode file names as
described by RFC-2640."""
def putline(self, line):
line = line + '\r\n'
if isinstance(line, unicode):
line = line.encode('utf8')
self.sock.sendall(line)
...and pass unicode strings when using the remaining API as in:
>>> ftp = UnicodeFTP(host='ftp.site.com', user='foo', passwd='bar')
>>> ftp.delete(u'somefile')
We got UTF8 encoded filenames working for Python 2.7's FTPlib.
Note 1: Here's a background to easily explain UTF8 and unicode:
https://code.google.com/p/iqbox-ftp/wiki/ProgrammingGuide_UnicodeVsAscii
Note 2: You can take a look at the AGPL libraries we use for IQBox. You might be able to use those (or parts of those), and they support UTF8 over FTP. Look at filetransfer_abc.py
You do need to add code to (1) Determine if the server supports UTF8, and (2) encode the unicode Python string in UTF8 format. (3) (Full code not shown since everyone gets file listings differently) When you get the file listings you need to also use if UTF8_support: name = name.decode('utf-8')
# PART (1): DETERMINE IF SERVER HAS UTF8 SUPPORT:
# Get FTP features:
try:
features_string_ftp = ftp.sendcmd('FEAT')
print features_string_ftp
# Determine UTF8 support:
if 'UTF8' in features_string_ftp.upper():
print "FTP>> Server supports international characters (UTF8)"
UTF8_support = True
else:
print "FTP>> Server does NOT support international (non-ASCII) characters."
UTF8_support = False
except:
print "FTP>> Could not get list of features using FEAT command."
print "FTP>> Server does NOT support international (non-ASCII) characters."
UTF8_support = False
# Part (2): Encode FTP commands needed to be sent using UTF8 encoding, if it's supported.
def sendFTPcommand(ftp, command_string, UTF8_support):
# Needed for UTF8 international file names etc.
c = None
if UTF8_support:
c = command_string.encode('utf-8')
else:
c = command_string
# TODO: Add try-catch here and connection error retries.
return ftp.sendcmd(c)
# If you just want to get a string with the UTF8 command and send it yourself, then use this:
def encodeFTPcommand(self, command_string. UTF8_support):
# Needed for UTF8 international file names etc.
c = None
if UTF8_support:
c = command_string.encode('utf-8')
else:
c = command_string
return c

How to convert html entities into symbols?

I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly.
Some answers and responses end up looking like:
Yeah.. I know.. I’m a simpleton.. So what’s a Singleton? (2)
How can I make the ’ to be translated to the right character?
Note: If that matters, I'm using python 2.6, on a French windows.
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
EDIT1: Based on Ryan Ginstrom's post, I have been able to correct a part of the output, but I am having problems with python's unicode.
In Idle / python shell:
Yeah.. I know.. I’m a simpleton.. So
what’s a Singleton?
In a text file, when redirecting stdout
Yeah.. I know.. I’m a simpleton.. So
what’s a Singleton?
How can I correct that ?
Edit2: I have tried Jarret Hardie's solution but it didn't do anything.
I am on windows, using python 2.6, so my site-packages folder is at:
C:\Python26\Lib\site-packages
There was no siteconfig.py file, so I created one, pasted the code provided by Jarret Hardie, started a python interpreter, but seems like it has not been loaded.
sys.getdefaultencoding()
'ascii'
I noticed there is a site.py file at :
C:\Python26\Lib\site.py
I tried changing the encoding in the function
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
to set the encoding to utf-8. It worked (after a restart of python of course).
>>> sys.getdefaultencoding()
'utf-8'
The sad thing is that it didn't correct the caracters in my program. :(
You should be able to convert HTMl/XML entities into Unicode characters. Check out this answer in SO:
Decoding HTML Entities With Python
Basically you want something like this:
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
Does changing your default encoding in siteconfig.py work?
In your site-packages file (on my OS X system it's in /Library/Python/2.5/site-packages/) create a file called siteconfig.py. In this file put:
import sys
sys.setdefaultencoding('utf-8')
The setdefaultencoding method is removed from the sys module once siteconfig.py is processed, so you must put it in site-packages so that Python will read it when the interpreter starts up.

Categories