UnicodeDecodeError Loading with sqlalchemy

UnicodeDecodeError Loading with sqlalchemy - python

I am querying a MySQL database with sqlalchemy and getting the following error:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 498-499: unexpected end of data
A column in the table was defined as Unicode(500) so this error suggests to me that there is an entry that was truncated because it was longer than 500 characters. Is there a way to handle this error and still load the entry? Is there a way to find the errant entry and delete it other than trying to load every entry one by one (or in batches) until I get the error?

In short, you should change:
Unicode(500)
to:
Unicode(500, unicode_errors='ignore', convert_unicode='force')
(Python 2 code follows, but the principles hold in python 3; only some of the output will differ.)
What's going on is that when you decode a bytestring, it complains if the bytestring can't be decoded, with the error you saw.
>>> u = u'ABCDEFGH\N{TRADE MARK SIGN}'
>>> u
u'ABCDEFGH\u2122'
>>> print(u)
ABCDEFGH™
>>> s = u.encode('utf-8')
>>> s
'ABCDEFGH\xe2\x84\xa2'
>>> truncated = s[:-1]
>>> truncated
'ABCDEFGH\xe2\x84'
>>> truncated.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/cliffdyer/.virtualenvs/edx-platform/lib/python2.7/encodings/utf_8.py",
line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-9: unexpected
end of data
Python provides different optional modes of handling decode errors, though. Raising an exception is the default, but you can also truncate the text or convert the malformed part of the string to the official unicode replacement character.
>>> trunc.decode('utf-8', errors='replace')
u'ABCDEFGH\ufffd'
>>> trunc.decode('utf-8', errors='ignore')
u'ABCDEFGH'
This is exactly what's happening within the column handling.
Looking at the Unicode and String classes in sqlalchemy/sql/sqltypes.py, it looks like there is a unicode_errors argument that you can pass to the constructor which passes its value through to the encoder's errors argument. There is also a note that you will need to set convert_unicode='force' to make it work.
Thus Unicode(500, unicode_errors='ignore', convert_unicode='force') should solve your problem, if you're okay with truncating the ends of your data.
If you have some control over the database, you should be able to prevent this issue in the future by defining your database to use the utf8mb4 character set. (Don't just use utf8, or it will fail on four byte utf8 characters, including most emojis). Then you will be guaranteed to have valid utf-8 stored in and returned from your database.

In short, your MySQL setup is incorrect in that it truncates UTF-8 characters in mid-sequence. I would check twice that MySQL actually expects the character encoding of UTF-8 within the sessions and in the tables themselves.
I would suggest switching to PostgreSQL (seriously) to avoid this kind of problem: not only does PostgreSQL understand UTF-8 properly in default configurations, but also it would not ever truncate a string to fit into the value, choosing to raise an error instead:
psql (9.5.3, server 9.5.3)
Type "help" for help.
testdb=> create table foo(bar varchar(4));
CREATE TABLE
testdb=> insert into foo values ('aaaaa');
ERROR: value too long for type character varying(4)
This is also not unlike the Zen of Python:
Explicit is better than implicit.
and
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.

Make the column you are storing into a BLOB. After loading the data, do various things such as
SELECT MAX(LENGTH(col)) FROM ... -- to see what the longest is in _bytes_.
Copy the data into another BLOB column and do
ALTER TABLE t MODIFY col2 TEXT CHARACTER SET utf8 ... -- to see if it converts correctly
If that succeeds, then do
SELECT MAX(CHAR_LENGTH(col2)) ... -- to see if the longest is more than 500 _characters_.
After you have tried a few things like that, we can see what direction to take next.

Related

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.
First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.
I figured out that my Postgres database has UTF8 encoding.
The file/StringIO object I am writing my data into shows its encoding as the following:
setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators
I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.
This is the error I got now:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)
What does it mean? How do I fix it?
EDIT:
I am using Python 2.7
Some pieces of my code:
I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench.
This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:
# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :
# Separate rows in a table by new line delimiter
if(row_num!=0):
table_data.write("\n")
col_num=0
for cell in row:
# Separate cells in a row by tab delimiter
if(col_num!=0):
table_data.write("\t")
table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
col_num = col_num+1
row_num = row_num+1
This is the code that writes to Postgres database from my StringIO object table_data:
cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

The problem is that you're calling encode on a str object.
A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.
So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.
The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.
But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.
Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.
In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.
As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

Is there an easy way to make unicode work in python?

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.
Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?
I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...
For example:
url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Update
If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "classes.py", line 583, in <module>
wiki.getPage(title)
File "classes.py", line 146, in getPage
url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
return ''.join(map(quoter, s))
KeyError: u'\xf1'
I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.
Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.

This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:
from __future__ import unicode_literals
This will at least ensure that your own literal strings default to unicode rather than str.

There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.
Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.
In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:
encoded_title = title
if isinstance(encoded_title, unicode):
encoded_title = title.encode('utf-8')
If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:
python -Werror -municodenazi myprog.py
This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.

Yes, define your unicode data as unicode literals:
>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'
You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:
# -*- coding: utf-8 -*-
Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).
As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:
>>> 'å'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:
>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'
If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').
If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:
if isinstance(title, unicode):
title = title.encode('utf-8')
You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.

be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")
use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers

Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.
Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(

Importing file with unknown encoding from Python into MongoDB

Working on importing a tab-delimited file over HTTP in Python.
Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.
Whatever the encoding of the data is, MongoDB is throwing me the exception:
bson.errors.InvalidStringData: strings in documents must be valid UTF-8
So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:
LookupError: unknown encoding: unicode
From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.
Problem: When converting to Unicode, I am receiving the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)
With this error, I'm not exactly sure where to continue.
My question is this: How do I successfully import the data from a file without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?
Thanks Much!

Try these in order:
(0) Check that your removal of the slashes/ticks/etc is not butchering the data. What's a tick? Please show your code. Please show a sample of the raw data ... use print repr(sample_raw data) and copy/paste the output into an edit of your question.
(1) There's an old maxim: "If the encoding of a file is unknown, or stated to be ISO-8859-1, it is cp1252" ... where are you getting it from? If it's coming from Western Europe, the Americas, or any English/French/Spanish-speaking country/territory elsewhere, and it's not valid UTF-8, then it's likely to be cp1252
[Edit 2] Your error byte 0x93 decodes to U+201C LEFT DOUBLE QUOTATION MARK for all encodings cp1250 to cp1258 inclusive ... what language is the text written in? [/Edit 2]
(2) Save the file (before tick removal), then open the file in your browser: Does it look sensible? What do you see when you click on View / Character Encoding?
(3) Try chardet
Edit with some more advice:
Once you know what the encoding is (let's assume it's cp1252):
(1) convert your input data to unicode: uc = raw_data.decode('cp1252')
(2) process the data (remove slashes/ticks/etc) as unicode: clean_uc = manipulate(uc)
(3) you need to output your data encoded as utf8: to_mongo = clean_uc.encode('utf8')
Note 1: Your error message says "can't decode byte 0x93 in position 1258" ... 1258 bytes is a rather long chunk of text; is this reasonable? Have you had a look at the data that it is complaining about? How? what did you see?
Note 2: Please consider reading the Python Unicode HOWTO and this article

Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

I want to send Chinese characters to be translated by an online service, and have the resulting English string returned. I'm using simple JSON and urllib for this.
And yes, I am declaring.
# -*- coding: utf-8 -*-
on top of my code.
Now everything works fine if I feed urllib a string type object, even if that object contains what would be Unicode information. My function is called translate.
For example:
stringtest1 = '無與倫比的美麗'
print translate(stringtest1)
results in the proper translation and doing
type(stringtest1)
confirms this to be a string object.
But if do
stringtest1 = u'無與倫比的美麗'
and try to use my translation function I get this error:
File "C:\Python27\lib\urllib.py", line 1275, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-8: ordinal not in range(128)
After researching a bit, it seems this is a common problem:
Problem: neither urllib2.quote nor urllib.quote encode the unicode strings arguments
urllib.quote throws exception on Unicode URL
Now, if I type in a script
stringtest1 = '無與倫比的美麗'
stringtest2 = u'無與倫比的美麗'
print 'stringtest1',stringtest1
print 'stringtest2',stringtest2
excution of it returns:
stringtest1 ç„¡èˆ‡å€«æ¯”çš„ç¾Žéº—
stringtest2 無與倫比的美麗
But just typing the variables in the console:
>>> stringtest1
'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'
>>> stringtest2
u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'
gets me that.
My problem is that I don't control how the information to be translated comes to my function. And it seems I have to bring it in the Unicode form, which is not accepted by the function.
So, how do I convert one thing into the other?
I've read Stack Overflow question Convert Unicode to a string in Python (containing extra symbols).
But this is not what I'm after. Urllib accepts the string object but not the Unicode object, both containing the same information
Well, at least in the eyes of the web application I'm sending the unchanged information to, I'm not sure if they're are still equivalent things in Python.

When you get a unicode object and want to return a UTF-8 encoded byte string from it, use theobject.encode('utf8').
It seems strange that you don't know whether the incoming object is a str or unicode -- surely you do control the call sites to that function, too?! But if that is indeed the case, for whatever weird reason, you may need something like:
def ensureutf8(s):
if isinstance(s, unicode):
s = s.encode('utf8')
return s
which only encodes conditionally, that is, if it receives a unicode object, not if the object it receives is already a byte string. It returns a byte string in either case.
BTW, part of your confusion seems to be due to the fact that you don't know that just entering an expression at the interpreter prompt will show you its repr, which is not the same effect you get with print;-).

Latin-1 and the unicode factory in Python

I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode factory, and I don't know how to make Python use a codec other than ascii.
The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.
The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi instead of pyodbc, which would get me Unicode, but adodbapi gives me other problems.)
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
t.add_row(rec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t
But the Name column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc, when it gets to the t.add_row call:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)
This is line 222 in prettytable.py. It uses unicode, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.
for i in range(0,len(row)):
if len(unicode(row[i])) > self.widths[i]: # This is line 222
self.widths[i] = len(unicode(row[i]))
Please tell me what I'm doing wrong here. How can I make unicode work without hacking prettytable.py or any of the other libraries that I use? Is there even a way to do this?
EDIT: The error occurs not at the print statement, but at the t.add_row call.
EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
t.add_row(urec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')
I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!

Add this at the beginning of the module:
# coding: latin1
Or decode the string to Unicode yourself.
[Edit]
It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:
>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'
[Edit]
Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html

Maybe try to decode the latin1-encoded strings into unicode?
t.add_row((value.decode('latin1') for value in rec))

After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row, add_row and add_column, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.
Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:
>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123
Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). BTW, you can omit the check for strings in your list comprehension by replacing s.decode('latin-1') with unicode(s, 'latin-1') since all objects can be coerced to strings.
One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name> command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name> to do the same for a database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.