Sqlite database non-ascii character error with python SQLalchemy library

Sqlite database non-ascii character error with python SQLalchemy library - python

When i use sqlite3 database with sqlalchemy library, i got this error
sqlalchemy.exc.ProgrammingError: (ProgrammingError)
You must not use 8-bit bytestrings unless you use a text_factory that can
interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application
to Unicode strings.
u'INSERT INTO model_pair (user, password) VALUES (?, ?)' ('\xebE\xc2\xe4.\t312874#gg.com', '123456')
and here is some test data:
呆呆 3098920#gg.com 11111
è?“è¨€ 9707996#gg.com 11111
wwwj55572#gg?€? 11111
I have configured database encoding as utf-8 or gbk but neither success
when insert, i try str.decode('gbk'), it will stuck on char like € and get error like above.
anyone tell me how get around this error ?

try to change '\xebE\xc2\xe4.\t312874#gg.com' to u'\xebE\xc2\xe4.\t312874#gg.com'
also, you could try to do '\xebE\xc2\xe4.\t312874#gg.com'.decode("utf-8"), but it gives an error "invalid continuation byte", perhaps your string is not valid utf-8 after all?
btw, do mention is you are running python 2.x or 3.x, there is a difference.

Related

Can I insert UTF8 encoded characters into a Latin-1 table if I know only Latin-1 characters will be used?

I have 10 tables in a database. 9 of them only store data with standard ascii 1-byte characters supported by Latin-1. 1 of them requires that I store special characters that are only supported by UTF8. I would like to use the same MySQL connection object (using Python's PyMySQL library) to populate all 10 tables.
Previously, when creating the MySQL connection object, I did not specify the character set and it defaulted to Latin-1. That was fine when I was only populating the 9 Latin-1 tables. Now that I am populating the UTF8 table, I modified the connection object by passing in the parameter charset='utf8mb4' to the PyMySQL connection object function:
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
Now I am confident that, when inserting into my UTF8 MySQL table, all of my data is being stored fine. However, I am unsure if problems may arise when using my UTF8 connection object and inserting into the Latin-1 tables. After my first rounds of testing, everything looks great.
Is there anything I have overlooked? Are there any potential issues with inserting UTF8 encoded characters into a Latin-1 table?

It can be done. But... You must set some things correctly, else you will get any of several forms of garbage.
If the bytes in your client are UTF-8 encoded, then you must tell MySQL that fact. This is usually done on the connect string. Your charset='utf8mb4' connection argument does that. Here are some Python-specific tips: http://mysql.rjweb.org/doc.php/charcoll#python
Meanwhile, the column(s) in the table(s) can be either latin1 or utf8 (since you are sure the data is limited to the characters that are common between them).
A character example: é is hex E9 in latin1 and C3A9 in MySQL's utf8 (or utf8mb4). The conversion will occur during INSERT and SELECT if you correctly state the clients encoding.
(For your purposes, either utf8 and utf8mb4 will work.)
If you have further troubles, see Trouble with utf8 characters; what I see is not what I stored and/or provide SHOW CREATE TABLE and hex of some offending character.

Hi utf8 and latin 1 both are simple encoding they support some character which not included in both so problem may occur. if you pass some data of utf8 which is not in latin 1. In this process double encoding occour.
Here is a link to insert utf8 to latin

I had the same problem and solved it by using the Convert and Cast function :
mycursor.execute("INSERT INTO `topics` (`title`,parent_id)
VALUES (convert(cast(convert( %s using utf8) as binary) using latin1),0)" ,(name,) )

Python pyodbc connections to IBM Netezza Erroring

So. This issue is almost exactly the same as the one discussed here -- but the fix (such as it is) discussed in that post doesn't fix things for me.
I'm trying to use Python 2.7.5 and pyodbc 3.0.7 to connect from an Ubuntu 12.04 64bit machine to an IBM Netezza Database. I'm using unixODBC to handle specifying a DSN. This DSN works beautifully from the isql CLI -- so I know it's configured correctly, and unixODBC is ticking right along.
The code is currently dead simple, and easy to reproduce in a REPL:
In [1]: import pyodbc
In [2]: conn = pyodbc.connect(dsn='NZSQL')
In [3]: curs = conn.cursor()
In [4]: curs.execute("SELECT * FROM DB..FOO ORDER BY created_on DESC LIMIT 10")
Out[4]: <pyodbc.Cursor at 0x1a70ab0>
In [5]: curs.fetchall()
---------------------------------------------------------------------------
InvalidOperation Traceback (most recent call last)
<ipython-input-5-ad813e4432e9> in <module>()
----> 1 curs.fetchall()
/usr/lib/python2.7/decimal.pyc in __new__(cls, value, context)
546 context = getcontext()
547 return context._raise_error(ConversionSyntax,
--> 548 "Invalid literal for Decimal: %r" % value)
549
550 if m.group('sign') == "-":
/usr/lib/python2.7/decimal.pyc in _raise_error(self, condition, explanation, *args)
3864 # Errors should only be risked on copies of the context
3865 # self._ignored_flags = []
-> 3866 raise error(explanation)
3867
3868 def _ignore_all_flags(self):
InvalidOperation: Invalid literal for Decimal: u''
So I get a connection, the query returns correctly, and then when I try to get a row... asplode.
Anybody ever managed to do this?

Turns out pyodbc can't gracefully convert all of Netezza's types. The table I was working with had two that are problematic:
A column of type NUMERIC(7,2)
A column of type NVARCHAR(255)
The NUMERIC column causes a decimal conversion error on NULL. The NVARCHAR column returns a utf-16-le encoded string, which is a pain in the ass.
I haven't found a good driver-or-wrapper-level solution yet. This can be hacked by casting types in the SQL statement:
SELECT
foo::FLOAT AS was_numeric
, bar::VARCHAR(255) AS was_nvarchar
I'll post here if I find a lower-level answer.

I've just encounter the same problem and found a different solution.
I managed to solve the issue by:
Making sure the following attributes are part of my driver options in odbc ini file:
UnicodeTranslationOption = utf16
CharacterTranslationOption = all
Add the following environment variables:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[NETEZZA_LIB_FILES_PATH]
ODBCINI=[ODBC_INI_FULL_PATH]
NZ_ODBC_INI_PATH=[ODBC_INI_FOLDER_PATH]
In my case the values are:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nz/lib
ODBC_INI=/etc/odbc.ini
NZ_ODBC_INI_PATH=/etc
I'm using centos 6 and also installed both 'unixODBC' and 'unixODBC-devel' packages.
Hope it helps someone.

I'm not sure what your error is, but the code below is allowing me to connect to Netezza via ODBC:
# Connect via pyodbc to listed data sources on machine
import pyodbc
print pyodbc.dataSources()
print "Connecting via ODBC"
# get a connection, if a connect cannot be made an exception will be raised here
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=<myserver>;PORT=<myport>;DATABASE=<mydbschema>;UID=<user>;PWD=<password>;")
print "Connected!\n"
# you can then use conn cursor to perform queries

The Netezza Linux client package includes /usr/local/nz/lib/ODBC_README, which lists all the values for those two attributes:
UnicodeTranslationOption:
Specify translation option for Unicode.
Possible value strings are:
utf8 : unicode data is in utf-8 encoding
utf16 : unicode data is in utf-16 encoding
utf32 : unicode data is in utf-32 encoding
Do not add '-' in the value strings listed above, e.g. "utf-8" is not
a valid string value. These value strings are case insensitive.
On windows this option is not available as windows DM always passes
unicode data in utf16.
CharacterTranslationOption ("Optimize for ASCII character set" on Windows):
Specify translation option for character encodings.
Possible value strings are:
all : Support all character encodings
latin9 : Support Latin9 character encoding only
Do not add '-' in the value strings listed above, e.g. "latin-9"
is not a valid string value. These value strings are case
insensitive.
NPS uses the Latin9 character encoding for char and varchar
datatypes. The character encoding on many Windows systems
is similar, but not identical to this. For the ASCII subset
(letters a-z, A-Z, numbers 0-9 and punctuation marks) they are
identical. If your character data in CHAR or VARCHAR datatypes is
only in this ASCII subset then it will be faster if this box is
checked. If your data has special characters such as the Euro
sign (€) then keep the box unchecked to accurately convert
the encoding. Characters in the NCHAR or NVARCHAR data types
will always be converted appropriately.

SQLAlchemy Returning UTF-8 as Latin1 Strings

I have a MySQL database encoded in UTF-8, but when I connect to it with SQLAlchemy (Python 2.7), I get back strings with Latin1 Unicode characters in them.
So, the Dutch spelling of Belgium (België) comes out as
'Belgi\xeb'
rather than
'Belgi\xc3\xab'
or, ideally the Unicode object
u'Belgi\xeb'

According to docs (http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#custom-dbapi-args):
MySQLdb will accommodate Python unicode objects if the use_unicode=1 parameter, or the charset parameter, is passed as a connection argument.
Without this setting, many MySQL server installations default to a latin1 encoding for client connections.
You need to use
create_engine('mysql+mysqldb://HOSTNAME/DATABASE?charset=utf8')
rather than just
create_engine('mysql+mysqldb://HOSTNAME/DATABASE')

Why Chinese garbled when use webpy but it's normal when use MySQLdb?

I create a database in mysql and use webpy to construct my web server.
But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.
Below is my problem:
My table t_test (utf8 databse):
id name
1 测试
the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95
when using MySQLdb to do "select" like this:
c=conn.cursor()
c.execute("SELECT * FROM t_test")
items = c.fetchall()
c.close()
print "items=%s, name=%s"%(eval_items, eval_items[1])
the result is normal, it prints:
items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试
But when I use webpy do the same things:
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8")
eval_items=db.select('t_test')
comment=eval_items[0].name
print "comment code=%s"%repr(comment)
print "comment=%s"%comment.encode("utf8")
Chinese garble occured, the print result is:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
comment=忙碌鈥姑€
I know webpy's database is also dependent on MySQLdb， but it's so different for these two way. Why?
BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?

Indeed, something very wrong is taking place --
as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 -
which works on my utf-8 terminal here:
>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试
But look at the bytes on your "comment" unicode object:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
Meaning part of your content are the utf-8 bytes for the comment
(the chars represented as "\xYY" and part is encoded as Unicode points
(the chares represented with \uYYYY ) - this indicates serious garbage.
MySQL has some catchs to proper decode (utf-8 or otherwise) encoded
text in it - one of which is passing a proper "charset" parameter
to the connection. But you did that already -
One attempt you can do is to pass the connection the option use_unicode=False -
and decode the utf-8 strings in your own code.
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)
Check the options to connect for this and other parameters you might try:
http://mysql-python.sourceforge.net/MySQLdb.html
Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects)
in your encoded string are encoded in one of these encodings:
("cp1258", "cp1252", "palmos", "cp1254")
Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses
if it does not get the "charset" argument in the connection. But it is not only a matter
of web2py not passing it to the database, as you are getting mangled chars, not
just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors
From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:
comment = comment.encode("cp1252", errors="ignore")
So, placing this line might work for you now, but guessing around with unicode is never good -
the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.
update
I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode.  It is probably a web2py error, like, it does not pass your "charset=utf8"  parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the
'\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).
all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of
doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)
** update 2 **
I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.

1366 mysql incorrect string value for Zhuyin Fuhao

The "Incorrect string value error" is raised from MySQLdb.
_mysql_exceptions.OperationalError: (1366, "Incorrect string value: '\\xF0\\xA0\
\x84\\x8E\\xE2\\x8B...' for column 'from_url' at row 1")
But I already set both the connection charset and from url encoding to utf8. It works without problem for millions for records previously.
the value which will cause exception:
I think the issue is related to the special character u'\U0002010e' (a chinese special character "ㄋ")
u'http://www.ettoday.net/news/20120227/27879.htm?fb_action_ids=305328666231772&
fb_action_types=og.likes&fb_source=aggregation&fb_aggregation_id=288381481237582
\u7c89\u53ef\u611b\U0002010e\u22ef http://www.ettoday.net/news/20120221/26254.h
tm?fb_action_ids=305330026231636&fb_action_types=og.likes&fb_source=aggregation&
fb_aggregation_id=288381481237582 \u597d\u840c\u53c8\u22ef'
but this character can be encoded as utf8 in python as well.
>>> u'\U0002010e'.encode('utf8')
'\xf0\xa0\x84\x8e'
So why MySQL cannot accept this character?

The character you are using is outside the BMP, therefore it requires 4 bytes to store. Using the utf8 charset is not enough; you must have MySQL 5.5 or greater and use the utf8mb4 charset instead.

check the charset encoding you have set for mysql and make sure that you are using one that accepts utf8 encodings

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.