I have a MySQL database encoded in UTF-8, but when I connect to it with SQLAlchemy (Python 2.7), I get back strings with Latin1 Unicode characters in them.
So, the Dutch spelling of Belgium (België) comes out as
'Belgi\xeb'
rather than
'Belgi\xc3\xab'
or, ideally the Unicode object
u'Belgi\xeb'
According to docs (http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#custom-dbapi-args):
MySQLdb will accommodate Python unicode objects if the use_unicode=1 parameter, or the charset parameter, is passed as a connection argument.
Without this setting, many MySQL server installations default to a latin1 encoding for client connections.
You need to use
create_engine('mysql+mysqldb://HOSTNAME/DATABASE?charset=utf8')
rather than just
create_engine('mysql+mysqldb://HOSTNAME/DATABASE')
Related
I have 10 tables in a database. 9 of them only store data with standard ascii 1-byte characters supported by Latin-1. 1 of them requires that I store special characters that are only supported by UTF8. I would like to use the same MySQL connection object (using Python's PyMySQL library) to populate all 10 tables.
Previously, when creating the MySQL connection object, I did not specify the character set and it defaulted to Latin-1. That was fine when I was only populating the 9 Latin-1 tables. Now that I am populating the UTF8 table, I modified the connection object by passing in the parameter charset='utf8mb4' to the PyMySQL connection object function:
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
Now I am confident that, when inserting into my UTF8 MySQL table, all of my data is being stored fine. However, I am unsure if problems may arise when using my UTF8 connection object and inserting into the Latin-1 tables. After my first rounds of testing, everything looks great.
Is there anything I have overlooked? Are there any potential issues with inserting UTF8 encoded characters into a Latin-1 table?
It can be done. But... You must set some things correctly, else you will get any of several forms of garbage.
If the bytes in your client are UTF-8 encoded, then you must tell MySQL that fact. This is usually done on the connect string. Your charset='utf8mb4' connection argument does that. Here are some Python-specific tips: http://mysql.rjweb.org/doc.php/charcoll#python
Meanwhile, the column(s) in the table(s) can be either latin1 or utf8 (since you are sure the data is limited to the characters that are common between them).
A character example: é is hex E9 in latin1 and C3A9 in MySQL's utf8 (or utf8mb4). The conversion will occur during INSERT and SELECT if you correctly state the clients encoding.
(For your purposes, either utf8 and utf8mb4 will work.)
If you have further troubles, see Trouble with utf8 characters; what I see is not what I stored and/or provide SHOW CREATE TABLE and hex of some offending character.
Hi utf8 and latin 1 both are simple encoding they support some character which not included in both so problem may occur. if you pass some data of utf8 which is not in latin 1. In this process double encoding occour.
Here is a link to insert utf8 to latin
I had the same problem and solved it by using the Convert and Cast function :
mycursor.execute("INSERT INTO `topics` (`title`,parent_id)
VALUES (convert(cast(convert( %s using utf8) as binary) using latin1),0)" ,(name,) )
I create a database in mysql and use webpy to construct my web server.
But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.
Below is my problem:
My table t_test (utf8 databse):
id name
1 测试
the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95
when using MySQLdb to do "select" like this:
c=conn.cursor()
c.execute("SELECT * FROM t_test")
items = c.fetchall()
c.close()
print "items=%s, name=%s"%(eval_items, eval_items[1])
the result is normal, it prints:
items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试
But when I use webpy do the same things:
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8")
eval_items=db.select('t_test')
comment=eval_items[0].name
print "comment code=%s"%repr(comment)
print "comment=%s"%comment.encode("utf8")
Chinese garble occured, the print result is:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
comment=忙碌鈥姑€
I know webpy's database is also dependent on MySQLdb, but it's so different for these two way. Why?
BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?
Indeed, something very wrong is taking place --
as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 -
which works on my utf-8 terminal here:
>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试
But look at the bytes on your "comment" unicode object:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
Meaning part of your content are the utf-8 bytes for the comment
(the chars represented as "\xYY" and part is encoded as Unicode points
(the chares represented with \uYYYY ) - this indicates serious garbage.
MySQL has some catchs to proper decode (utf-8 or otherwise) encoded
text in it - one of which is passing a proper "charset" parameter
to the connection. But you did that already -
One attempt you can do is to pass the connection the option use_unicode=False -
and decode the utf-8 strings in your own code.
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)
Check the options to connect for this and other parameters you might try:
http://mysql-python.sourceforge.net/MySQLdb.html
Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects)
in your encoded string are encoded in one of these encodings:
("cp1258", "cp1252", "palmos", "cp1254")
Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses
if it does not get the "charset" argument in the connection. But it is not only a matter
of web2py not passing it to the database, as you are getting mangled chars, not
just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors
From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:
comment = comment.encode("cp1252", errors="ignore")
So, placing this line might work for you now, but guessing around with unicode is never good -
the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.
update
I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode. It is probably a web2py error, like, it does not pass your "charset=utf8" parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the
'\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).
all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of
doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)
** update 2 **
I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.
The "Incorrect string value error" is raised from MySQLdb.
_mysql_exceptions.OperationalError: (1366, "Incorrect string value: '\\xF0\\xA0\
\x84\\x8E\\xE2\\x8B...' for column 'from_url' at row 1")
But I already set both the connection charset and from url encoding to utf8. It works without problem for millions for records previously.
the value which will cause exception:
I think the issue is related to the special character u'\U0002010e' (a chinese special character "ㄋ")
u'http://www.ettoday.net/news/20120227/27879.htm?fb_action_ids=305328666231772&
fb_action_types=og.likes&fb_source=aggregation&fb_aggregation_id=288381481237582
\u7c89\u53ef\u611b\U0002010e\u22ef http://www.ettoday.net/news/20120221/26254.h
tm?fb_action_ids=305330026231636&fb_action_types=og.likes&fb_source=aggregation&
fb_aggregation_id=288381481237582 \u597d\u840c\u53c8\u22ef'
but this character can be encoded as utf8 in python as well.
>>> u'\U0002010e'.encode('utf8')
'\xf0\xa0\x84\x8e'
So why MySQL cannot accept this character?
The character you are using is outside the BMP, therefore it requires 4 bytes to store. Using the utf8 charset is not enough; you must have MySQL 5.5 or greater and use the utf8mb4 charset instead.
check the charset encoding you have set for mysql and make sure that you are using one that accepts utf8 encodings
When i use sqlite3 database with sqlalchemy library, i got this error
sqlalchemy.exc.ProgrammingError: (ProgrammingError)
You must not use 8-bit bytestrings unless you use a text_factory that can
interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application
to Unicode strings.
u'INSERT INTO model_pair (user, password) VALUES (?, ?)' ('\xebE\xc2\xe4.\t312874#gg.com', '123456')
and here is some test data:
呆呆 3098920#gg.com 11111
�言 9707996#gg.com 11111
wwwj55572#gg?€? 11111
I have configured database encoding as utf-8 or gbk but neither success
when insert, i try str.decode('gbk'), it will stuck on char like € and get error like above.
anyone tell me how get around this error ?
try to change '\xebE\xc2\xe4.\t312874#gg.com' to u'\xebE\xc2\xe4.\t312874#gg.com'
also, you could try to do '\xebE\xc2\xe4.\t312874#gg.com'.decode("utf-8"), but it gives an error "invalid continuation byte", perhaps your string is not valid utf-8 after all?
btw, do mention is you are running python 2.x or 3.x, there is a difference.
Using MySQLdb I connect to a database where everything is stored in the utf8 encoding.
If I do
cursor.execute("SET NAMES utf8")
and fetch some data from the database by another statement. Does that mean, that the strings in
cursor.execute("SELECT ...")
cursor.fetchall()
will be in unicode? Or do I have to turn them first by
mystr.decode("utf8")
to unicode?
From the docs:
connect(parameters...)
...
use_unicode
If True, CHAR and VARCHAR and TEXT columns are returned as Unicode strings, using the configured character set. It is best to set the default encoding in the server configuration, or client configuration (read with read_default_file). If you change the character set after connecting (MySQL-4.1 and later), you'll need to put the correct character set name in connection.charset.
If False, text-like columns are returned as normal strings, but you can always write Unicode strings.
This must be a keyword parameter.