I 've been having a lot of trouble with sending unicode characters to a database for the last couple of hours. I am using the MySQLdb Python library. I had to first declare the enconding, so that I could save the file.
# -*- coding: utf-8 -*-
Than I added charset="utf8",use_unicode=True as parametres in the connect function.
db = MySQLdb.connect(host="146.247.111.111",user="xxxxxxxx",passwd="[redacted]",db="xxxxxxxx",charset="utf8",use_unicode=True)
db.set_character_set('utf8')
cursor = db.cursor()
cursor.execute("INSERT INTO xls_pravne_osebe_users(name,email,Wall_idWall) VALUES(%s,%s,1) ",("šđčđĐĐšš","ŽŽŽŽŽŽ"))
It definitely changed something, yet everything is still not the way it should be. The letters ž and š are alright, which they weren't before I added the encoding parameters, however the letters ćčđ are still not alright. All of these characters are unicode characters. What encoding is this, that properly encodes some characters and fails with others of the same family?
Related
I'm developing a website in python on google app engine with mysql. My problem if I have the characters 'ő' or 'ű' in the database, the rendering shows an error or shows the '?' instead of the 'ő' or 'ű' characters.
I've already tried to change the collation in the database to utf-8 or latin-1, but the result is the same.
i've also tried to use the unidecode(), .decode('latin1'), .decode('utf8') and added the line to my .py # -*- coding: utf-8 -*-
nothing helps. sometimes i receive an 'ascii decoding error' or 'utf8 can't decode byte' error. The best what i could achieve is the '?' sign instead of the special characters.
this is a sample of my code:
c.execute("""select subject from mytable""")
blogs = []
for (row) in c:
blogs.append(dict([('azon',row[0])]))
return blogs
if i use use this one, then the page rendering perfectly
c.execute("""select subject from mytable""")
blogs = []
for (row) in c:
blogs.append(dict([('azon','ő')]))
return blogs
You cant change encoding type when decoding. ie an encoded latin-1 string doesnt magically become utf-8 if you set utf-8 decode type.
Make sure your mysql inputs are utf-8 first. In the mysql connection string set it to utf-8. Also make sure the character set on the tables is set to utf-8 https://dev.mysql.com/doc/refman/5.7/en/charset-applications.html
I have a variable which contains a string in Persian language, and I cannot save that string into the database correctly. I am using flask for REST API, and I am getting the string from client. Here's my code:
#app.route('/getfile',methods=['POST'])
def get_file():
#check the validity of json format
if not request.json or not 'FileName' in request.json:
abort(400)
if not request.json or not 'FilePath' in request.json:
abort(400)
if not request.json or not 'Message' in request.json:
abort(400)
#retreive data from request
filename_=request.json['FileName']
filepath_=request.json['FilePath']
message_=request.json['Message']
try:
conn = mysql.connector.connect(host=DBhost,database=DBname,user=DBusername,password=DBpassword)
except:
return jsonify({'Result':'Error, Could not connect to database.'})
cursor_ = conn.cursor()
query_ = "INSERT INTO sms_excel_files VALUES(null,%s,%s,%s,0)"
data_ =(filename_,Dst_num_file,message_)
cursor_.execute(query_, data_)
last_row_id_=cursor_.lastrowid
conn.commit()
The variable in question is message_. I can save English texts correctly, but not Persian ones. I also added # -*- coding: utf-8 -*- at the top of my code, but this did not solve the problem. But if I manually fill message_ with a Persian string, it is saved correctly to the database. Furthermore, if I simply return the value of message_, it is correct.
For example, this is what gets inserted into the database when message_ contains the string 'سلام':
سلام
Any help is appreciated.
Please note that this is the first time I am trying to read Arabic / Persian characters, so the following information might not be correct (I could have made a mistake when comparing my test output with the Persian string you have shown in your question). Furthermore, I never have heard of flask so far.
Having said this:
1587 1604 1575 1605 is the sequence of code points which represents the Persian string you have shown in Unicode. Now, in HTML, Unicode code points (in decimal) can be encoded as entities in the form &#xxxx;. So the string سلام is one of the allowed forms of representation of that string in HTML.
Given that, there could be two possible reasons for the misbehavior:
1) request.json['Message'] already contains / returns HTML (and not natural text) and (for some reason I don't know) contains / returns the string in question in HTML-entity encoded form. So this is the first thing you should check.
2) cursor_.execute(...) somehow encodes the string into HTML and thereby (for some reason I don't know) encodes your string in question into HTML-entity encoded form. Perhaps you have told your database driver to encode non-ASCII characters in message_ as HTML entities?
For further analysis, you could check what happens in a test case where request.json['Message'] contains / returns only ASCII characters.
If ASCII characters are written into the database as HTML entities as well, there must be a basic problem which causes all characters without exception to be encoded into HTML entities.
Otherwise, you eventually have not told your database, your database drivers or your file system drivers which encoding to use. In such cases, ASCII characters are often treated correctly, whereas weird things happen to non-ASCII characters. Automatically encoding non-ASCII characters to HTML entities during file IO or database operations would be very unusual, though. But as mentioned above, I don't know flask ...
Please consult the MySQL manual to see how to set the character encoding for databases, tables, columns and connections, your database driver documentation to see which other things you must do to get this encoding to be handled correctly, and your interpreter's and its libraries' manuals to see how to correctly set that encoding for file IO (CGI works via STDIN / STDOUT).
You make your life a lot easier if the database character encodings and the file IO encoding are all the same. Personally, I always use UTF-8.
A final note: Since I don't know anything about flask, I don't know what # -*- coding: utf-8 -*- is supposed to do. But chances are that this only tells the interpreter how the script itself is encoded, but not which encoding to use for input / output / database operations.
Try this code. it is using MySQLdb library which is almost like the library you are using (install it using pip before using).
I tried to set "utf-8" in all possible ways.
# -*- coding: utf-8 -*-
import MySQLdb
# Open database connection
try:
db = MySQLdb.connect(host="localhost",
user="root",
passwd="",
db="db_name"
#,unix_socket="/opt/lampp/var/mysql/mysql.sock"
)
db.set_character_set('utf8')
crsr = db.cursor(MySQLdb.cursors.DictCursor)
crsr.execute('SET NAMES utf8;')
crsr.execute('SET CHARACTER SET utf8;')
crsr.execute('SET character_set_connection=utf8;')
except MySQLdb.Error as e:
print e
I have 10 tables in a database. 9 of them only store data with standard ascii 1-byte characters supported by Latin-1. 1 of them requires that I store special characters that are only supported by UTF8. I would like to use the same MySQL connection object (using Python's PyMySQL library) to populate all 10 tables.
Previously, when creating the MySQL connection object, I did not specify the character set and it defaulted to Latin-1. That was fine when I was only populating the 9 Latin-1 tables. Now that I am populating the UTF8 table, I modified the connection object by passing in the parameter charset='utf8mb4' to the PyMySQL connection object function:
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
password='passwd',
db='db',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
Now I am confident that, when inserting into my UTF8 MySQL table, all of my data is being stored fine. However, I am unsure if problems may arise when using my UTF8 connection object and inserting into the Latin-1 tables. After my first rounds of testing, everything looks great.
Is there anything I have overlooked? Are there any potential issues with inserting UTF8 encoded characters into a Latin-1 table?
It can be done. But... You must set some things correctly, else you will get any of several forms of garbage.
If the bytes in your client are UTF-8 encoded, then you must tell MySQL that fact. This is usually done on the connect string. Your charset='utf8mb4' connection argument does that. Here are some Python-specific tips: http://mysql.rjweb.org/doc.php/charcoll#python
Meanwhile, the column(s) in the table(s) can be either latin1 or utf8 (since you are sure the data is limited to the characters that are common between them).
A character example: é is hex E9 in latin1 and C3A9 in MySQL's utf8 (or utf8mb4). The conversion will occur during INSERT and SELECT if you correctly state the clients encoding.
(For your purposes, either utf8 and utf8mb4 will work.)
If you have further troubles, see Trouble with utf8 characters; what I see is not what I stored and/or provide SHOW CREATE TABLE and hex of some offending character.
Hi utf8 and latin 1 both are simple encoding they support some character which not included in both so problem may occur. if you pass some data of utf8 which is not in latin 1. In this process double encoding occour.
Here is a link to insert utf8 to latin
I had the same problem and solved it by using the Convert and Cast function :
mycursor.execute("INSERT INTO `topics` (`title`,parent_id)
VALUES (convert(cast(convert( %s using utf8) as binary) using latin1),0)" ,(name,) )
I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = '大å¦ã®ããã·ã§ã¯ãã¦å¦çãæ¬å¤§éç½ã®è¢«ç½å°(岩æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?
We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)
Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.
I create a database in mysql and use webpy to construct my web server.
But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.
Below is my problem:
My table t_test (utf8 databse):
id name
1 测试
the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95
when using MySQLdb to do "select" like this:
c=conn.cursor()
c.execute("SELECT * FROM t_test")
items = c.fetchall()
c.close()
print "items=%s, name=%s"%(eval_items, eval_items[1])
the result is normal, it prints:
items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试
But when I use webpy do the same things:
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8")
eval_items=db.select('t_test')
comment=eval_items[0].name
print "comment code=%s"%repr(comment)
print "comment=%s"%comment.encode("utf8")
Chinese garble occured, the print result is:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
comment=忙碌鈥姑€
I know webpy's database is also dependent on MySQLdb, but it's so different for these two way. Why?
BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?
Indeed, something very wrong is taking place --
as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 -
which works on my utf-8 terminal here:
>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试
But look at the bytes on your "comment" unicode object:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
Meaning part of your content are the utf-8 bytes for the comment
(the chars represented as "\xYY" and part is encoded as Unicode points
(the chares represented with \uYYYY ) - this indicates serious garbage.
MySQL has some catchs to proper decode (utf-8 or otherwise) encoded
text in it - one of which is passing a proper "charset" parameter
to the connection. But you did that already -
One attempt you can do is to pass the connection the option use_unicode=False -
and decode the utf-8 strings in your own code.
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)
Check the options to connect for this and other parameters you might try:
http://mysql-python.sourceforge.net/MySQLdb.html
Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects)
in your encoded string are encoded in one of these encodings:
("cp1258", "cp1252", "palmos", "cp1254")
Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses
if it does not get the "charset" argument in the connection. But it is not only a matter
of web2py not passing it to the database, as you are getting mangled chars, not
just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors
From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:
comment = comment.encode("cp1252", errors="ignore")
So, placing this line might work for you now, but guessing around with unicode is never good -
the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.
update
I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode. It is probably a web2py error, like, it does not pass your "charset=utf8" parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the
'\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).
all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of
doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)
** update 2 **
I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.