I have a variable which contains a string in Persian language, and I cannot save that string into the database correctly. I am using flask for REST API, and I am getting the string from client. Here's my code:
#app.route('/getfile',methods=['POST'])
def get_file():
#check the validity of json format
if not request.json or not 'FileName' in request.json:
abort(400)
if not request.json or not 'FilePath' in request.json:
abort(400)
if not request.json or not 'Message' in request.json:
abort(400)
#retreive data from request
filename_=request.json['FileName']
filepath_=request.json['FilePath']
message_=request.json['Message']
try:
conn = mysql.connector.connect(host=DBhost,database=DBname,user=DBusername,password=DBpassword)
except:
return jsonify({'Result':'Error, Could not connect to database.'})
cursor_ = conn.cursor()
query_ = "INSERT INTO sms_excel_files VALUES(null,%s,%s,%s,0)"
data_ =(filename_,Dst_num_file,message_)
cursor_.execute(query_, data_)
last_row_id_=cursor_.lastrowid
conn.commit()
The variable in question is message_. I can save English texts correctly, but not Persian ones. I also added # -*- coding: utf-8 -*- at the top of my code, but this did not solve the problem. But if I manually fill message_ with a Persian string, it is saved correctly to the database. Furthermore, if I simply return the value of message_, it is correct.
For example, this is what gets inserted into the database when message_ contains the string 'سلام':
سلام
Any help is appreciated.
Please note that this is the first time I am trying to read Arabic / Persian characters, so the following information might not be correct (I could have made a mistake when comparing my test output with the Persian string you have shown in your question). Furthermore, I never have heard of flask so far.
Having said this:
1587 1604 1575 1605 is the sequence of code points which represents the Persian string you have shown in Unicode. Now, in HTML, Unicode code points (in decimal) can be encoded as entities in the form &#xxxx;. So the string سلام is one of the allowed forms of representation of that string in HTML.
Given that, there could be two possible reasons for the misbehavior:
1) request.json['Message'] already contains / returns HTML (and not natural text) and (for some reason I don't know) contains / returns the string in question in HTML-entity encoded form. So this is the first thing you should check.
2) cursor_.execute(...) somehow encodes the string into HTML and thereby (for some reason I don't know) encodes your string in question into HTML-entity encoded form. Perhaps you have told your database driver to encode non-ASCII characters in message_ as HTML entities?
For further analysis, you could check what happens in a test case where request.json['Message'] contains / returns only ASCII characters.
If ASCII characters are written into the database as HTML entities as well, there must be a basic problem which causes all characters without exception to be encoded into HTML entities.
Otherwise, you eventually have not told your database, your database drivers or your file system drivers which encoding to use. In such cases, ASCII characters are often treated correctly, whereas weird things happen to non-ASCII characters. Automatically encoding non-ASCII characters to HTML entities during file IO or database operations would be very unusual, though. But as mentioned above, I don't know flask ...
Please consult the MySQL manual to see how to set the character encoding for databases, tables, columns and connections, your database driver documentation to see which other things you must do to get this encoding to be handled correctly, and your interpreter's and its libraries' manuals to see how to correctly set that encoding for file IO (CGI works via STDIN / STDOUT).
You make your life a lot easier if the database character encodings and the file IO encoding are all the same. Personally, I always use UTF-8.
A final note: Since I don't know anything about flask, I don't know what # -*- coding: utf-8 -*- is supposed to do. But chances are that this only tells the interpreter how the script itself is encoded, but not which encoding to use for input / output / database operations.
Try this code. it is using MySQLdb library which is almost like the library you are using (install it using pip before using).
I tried to set "utf-8" in all possible ways.
# -*- coding: utf-8 -*-
import MySQLdb
# Open database connection
try:
db = MySQLdb.connect(host="localhost",
user="root",
passwd="",
db="db_name"
#,unix_socket="/opt/lampp/var/mysql/mysql.sock"
)
db.set_character_set('utf8')
crsr = db.cursor(MySQLdb.cursors.DictCursor)
crsr.execute('SET NAMES utf8;')
crsr.execute('SET CHARACTER SET utf8;')
crsr.execute('SET character_set_connection=utf8;')
except MySQLdb.Error as e:
print e
Related
I'm developing a website in python on google app engine with mysql. My problem if I have the characters 'ő' or 'ű' in the database, the rendering shows an error or shows the '?' instead of the 'ő' or 'ű' characters.
I've already tried to change the collation in the database to utf-8 or latin-1, but the result is the same.
i've also tried to use the unidecode(), .decode('latin1'), .decode('utf8') and added the line to my .py # -*- coding: utf-8 -*-
nothing helps. sometimes i receive an 'ascii decoding error' or 'utf8 can't decode byte' error. The best what i could achieve is the '?' sign instead of the special characters.
this is a sample of my code:
c.execute("""select subject from mytable""")
blogs = []
for (row) in c:
blogs.append(dict([('azon',row[0])]))
return blogs
if i use use this one, then the page rendering perfectly
c.execute("""select subject from mytable""")
blogs = []
for (row) in c:
blogs.append(dict([('azon','ő')]))
return blogs
You cant change encoding type when decoding. ie an encoded latin-1 string doesnt magically become utf-8 if you set utf-8 decode type.
Make sure your mysql inputs are utf-8 first. In the mysql connection string set it to utf-8. Also make sure the character set on the tables is set to utf-8 https://dev.mysql.com/doc/refman/5.7/en/charset-applications.html
I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:
OperationalError: Could not decode to UTF-8
If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become £
If I read the csv in python, the £ in the offending records is still read as £ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.
I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.
My code below results in the OperationalError: Could not decode to UTF-8
conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT 5000 , 5001''')
row = curr.fetchone()
All the records above 5000 in the database have this character problem and hence produce the error.
Any help appreciated.
Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.
The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line
conn.text_factory = str
However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).
What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:
your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`
...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:
conn.text_factory = lambda x: str(x, 'latin1')
Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:
'iso-8859-1'
'utf-16'
'utf-32'
'latin1'
You can find a more complete list here.
Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:
conn.text_factory = bytes
If the text in the database is actually mostly encoded in UTF-8, but you're still seeing this error (Could not decode to UTF-8), then the problem may be that one or more rows have bogus data that is not valid UTF-8. By default, Python's decode() function throws an exception when it sees text like that. If you are in this situation and want to simply ignore these errors, you can set up a text_factory like this:
conn = sqlite3.connect('my-database.db')
conn.text_factory = lambda b: b.decode(errors = 'ignore')
I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = '大å¦ã®ããã·ã§ã¯ãã¦å¦çãæ¬å¤§éç½ã®è¢«ç½å°(岩æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?
We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)
Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.
I 've been having a lot of trouble with sending unicode characters to a database for the last couple of hours. I am using the MySQLdb Python library. I had to first declare the enconding, so that I could save the file.
# -*- coding: utf-8 -*-
Than I added charset="utf8",use_unicode=True as parametres in the connect function.
db = MySQLdb.connect(host="146.247.111.111",user="xxxxxxxx",passwd="[redacted]",db="xxxxxxxx",charset="utf8",use_unicode=True)
db.set_character_set('utf8')
cursor = db.cursor()
cursor.execute("INSERT INTO xls_pravne_osebe_users(name,email,Wall_idWall) VALUES(%s,%s,1) ",("šđčđĐĐšš","ŽŽŽŽŽŽ"))
It definitely changed something, yet everything is still not the way it should be. The letters ž and š are alright, which they weren't before I added the encoding parameters, however the letters ćčđ are still not alright. All of these characters are unicode characters. What encoding is this, that properly encodes some characters and fails with others of the same family?
I create a database in mysql and use webpy to construct my web server.
But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.
Below is my problem:
My table t_test (utf8 databse):
id name
1 测试
the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95
when using MySQLdb to do "select" like this:
c=conn.cursor()
c.execute("SELECT * FROM t_test")
items = c.fetchall()
c.close()
print "items=%s, name=%s"%(eval_items, eval_items[1])
the result is normal, it prints:
items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试
But when I use webpy do the same things:
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8")
eval_items=db.select('t_test')
comment=eval_items[0].name
print "comment code=%s"%repr(comment)
print "comment=%s"%comment.encode("utf8")
Chinese garble occured, the print result is:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
comment=忙碌鈥姑€
I know webpy's database is also dependent on MySQLdb, but it's so different for these two way. Why?
BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?
Indeed, something very wrong is taking place --
as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 -
which works on my utf-8 terminal here:
>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试
But look at the bytes on your "comment" unicode object:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
Meaning part of your content are the utf-8 bytes for the comment
(the chars represented as "\xYY" and part is encoded as Unicode points
(the chares represented with \uYYYY ) - this indicates serious garbage.
MySQL has some catchs to proper decode (utf-8 or otherwise) encoded
text in it - one of which is passing a proper "charset" parameter
to the connection. But you did that already -
One attempt you can do is to pass the connection the option use_unicode=False -
and decode the utf-8 strings in your own code.
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)
Check the options to connect for this and other parameters you might try:
http://mysql-python.sourceforge.net/MySQLdb.html
Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects)
in your encoded string are encoded in one of these encodings:
("cp1258", "cp1252", "palmos", "cp1254")
Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses
if it does not get the "charset" argument in the connection. But it is not only a matter
of web2py not passing it to the database, as you are getting mangled chars, not
just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors
From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:
comment = comment.encode("cp1252", errors="ignore")
So, placing this line might work for you now, but guessing around with unicode is never good -
the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.
update
I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode. It is probably a web2py error, like, it does not pass your "charset=utf8" parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the
'\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).
all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of
doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)
** update 2 **
I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.