encoding problems writing UTF-8 SQL statements to a local file - python

I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = '大学ã®ãã­ã·ã§ã¯ãã¦å­¦çãæ¬å¤§éç½ã®è¢«ç½å°(岩æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥­ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?

We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ­© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)

Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.

Related

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c

Regex conflict for certain characters (ISO-8859-1 Windows-1252)

all - I'm trying to perform a regex on a bunch of science data, converting certain special symbols into ASCII-friendly characters. For example, I want to replace 'µ'(UTF-8 \xc2\xb5) to the string 'micro', and '±' with '+/-'. I cooked up a python script to do this, which looks like this:
import re
def stripChars(string):
outString = (re.sub(r'\xc2\xb5+','micro', string)) #Metric 'micro (10^-6)' (Greek 'mu') letter
outString = (re.sub(r'\xc2\xb1+','+/-', outString)) #Scientific 'Plus-Minus' symbol
return outString
However, for these two specific characters, I'm getting strange results. I dug into it a bit, and it looks like I'm suffering from the bug described here, in which certain characters come out wrong because they are UTF data being interpreted as Windows-1252 (or ISO 8859-1).
I grepped the relevant data, and found that it is returning the erroneous result there as well (e.g. the 'µ' appears as 'µ') However, elsewhere in the same data set there exists datum in which the same symbol is displayed correctly. This may be due to a bug in the system which collected the data in the first place. The real weirdness is that it seems my current code only catches the incorrect version, letting the correct one pass through.
In any case, I'm really stuck on how to proceed. I need to be able to come up with a series of regex substitutions which will catch both the correct and incorrect versions of these characters, but the identifier for the correct version is failing in this case.
I must admit, I'm still fairly junior to programming, and anything more than the most basic regex is still like black magic to me. This problem seems a bit more intractable than any I've had to tackle before, and that's why I bring it to here to get some more eyes on it.
Thanks!
If your input data is encoded as UTF-8, your code should work. Here’s a
complete program that works for me. It assumes the input is UTF-8 and
simply operates on the raw bytes, not converting to or from Unicode.
Note that I removed the + from the end of each input regex; that
would accept one or more of the last character, which you probably
didn’t intend.
import re
def stripChars(s):
s = (re.sub(r'\xc2\xb5', 'micro', s)) # micro
s = (re.sub(r'\xc2\xb1', '+/-', s)) # plus-or-minus
return s
f_in = open('data')
f_out = open('output', 'w')
for line in f_in:
print(type(line))
line = stripChars(line)
f_out.write(line)
If your data is encoded some other way (see for example this
question for how to tell), this version will be more useful. You can
specify any encoding for input and output. It decodes to internal
Unicode on reading, acts on that when replacing, then encodes on
writing.
import codecs
import re
encoding_in = 'iso8859-1'
encoding_out = 'ascii'
def stripChars(s):
s = (re.sub(u'\u00B5', 'micro', s)) # micro
s = (re.sub(u'\u00B1', '+/-', s)) # plus-or-minus
return s
f_in = codecs.open('data-8859', 'r', encoding_in)
f_out = codecs.open('output', 'w', encoding_out)
for uline in f_in:
uline = stripChars(uline)
f_out.write(uline)
Note that it will raise an exception if it tries to write non-ASCII data
with an ASCII encoding. The easy way to avoid this is to just write
UTF-8, but then you may not notice uncaught characters. You can catch
the exception and do something graceful. Or you can let the program
crash and update it for the character(s) you’re missing.
Ok, as you use a Python2 version, you read the file as byte strings, and your code should successfully translate all utf-8 encoded versions of µ (U+00B5) or ± (U+00B1).
This is coherent with what you later say:
my current code only catches the incorrect version, letting the correct one pass through
This is in fact perfectly correct. Let us first look at what exactly happen for µ. µ is u'\u00b5' it is encoded in utf-8 as '\xc2\xb5' and encoded in Latin1 or cp1252 as '\xb5'. As 'Â' is U+00C2, its Latin1 or cp1252 code is 0xc2. That means that a µ character correctly encoded in utf-8 will read as µ in a Windows 1252 system. And when it looks correct, it is because it is not utf-8 encoded but Latin1 encoded.
It looks that you are trying to process a file where parts are utf-8 encoded while others are Latin1 (or cp1252) encoded. You really should try to fix that in the system that is collecting data because it can cause hard to recover trouble.
The good news is that it can be fixed here because you only want to process 2 non ASCII characters: you just have to try to decode the utf-8 version as you do, and then try in a second pass to decode the Latin1 version. Code could be (ne need for regexes here):
def stripChars(string):
outString = string.replace('\xc2\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in utf-8
outString = outString.replace('\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in Latin1
outString = outString.replace('\xc2\xb1','+/-') #Scientific 'Plus-Minus' symbol in utf-8
outString = outString.replace('\xb1','+/-') #Scientific 'Plus-Minus' symbol in Latin1
return outString
For references Latin1 AKA ISO-8859-1 encoding has the exact unicode values for all unicode character below 256. Window code page 1252 (cp1252 in Python) is a Windows variation of the Latin1 encoding where some characters normally unused in Latin1 are used for higher code characters. For example € (U+20AC) is encoded as '\80' in cp1252 while it does not exist at all in Latin1.

Save Persian string into mysql database with python

I have a variable which contains a string in Persian language, and I cannot save that string into the database correctly. I am using flask for REST API, and I am getting the string from client. Here's my code:
#app.route('/getfile',methods=['POST'])
def get_file():
#check the validity of json format
if not request.json or not 'FileName' in request.json:
abort(400)
if not request.json or not 'FilePath' in request.json:
abort(400)
if not request.json or not 'Message' in request.json:
abort(400)
#retreive data from request
filename_=request.json['FileName']
filepath_=request.json['FilePath']
message_=request.json['Message']
try:
conn = mysql.connector.connect(host=DBhost,database=DBname,user=DBusername,password=DBpassword)
except:
return jsonify({'Result':'Error, Could not connect to database.'})
cursor_ = conn.cursor()
query_ = "INSERT INTO sms_excel_files VALUES(null,%s,%s,%s,0)"
data_ =(filename_,Dst_num_file,message_)
cursor_.execute(query_, data_)
last_row_id_=cursor_.lastrowid
conn.commit()
The variable in question is message_. I can save English texts correctly, but not Persian ones. I also added # -*- coding: utf-8 -*- at the top of my code, but this did not solve the problem. But if I manually fill message_ with a Persian string, it is saved correctly to the database. Furthermore, if I simply return the value of message_, it is correct.
For example, this is what gets inserted into the database when message_ contains the string 'سلام':
سلام
Any help is appreciated.
Please note that this is the first time I am trying to read Arabic / Persian characters, so the following information might not be correct (I could have made a mistake when comparing my test output with the Persian string you have shown in your question). Furthermore, I never have heard of flask so far.
Having said this:
1587 1604 1575 1605 is the sequence of code points which represents the Persian string you have shown in Unicode. Now, in HTML, Unicode code points (in decimal) can be encoded as entities in the form &#xxxx;. So the string سلام is one of the allowed forms of representation of that string in HTML.
Given that, there could be two possible reasons for the misbehavior:
1) request.json['Message'] already contains / returns HTML (and not natural text) and (for some reason I don't know) contains / returns the string in question in HTML-entity encoded form. So this is the first thing you should check.
2) cursor_.execute(...) somehow encodes the string into HTML and thereby (for some reason I don't know) encodes your string in question into HTML-entity encoded form. Perhaps you have told your database driver to encode non-ASCII characters in message_ as HTML entities?
For further analysis, you could check what happens in a test case where request.json['Message'] contains / returns only ASCII characters.
If ASCII characters are written into the database as HTML entities as well, there must be a basic problem which causes all characters without exception to be encoded into HTML entities.
Otherwise, you eventually have not told your database, your database drivers or your file system drivers which encoding to use. In such cases, ASCII characters are often treated correctly, whereas weird things happen to non-ASCII characters. Automatically encoding non-ASCII characters to HTML entities during file IO or database operations would be very unusual, though. But as mentioned above, I don't know flask ...
Please consult the MySQL manual to see how to set the character encoding for databases, tables, columns and connections, your database driver documentation to see which other things you must do to get this encoding to be handled correctly, and your interpreter's and its libraries' manuals to see how to correctly set that encoding for file IO (CGI works via STDIN / STDOUT).
You make your life a lot easier if the database character encodings and the file IO encoding are all the same. Personally, I always use UTF-8.
A final note: Since I don't know anything about flask, I don't know what # -*- coding: utf-8 -*- is supposed to do. But chances are that this only tells the interpreter how the script itself is encoded, but not which encoding to use for input / output / database operations.
Try this code. it is using MySQLdb library which is almost like the library you are using (install it using pip before using).
I tried to set "utf-8" in all possible ways.
# -*- coding: utf-8 -*-
import MySQLdb
# Open database connection
try:
db = MySQLdb.connect(host="localhost",
user="root",
passwd="",
db="db_name"
#,unix_socket="/opt/lampp/var/mysql/mysql.sock"
)
db.set_character_set('utf8')
crsr = db.cursor(MySQLdb.cursors.DictCursor)
crsr.execute('SET NAMES utf8;')
crsr.execute('SET CHARACTER SET utf8;')
crsr.execute('SET character_set_connection=utf8;')
except MySQLdb.Error as e:
print e

Why Chinese garbled when use webpy but it's normal when use MySQLdb?

I create a database in mysql and use webpy to construct my web server.
But it's so strange for Chinese character between the webpy's and MySQLdb's behaviors when using them to access database respectively.
Below is my problem:
My table t_test (utf8 databse):
id name
1 测试
the utf8 code for "测试" is: \xe6\xb5\x8b\xe8\xaf\x95
when using MySQLdb to do "select" like this:
c=conn.cursor()
c.execute("SELECT * FROM t_test")
items = c.fetchall()
c.close()
print "items=%s, name=%s"%(eval_items, eval_items[1])
the result is normal, it prints:
items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试
But when I use webpy do the same things:
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8")
eval_items=db.select('t_test')
comment=eval_items[0].name
print "comment code=%s"%repr(comment)
print "comment=%s"%comment.encode("utf8")
Chinese garble occured, the print result is:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
comment=忙碌鈥姑€
I know webpy's database is also dependent on MySQLdb, but it's so different for these two way. Why?
BTW, for the reason above, I can just use MySQLdb directly to solve my Chinese character garble problem, but it loses the clolumn name in table——It's so ungraceful. I want to know how can I solve it with webpy?
Indeed, something very wrong is taking place --
as you said on your comment, the unicode repr. bytes for "测试" are E6B5 8BE8 AF95 -
which works on my utf-8 terminal here:
>>> d
'\xe6\xb5\x8b\xe8\xaf\x95'
>>> print d
测试
But look at the bytes on your "comment" unicode object:
comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'
Meaning part of your content are the utf-8 bytes for the comment
(the chars represented as "\xYY" and part is encoded as Unicode points
(the chares represented with \uYYYY ) - this indicates serious garbage.
MySQL has some catchs to proper decode (utf-8 or otherwise) encoded
text in it - one of which is passing a proper "charset" parameter
to the connection. But you did that already -
One attempt you can do is to pass the connection the option use_unicode=False -
and decode the utf-8 strings in your own code.
db = web.database(dbn='mysql', host="127.0.0.1",
user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)
Check the options to connect for this and other parameters you might try:
http://mysql-python.sourceforge.net/MySQLdb.html
Regardless of getting it to work correctly, with the hints above, I got a workaround for you -- It looks like the Unicode characters (not the utf-8 raw bytes in the unicode objects)
in your encoded string are encoded in one of these encodings:
("cp1258", "cp1252", "palmos", "cp1254")
Of these, cp1252 is almost the same as "latin1" - which is the default charset MySQL uses
if it does not get the "charset" argument in the connection. But it is not only a matter
of web2py not passing it to the database, as you are getting mangled chars, not
just the wrong encoding - it is as if web2py is encoding and decoding your string back and forth, and ignoring encoding errors
From all of these encodings I could retrieve your original "测试" string,as an utf-8 byte string, doing, for example:
comment = comment.encode("cp1252", errors="ignore")
So, placing this line might work for you now, but guessing around with unicode is never good -
the proepr thing is to narrow down what is making web2py to give you those semi-decoded utf-8 strings on the first place, and make it stop there.
update
I checked here- this is what is happening - the correct utf-8 '\xe6\xb5\x8b\xe8\xaf\x95'string is read from the mysql, and before delivering it to you, (in the use_unicode=True case) 0- these bytes are being decoded as if they werhe "cp1252" - this yields the incorrect u'\xe6\xb5\u2039\xe8\xaf\u2022' unicode.  It is probably a web2py error, like, it does not pass your "charset=utf8"  parameter to the actual connection. When you set the "use_unicode=False" instead of giving you the raw bytes, it apparently picks the incorrect unicode, an dencode it using "utf-8" - this yields the
'\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2'sequence you commented bellow (which is even more incorrect).
all in all, the workaround I mentioned above seems the only way to retrieve the original, correct string -that is, given the wrong unicode, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, short of
doing some other thing to set-up the database connection (or maybe update web2py or mysql drivers, if possible)
** update 2 **
I futrher checked the code in web2py dal.py file itself - it attempts to setup the connection as utf-8 by default - but it looks like it will try both MySQLdb and pymysql drivers -- if you have both installed try uninstalling pymysql, and leave only MySQLdb.

Latin-1 and the unicode factory in Python

I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode factory, and I don't know how to make Python use a codec other than ascii.
The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.
The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi instead of pyodbc, which would get me Unicode, but adodbapi gives me other problems.)
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
t.add_row(rec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t
But the Name column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc, when it gets to the t.add_row call:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)
This is line 222 in prettytable.py. It uses unicode, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.
for i in range(0,len(row)):
if len(unicode(row[i])) > self.widths[i]: # This is line 222
self.widths[i] = len(unicode(row[i]))
Please tell me what I'm doing wrong here. How can I make unicode work without hacking prettytable.py or any of the other libraries that I use? Is there even a way to do this?
EDIT: The error occurs not at the print statement, but at the t.add_row call.
EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
t.add_row(urec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')
I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!
Add this at the beginning of the module:
# coding: latin1
Or decode the string to Unicode yourself.
[Edit]
It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:
>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'
[Edit]
Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html
Maybe try to decode the latin1-encoded strings into unicode?
t.add_row((value.decode('latin1') for value in rec))
After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row, add_row and add_column, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.
Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:
>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123
Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). BTW, you can omit the check for strings in your list comprehension by replacing s.decode('latin-1') with unicode(s, 'latin-1') since all objects can be coerced to strings.
One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name> command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name> to do the same for a database.

Categories