python cyrillic decode - python

I'm trying to print cyrillic chars selected from mysql. Here is my code:
content id DB is cp1251
>>> db = MySQLdb.connect(host="localhost", user="XXX", passwd="XXXX" )
>>> cursor = db.cursor()
>>> cursor.execute("""select id,title,cat,text,tags,date from db1.table1;""")
>>> test=cursor.fetchone()
>>> somevar=test[1]
>>> somevar=somevar.decode('utf8')
>>> print somevar
Result: ?????? ?? ????????
Please guide me how to print this correctly. Thx.

This helped me (got it from here):
db = MySQLdb.connect("localhost", config.db_user, config.db_pwd, config.db_name)
# here's the magic
db.set_character_set("utf8")
dbc = db.cursor()
dbc.execute("SET NAMES utf8;")
dbc.execute("SET CHARACTER SET utf8;")
dbc.execute("SET character_set_connection=utf8;")
# and here goes your SELECT for cyrillic fields
dbc.execute("SELECT id, title, cat, text, tags, date FROM db1.table1;")
# and then you just get the results
test = dbc.fetchone()
somevar = test[1]
print somevar

try this:
somevar = somevar.decode('cp1251')
If that does not help, try to add charset='cp1251' parameter in MySQLdb.connect and there is use_unicode parameter, maybe you should use it to...
all connect parameter you can find here https://github.com/farcepest/MySQLdb1/blob/master/MySQLdb/connections.py
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.

Related

How to fix python hardcoded dictionary encoding issue

Error:
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xEF\\xBF\\xBD 20...' for column 'history' at row 1")
I've received a few variations of this as I've tried to tweak my dictionary, always in the history column, the only variations is the characters it tells me are issues.
I can't post the dictionary because it's got sensitive information, but here is the jist:
I started with 200 addresses (including state, zip, etc) that needed
to be validated, normalized and standardized for DB insertion.
I spent a lot of time on google maps validating and standardizing.
I decided to get fancy, and put all the crazy accented letters in the addresses of these world addresses (often copies from google because I don't know how to type and A with an o over it, lol), Singapore to Brazil, everywhere.
I ended up with 120 unique addresses in my dictionary after processing.
Everything works 100% perfectly when INSERTING the data in SQLite and OUTPUTING to a CSV. The issue is exclusively with MySQL and some sneaky un-viewable characters.
Note: I used this to remove the accents after 7 hours of copy/pasting to notepad, encoding it with notepad++ and just trying to processes the data in a way that made it all the correct encoding. I think I did lose the version with the accents and only have this tools output now.
I do not see "\xEF\xBF\xBD 20..." in my dictionary I only see text. Currently I don't even see "20"... those two chars helped me find the previous issues.
Code I can show:
def insert_tables(cursor, assets_final, ips_final):
#Insert Asset data into asset table
field_names_dict = get_asset_field_names(assets_final)
sql_field_names = ",".join(field_names_dict.keys())
for key, row in assets_final.items():
insert_sql = 'INSERT INTO asset(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(row.values())) + '")'
print(insert_sql)
cursor.execute(insert_sql)
#Insert IP data into IP table
field_names_dict = get_ip_field_names(ips_final)
sql_field_names = ",".join(field_names_dict.keys())
for hostname_key, ip_dict in ips_final.items():
for ip_key, ip_row in ip_dict.items():
insert_sql = 'INSERT INTO ip(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(ip_row.values())) + '")'
print(insert_sql)
cursor.execute(insert_sql)
def output_sqlite_db(sqlite_file, assets_final, ips_final):
conn = sqlite3.connect(sqlite_file)
cursor = conn.cursor()
insert_tables(cursor, assets_final, ips_final)
conn.commit()
conn.close()
def output_mysql_db(assets_final, ips_final):
conn = mysql.connect(host=config.mysql_ip, port=config.mysql_port, user=config.mysql_user, password=config.mysql_password, charset="utf8mb4", use_unicode=True)
cursor = conn.cursor()
cursor.execute('USE ' + config.mysql_DB)
insert_tables(cursor, assets_final, ips_final)
conn.commit()
conn.close()
EDIT: Could this have something to do with the fact I'm using Cygwin as my terminal? HA! I added this line and got a different message (now using the accented version again):
cursor.execute('SET NAMES utf8')
Error:
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xC5\\x81A II...' for column 'history' at row 1")
I can shine a bit of light on the messages that you have supplied:
Case 1:
>>> import unicodedata as ucd
>>> s1 = b"\xEF\xBF\xBD"
>>> s1
b'\xef\xbf\xbd'
>>> u1 = s1.decode('utf8')
>>> u1
'\ufffd'
>>> ucd.name(u1)
'REPLACEMENT CHARACTER'
>>>
Looks like you have obtained some bytes encoded in an encoding other than utf8 (e.g. cp1252) then tried bytes.decode(encoding='utf8', errors='strict'). This detected some errors. You then decoded again with errors="replace". This raised no exceptions. However your data has had the error bytes replaced by the replacement character (U+FFFD). Then you encoded your data using str.encodeso that you could write to a file or database. Each replacement characters turns up as 3 hex bytes EF BF BD.
... more to come
Case 2:
>>> s2 = b"\xC5\x81A II"
>>> s2
b'\xc5\x81A II'
>>> u2 = s2.decode('utf8')
>>> u2
'\u0141A II'
>>> ucd.name(u2[0])
'LATIN CAPITAL LETTER L WITH STROKE'
>>>

why Python doesn't convert \n to newline when queried from Sqlite?

I want to query from Sqlite and contains \n and I expect python converts it to newline but it doesn't. and I also changed \n to \n in the database but still can't be converted.
cursor.execute('''SELECT test FROM table_name ''')
for row in cursor:
self.ui.textEdit.append(row[0])
# or
print row[0]
I also tried unicode(row[0]) and not working. I am surprised there is no an easy solution for this in the web.
Neither SQLite nor Python convert characters in strings (except for \ escapes in a Python string written in the source code).
Newlines work correctly if you handle them correctly:
>>> import sqlite3
>>> db = sqlite3.connect(':memory:')
>>> c = db.cursor()
>>> c.execute('create table t(x)')
>>> c.execute("insert into t values ('x\ny')")
>>> c.execute("insert into t values ('x\\ny')")
>>> c.execute("select * from t")
>>> for row in c:
... print row[0]
...
x
y
x\ny

IndexError string index out of range when referencing first character

Here is my code (currently):
conn = sqlite3.connect(db)
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cursor()
reader = csv.reader(open(csvfile, "rU"), delimiter = '\t')
for Number, Name, Message, Datetime, Type in reader:
# populate subscriber table
if str(Number)[0] == '1': # errors on this line
tmpNumber = str(Number)[1:]
Number = int(tmpNumber)
cur.execute('INSERT OR IGNORE INTO subscriber (name, phone_number) VALUES (?,?)', (Name, Number))
cur.close()
conn.close()
It returns this error on the line commented to indicate where the error lies:
IndexError: string index out of range
All of the numbers have values, but if the phone number starts with a 1 I want to remove the 1 before inserting it into the database. Why won't this work? I've converted it to a string before trying to reference the first character, so I don't understand why this isn't working.
Seems like you are getting an empty string. Try replacing your if statement with the following and see if it works.
if str(Number).startswith('1'):
(Edited to reflect #kindall 's suggestion of using startswith instead of slicing [:1]).

Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

how to correctly compare unicode string from psycopg2 in Python?

I have a problem with comparing a UTF-8 string obtained from PostgreSQL database:
>>> db_conn = psycopg2.connect("dbname='foo' user='foo' host='localhost' password='xxx'")
>>> db_cursor = db_conn.cursor()
>>> sql_com = ("""SELECT my_text FROM table WHERE id = 1""")
>>> db_cursor.execute(sql_com)
>>> sql_result = db_cursor.fetchone()
>>> db_conn.commit()
>>> db_conn.close()
>>> a = sql_result[0]
>>> a
u'M\xfcnchen'
>>> type(a)
<type 'unicode'>
>>> print a
München
>>> b = u'München'
>>> type(b)
<type 'unicode'>
>>> print b
München
>>> a == b
False
I am really confused why is this so, I can someone tell me how should I compare a string with an Umlaut from the database to another string, so the comparison is true? My database is UTF8:
postgres#localhost:$ psql -l
List of databases
Name | Owner | Encoding
-----------+----------+----------
foo | foo | UTF8
This is clearly a problem with locale of your console.
u"München" is u'M\xfcnchen' in Unicode and 'M\xc3\xbcnchen' in UTF-8. That latter is your München if taken as ISO8859-1 or CP1252.
Psycopg2 seems to supply you with correct Unicode values, as it should.
If you type
b = 'München'
What do you get from type(b) ??
Maybe you don't need to literally transform the string into unicode text as Python will automatically note this.
EDIT: I get this from my python CLI:
>>> b = u'München'
>>> b
u'M\xfcnchen'
>>> print b
München
While you are gettin' your print result in a different encoding

Categories