How to fix python hardcoded dictionary encoding issue - python

Error:
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xEF\\xBF\\xBD 20...' for column 'history' at row 1")
I've received a few variations of this as I've tried to tweak my dictionary, always in the history column, the only variations is the characters it tells me are issues.
I can't post the dictionary because it's got sensitive information, but here is the jist:
I started with 200 addresses (including state, zip, etc) that needed
to be validated, normalized and standardized for DB insertion.
I spent a lot of time on google maps validating and standardizing.
I decided to get fancy, and put all the crazy accented letters in the addresses of these world addresses (often copies from google because I don't know how to type and A with an o over it, lol), Singapore to Brazil, everywhere.
I ended up with 120 unique addresses in my dictionary after processing.
Everything works 100% perfectly when INSERTING the data in SQLite and OUTPUTING to a CSV. The issue is exclusively with MySQL and some sneaky un-viewable characters.
Note: I used this to remove the accents after 7 hours of copy/pasting to notepad, encoding it with notepad++ and just trying to processes the data in a way that made it all the correct encoding. I think I did lose the version with the accents and only have this tools output now.
I do not see "\xEF\xBF\xBD 20..." in my dictionary I only see text. Currently I don't even see "20"... those two chars helped me find the previous issues.
Code I can show:
def insert_tables(cursor, assets_final, ips_final):
#Insert Asset data into asset table
field_names_dict = get_asset_field_names(assets_final)
sql_field_names = ",".join(field_names_dict.keys())
for key, row in assets_final.items():
insert_sql = 'INSERT INTO asset(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(row.values())) + '")'
print(insert_sql)
cursor.execute(insert_sql)
#Insert IP data into IP table
field_names_dict = get_ip_field_names(ips_final)
sql_field_names = ",".join(field_names_dict.keys())
for hostname_key, ip_dict in ips_final.items():
for ip_key, ip_row in ip_dict.items():
insert_sql = 'INSERT INTO ip(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(ip_row.values())) + '")'
print(insert_sql)
cursor.execute(insert_sql)
def output_sqlite_db(sqlite_file, assets_final, ips_final):
conn = sqlite3.connect(sqlite_file)
cursor = conn.cursor()
insert_tables(cursor, assets_final, ips_final)
conn.commit()
conn.close()
def output_mysql_db(assets_final, ips_final):
conn = mysql.connect(host=config.mysql_ip, port=config.mysql_port, user=config.mysql_user, password=config.mysql_password, charset="utf8mb4", use_unicode=True)
cursor = conn.cursor()
cursor.execute('USE ' + config.mysql_DB)
insert_tables(cursor, assets_final, ips_final)
conn.commit()
conn.close()
EDIT: Could this have something to do with the fact I'm using Cygwin as my terminal? HA! I added this line and got a different message (now using the accented version again):
cursor.execute('SET NAMES utf8')
Error:
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xC5\\x81A II...' for column 'history' at row 1")

I can shine a bit of light on the messages that you have supplied:
Case 1:
>>> import unicodedata as ucd
>>> s1 = b"\xEF\xBF\xBD"
>>> s1
b'\xef\xbf\xbd'
>>> u1 = s1.decode('utf8')
>>> u1
'\ufffd'
>>> ucd.name(u1)
'REPLACEMENT CHARACTER'
>>>
Looks like you have obtained some bytes encoded in an encoding other than utf8 (e.g. cp1252) then tried bytes.decode(encoding='utf8', errors='strict'). This detected some errors. You then decoded again with errors="replace". This raised no exceptions. However your data has had the error bytes replaced by the replacement character (U+FFFD). Then you encoded your data using str.encodeso that you could write to a file or database. Each replacement characters turns up as 3 hex bytes EF BF BD.
... more to come
Case 2:
>>> s2 = b"\xC5\x81A II"
>>> s2
b'\xc5\x81A II'
>>> u2 = s2.decode('utf8')
>>> u2
'\u0141A II'
>>> ucd.name(u2[0])
'LATIN CAPITAL LETTER L WITH STROKE'
>>>

Related

sqlite database supporting unicode and longdate

I'm working on my python script to pull the data from the sqlite3 database.
When I try this code:
#Pull the data from the database
c = con.cursor()
channelList = list()
channel_db = xbmc.translatePath(os.path.join('special://userdata/addon_data/script.tvguide', 'source.db'))
if os.path.exists(channel_db):
c.execute('SELECT channel, title, start_date, stop_date FROM programs WHERE channel')
for row in c:
channel = row[0], row[1],row[2], row[3]
channelList.append(channel)
print channel
c.close()
I will get the list of data with unicode u and long date L like this:
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u'Reba - Location, Location, Location', 20140522133000L, 20140522140000L)
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u'Reba - Your Place or Mine', 20140522140000L, 20140522143000L)
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u"Reba - She's Leaving Home, Bye, Bye", 20140522143000L, 20140522150000L)
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u'Boy Meets World - No Such Thing as a Sure Thing', 20140522150000L, 20140522153000L)
I want to print the data without the u and L strings.
Could you please tell me how I can print the data without the u and the L strings?
The problem is that you are printing a tuple, the elements of which will be printed using __repr__ instead of __str__. To get each to be printed in a more natural way, try:
print row[0], row[1], row[2], row[3]
Explanation by example:
>>> print u'Hello'
Hello
>>> print (u'Hello', u'World')
(u'Hello', u'World')
>>> print u'Hello', u'World'
Hello World
Converting:
If you're interested in converting the data so that the strings are no longer unicode, and the dates are ints instead of longs, you can do the following:
>>> channel = row[0].encode('ascii'), row[1].encode('ascii'), int(row[2]), int(row[3])
>>> print channel
('101 ABC FAMILY ', 'Reba - Location, Location, Location', 20140522133000, 20140522140000)
Beware that encoding to ascii will fail if the string contains a non-ascii character, by raising a UnicodeDecodeError. Casting the long to int will never raise an exception, but the result will simply be another long if the number is too large to be stored in an int. More about Python's long.
Text factory:
Another option is to use a sqlite3 feature called text_factory. Do this before the c.execute:
con.text_factory = lambda x: x.encode('ascii')
This will be automatically called when retrieving any text columns. Note that in this case, the UnicodeDecodeError will be raised by c.execute if the text can't be decoded properly.

why Python doesn't convert \n to newline when queried from Sqlite?

I want to query from Sqlite and contains \n and I expect python converts it to newline but it doesn't. and I also changed \n to \n in the database but still can't be converted.
cursor.execute('''SELECT test FROM table_name ''')
for row in cursor:
self.ui.textEdit.append(row[0])
# or
print row[0]
I also tried unicode(row[0]) and not working. I am surprised there is no an easy solution for this in the web.
Neither SQLite nor Python convert characters in strings (except for \ escapes in a Python string written in the source code).
Newlines work correctly if you handle them correctly:
>>> import sqlite3
>>> db = sqlite3.connect(':memory:')
>>> c = db.cursor()
>>> c.execute('create table t(x)')
>>> c.execute("insert into t values ('x\ny')")
>>> c.execute("insert into t values ('x\\ny')")
>>> c.execute("select * from t")
>>> for row in c:
... print row[0]
...
x
y
x\ny

How can I convert this to unicode so it displays properly?

I'm querying a database which from the MySQL workbench returns the following value:
Vitória da Conquista
which should be displayed as:
Vitória da Conquista
No matter what I've tried I can't get convert 'Vit\xc3\xb3ria da Conquista' into 'Vitória da Conquista'
#Querying MySQL "world" database
print "====================================="
query = 'select name from city where id=283;'
cursor.execute(query)
cities = cursor.fetchall()
print cities
for city in cities:
cs = str(city)
cs = cs[3:-3].decode('utf-8')
print cs
print cs.decode('utf-8')
print cs.encode('ascii','ignore')
the output of which looks like:
=====================================
[(u'Vit\xc3\xb3ria da Conquista',)]
Vit\xc3\xb3ria da Conquista
Vit\xc3\xb3ria da Conquista
Vit\xc3\xb3ria da Conquista
Well, this actually worked. I'm not sure why however. But I am getting the correct value of Vitória da Conquista. I would like to understand what is happening however.
#Querying MySQL "world" database
query = 'SELECT CONVERT(CAST(Name as BINARY) USING utf8) from city where id = 283;'
cursor.execute(query)
cities = cursor.fetchall()
for tup in cities:
cs=tup[0]
print cs
If the data coming in is in UTF-8 (which looks like it is), use (in Python 2), unicode() to convert it from bytes to a Python Unicode string:
cs = unicode(cs[3:-3], "utf-8")
Basic rule: inside your code, always use Unicode strings. Convert with unicode() input data and with encode() output data.
You are getting unicode strings back, stored in a list of tuples, which is what fetchall does. So you don't need to encode or decode at all. Just try this:
#Querying MySQL "world" database
print "====================================="
query = 'select name from city where id=283;'
cursor.execute(query)
cities = cursor.fetchall()
for tup in cities:
cs = tup[0]
print cs
If this doesn't print right, then you probably have issues with your terminal, as mentioned by #Jarrod Roberson. The only other possibility is that the data was entered into, or is being returned from, the database with the wrong (unexpected) encoding.

IndexError string index out of range when referencing first character

Here is my code (currently):
conn = sqlite3.connect(db)
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cursor()
reader = csv.reader(open(csvfile, "rU"), delimiter = '\t')
for Number, Name, Message, Datetime, Type in reader:
# populate subscriber table
if str(Number)[0] == '1': # errors on this line
tmpNumber = str(Number)[1:]
Number = int(tmpNumber)
cur.execute('INSERT OR IGNORE INTO subscriber (name, phone_number) VALUES (?,?)', (Name, Number))
cur.close()
conn.close()
It returns this error on the line commented to indicate where the error lies:
IndexError: string index out of range
All of the numbers have values, but if the phone number starts with a 1 I want to remove the 1 before inserting it into the database. Why won't this work? I've converted it to a string before trying to reference the first character, so I don't understand why this isn't working.
Seems like you are getting an empty string. Try replacing your if statement with the following and see if it works.
if str(Number).startswith('1'):
(Edited to reflect #kindall 's suggestion of using startswith instead of slicing [:1]).

Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

Categories