sqlite database supporting unicode and longdate - python

I'm working on my python script to pull the data from the sqlite3 database.
When I try this code:
#Pull the data from the database
c = con.cursor()
channelList = list()
channel_db = xbmc.translatePath(os.path.join('special://userdata/addon_data/script.tvguide', 'source.db'))
if os.path.exists(channel_db):
c.execute('SELECT channel, title, start_date, stop_date FROM programs WHERE channel')
for row in c:
channel = row[0], row[1],row[2], row[3]
channelList.append(channel)
print channel
c.close()
I will get the list of data with unicode u and long date L like this:
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u'Reba - Location, Location, Location', 20140522133000L, 20140522140000L)
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u'Reba - Your Place or Mine', 20140522140000L, 20140522143000L)
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u"Reba - She's Leaving Home, Bye, Bye", 20140522143000L, 20140522150000L)
20:52:01 T:5212 NOTICE: (u'101 ABC FAMILY ', u'Boy Meets World - No Such Thing as a Sure Thing', 20140522150000L, 20140522153000L)
I want to print the data without the u and L strings.
Could you please tell me how I can print the data without the u and the L strings?

The problem is that you are printing a tuple, the elements of which will be printed using __repr__ instead of __str__. To get each to be printed in a more natural way, try:
print row[0], row[1], row[2], row[3]
Explanation by example:
>>> print u'Hello'
Hello
>>> print (u'Hello', u'World')
(u'Hello', u'World')
>>> print u'Hello', u'World'
Hello World
Converting:
If you're interested in converting the data so that the strings are no longer unicode, and the dates are ints instead of longs, you can do the following:
>>> channel = row[0].encode('ascii'), row[1].encode('ascii'), int(row[2]), int(row[3])
>>> print channel
('101 ABC FAMILY ', 'Reba - Location, Location, Location', 20140522133000, 20140522140000)
Beware that encoding to ascii will fail if the string contains a non-ascii character, by raising a UnicodeDecodeError. Casting the long to int will never raise an exception, but the result will simply be another long if the number is too large to be stored in an int. More about Python's long.
Text factory:
Another option is to use a sqlite3 feature called text_factory. Do this before the c.execute:
con.text_factory = lambda x: x.encode('ascii')
This will be automatically called when retrieving any text columns. Note that in this case, the UnicodeDecodeError will be raised by c.execute if the text can't be decoded properly.

Related

How to fix python hardcoded dictionary encoding issue

Error:
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xEF\\xBF\\xBD 20...' for column 'history' at row 1")
I've received a few variations of this as I've tried to tweak my dictionary, always in the history column, the only variations is the characters it tells me are issues.
I can't post the dictionary because it's got sensitive information, but here is the jist:
I started with 200 addresses (including state, zip, etc) that needed
to be validated, normalized and standardized for DB insertion.
I spent a lot of time on google maps validating and standardizing.
I decided to get fancy, and put all the crazy accented letters in the addresses of these world addresses (often copies from google because I don't know how to type and A with an o over it, lol), Singapore to Brazil, everywhere.
I ended up with 120 unique addresses in my dictionary after processing.
Everything works 100% perfectly when INSERTING the data in SQLite and OUTPUTING to a CSV. The issue is exclusively with MySQL and some sneaky un-viewable characters.
Note: I used this to remove the accents after 7 hours of copy/pasting to notepad, encoding it with notepad++ and just trying to processes the data in a way that made it all the correct encoding. I think I did lose the version with the accents and only have this tools output now.
I do not see "\xEF\xBF\xBD 20..." in my dictionary I only see text. Currently I don't even see "20"... those two chars helped me find the previous issues.
Code I can show:
def insert_tables(cursor, assets_final, ips_final):
#Insert Asset data into asset table
field_names_dict = get_asset_field_names(assets_final)
sql_field_names = ",".join(field_names_dict.keys())
for key, row in assets_final.items():
insert_sql = 'INSERT INTO asset(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(row.values())) + '")'
print(insert_sql)
cursor.execute(insert_sql)
#Insert IP data into IP table
field_names_dict = get_ip_field_names(ips_final)
sql_field_names = ",".join(field_names_dict.keys())
for hostname_key, ip_dict in ips_final.items():
for ip_key, ip_row in ip_dict.items():
insert_sql = 'INSERT INTO ip(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(ip_row.values())) + '")'
print(insert_sql)
cursor.execute(insert_sql)
def output_sqlite_db(sqlite_file, assets_final, ips_final):
conn = sqlite3.connect(sqlite_file)
cursor = conn.cursor()
insert_tables(cursor, assets_final, ips_final)
conn.commit()
conn.close()
def output_mysql_db(assets_final, ips_final):
conn = mysql.connect(host=config.mysql_ip, port=config.mysql_port, user=config.mysql_user, password=config.mysql_password, charset="utf8mb4", use_unicode=True)
cursor = conn.cursor()
cursor.execute('USE ' + config.mysql_DB)
insert_tables(cursor, assets_final, ips_final)
conn.commit()
conn.close()
EDIT: Could this have something to do with the fact I'm using Cygwin as my terminal? HA! I added this line and got a different message (now using the accented version again):
cursor.execute('SET NAMES utf8')
Error:
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xC5\\x81A II...' for column 'history' at row 1")
I can shine a bit of light on the messages that you have supplied:
Case 1:
>>> import unicodedata as ucd
>>> s1 = b"\xEF\xBF\xBD"
>>> s1
b'\xef\xbf\xbd'
>>> u1 = s1.decode('utf8')
>>> u1
'\ufffd'
>>> ucd.name(u1)
'REPLACEMENT CHARACTER'
>>>
Looks like you have obtained some bytes encoded in an encoding other than utf8 (e.g. cp1252) then tried bytes.decode(encoding='utf8', errors='strict'). This detected some errors. You then decoded again with errors="replace". This raised no exceptions. However your data has had the error bytes replaced by the replacement character (U+FFFD). Then you encoded your data using str.encodeso that you could write to a file or database. Each replacement characters turns up as 3 hex bytes EF BF BD.
... more to come
Case 2:
>>> s2 = b"\xC5\x81A II"
>>> s2
b'\xc5\x81A II'
>>> u2 = s2.decode('utf8')
>>> u2
'\u0141A II'
>>> ucd.name(u2[0])
'LATIN CAPITAL LETTER L WITH STROKE'
>>>

CSV to dict, dict not finding the item

I am converting a CSV to dict, all the values are loaded correctly but with one issue.
CSV :
Testing testing\nwe are into testing mode
My\nServer This is my server.
When I convert the CSV to dict and if I try to use dict.get() method it is returning None.
When I debug, I get the following output:
{'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
The My\nServer key is having an extra backslash.
If I do .get("My\nServer"), I am getting the output as None.
Can anyone help me?
#!/usr/bin/env python
import os
import codecs
import json
from csv import reader
def get_dict(path):
with codecs.open(path, 'r', 'utf-8') as msgfile:
data = msgfile.read()
data = reader([r.encode('utf-8') for r in data.splitlines()])
newdata = []
for row in data:
newrow = []
for val in row:
newrow.append(unicode(val, 'utf-8'))
newdata.append(newrow)
return dict(newdata)
thanks
You either need to escape the newline properly, using \\n:
>>> d = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> d.get('My\\nServer')
'This is my server.'
or you can use a raw string literal which doesn't need extra escaping:
>>> d.get(r'My\nServer')
'This is my server.'
Note that raw string will treat all the backslash escape sequences this way, not just the newline \n.
In case you are getting the values dynamically, you can use str.encode with string_escape or unicode_escape encoding:
>>> k = 'My\nServer' # API call result
>>> k.encode('string_escape')
'My\\nServer'
>>> d.get(k.encode('string_escape'))
'This is my server.'
"\n" is newline.
If you want to represent a text like "---\n---" in Python, and not having there newline, you have to escape it.
The way you write it in code and how it gets printed differs, in code, you will have to write "\" (unless u use raw string), when printed, the extra slash will not be seen
So in your code, you shall ask:
>>> dct = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> dct.get("My\\nServer")
'This is my server.'

How can I convert this to unicode so it displays properly?

I'm querying a database which from the MySQL workbench returns the following value:
Vitória da Conquista
which should be displayed as:
Vitória da Conquista
No matter what I've tried I can't get convert 'Vit\xc3\xb3ria da Conquista' into 'Vitória da Conquista'
#Querying MySQL "world" database
print "====================================="
query = 'select name from city where id=283;'
cursor.execute(query)
cities = cursor.fetchall()
print cities
for city in cities:
cs = str(city)
cs = cs[3:-3].decode('utf-8')
print cs
print cs.decode('utf-8')
print cs.encode('ascii','ignore')
the output of which looks like:
=====================================
[(u'Vit\xc3\xb3ria da Conquista',)]
Vit\xc3\xb3ria da Conquista
Vit\xc3\xb3ria da Conquista
Vit\xc3\xb3ria da Conquista
Well, this actually worked. I'm not sure why however. But I am getting the correct value of Vitória da Conquista. I would like to understand what is happening however.
#Querying MySQL "world" database
query = 'SELECT CONVERT(CAST(Name as BINARY) USING utf8) from city where id = 283;'
cursor.execute(query)
cities = cursor.fetchall()
for tup in cities:
cs=tup[0]
print cs
If the data coming in is in UTF-8 (which looks like it is), use (in Python 2), unicode() to convert it from bytes to a Python Unicode string:
cs = unicode(cs[3:-3], "utf-8")
Basic rule: inside your code, always use Unicode strings. Convert with unicode() input data and with encode() output data.
You are getting unicode strings back, stored in a list of tuples, which is what fetchall does. So you don't need to encode or decode at all. Just try this:
#Querying MySQL "world" database
print "====================================="
query = 'select name from city where id=283;'
cursor.execute(query)
cities = cursor.fetchall()
for tup in cities:
cs = tup[0]
print cs
If this doesn't print right, then you probably have issues with your terminal, as mentioned by #Jarrod Roberson. The only other possibility is that the data was entered into, or is being returned from, the database with the wrong (unexpected) encoding.

Python to show special characters

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.
I am trying to print a string but when printed it doesn't show special characters (e.g. æ, ø, å, ö and ü). When I print the string using repr() this is what I get:
u'Von D\xc3\xbc' and u'\xc3\x96berg'
Does anyone know how I can convert this to Von Dü and Öberg? It's important to me that these characters are not ignored, e.g. myStr.encode("ascii", "ignore").
EDIT
This is the code I use. I use BeautifulSoup to scrape a website. The contents of a cell (<td>) in a table (<table>), is put into the variable name. This is the variable which contains special characters that I cannot print.
web = urllib2.urlopen(url);
soup = BeautifulSoup(web)
tables = soup.find_all("table")
scene_tables = [2, 3, 6, 7, 10]
scene_index = 0
# Iterate over the <table>s we want to work with
for scene_table in scene_tables:
i = 0
# Iterate over < td> to find time and name
for td in tables[scene_table].find_all("td"):
if i % 2 == 0: # td contains the time
time = remove_whitespace(td.get_text())
else: # td contains the name
name = remove_whitespace(td.get_text()) # This is the variable containing "nonsense"
print "%s: %s" % (time, name,)
i += 1
scene_index += 1
Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:
your_unicode_string = original_utf8_encoded_bytestring.decode('latin1')
The cure is to reverse the process, simply, and then decode.
correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.
If you can't show it, read the BS docs; it looks like you'll need to use:
BeautifulSoup(web, from_encoding='utf8')
Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:
>>> err = u'\xc3\x96berg'
>>> print err
Ã?berg
>>> x = '\xc3\x96berg'
>>> print x
Öberg
>>> u = x.decode('utf-8')
>>> u
u'\xd6berg'
>>> print u
Öberg
For lots more information:
http://www.joelonsoftware.com/articles/Unicode.html
http://docs.python.org/howto/unicode.html
You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:
def convert_fake_unicode_to_real_unicode(string):
return ''.join(map(chr, map(ord, string))).decode('utf-8')
The contents of the strings are not unicode, they are UTF-8 encoded.
>>> print u'Von D\xc3\xbc'
Von Dü
>>> print 'Von D\xc3\xbc'
Von Dü
>>> print unicode('Von D\xc3\xbc', 'utf-8')
Von Dü
>>>
Edit:
>>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
Öberg
>>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
Ãberg
# Look at the differing object types:
>>> type('\xc3\x96berg')
<type 'str'>
>>> type(u'\xc3\x96berg')
<type 'unicode'>
>>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
u'\xd6berg'
>>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
u'\xd6berg'
>>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

Make unicode from variable containing QString

I have QPlainTextEdit field with data containing national characters (iso-8859-2).
tmp = self.ui.field.toPlainText() (QString type)
When I do:
tmp = unicode(tmp, 'iso-8859-2')
I get question marks instead of national characters. How can I convert properly the data in QPlainTextEdit field to unicode?
As it was said QPlainTextEdit.toPlainText() returns QString which should be UTF-16, whereas unicode() constructor expects a byte string. Below is a small example:
tmp = self.field.toPlainText()
print 'field.toPlainText: ', tmp
codec0 = QtCore.QTextCodec.codecForName("UTF-16");
codec1 = QtCore.QTextCodec.codecForName("ISO 8859-2");
print 'UTF-16: ', unicode(codec0.fromUnicode(tmp), 'UTF-16')
print 'ISO 8859-2: ', unicode(codec1.fromUnicode(tmp), 'ISO 8859-2')
this code produces following output:
field.toPlainText: test ÖÄ это
китайский: 最主要的
UTF-16: test ÖÄ это китайский: 最主要的
ISO 8859-2: test ÖÄ ??? ?????????:
????
hope this helps, regards

Categories