Getting strings from database with encoding problems in outfile

Getting strings from database with encoding problems in outfile - python

i'm tryng to get some tweet data from a MySql database.
I've got tons of encoding errors while i was developing this code. This last for is the only way i got for running the code and getting this outfile full with \uxx characters all around, as you can see here:
[{..., "lang_tweet": "es", "text_tweet": "Recuerdo un d\u00eda de, *llamada a la 1:45*, \"Micho, me va a dar algo, estoy temblando, me tome un moster y un balium... Que me muero.!!\",...},...]
I've been here around and around trying different solutions, but the thing is that i got really confused with the abstraction of coding and encoding.
What can i do for fixing this?
Or maybe would be easier to just grab the dirty JSON and 'parse' it decoding those characters manually.
If you want take a look to the code i'm using to querying the db:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pymysql
import collections
import json
conn = pymysql.connect(host='localhost', user='sut', passwd='r', db='tweetsjun2016')
cur = conn.cursor()
cur.execute("""
SELECT * FROM 20160607_tweets
WHERE 20160607_tweets.creation_date >= '2016-06-07 10:51'
AND 20160607_tweets.creation_date <= '2016-06-07 11:51'
AND 20160607_tweets.lang_tweet = "es"
AND 20160607_tweets.has_keyword = 1
AND 20160607_tweets.rt = 0
LIMIT 20
""")
objects_list = []
for row in cur:
d = collections.OrderedDict()
d['download_date'] = row[1]
d['creation_date'] = row[2]
d['id_user'] = row[5]
d['favorited'] = row[7]
d['lang_tweet'] = row[10]
d['text_tweet'] = row[11].decode('latin1')
d['rt'] = row[12]
d['rt_count'] = row[13]
d['has_keyword'] = row[19]
objects_list.append(d)
# print(row[11].decode('latin1')) <- looks perfect, it prints with accents and fine
j = json.dumps(objects_list, default=date_handler, encoding='latin1')
objects_file = "test23" + "_dicts"
f = open(objects_file,'w')
print >> f, j
cur.close()
conn.close()
If i delete the *.decode('latin1') method from all it's applications i get this error:
Traceback (most recent call last):
File "test.py", line 51, in <module>
j = json.dumps(objects_list, default=date_handler)
File "C:\Users\Vichoko\Anaconda2\lib\json\__init__.py", line 251, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "C:\Users\Vichoko\Anaconda2\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\Vichoko\Anaconda2\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xed in position 13: invalid continuation byte
I really can't figure out the way the string is comming from the db to my script.
Thanks for reading, any idea would be thankfull.
Edit1:
Here you can see how the JSON files are being exported with the codification error in the text text_tweet key-val:
https://github.com/Vichoko/real-time-twit/blob/master/auto_labeling/json/tweets_sismos/tweetsago20160.json

Try passing the charset keyword argument to connect as shown in the example on pymysql's github.

When using json_encode, add this extra parameter:
$t = json_encode($s, JSON_UNESCAPED_UNICODE);
That will give you í instead of \u00ed.
(Don't use regexps, don't use decode functions, etc., they will only dig your hole deeper.)

Related

Store Gtk.Textbuffer in SQL database. Encoding troubles

I'm working on a note taking app using python2/Gtk3/Glade.
The notes are stored in a MySQL Database and displayed in a TextView widget.
I can load/store/display plain text fine. However I want the ability to add images to the note page, and store them in the Database.so the data has to be serialised and I'm having some trouble figuring out how to encode/decode the serialised data going in and out of the Database. I'm getting unicode start byte errors. If was working with files I could just open the file in binary mode, but I'm storing as a string in a Database. I've tried encoding/decoding as UTF-8 and ASCII using bytes() and string.encode()[see the sample code below] and a few other ways but none work.
I am using this function to add the image to the textview buffer:
def _AddImagetoNode(self,oWidget):
filenm = None
seliter = self.GetTreeSelection(self.treeview)
filenm = self.FileOpenDiag("Select an Image To Insert.","Image","*.png,*.jpg,*.bmp")
if filenm == None:
return()
#filenm = "/home/drift/Pictures/a.png"
buf = self.dataview.get_buffer()
pixbuf = GdkPixbuf.Pixbuf.new_from_file(filenm)
#pixbuf.scale_simple(dest_width, dest_height, gtk.gdk.INTERP_BILINEAR)
buf.insert_pixbuf(buf.get_end_iter(), pixbuf)
self.dataview.set_buffer(buf)
self.dataview.show()
This is the function that stores the textview buffer:
def SaveDataView(self):
global DataViewNode
global DataViewIsImage
if len(self.GetProjectName()) == 0:
return()
buf = self.dataview.get_buffer()
format = buf.register_serialize_tagset()
data2 = buf.serialize(buf, format, buf.get_start_iter(), buf.get_end_iter())
#convert bytes(data) to string
data = data2.decode(encoding='UTF-8') #<< i think my problem is here
print("save b4 decode >>>>>>:%s"%data2)
sql = "UPDATE " + self.GetProjectName() + " SET tDataPath=%s WHERE tNodeID=%s"
val = (data, DataViewNode)
self.cursor.execute(sql,val)
self.mariadb_connection.commit()
This is the function that loads the Buffer:
def UpdateDataView(self, nodeid):
global DataViewNode
#global DataViewIsFile
DataViewNode=nodeid
if self.GetProjectName() != None and DataViewNode != None:
self.dataview.set_sensitive(True)
else:
self.dataview.set_sensitive(False)
self.dataview.show()
return()
buf = self.dataview.get_buffer()
buf.set_text('')
enc = self.DbGetNodeData(nodeid)
#convert string(enc) to bytes
data = enc.encode(encoding='UTF-8')#<<< i think my problem is here
print("update after decode >>>>>>>>>: %s"%data)
########### load
format = buf.register_deserialize_tagset()
buf.deserialize(buf, format, buf.get_end_iter(),data)
#buf.set_text(enc)
self.dataview.set_buffer(buf)
self.dataview.show()
I'm using mysql.connector to connect to a mariadb.
This is the sql connection string:
self.mariadb_connection = mariadb.connect(user='box', password='box', host='localhost', database='Boxer',charset='utf8')
This is the error im getting.
Traceback (most recent call last): File "Boxer.py", line 402, in
_TreeSelectionChanged
self.SaveDataView() File "Boxer.py", line 334, in SaveDataView
data = data2.decode(encoding='UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 174: invalid start byte
Traceback (most recent call last): File "Boxer.py", line 398, in
_DataViewLostFocus
self.SaveDataView() File "Boxer.py", line 334, in SaveDataView
data = data2.decode(encoding='UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 174: invalid start byte
With this code I can add/edit plain text in the text view and successfully save/load it but as soon as I add the image, I'm get the encoding errors. Any help would be appreciated.

Here is a more complete example:
def example (self):
#retrieve info from first textview
buf = self.builder.get_object('textbuffer1')
format = buf.register_serialize_tagset()
data = buf.serialize(buf, format, buf.get_start_iter(), buf.get_end_iter())
#run db update to prove it can be inserted into a database
db = psycopg2.connect(database= 'silrep_restore3', host='192.168.0.101',
user='postgres', password = 'true',
port = '5432')
c = db.cursor()
c.execute("UPDATE products SET byt = %s WHERE id = 1", (psycopg2.Binary(data),))
#append info to second treeview as a proof of concept
c.execute("SELECT byt FROM products WHERE id = 1")
data = c.fetchone()[0]
buf = self.builder.get_object('textbuffer2')
format = buf.register_deserialize_tagset()
buf.deserialize(buf, format, buf.get_end_iter(), data)
Since you are using MySQL, I recommend reading this article about inserting and retrieving data like you are.
For my example I used a bytea column. In MySQL this is may be a BLOB or BINARY type.
P.S. Sorry for not having a complete MySQL example in my answer. I would have posted a comment, but comments are pathetic for proper formatting.

Got it workings. thanks to theGtknerd your answer was the key. for anyone else having trouble with this i ended up using the BLOB type for the MySQL field type for the column im working with. I tried BINARY[it returnd malformed serialize data] AND VARBINARY [wouldnt even allow me to create the table] so i ended up using the LONGBLOB type.
here is the working code for anyone that needs it.
def UpdateDataView(self, nodeid):
global DataViewNode
#global DataViewIsFile
DataViewNode=nodeid
if self.GetProjectName() != None and DataViewNode != None:
self.dataview.set_sensitive(True)
else:
self.dataview.set_sensitive(False)
self.dataview.show()
return()
buf = self.dataview.get_buffer()
buf.set_text('')
data = self.DbGetNodeData(nodeid)
if data =='':
return()
format = buf.register_deserialize_tagset()
buf.deserialize(buf, format, buf.get_end_iter(),data)
self.dataview.set_buffer(buf)
self.dataview.show()
def SaveDataView(self):
global DataViewNode
global DataViewIsImage
if len(self.GetProjectName()) == 0:
return()
buf = self.dataview.get_buffer()
enc = buf.get_text(buf.get_start_iter(),buf.get_end_iter(),False)
self.AddData2Db(DataViewNode,enc)
format = buf.register_serialize_tagset()
data = buf.serialize(buf, format, buf.get_start_iter(), buf.get_end_iter())
sql = "UPDATE " + self.GetProjectName() + " SET tDataPath=%s WHERE tNodeID=%s"
val = (data, DataViewNode)
self.cursor.execute(sql,val)
self.mariadb_connection.commit()
and im using this to create the table
sql = "CREATE TABLE %s (tParentNodeID TEXT,tNodeTxt TEXT,tNodeID TEXT,tDataPath LONGBLOB)" %pName
self.cursor.execute(sql)
self.mariadb_connection.commit()

How do I deal with non ascii character from CSV when using json.loads in Python?

I looked at some answers, including this but none seem to answer my question.
Here are some example lines from CSV:
_id category
ObjectId(56266da778d34fdc048b470b) [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}]
ObjectId(56266e0c78d34f22058b46de) [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}]
Here is my code:
import csv
import sys
from sys import argv
import json
def ReadCSV(csvfile):
with open('newCSVFile.csv','wb') as g:
filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
with open(csvfile, 'rb') as f:
reader = csv.reader(f) # ceate reader object
next(reader) # skip first row
for row in reader: #go trhough all the rows
listForExport = [] #initialize list that will have two items: id and list of categories
# ID section
vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv
vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases
listForExport.append(vendorId) #add evendor ID to first item in list
# categories section
tempCatList = [] #temporarly list of categories for scond item in listForExport
#this is line 41 where the error stems
categories = json.loads(row[1]) #create's a dict with the categoreis from a given row
for names in categories: # loop through the categorie names using the key 'name'
print names['name']
Here's what I get:
Cleaning Services
Traceback (most recent call last):
File "csvtesting.py", line 57, in <module>
ReadCSV(csvfile)
File "csvtesting.py", line 41, in ReadCSV
categories = json.loads(row[1]) #create's a dict with the categoreis from a given row
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte
So the code pulls out the fist category Cleaning Services, but then fails when we get to the non ascii characters.
How do I deal with this? I'm happy to just remove any non-ascii items.

As you open the input csv file in rb mode, I assume that you are using a Python2.x version. The good news is that you have no problem in the csv part because the csv reader will read plain bytes without trying to interpret them. But the json module will insist in decoding the text into unicode and by default uses utf8. As your input file is not utf8 encoded is chokes and raises a UnicodeDecodeError.
Latin1 has a nice property: the unicode value of any byte is just the value of the byte, so you are sure to decode anything - whether it makes sense then depend of actual encoding being Latin1...
So you could just do:
categories = json.loads(row[1], encoding="Latin1")
Alternatively, if you want to ignore non ascii characters, you could first convert the byte string to unicode ignoring errors and only then load the json:
categories = json.loads(row[1].decode(errors='ignore)) # ignore all non ascii characters

Most probably you have certain non-ascii characters in your csv content.
import re
def remove_unicode(text):
if not text:
return text
if isinstance(text, str):
text = str(text.decode('ascii', 'ignore'))
else:
text = text.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', text)
...
vendorId = remove_unicode(row[0])
...
categories = json.loads(remove_unicode(row[1]))

TypeError: decoding Unicode is not supported

New to python....Trying to get the parser to decode properly into a sqlite database but it just won't work :(
# coding: utf8
from pysqlite2 import dbapi2 as sqlite3
import urllib2
from bs4 import BeautifulSoup
from string import *
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
# # create a table
def createTable():
cursor.execute("""CREATE TABLE characters
(rank INTEGER PRIMARY KEY, word TEXT, definition TEXT)
""")
def insertChar(rank,word,definition):
cursor.execute("""INSERT INTO characters (rank,word,definition)
VALUES (?,?,?)""",(rank,word,definition))
def main():
createTable()
# u = unicode("辣", "utf-8")
# insertChar(1,u,"123123123")
soup = BeautifulSoup(urllib2.urlopen('http://www.zein.se/patrick/3000char.html').read())
# print (html_doc.prettify())
tables = soup.blockquote.table
# print tables
rows = tables.find_all('tr')
result=[]
for tr in rows:
cols = tr.find_all('td')
character = []
x = cols[0].string
y = cols[1].string
z = cols[2].string
xx = unicode(x, "utf-8")
yy = unicode(y , "utf-8")
zz = unicode(z , "utf-8")
insertChar(xx,yy,zz)
conn.commit()
main()
I keep getting the follow error:
TypeError: decoding Unicode is not supported
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Traceback (most recent call last):
File "sqlitetestbed.py", line 64, in <module>
main()
File "sqlitetestbed.py", line 48, in main
xx = unicode(x, "utf-8")
Traceback (most recent call last):
File "sqlitetestbed.py", line 52, in <module>
main()
File "sqlitetestbed.py", line 48, in main
insertChar(x,y,z)
File "sqlitetestbed.py", line 20, in insertChar
VALUES (?,?,?)""",(rank,word,definition))
pysqlite2.dbapi2.IntegrityError: datatype mismatch
I'm probably doing something thats really stupid... :( Please tell me what I'm doing wrong... Thanks!

x is already unicode, as the cols[0].string field contains unicode (just as documented).

Python UnicodeEncodeError with pre decoded UTF-8

I'm trying to parse through a bunch of logfiles (up to 4GiB) in a tar.gz file. The source files come from RedHat 5.8 Server systems and SunOS 5.10, processing has to be done on WindowsXP.
I iterate through the tar.gz files, read the files, decode the file contents to UTF-8 and parse them with regular expressions before further processing.
When I'm writing out the processed data along with the raw-data that was read from the tar.gz, I get the following error:
Traceback (most recent call last):
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 375, in <module>
p.analyze_longtails()
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 196, in analyze_longtails
oFile.write(entries[key]['source'] + '\n')
File "C:\Python\3.2\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24835-24836: character maps
to <undefined>
Heres the part where I read and parse the logfiles:
def getSalesSoaplogEntries(perfid=None):
for tfile in parser.salestarfiles:
path = os.path.join(parser.logpath,tfile)
if os.path.isfile(path):
if tarfile.is_tarfile(path):
tar = tarfile.open(path,'r:gz')
for tarMember in tar.getmembers():
if 'salescomponent-soap.log' in tarMember.name:
tarMemberFile = tar.extractfile(tarMember)
content = tarMemberFile.read().decode('UTF-8','surrogateescape')
for m in parser.soaplogregex.finditer(content):
entry = {}
entry['time'] = datetime(datetime.now().year, int(m.group('month')), int(m.group('day')),int(m.group('hour')), int(m.group('minute')), int(m.group('second')), int(m.group('millis'))*1000)
entry['perfid'] = m.group('perfid')
entry['direction'] = m.group('direction')
entry['payload'] = m.group('payload')
entry['file'] = tarMember.name
entry['source'] = m.group(0)
sm = parser.soaplogmethodregex.match(entry['payload'])
if sm:
entry['method'] = sm.group('method')
if entry['time'] >= parser.starttime and entry['time'] <= parser.endtime:
if perfid and entry['perfid'] == perfid:
yield entry
tar.members = []
And heres the part where I write the processed information along with the raw data out(its an aggregation of all log-entries for one specific process:
if len(entries) > 0:
time = perfentry['time']
filename = time.isoformat('-').replace(':','').replace('-','') + 'longtail_' + perfentry['perfid'] + '.txt'
oFile = open(os.path.join(parser.logpath,filename), 'w')
oFile.write(perfentry['source'] +'\n')
oFile.write('------\n')
for key in sorted(entries.keys()):
oFile.write('------\n')
oFile.write(entries[key]['source'] + '\n') #<-- here it is failing
What I don't get is why it seems to be correct to read the files in UTF-8, it is not possible to just write them out as UTF-8. What am I doing wrong?

Your output file is using the default encoding for your OS, which is not UTF-8. Use codecs.open instead of open and specify encoding='utf-8'.
oFile = codecs.open(os.path.join(parser.logpath,filename), 'w', encoding='utf-8')
See http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

Python - SQLite to CSV Writer Error - ASCII values not parsed

Afternoon,
I am having some trouble with a SQLite to CSV python script. I have searched high and I have searched low for an answer but none have worked for me, or I am having a problem with my syntax.
I want to replace characters within the SQLite database which fall outside of the ASCII table (larger than 128).
Here is the script I have been using:
#!/opt/local/bin/python
import sqlite3
import csv, codecs, cStringIO
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Select whichever rows you want in whatever order you like
c.execute('select ROWID, Name, Type, PID from PID')
writer = UnicodeWriter(open("ProductListing.csv", "wb"))
# Make sure the list of column headers you pass in are in the same order as your SELECT
writer.writerow(["ROWID", "Product Name", "Product Type", "PID", ])
writer.writerows(c)
I have tried to add the 'replace' as indicated here but have got the same error. Python: Convert Unicode to ASCII without errors for CSV file
The error is the UnicodeDecodeError.
Traceback (most recent call last):
File "SQLite2CSV1.py", line 53, in <module>
writer.writerows(c)
File "SQLite2CSV1.py", line 32, in writerows
self.writerow(row)
File "SQLite2CSV1.py", line 19, in writerow
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 65: ordinal not in range(128)
Obviously I want the code to be robust enough that if it encounters characters outside of these bounds that it replaces it with a character such as '?' (\x3f).
Is there a way to do this within the UnicodeWriter class? And a way I can make the code robust that it won't produce these errors.
Your help is greatly appreciated.

If you just want to write an ASCII CSV, simply use the stock csv.writer(). To ensure that all values passed are indeed ASCII, use encode('ascii', errors='replace').
Example:
import csv
rows = [
[u'some', u'other', u'more'],
[u'umlaut:\u00fd', u'euro sign:\u20ac', '']
]
with open('/tmp/test.csv', 'wb') as csvFile:
writer = csv.writer(csvFile)
for row in rows:
asciifiedRow = [item.encode('ascii', errors='replace') for item in row]
print '%r --> %r' % (row, asciifiedRow)
writer.writerow(asciifiedRow)
The console output for this is:
[u'some', u'other', u'more'] --> ['some', 'other', 'more']
[u'umlaut:\xfd', u'euro sign:\u20ac', ''] --> ['umlaut:?', 'euro sign:?', '']
The resulting CSV file contains:
some,other,more
umlaut:?,euro sign:?,

With access to a unix environment, here's what worked for me
sqlite3.exe a.db .dump > a.sql;
tr -d "[\\200-\\377]" < a.sql > clean.sql;
sqlite3.exe clean.db < clean.sql;
(It's not a python solution, but maybe it will help someone else due to its brevity. This solution STRIPS OUT all non ascii characters, doesn't try to replace them.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting strings from database with encoding problems in outfile - python

Try passing the charset keyword argument to connect as shown in the example on pymysql's github.

When using json_encode, add this extra parameter: $t = json_encode($s, JSON_UNESCAPED_UNICODE); That will give you í instead of \u00ed. (Don't use regexps, don't use decode functions, etc., they will only dig your hole deeper.)

Related

Store Gtk.Textbuffer in SQL database. Encoding troubles

How do I deal with non ascii character from CSV when using json.loads in Python?

TypeError: decoding Unicode is not supported

Python UnicodeEncodeError with pre decoded UTF-8

Python - SQLite to CSV Writer Error - ASCII values not parsed

Categories

Resources