def openFile(fileName):
try:
trainFile = io.open(fileName,"r",encoding = "utf-8")
except IOError as e:
print ("File could not be opened: {}".format(e))
else:
trainData = csv.DictReader(trainFile)
print trainData
return trainData
def computeTFIDF(trainData):
bodyList = []
print "Inside computeTFIDF"
for row in trainData:
for key, value in row.iteritems():
print key, unicode(value, "utf-8", "ignore")
print "Done"
return
if __name__ == "__main__":
print "Main"
trainData = openFile("../Data/TrainSample.csv")
print "File Opened"
computeTFIDF(trainData)
Error:
Traceback (most recent call last):
File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 62, in <module>
computeTFIDF(trainData)
File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 42, in computeTFIDF
for row in trainData:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 215: ordinal not in range(128)
TrainSample.csv: Is a csv file with 4 columns (with header).
OS: Windows 7 64 bit.
Using Python 2.x
I don't know what is going wrong here. I said it to ignore the encoding. But still is throws the same error.
I think before the control reaches the encoding, it throws an error.
Can anybody tell me where I am going wrong.
The Python 2 CSV module does not handle Unicode input.
Open the file in binary mode, and decode after parsing it as CSV. This is safe for the UTF-8 codec as newlines, delimiters and quotes all encode to 1 byte.
The csv module documentation includes a UnicodeReader wrapper class in the example section that will do the decoding for you; it is easily adapted to the DictReader class:
import csv
class UnicodeDictReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
self.encoding = encoding
self.reader = csv.DictReader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return {k: unicode(v, "utf-8") for k, v in row.iteritems()}
def __iter__(self):
return self
Use this with the file opened in binary mode:
def openFile(fileName):
try:
trainFile = open(fileName, "rb")
except IOError as e:
print "File could not be opened: {}".format(e)
else:
return UnicodeDictReader(trainFile)
I can't give a comment to Martijn, which solution works for me perfectly after little upgrade which I leave here for others:
def next(self):
row = self.reader.next()
try:
d = dict((unicode(k, self.encoding), unicode(v, self.encoding)) for k, v in row.iteritems())
except TypeError:
d = row
return d
One thing is that python 2.6 and lower doesn't support dict comprahension.
Another, that dicts can use different types, and unicode function not, so it's worth to catch TypeError in case of null or number.
One more thing which drive me creazy was, it doesn't work when you open file with encoding! Just leave it simple open().
Related
I am dealing with a very large csv file in python where some lines are throwing an error "'utf-8' codec can't decode byte 0x9b in position 7657: invalid start byte". Is there a way to skip lines that aren't utf-8 without going by hand and deleting or fixing data?
for filename in filenames:
f = open(filename, 'rt')
reader = csv.reader(f, delimiter = ',')
for row in reader:
#process data for future use
I can't use the non-utf8 data because of later processes that require utf-8 use
You could use a filter that reads a line as raw bytes, tries to convert it to unicode as UTF8 and then :
if successful, passes it down to the csv reader
if not, stores it for later analyzing
Assuming that you are using Python2, you could use something like :
class MyFilter:
def __init__(self, instr, errstr):
self.instr = instr
self.errstr = errstr
def __enter__(self):
print("ENTERING filter")
return self
def __exit__(self, a, b, c):
print("EXITING filter")
self.instr.close()
self.errstr.close()
return False
def __next__(self):
line = next(self.instr)
while True:
try:
t = line.decode('utf8')
return line.strip()
except UnicodeDecodeError:
self.errstr.write(line)
line = next(self.instr)
return line
def __iter__(self):
return self
def next(self):
return self.__next__()
You could then use it that way (assuming Python 2.7), getting all offending lines in err.txt :
with open('file.csv') as istream, open("err.txt", 'w') as err, MyFilter(istream, err) as fd:
c = csv.reader(fd)
for i in c:
# do you stuff, eg: print i
If you use Python 3, you can use almost same filter class, simply replacing line return line.strip() with return t.strip(), in order to return a string and not bytes.
Usage is again almost the same :
with open('file.csv', 'rb') as istream, open("err.txt", 'wb') as err, MyFilter(istream, err) as fd:
c = csv.reader(fd)
for i in c:
# do you stuff, eg: print (i)
Per your comment, you want to also filter lines containing null characters. This only needs a slight change in filter, the while block becoming (Python 3 version) :
while True:
if b'\x00' not in line:
try:
t = line.decode('utf8')
return t.strip()
except UnicodeDecodeError:
pass
self.errstr.write(line)
line = next(self.instr)
I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.
I want to write data to files where a row from a CSV should look like this list (directly from the Python console):
row = ['\xef\xbb\xbft_11651497', 'http://kozbeszerzes.ceu.hu/entity/t/11651497.xml', "Szabolcs Mag '98 Kft.", 'ny\xc3\xadregyh\xc3\xa1za', 'ny\xc3\xadregyh\xc3\xa1za', '4400', 't\xc3\xbcnde utca 20.', 47.935175, 21.744975, u'Ny\xedregyh\xe1za', u'Borb\xe1nya', u'Szabolcs-Szatm\xe1r-Bereg', u'Ny\xedregyh\xe1zai', u'20', u'T\xfcnde utca', u'Magyarorsz\xe1g', u'4405']
Py2k does not do Unicode, but I had a UnicodeWriter wrapper:
import cStringIO, codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
However, these lines still produce the dreaded encoding error message below:
f.write(codecs.BOM_UTF8)
writer = UnicodeWriter(f)
writer.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)
What is there to do? Thanks!
You are passing bytestrings containing non-ASCII data in, and these are being decoded to Unicode using the default codec at this line:
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
unicode(bytestring) with data that cannot be decoded as ASCII fails:
>>> unicode('\xef\xbb\xbft_11651497')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
Decode the data to Unicode before passing it to the writer:
row = [v.decode('utf8') if isinstance(v, str) else v for v in row]
This assumes that your bytestring values contain UTF-8 data instead. If you have a mix of encodings, try to decode to Unicode at the point of origin; where your program first sourced the data. You really want to do so anyway, regardless of where the data came from or if it already was encoded to UTF-8 as well.
I have copied this script from [python web site][1] This is another question but now problem with encoding:
import sqlite3
import csv
import codecs
import cStringIO
import sys
class UTF8Recoder:
"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
This time problem with encoding, when I ran this it gave me this error:
Traceback (most recent call last):
File "makeCSV.py", line 87, in <module>
uW.writerow(d)
File "makeCSV.py", line 54, in writerow
self.writer.writerow([s.encode("utf-8") for s in row])
AttributeError: 'int' object has no attribute 'encode'
Then I converted all integers to string, but this time I got this error:
Traceback (most recent call last):
File "makeCSV.py", line 87, in <module>
uW.writerow(d)
File "makeCSV.py", line 54, in writerow
self.writer.writerow([str(s).encode("utf-8") for s in row])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in range(128)
I have implemented above to deal with unicode characters, but it gives me such error. What is the problem and how to fix it?
Then I converted all integers to string,
You converted both integers and strings to byte strings. For strings this will use the default character encoding which happens to be ASCII, and this fails when you have non-ASCII characters. You want unicode instead of str.
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
It might be better to convert everything to unicode before calling that method. The class is designed specifically for parsing Unicode strings. It was not designed to support other data types.
From the documentation:
http://docs.python.org/library/stringio.html?highlight=cstringio#cStringIO.StringIO
Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.
I.e. only 7-bit clean strings can be stored.
If you are using Python 2:
make encoding as : str(s.encode("utf-8"))
i.e.
def writerow(self, row):
self.writer.writerow([str(s.encode("utf-8")) for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
Afternoon,
I am having some trouble with a SQLite to CSV python script. I have searched high and I have searched low for an answer but none have worked for me, or I am having a problem with my syntax.
I want to replace characters within the SQLite database which fall outside of the ASCII table (larger than 128).
Here is the script I have been using:
#!/opt/local/bin/python
import sqlite3
import csv, codecs, cStringIO
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Select whichever rows you want in whatever order you like
c.execute('select ROWID, Name, Type, PID from PID')
writer = UnicodeWriter(open("ProductListing.csv", "wb"))
# Make sure the list of column headers you pass in are in the same order as your SELECT
writer.writerow(["ROWID", "Product Name", "Product Type", "PID", ])
writer.writerows(c)
I have tried to add the 'replace' as indicated here but have got the same error. Python: Convert Unicode to ASCII without errors for CSV file
The error is the UnicodeDecodeError.
Traceback (most recent call last):
File "SQLite2CSV1.py", line 53, in <module>
writer.writerows(c)
File "SQLite2CSV1.py", line 32, in writerows
self.writerow(row)
File "SQLite2CSV1.py", line 19, in writerow
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 65: ordinal not in range(128)
Obviously I want the code to be robust enough that if it encounters characters outside of these bounds that it replaces it with a character such as '?' (\x3f).
Is there a way to do this within the UnicodeWriter class? And a way I can make the code robust that it won't produce these errors.
Your help is greatly appreciated.
If you just want to write an ASCII CSV, simply use the stock csv.writer(). To ensure that all values passed are indeed ASCII, use encode('ascii', errors='replace').
Example:
import csv
rows = [
[u'some', u'other', u'more'],
[u'umlaut:\u00fd', u'euro sign:\u20ac', '']
]
with open('/tmp/test.csv', 'wb') as csvFile:
writer = csv.writer(csvFile)
for row in rows:
asciifiedRow = [item.encode('ascii', errors='replace') for item in row]
print '%r --> %r' % (row, asciifiedRow)
writer.writerow(asciifiedRow)
The console output for this is:
[u'some', u'other', u'more'] --> ['some', 'other', 'more']
[u'umlaut:\xfd', u'euro sign:\u20ac', ''] --> ['umlaut:?', 'euro sign:?', '']
The resulting CSV file contains:
some,other,more
umlaut:?,euro sign:?,
With access to a unix environment, here's what worked for me
sqlite3.exe a.db .dump > a.sql;
tr -d "[\\200-\\377]" < a.sql > clean.sql;
sqlite3.exe clean.db < clean.sql;
(It's not a python solution, but maybe it will help someone else due to its brevity. This solution STRIPS OUT all non ascii characters, doesn't try to replace them.)