Mixed encoding in csv file - python

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this é or ó.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("Pública que cotiza en México") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?

In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.

a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv

Related

Python problem reading CSV files that contain the word NUL [duplicate]

I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.
As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.
Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.
You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.
Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm
Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
I got the same error. Saved the file in UTF-8 and it worked.
This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.
I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!
I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.
For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.
I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.
Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open
One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.

How to correctly put extended ASCII characters into a CSV file?

I'm trying to write some data in an array that contains extended ASCII characters to a CSV file. Below I made an small example of the code I'm using on real file.
The array text_array represents an array containing only one row.
import csv
text_array = [["Á","Â","Æ","Ç","Ö","×","Ø","Ù","Þ","ß","á","â","ã","ä","å","æ"]]
with open("/Files/out.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(text_array)
The output I'm getting on CSV file is wrong, showing these characters.
à Â Æ Ç Ö × Ø Ù Þ ß á â ã ä å æ
I found that the code below fixes the issue in Python 3.4 but I'm working on Python 2.7.
c = csv.writer(open("Out.csv", 'w', newline='', encoding='utf-8'))
How can I fix this?
UPDATE
I receive some links as comments, but is a kind of difficult for me to understand what is needed to do to fix this issue. May someone show some example please.

Saving data from python to excel file as CSV UTF-8 file format

I have been trying to save the data as a excel file as a type of CSV UTF-8 (Comma delimited) (*.csv) which is different then the normal
CSV (Comma delimited) (*.csv) file. It display the unicode text when opened in excel. I can save as that file easily from excel but from python i am only able to save it as normal csv. Which will not cause loss of data but when opened it shows this kind of text "à¤à¤‰à¤Ÿà¤¾" instead of "एउटा" this text.
If I copied the text opening it with notepad to the excel file and then manually save the file as CSV UTF-8 then it preserves the correct display. But doing so is time consuming since all values appear in same line in notepad and i have to separate it in excel file.
So i just want to know how can i save data as CSV UTF-8 format of excel using python.
I have tried the follwing code but it results in normal csv file.
import codecs
import unicodecsv as csv
input_text = codecs.open('input.txt', encoding='utf-8')
all_text = input_text.read()
text_list = all_text.split()
output_list = [['Words','Tags']]
for input_word in text_list:
word_tag_list = [input_word,'O']
output_list.append(word_tag_list)
with codecs.open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(output_list)
You need to indicate to Excel that this is a UTF-8 file. Unfortunately the only way to do this is by prepending a special byte sequence to the front of the file. Python will do this automatically if you use a special encoding.
with codecs.open("output.csv", "w", "encoding="utf_8_sig") as f:
I have found the answer. The encoding="utf_8_sig" should be given to csv.writer method to write the excel file as CSV UTF-8 file. Previous code can be witten as:
with open("output.csv", "wb") as f:
writer = csv.writer(f, dialect='excel', encoding='utf_8_sig')
writer.writerows(output_list)
However there was problem when data has , at the end Eg: "भने," For this case i didn't need the comma so i removed it with following code within the for loop.
import re
if re.search(r'.,$',input_word):
input_word = re.sub(',$','',input_word)
Finally I was able to obtain the output as desired with Unicode character correctly displayed and removing extra comma which is present at the end of data. So, if anyone know how to ignore comma at the end of data in excel file then you can comment here. Thanks.

remove <feff> from a file

I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.
I am already tried answers from here, without success.
The converted XML file.
Thanks for any help!
Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python
>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'
So for your specific case, try something like
from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)
Very terrible style, but that script is a hacked together script anyway for a one-shot job.
Change utf-8 to utf-8-sig
import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:
Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.
import csv
import lxml.etree
csvFile = 'myData.csv'
xmlFile = 'myData.xml'
reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
xf.write_declaration(standalone=True)
with xf.element('root'):
for row in reader:
row_el = lxml.etree.Element('row')
for col in row:
col_el = lxml.etree.SubElement(row_el, 'col')
col_el.text = col
xf.write(row_el)
To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().

Writing unicode data in csv

I know similar kind of question has been asked many times but seriously i have not been able to properly implement the csv writer which writes properly in csv (it shows garbage).
I am trying to use UnicodeWriter as mention in official docs .
ff = open('a.csv', 'w')
writer = UnicodeWriter(ff)
st = unicode('Displaygrößen', 'utf-8') #gives (u'Displaygr\xf6\xdfen', 'utf-8')
writer.writerow([st])
This does not give me any decoding or encoding error. But it writes the word Displaygrößen as Displaygrößen which is not good. Can any one help me what i am doing wrong here??
You are writing a file in UTF-8 format, but you don't indicate that into your csv file.
You should write the UTF-8 header at the beginning of the file. Add this:
ff = open('a.csv', 'w')
ff.write(codecs.BOM_UTF8)
And your csv file should open correctly after that with the program trying to read it.
Opening the file with codecs.open should fix it.

Categories