Related
I have a .csv file generated by a program. When I try to open it with the following code the output makes no sense, even though I have tried the same code with not program generated csv and it works fine.
g = 'datos/1.81/IR20211103_2275.csv'
f = open(g, "r", newline = "")
f = f.readlines()
print(f)
The output of the code looks like this
['ÿþA\x00l\x00l\x00 \x00t\x00e\x00m\x00p\x00e\x00r\x00a\x00t\x00u\x00r\x00e\x00s\x00 \x00i\x00n\x00 \x00°\x00F\x00.\x00\r',
'\x00\n',
'\x00\r',
'\x00\n',
'\x00D\x00:\x00\\\x00O\x00n\x00e\x00D\x00r\x00i\x00v\x00e\x00\\\x00M\x00A\x00E\x00S\x00T\x00R\x00I\x00A\x00 \x00I\x00M\x00E\x00C\x00\\\x00T\x00e\x00s\x00i\x00s\x00\\\x00d\x00a\x00t\x00o\x00s\x00\\\x001\x00.\x008\x001\x00\\\x00I\x00R\x002\x000\x002\x001\x001\x001\x000\x003\x00_\x002\x002\x007\x005\x00.\x00i\x00s\x002\x00\r',
However, when I first open the file with excel and save it as a .csv (replacing the original with the .csv from excel), the output is as expected, like this:
['All temperatures in °F.\r\n',
'\r\n',
'D:\\OneDrive\\MAESTRIA IMEC\\Tesis\\datos\\1.81\\IR20211103_2275.is2\r\n',
'\r\n',
'",1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,"\r\n',
I have also tried csv.reader() and doesn't work either.
Does anyone know what's going on and how can I solve it? How can I open my .csv without opening and saving it from excel first? The source program is SmartView from Fluke which reads thermal image file .is2 and converts it into a .csv file
Thank you very much
Your file is encoded with UTF-16 (Little Endian byte order). You can specify file encoding using encoding argument of open() function (list of standard encodings and their names you can find here).
Also I'd recommend to not use .readlines() as it will keep trailing newline chars. You can read all file content into as string (using .read()) and apply str.splitlines() to ... split string into a list of lines. Alternatively you can also consume file line by line and call str.rstrip() to cut trailing newline chars.
Final code:
filename = "datos/1.81/IR20211103_2275.csv"
with open(filename, encoding="utf16") as f:
lines = f.read().splitlines()
# OR
lines = [line.rstrip() for line in f]
g = 'datos/1.81/IR20211103_2275.csv'
f = open(g, "r", newline = "",encoding="utf-16")
f = f.readlines()
print(f)
try this one it may help
I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.
As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.
Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.
You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.
Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm
Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
I got the same error. Saved the file in UTF-8 and it worked.
This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.
I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!
I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.
For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.
I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.
Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open
One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.
I am incredibly new to python, so I might not have the right terminology...
I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:
with pdfplumber.open('Bell_2014.pdf') as pdf:
page = pdf.pages[0]
bell = page.extract_text()
print(bell)
So "bell" is all of the text from the first page of the imported PDF.
what bell looks like I need to write all of that text as a string to a csv. I tried using:
with open('Bell_2014_ex.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(bell)
and
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(bell)
All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(['bell'])
to create a csv that has "bell" in one cell of the csv, but that's as close as I can get.
I feel like this should be super easy, but I just can't seem to get it to work.
Any thoughts?
Please and thank you for helping my inexperienced self.
page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.
The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.
Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.
Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.
Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:
Maybe this is helpful:
retval = ''
mode = 'r'
out_file = os.path.join('input', 'listfile.csv')
"""
Convert tab-delimited listfile.txt to comma separated values (.csv) file
"""
in_text = open(listfile.txt, 'r')
in_reader = csv.reader(in_text, delimiter='\t')
out_csv = open(out_file, 'w', newline='\n')
out_writer = csv.writer(out_csv, dialect=csv.excel)
for _line in in_reader:
out_writer.writerow(_line)
out_csv.close()
... and that's it, not too tough
So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:
from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)
new_csv = 'filename.csv'
with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
file_writer = csv.writer(csvfile,delimiter=' ')
file_writer.writerow(object)
However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly.
Thank you for everyone that helped!
I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this é or ó.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("Pública que cotiza en México") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.
a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv
I need a comma seperated txt file with txt extension.
"a,b,c"
I used csv.writer to create a csv file changed the extension. Another prog would not use/process the data. I tried "wb", "w."
F = open(Fn, 'w')
w = csv.writer(F)
w.writerow(sym)
F.close()
opened with notepad ---These are the complete files.
Their file: created using their gui used three symbols
PDCO,ICUI,DVA
my file : created using python
PDCO,ICUI,DVA
Tested: open thier file- worked, opened my file - failed.
Simple open and close with save in notepad. open my file-- worked
Works= 'PDCO,ICUI,DVA'
Fails= 'PDCO,ICUI,DVA\r\r\n'
Edit: writing txt file without Cvs writer.....
sym = ['MHS','MRK','AIG']
with open(r'C:\filename.txt', 'w') as F: # also try 'w'
for s in sym[:-1]: # separate all but the last
F.write(s + ',') # symbols with commas
F.write(sym[-1]) # end with the last symbol
To me, it look like you don't exactly know you third party application input format. If a .CSV isn't reconized, it might be something else.
Did you try to change the delimiter fromn ';' to ','
import csv
spamWriter = csv.writer(open('eggs.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
Take a look in the CSV Python API
I think the problem is your file write mode, as per CSV file written with Python has blank lines between each row
If you create your csv file like
csv.writer(open('myfile.csv', 'w'))
csv.writer ends its lines in '\r\n', and Python's text file handling (on Windows machines) then converts '\n' to '\r\n', resulting in lines ending in '\r\r\n'. Many programs will choke on this; Notepad recognizes it as a problem and strips the extra '\r' out.
If you use
csv.writer(open('myfile.csv', 'wb'))
it produces the expected '\r\n' line ending, which should work as desired.
Edit: #senderle has a good point; try the following:
goodf = open('file_that_works.txt', 'rb')
print repr(goodf.read(100))
badf = open('file_that_fails.txt', 'rb')
print repr(badf.read(100))
paste the results of that here, so we can see how the two compare byte-for-byte.
Try this:
with open('file_that_works.csv', 'rb') as testfile: # file is automatically
d = csv.Sniffer().sniff(testfile.read(1024)) # closed at end of with
# block
with open(Fn, 'wb') as F: # also try 'w'
w = csv.writer(F, dialect=d)
w.writerow(sym)
To explain further: this looks at a sample of a working .csv file and deduces its format. Then it uses that format to write a new .csv file that, hopefully, will not have to be resaved in notepad.
Edit: if the program you're using doesn't accept multi-line input (?!) then don't use csv. Just do something like this:
syms = ['JAGHS','GJKDGJ','GJDFAJ']
with open('filename.txt', 'wb') as F:
for s in syms[:-1]: # separate all but the last
F.write(s + ',') # symbols with commas
F.write(syms[-1]) # end with the last symbol
Or more tersely:
with open('filename.txt', 'wb') as F:
F.write(','.join(syms))
Also, check different file extensions (i.e. .txt, .csv, etc) to make sure that's not the problem. If this program chokes on a newline, then anything is possible.
So, I save as text file.
Now, create my own txt file with python.
What are the exact differences between their file and your file? Exact.
I suspect that #Hugh's comment is correct that it's an encoding issue.
When you do a Save As in notepad, what's selected in the Encoding dropdown? If you select different encodings do some or all of those fail to be opened by the 3rd party program?