"Line contains NULL byte" in CSV reader (Python)

"Line contains NULL byte" in CSV reader (Python) - python

I'm trying to write a program that looks at a .CSV file (input.csv) and rewrites only the rows that begin with a certain element (corrected.csv), as listed in a text file (output.txt).
This is what my program looks like right now:
import csv
lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])
with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'r') as mycsv:
reader = csv.reader(mycsv)
for row in reader:
if row[0] not in lines:
writer.writerow(row)
Unfortunately, I keep getting this error, and I have no clue what it's about.
Traceback (most recent call last):
File "C:\Python32\Sample Program\csvParser.py", line 12, in <module>
for row in reader:
_csv.Error: line contains NULL byte
Credit to all the people here to even to get me to this point.

I'm guessing you have a NUL byte in input.csv. You can test that with
if '\0' in open('input.csv').read():
print "you have null bytes in your input file"
else:
print "you don't"
if you do,
reader = csv.reader(x.replace('\0', '') for x in mycsv)
may get you around that. Or it may indicate you have utf16 or something 'interesting' in the .csv file.

I've solved a similar problem with an easier solution:
import codecs
csvReader = csv.reader(codecs.open('file.csv', 'rU', 'utf-16'))
The key was using the codecs module to open the file with the UTF-16 encoding, there are a lot more of encodings, check the documentation.

If you want to replace the nulls with something you can do this:
def fix_nulls(s):
for line in s:
yield line.replace('\0', ' ')
r = csv.reader(fix_nulls(open(...)))

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
See the (line.replace('\0','') for line in f) below, also you'll want to probably open that file up using mode rb.
import csv
lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])
with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'rb') as mycsv:
reader = csv.reader( (line.replace('\0','') for line in mycsv) )
for row in reader:
if row[0] not in lines:
writer.writerow(row)

This will tell you what line is the problem.
import csv
lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])
with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'r') as mycsv:
reader = csv.reader(mycsv)
try:
for i, row in enumerate(reader):
if row[0] not in lines:
writer.writerow(row)
except csv.Error:
print('csv choked on line %s' % (i+1))
raise
Perhaps this from daniweb would be helpful:
I'm getting this error when reading from a csv file: "Runtime Error!
line contains NULL byte". Any idea about the root cause of this error?
...
Ok, I got it and thought I'd post the solution. Simply yet caused me
grief... Used file was saved in a .xls format instead of a .csv Didn't
catch this because the file name itself had the .csv extension while
the type was still .xls

A tricky way:
If you develop under Lunux, you can use all the power of sed:
from subprocess import check_call, CalledProcessError
PATH_TO_FILE = '/home/user/some/path/to/file.csv'
try:
check_call("sed -i -e 's|\\x0||g' {}".format(PATH_TO_FILE), shell=True)
except CalledProcessError as err:
print(err)
The most efficient solution for huge files.
Checked for Python3, Kubuntu

def fix_nulls(s):
for line in s:
yield line.replace('\0', '')
with open(csv_file, 'r', encoding = "utf-8") as f:
reader = csv.reader(fix_nulls(f))
for line in reader:
#do something
this way works for me

I've recently fixed this issue and in my instance it was a file that was compressed that I was trying to read. Check the file format first. Then check that the contents are what the extension refers to.

Turning my linux environment into a clean complete UTF-8 environment made the trick for me.
Try the following in your command line:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

This is long settled, but I ran across this answer because I was experiencing an unexpected error while reading a CSV to process as training data in Keras and TensorFlow.
In my case, the issue was much simpler, and is worth being conscious of. The data being produced into the CSV wasn't consistent, resulting in some columns being completely missing, which seems to end up throwing this error as well.
The lesson: If you're seeing this error, verify that your data looks the way that you think it does!

pandas.read_csv now handles the different UTF encoding when reading/writing and therefore can deal directly with null bytes
data = pd.read_csv(file, encoding='utf-16')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

for skipping the NULL byte rows
import csv
with open('sample.csv', newline='') as csv_file:
reader = csv.reader(csv_file)
while True:
try:
row = next(reader)
print(row)
except csv.Error:
continue
except StopIteration:
break

The above information is great. For me I had this same error. My fix was easy and just user error aka myself. Simply save the file as a csv and not an excel file.

It is very simple.
don't make a csv file by "create new excel" or save as ".csv" from window.
simply import csv module, write a dummy csv file, and then paste your data in that.
csv made by python csv module itself will no longer show you encoding or blank line error.

Related

Python problem reading CSV files that contain the word NUL [duplicate]

I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.

As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?

data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.

Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.

Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)

Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm

Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")

I got the same error. Saved the file in UTF-8 and it worked.

This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.

I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!

I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.

For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.

I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.

Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open

One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.

Iterating through a csv file

Hi I am trying to iterate through a csv file but I cannot get it to work somehow. I followed the python docs but I am still not able to iterate through it. I have a gzipped csv file that I work with with this format:
2015-01-10 00:00:05;32
As you can see it's delimited with a ';'.
Here is my code to run though it (simplified)
gzip_fd = gzip.decompress(gzip_file).decode(encoding='utf8')
csv_data = csv.reader(gzip_fd, delimiter=';', lineterminator='\n')
for data in csv_data:
print(data)
But when I want to work with data it only contains the first character (like: 2) and not the first part of the csv data that I need. Anyone here that had the same issues? I also tried csv.DictReader but with no success.

Even if your snippet was fixed to work, it would buffer all data in the memory, which might not scale well for very large files.
Gzipped data can also be iterated on-the-fly -- the following works for me on CPython 3.8:
import csv
import gzip
with gzip.open('test.csv.gz', 'r') as gzipped:
reader = csv.reader(gzipped, delimiter=';', lineterminator='\n')
for line in reader:
print(line)
['2015-01-10 00:00:05', '32']
<...>
Update: As per comments below, my snippet does not work on older Python versions (reproduced on CPython 3.5).
You can use io.TextIOWrapper to achieve the same effect:
import csv
import io
import gzip
with gzip.open('test.csv.gz', 'rb') as gzipped:
reader = csv.reader(io.TextIOWrapper(gzipped), delimiter=';',
lineterminator='\n')
for line in reader:
print(line)

So I fixed my issue, the issue was that I didn't split the string that I get (can't do gzip.open because it isn't a file but rather a bytes string of the gzipped file
Here is the fix to my problem:
gzip_fd = gzip.decompress(compressed_data).decode(encoding='utf-8').split('\n')
self.data = csv.reader(gzip_fd, delimiter=';', lineterminator='\n')

CSV Silently Not Reading All Lines on Python on Windows

I'm trying to read all lines of a TSV file to a list. However, the TSV reader is terminating early and not reading the whole file. I know this because data is only 1/6 of the length of the whole file. No errors are thrown when this happens.
When I manually inspect the line it terminates on (corresponding to the length of data, those lines have tons of Unicode symbols. I thought I could catch a UnicodeDecodeError, but instead of throwing an error, it quits out of reading the whole file entirely. I imagine it's hitting something that's triggering an end-of-file??
What's really throwing me for a loop: the error only occurs when I'm using Python 2.7 on Windows Server 2012. The file reads 100% perfectly on Unix implementations of Python 2.7 using both code snippets below. I'm running this inside Anaconda on both.
Here's what I've tried and neither works:
data = []
with open('data.tsv','r') as infile:
csvreader = csv.reader((x.replace('\0', '') for x in infile),
delimiter='\t', quoting=csv.QUOTE_NONE)
data = list(csvreader)
I also tried reading line by line...
with open('data.tsv','r') as infile:
for line in infile:
try:
d = line.split('\t')
q = d[0].decode('utf-8') #where the unicode symbols are located
data.append(d)
except UnicodeDecodeError:
continue
Thanks in advance!

As per general suggestion from the documentation:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
So open your file with:
with open('data.csv', 'rb') as infile:
csvreader = csv.reader(infile, delimiter='\t', quoting=csv.QUOTE_NONE)
data = list(csvreader)
Also, you will have to decode your strings if they have unicode data, or just use unicodecsv as a drop-in replacement so you don't have to worry about it.

CSV reader not reading entire file

I have looked at previous answers to this question, but in each of those scenarios the questioners were asking about something specific they were doing with the file, but the problem occurs for me even when I am not.
I have a .csv file of 27,204 rows. When I open the python interpreter:
python
import csv
o = open('btc_usd1hour.csv','r')
p = csv.reader(o)
for row in p:
print(row)
I then only see roughly the last third of the document displayed to me.

Try so, at me works:
with open(name) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
reference:
https://docs.python.org/3.6/library/csv.html#csv.DictReader

Try the following code
import csv
fname = 'btc_usd1hour.csv'
with open(fname, newline='') as f:
reader = csv.reader(f)
for row in reader:
print(row)
It is difficult to tell what is the problem without having the sample. I guess the problem would be removed if you add that newline='' for opening the file.
Use the with construct to close the file automatically. Use the f name for a file object when no further explanation is needed. Store the file name to fname to make future modifications easier (and also for easy copying the code fragment for your later programs).
olisch may be right that the console just scrolled so fast you could not see the result. You can write the result to another text file like this:
with open(fname, newline='') as fin,\
open('output.txt', 'w') as fout:
reader = csv.reader(fin)
for row in reader:
fout.write(repr(row) + '\n')
The repr function converts the row list into its string representation. The print calls that function internally, so you will have the same result that you otherwise observe on screen.

maybe your scrollback buffer is just to short to see the whole list?
In general your csv.reader call should be working fine, except your 27k rows aren't extremly long so that you might be able to hit any 64bit boundaries, which would be quite uncommon.
len(o) might be interesting to see.

Python - CSV file empty after rewriting using csv module

I'm attempting to rewrite specific cells in a csv file using Python.
However, whenever I try to modify an aspect of the csv file, the csv file ends up being emptied (the file contents becomes blank).
Minimal code example:
import csv
ReadFile = open("./Resources/File.csv", "rt", encoding = "utf-8-sig")
Reader = csv.reader(ReadFile)
WriteFile = open("./Resources/File.csv", "wt", encoding = "utf-8-sig")
Writer = csv.writer(WriteFile)
for row in Reader:
row[3] = 4
Writer.writerow(row)
ReadFile.close()
WriteFile.close()
'File.csv' looks like this:
1,2,3,FOUR,5
1,2,3,FOUR,5
1,2,3,FOUR,5
1,2,3,FOUR,5
1,2,3,FOUR,5
In this example, I'm attempting to change 'FOUR' to '4'.
Upon running this code, the csv file becomes empty instead.
So far, the only other question related to this that I've managed to find is this one, which does not seem to be dealing with rewriting specific cells in a csv file but instead deals with writing new rows to a csv file.
I'd be very grateful for any help anyone reading this could provide.

The following should work:
import csv
with open("./Resources/File.csv", "rt", encoding = "utf-8-sig") as ReadFile:
lines = list(csv.reader(ReadFile))
with open("./Resources/File.csv", "wt", encoding = "utf-8-sig") as WriteFile:
Writer = csv.writer(WriteFile)
for line in lines:
line[3] = 4
Writer.writerow(line)

When you open a writer with w option, it will delete the contents and start writing the file anew. The file is therefore, at the point when you start to read, empty.
Try writing to another file (like FileTemp.csv) and at the end of the program renaming FileTemp.csv to File.csv.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

"Line contains NULL byte" in CSV reader (Python) - python

I've solved a similar problem with an easier solution: import codecs csvReader = csv.reader(codecs.open('file.csv', 'rU', 'utf-16')) The key was using the codecs module to open the file with the UTF-16 encoding, there are a lot more of encodings, check the documentation.

If you want to replace the nulls with something you can do this: def fix_nulls(s): for line in s: yield line.replace('\0', ' ') r = csv.reader(fix_nulls(open(...)))

def fix_nulls(s): for line in s: yield line.replace('\0', '') with open(csv_file, 'r', encoding = "utf-8") as f: reader = csv.reader(fix_nulls(f)) for line in reader: #do something this way works for me

I've recently fixed this issue and in my instance it was a file that was compressed that I was trying to read. Check the file format first. Then check that the contents are what the extension refers to.

Turning my linux environment into a clean complete UTF-8 environment made the trick for me. Try the following in your command line: export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 export LANGUAGE=en_US.UTF-8

pandas.read_csv now handles the different UTF encoding when reading/writing and therefore can deal directly with null bytes data = pd.read_csv(file, encoding='utf-16') see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

for skipping the NULL byte rows import csv with open('sample.csv', newline='') as csv_file: reader = csv.reader(csv_file) while True: try: row = next(reader) print(row) except csv.Error: continue except StopIteration: break

The above information is great. For me I had this same error. My fix was easy and just user error aka myself. Simply save the file as a csv and not an excel file.

It is very simple. don't make a csv file by "create new excel" or save as ".csv" from window. simply import csv module, write a dummy csv file, and then paste your data in that. csv made by python csv module itself will no longer show you encoding or blank line error.

Related

Python problem reading CSV files that contain the word NUL [duplicate]

Iterating through a csv file

CSV Silently Not Reading All Lines on Python on Windows

CSV reader not reading entire file

Python - CSV file empty after rewriting using csv module

Categories

Resources