Python reading CSV-File universal-newline mode

Python reading CSV-File universal-newline mode - python

I have a problem reading my csv-file.
I already got a solution for another csv-file (from a webpage) that always worked since.
I am using this code:
def readCSV(link):
data = []
ftpstream = urllib.request.urlopen(link)
csvFile = csv.reader(ftpstream.read().decode('latin-1').split('\n'))
newData = [line for line in csvFile]
data.extend(newData)
return data
But now it won't take this solutiion. It always says:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
I found a couple of solutions but implementing them always ended up in not finding the file, although it is found when I am not using split and not the "rU", but then the output is not usable.
I also tried changing the split to ';' or other but always ended up with the same problem.

Related

Python problem reading CSV files that contain the word NUL [duplicate]

I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.

As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?

data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.

Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.

Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)

Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm

Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")

I got the same error. Saved the file in UTF-8 and it worked.

This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.

I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!

I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.

For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.

I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.

Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open

One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.

Python not printing out special characters (extracted from an html file) when I write to another html file

I am extracting data from an html file and outputting it to another html file template using .replace. I wrote it so that on double clicking my script, the page opens up in a browser, ready to be printed.
Everything works fine until I ran into an extracted string that had a special character in it. On double click, nothing would happen (the web browser would not open). However, it seems to work when I run it straight from IDLE, with one issue: The special character comes up as a weird combination of characters.
I haven't tested this out with other special characters, but my problem right now is happening with Nyström, which comes up as NystrÃ¶m in my outputted file.
I figure this has something to do with encoding/decoding in 'utf-8', however I do not know enough about the subject to solve this issue myself post research.
When I open the read and write files, I make sure they have encoding='utf-8' as the third argument.
Finally, when I print the string i'm having trouble with out onto IDLE, it comes out fine. The issue just seems to pop up when I write it to my file.
Below are my file read and write calls if that helps
path = os.path.dirname(os.path.realpath(__file__))
htmlFile = open(path + input_filename, "r", encoding="utf-8")
htmlString = htmlFile.read()
infile = open(template_path, 'r', encoding='utf-8')
contents = infile.read()
After this I .replace certain parts of content with my extracted strings put into a dictionary named data.
eg:
(please ignore inconsistent naming conventions)
data = dict()
data['name_email'] = email
contents = contents.replace('_name_email', data['name_email'])
then:
outfile = open(output_filename, 'w', encoding='utf-8')
outfile.write(contents)
I am running this on python 3.6

additional list added after each row python

There is a discrepancy in execution of code in repl.it (which works fine, presumably because the bugs in Python have been fixed/updated), and IDLE, in which the code does not work correctly.
I have consulted the documentation, and previous stack overflow answers to add the "newline", but the problem persists.
You'll notice the repl it, here: (works perfectly)
https://repl.it/Jbv6/0
However, in IDLE on pasting the file contents (without a line break) it works fine
001,Joe,Bloggs,Test1:99,Test2:100,Test3:1002,Ash,Smith,Test1:20,Test2:20,Test3:100003003,Jonathan,Peter,Test1:99,Test2:33,Test3:44
but on pasting the file contents into the txt file as it should be (with each record on a new line) as so:
001,Joe,Bloggs,Test1:99,Test2:100,Test3:1
002,Ash,Smith,Test1:20,Test2:20,Test3:100003
003,Jonathan,Peter,Test1:99,Test2:33,Test3:44
the error on output is as follows (produces a new list after each line):
[['001', 'Joe', 'Bloggs', 'Test1:99', 'Test2:100', 'Test3:1'], [], ['002', 'Ash', 'Smith', 'Test1:20', 'Test2:20', 'Test3:100'], ['003'], ['', 'Jonathan', 'Peter', 'Test1:99', 'Test2:33', 'Test3:44']]
The code is here:
import csv
#==========1. Open the File, Read it into a list, and Print Contents
print("1==============Open File, Read into List, Print Contents")
#open the file, read it into a list (each line is a list within a list, and the end of line spaces are stripped as well as the individual elements split at the comma)
with open("studentinfo.txt","rb",newline="") as f:
studentlist=list(csv.reader(f))
print(studentlist)
I have tried, as the documentation, and previous answers on stackoverflow suggests, adding this: (the newline)
with open("studentinfo.txt","r",newline="") as f:
Unfortunately the error persists.
Any suggestions/solutions with an explanation would be appreciated.
Update, I also tried this:
with open("studentinfo.txt",newline="") as f:
reader=csv.reader(f)
for row in reader:
print(row)
again, it works perfectly in replit
https://repl.it/Jbv6/2
but this error in IDLE
1==============Open File, Read into List, Print Contents
['001', 'Joe', 'Bloggs', 'Test1:99', 'Test2:100', 'Test3:1']
[]
['002', 'Ash', 'Smith', 'Test1:20', 'Test2:20', 'Test3:100']
['003']
['', 'Jonathan', 'Peter', 'Test1:99', 'Test2:33', 'Test3:44']
>>>
This is a huge issue for students who need to be able to have consistency across both repl.it and IDLE which is what they are working on between their school and home environments.
Any answer that shows code that allows it to work on both is what I'm after.

The answer that is easiest is the following:
import csv
# ==========1. Open the File, Read it into a list, and Print Contents
print("1==============Open File, Read into List, Print Contents")
# open the file, read it into a list (each line is a list within a list,
# and the end of line spaces are stripped as well as the individual
# elements split at the comma)
studentlist = []
with open("studentinfo.txt", "r", newline="") as f:
for row in csv.reader(f):
if len(row) > 0:
studentlist.append(row)
print(studentlist)
But your original code should work - I've run it, but on linux rather than windows. If I could ask you to do more work:
with open("studentinfo.txt", "r", newline="") as f:
ascii_ch = list(map(ord,f.read()))
eol_delims = list(map(str,(ch if ch < 32 else '' for ch in ascii_ch)))
print(",".join(eol_delims))
This will produce a list of ,s but interspersed with either 13,10 or 10, but possibly even something like 10,13,10. These are the \r\n and \n that were talked about, but I'm wondering if you've managed to get that third option somehow?
If so, I think you'll need to rewrite that text file to get normal line endings.
-- (update in response to comment)
The only advice I have regarding the 10,13,10 is to only edit the text file in one application (say, notepad), and never edit it in another.
The actual problem comes from editing the file in two applications, which each have a different interpretation of what the line endings should be (windows applications should be \r\n, "repl.it" is \n. I've come across it before, but never worked out the sequence of actions required.

Try use codecs and explicitly specify the encoding of file to UTF-8.
import csv
import codecs
print("1==============Open File, Read into List, Print Contents")
with codecs.open("studentinfo.txt",encoding='utf-8') as f:
studentlist=list(csv.reader(f))
print(studentlist)

Using a filter may help:
with open('studentinfo.txt', 'rU') as f:
filtered = (line.replace('\r', '') for line in f)
for row in csv.reader(filtered):
print(row)

Pasting strings into a text editor and saving the file will not produce byte-identical files on different platforms. (Even different editors on the same platform are inconsistent!)
However, the CSV format accepted by the csv module is specified in terms of a byte-exact representation. The behavior can be customized by using a dialect (either a built-in dialect or implementing a new one) -- see the Python documentation for details. The default dialect is excel which requires Windows-style line endings (CR/LF). If you save the file in a different format it will not be parsed correctly.

Replace string in specific line of nonstandard text file

Similar to posting: Replace string in a specific line using python, however results were not forethcomming in my slightly different instance.
I working with python 3 on windows 7. I am attempting to batch edit some files in a directory. They are basically text files with .LIC tag. I'm not sure if that is relevant to my issue here. I am able to read the file into python without issue.
My aim is to replace a specific string on a specific line in this file.
import os
import re
groupname = 'Oldtext'
aliasname = 'Newtext'
with open('filename') as f:
data = f.readlines()
data[1] = re.sub(groupname,aliasname, data[1])
f.writelines(data[1])
print(data[1])
print('done')
When running the above code I get an UnsupportedOperation: not writable. I am having some issue writing the changes back to the file. Based on suggestion of other posts, I edited added the w option to the open('filename', "w") function. This causes all text in the file to be deleted.
Based on suggestion, the r+ option was tried. This leads to successful editing of the file, however, instead of editing the correct line, the edited line is appended to the end of the file, leaving the original intact.

Writing a changed line into the middle of a text file is not going to work unless it's exactly the same length as the original - which is the case in your example, but you've got some obvious placeholder text there so I have no idea if the same is true of your actual application code. Here's an approach that doesn't make any such assumption:
with open('filename', 'r') as f:
data = f.readlines()
data[1] = re.sub(groupname,aliasname, data[1])
with open('filename', 'w') as f:
f.writelines(data)
EDIT: If you really wanted to write only the single line back into the file, you'd need to use f.tell() BEFORE reading the line, to remember its position within the file, and then f.seek() to go back to that position before writing.

Why am I getting "_csv.Error: newline inside string"?

There is one answer to this question:
Getting "newline inside string" while reading the csv file in Python?
But this didn't work when I used the accepted answer.

If the answer in the above link doesn't work and you have opened multiple files during the execution of your code, go back and make sure you have closed all your previous files when you were done with them.
I had a script that opened and processed multiple files. Then at the very end, it kept throwing a _csv.Error in the same manner that Amit Pal saw.
My code runs about 500 lines and has three stages where it processes multiple files in succession. Here's the section of code that gave the error. As you can see, the code is plain vanilla:
f = open('file.csv')
fread = csv.reader(f)
for row in fread:
do something
And the error was:
for row in fread:
_csv.Error: newline inside string
So I told the script to print what the row....OK, that's not clear, here's what I did:
print row
f = open('file.csv')
fread = csv.reader(f)
for row in fread:
do something
Interestingly, what printed was the LAST LINE from one of the previous files I had opened and processed.
What made this really weird was that I used different variable names, but apparently the data was stuck in a buffer or memory somewhere.
So I went back and made sure I closed all previously opened files and that solved my problem.
Hope this helps someone.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python reading CSV-File universal-newline mode - python

Related

Python problem reading CSV files that contain the word NUL [duplicate]

Python not printing out special characters (extracted from an html file) when I write to another html file

additional list added after each row python

Replace string in specific line of nonstandard text file

Why am I getting "_csv.Error: newline inside string"?

Categories

Resources