Iterating through a csv file - python

Hi I am trying to iterate through a csv file but I cannot get it to work somehow. I followed the python docs but I am still not able to iterate through it. I have a gzipped csv file that I work with with this format:
2015-01-10 00:00:05;32
As you can see it's delimited with a ';'.
Here is my code to run though it (simplified)
gzip_fd = gzip.decompress(gzip_file).decode(encoding='utf8')
csv_data = csv.reader(gzip_fd, delimiter=';', lineterminator='\n')
for data in csv_data:
print(data)
But when I want to work with data it only contains the first character (like: 2) and not the first part of the csv data that I need. Anyone here that had the same issues? I also tried csv.DictReader but with no success.

Even if your snippet was fixed to work, it would buffer all data in the memory, which might not scale well for very large files.
Gzipped data can also be iterated on-the-fly -- the following works for me on CPython 3.8:
import csv
import gzip
with gzip.open('test.csv.gz', 'r') as gzipped:
reader = csv.reader(gzipped, delimiter=';', lineterminator='\n')
for line in reader:
print(line)
['2015-01-10 00:00:05', '32']
<...>
Update: As per comments below, my snippet does not work on older Python versions (reproduced on CPython 3.5).
You can use io.TextIOWrapper to achieve the same effect:
import csv
import io
import gzip
with gzip.open('test.csv.gz', 'rb') as gzipped:
reader = csv.reader(io.TextIOWrapper(gzipped), delimiter=';',
lineterminator='\n')
for line in reader:
print(line)

So I fixed my issue, the issue was that I didn't split the string that I get (can't do gzip.open because it isn't a file but rather a bytes string of the gzipped file
Here is the fix to my problem:
gzip_fd = gzip.decompress(compressed_data).decode(encoding='utf-8').split('\n')
self.data = csv.reader(gzip_fd, delimiter=';', lineterminator='\n')

Related

I need to edit a python script to remove quotes from a csv, then write back to that same csv file, quotes removed

I have seen similar posts to this but they all seem to be print statements (viewing the cleaned data) rather than overwriting the original csv with the cleaned data so I am stuck. When I tried to write back to the csv myself, it just deleted everything in the file. Here is the format of the csv:
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4747;"yes";"cellular";11;"may";110;1;339;2;"failure";"no"
35;"management";"single";"tertiary";"no";1470;"yes";"no";"cellular";12;"apr"185;1;330;1;"failure";"no"
It is delimited by semicolons, which is fine, but all text is wrapped in quotes and I only want to remove the quotes and write back to the file. Here is the code I reverted back to that successfully reads the file, removes all quotes, and then prints the results:
import csv
f = open("bank.csv", 'r')
try:
for row in csv.reader(f, delimiter=';', skipinitialspace=True):
print(' '.join(row))
finally:
f.close()
Any help on properly writing back to the csv would be appreciated, thanks!
See here: Python CSV: Remove quotes from value
I've done this basically two different ways, depending on the size of the csv.
You can read the entire csv into a python object (list), do some things and then
overwrite the other existing file with the cleaned version
As in the link above, you can use one reader and one writer, Create a new file, and write line by-line as you clean the input from the csv reader, delete the original csv and rename the new one to replace the old file.
In my opinion option #2 is vastly preferable as it avoids the possibility of data loss if your script has an error part way through writing. It also will have lower memory usage.
Finally: It may be possible to open a file as read/write, and iterate line-by-line overwriting as you go: But that will leave you open to half of your file having quotes, and half not if your script crashes part way through.
You could do something like this. Read it in, and write using quoting=csv.QUOTE_NONE
import csv
f = open("bank.csv", 'r')
inputCSV = []
try:
for row in csv.reader(f, delimiter=';', skipinitialspace=True):
inputCSV.append(row)
finally:
f.close()
with open('bank.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=';')
for row in inputCSV:
csvwriter.writerow(row)

CSV Silently Not Reading All Lines on Python on Windows

I'm trying to read all lines of a TSV file to a list. However, the TSV reader is terminating early and not reading the whole file. I know this because data is only 1/6 of the length of the whole file. No errors are thrown when this happens.
When I manually inspect the line it terminates on (corresponding to the length of data, those lines have tons of Unicode symbols. I thought I could catch a UnicodeDecodeError, but instead of throwing an error, it quits out of reading the whole file entirely. I imagine it's hitting something that's triggering an end-of-file??
What's really throwing me for a loop: the error only occurs when I'm using Python 2.7 on Windows Server 2012. The file reads 100% perfectly on Unix implementations of Python 2.7 using both code snippets below. I'm running this inside Anaconda on both.
Here's what I've tried and neither works:
data = []
with open('data.tsv','r') as infile:
csvreader = csv.reader((x.replace('\0', '') for x in infile),
delimiter='\t', quoting=csv.QUOTE_NONE)
data = list(csvreader)
I also tried reading line by line...
with open('data.tsv','r') as infile:
for line in infile:
try:
d = line.split('\t')
q = d[0].decode('utf-8') #where the unicode symbols are located
data.append(d)
except UnicodeDecodeError:
continue
Thanks in advance!
As per general suggestion from the documentation:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
So open your file with:
with open('data.csv', 'rb') as infile:
csvreader = csv.reader(infile, delimiter='\t', quoting=csv.QUOTE_NONE)
data = list(csvreader)
Also, you will have to decode your strings if they have unicode data, or just use unicodecsv as a drop-in replacement so you don't have to worry about it.

CSV reader not reading entire file

I have looked at previous answers to this question, but in each of those scenarios the questioners were asking about something specific they were doing with the file, but the problem occurs for me even when I am not.
I have a .csv file of 27,204 rows. When I open the python interpreter:
python
import csv
o = open('btc_usd1hour.csv','r')
p = csv.reader(o)
for row in p:
print(row)
I then only see roughly the last third of the document displayed to me.
Try so, at me works:
with open(name) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
reference:
https://docs.python.org/3.6/library/csv.html#csv.DictReader
Try the following code
import csv
fname = 'btc_usd1hour.csv'
with open(fname, newline='') as f:
reader = csv.reader(f)
for row in reader:
print(row)
It is difficult to tell what is the problem without having the sample. I guess the problem would be removed if you add that newline='' for opening the file.
Use the with construct to close the file automatically. Use the f name for a file object when no further explanation is needed. Store the file name to fname to make future modifications easier (and also for easy copying the code fragment for your later programs).
olisch may be right that the console just scrolled so fast you could not see the result. You can write the result to another text file like this:
with open(fname, newline='') as fin,\
open('output.txt', 'w') as fout:
reader = csv.reader(fin)
for row in reader:
fout.write(repr(row) + '\n')
The repr function converts the row list into its string representation. The print calls that function internally, so you will have the same result that you otherwise observe on screen.
maybe your scrollback buffer is just to short to see the whole list?
In general your csv.reader call should be working fine, except your 27k rows aren't extremly long so that you might be able to hit any 64bit boundaries, which would be quite uncommon.
len(o) might be interesting to see.

Python export to a TXT formatted or to Clipboard

I have a csv file that is something like BM13302, EM13203,etc
I have to read this from a file then reformat it to something like 'BM13302', 'EM13203',etc
What I'm having problems with is how do I export (write it either the clipboard or a file, I can cut and paste from. This is a tiny little project for reformatting some for part of some SQL code that's given to me in a unclean format and i have to spend a little while formatting it out. I would like to just point python to a directory and past the list in the file and have it export everything that way I need it.
I have the following code working
import os
f = open(r"/User/person/Desktop/folder/file.csv")
csv_f = csv.reader(f)
for row in csv_f:
print(row)
I get the expected results
I would like find out how to take the list(?) and format it like this
'BM1234', 'BM2351', '20394',....etc
and copy that to the clipboard
I thought something doing something like
with open('/Users/person/Desktop/csv/export.txt') as f:
f.write("open=", + "', '")
f.close()
nothing is printed. Can't find an example of what I'm needing. Anyone able to help me out??
Much Appreciate!
You can have the csv module quote things for you. As far as I know there is no clipboard in the python standard libs but there are various mechanisms out there. Here I'm using pyperclip which is reasonable for text-only copies.
import pyperclip
import csv
import io
def clip_csv(filename):
outbuf = io.StringIO()
with open('file.csv', newline='') as infile:
incsv = csv.reader(infile, skipinitialspace=True)
outcsv = csv.writer(outbuf, quotechar="'", quoting=csv.QUOTE_ALL)
outcsv.writerows(incsv)
pyperclip.copy(outbuf.getvalue())
clip_csv('file.csv')
# DEBUG: Verify by printing clipboard
print(pyperclip.paste())
I'm not sure but I think you try to add quote char ' to all data in csv
import csv
with open('export.csv', 'w') as f:
# use quote char `'` for all data
writer = csv.writer(f, quotechar="'", quoting=csv.QUOTE_ALL)
writer.writerow(["BM1234", "BM2351", "20394"])

"Line contains NULL byte" in CSV reader (Python)

I'm trying to write a program that looks at a .CSV file (input.csv) and rewrites only the rows that begin with a certain element (corrected.csv), as listed in a text file (output.txt).
This is what my program looks like right now:
import csv
lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])
with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'r') as mycsv:
reader = csv.reader(mycsv)
for row in reader:
if row[0] not in lines:
writer.writerow(row)
Unfortunately, I keep getting this error, and I have no clue what it's about.
Traceback (most recent call last):
File "C:\Python32\Sample Program\csvParser.py", line 12, in <module>
for row in reader:
_csv.Error: line contains NULL byte
Credit to all the people here to even to get me to this point.
I'm guessing you have a NUL byte in input.csv. You can test that with
if '\0' in open('input.csv').read():
print "you have null bytes in your input file"
else:
print "you don't"
if you do,
reader = csv.reader(x.replace('\0', '') for x in mycsv)
may get you around that. Or it may indicate you have utf16 or something 'interesting' in the .csv file.
I've solved a similar problem with an easier solution:
import codecs
csvReader = csv.reader(codecs.open('file.csv', 'rU', 'utf-16'))
The key was using the codecs module to open the file with the UTF-16 encoding, there are a lot more of encodings, check the documentation.
If you want to replace the nulls with something you can do this:
def fix_nulls(s):
for line in s:
yield line.replace('\0', ' ')
r = csv.reader(fix_nulls(open(...)))
You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
See the (line.replace('\0','') for line in f) below, also you'll want to probably open that file up using mode rb.
import csv
lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])
with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'rb') as mycsv:
reader = csv.reader( (line.replace('\0','') for line in mycsv) )
for row in reader:
if row[0] not in lines:
writer.writerow(row)
This will tell you what line is the problem.
import csv
lines = []
with open('output.txt','r') as f:
for line in f.readlines():
lines.append(line[:-1])
with open('corrected.csv','w') as correct:
writer = csv.writer(correct, dialect = 'excel')
with open('input.csv', 'r') as mycsv:
reader = csv.reader(mycsv)
try:
for i, row in enumerate(reader):
if row[0] not in lines:
writer.writerow(row)
except csv.Error:
print('csv choked on line %s' % (i+1))
raise
Perhaps this from daniweb would be helpful:
I'm getting this error when reading from a csv file: "Runtime Error!
line contains NULL byte". Any idea about the root cause of this error?
...
Ok, I got it and thought I'd post the solution. Simply yet caused me
grief... Used file was saved in a .xls format instead of a .csv Didn't
catch this because the file name itself had the .csv extension while
the type was still .xls
A tricky way:
If you develop under Lunux, you can use all the power of sed:
from subprocess import check_call, CalledProcessError
PATH_TO_FILE = '/home/user/some/path/to/file.csv'
try:
check_call("sed -i -e 's|\\x0||g' {}".format(PATH_TO_FILE), shell=True)
except CalledProcessError as err:
print(err)
The most efficient solution for huge files.
Checked for Python3, Kubuntu
def fix_nulls(s):
for line in s:
yield line.replace('\0', '')
with open(csv_file, 'r', encoding = "utf-8") as f:
reader = csv.reader(fix_nulls(f))
for line in reader:
#do something
this way works for me
I've recently fixed this issue and in my instance it was a file that was compressed that I was trying to read. Check the file format first. Then check that the contents are what the extension refers to.
Turning my linux environment into a clean complete UTF-8 environment made the trick for me.
Try the following in your command line:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
This is long settled, but I ran across this answer because I was experiencing an unexpected error while reading a CSV to process as training data in Keras and TensorFlow.
In my case, the issue was much simpler, and is worth being conscious of. The data being produced into the CSV wasn't consistent, resulting in some columns being completely missing, which seems to end up throwing this error as well.
The lesson: If you're seeing this error, verify that your data looks the way that you think it does!
pandas.read_csv now handles the different UTF encoding when reading/writing and therefore can deal directly with null bytes
data = pd.read_csv(file, encoding='utf-16')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
for skipping the NULL byte rows
import csv
with open('sample.csv', newline='') as csv_file:
reader = csv.reader(csv_file)
while True:
try:
row = next(reader)
print(row)
except csv.Error:
continue
except StopIteration:
break
The above information is great. For me I had this same error. My fix was easy and just user error aka myself. Simply save the file as a csv and not an excel file.
It is very simple.
don't make a csv file by "create new excel" or save as ".csv" from window.
simply import csv module, write a dummy csv file, and then paste your data in that.
csv made by python csv module itself will no longer show you encoding or blank line error.

Categories