I read large CSV file (millions of records) by this script. How do I detect the file is at end?
import csv
f = open("file.csv", newline='')
csv_reader = csv.reader(f)
while true:
do something with next(csv_reader)[6]
The obvious solution is to loop over csv_reader, as suggested by this answer. If that is not practical, the docs for the next function say:
Retrieve the next item from the iterator by calling its __next__() method. If default is given, it is returned if the iterator is exhausted, otherwise StopIteration is raised.
thus giving you two ways of detecting the end.
The csv.reader will read the file entirely and store it in the variable which is also an iterable. For reading "line by line", you need this:
for row in csv_reader:
do something
If you directly want the last line:
with open(‘file_name.csv’,’r’) as file:
data = file.readlines()
lastRow = data[-1]
This will be quite slow and memory consuming. Alternative is using pandas.
I solved it with pandas:
import pandas as pd
import numpy as np
csv_reader = pd.read_csv("file.csv", skiprows=2, usecols=[6])
csv_a = csv_reader.to_numpy()
this script skips first 2 rows and then imports 6th column only and converts to array
Related
I am trying to pickle a csv file and then turn its pickled representation back into a csv file.
This is the code I came up with:
from pathlib import Path
import pickle, csv
csvFilePath = Path('/path/to/file.csv')
pathToSaveTo = Path('/path/to/newFile.csv')
csvFile = open(csvFilePath, 'r')
f = csvFile.read()
csvFile.close()
f_pickled = pickle.dumps(f)
f_unpickled = pickle.loads(f_pickled)
#save unpickled csv file
new_csvFile = open(pathToSaveTo, 'w')
csvWriter = csv.writer(new_csvFile)
csvWriter.writerow(f_unpickled)
new_csvFile.close()
newFile.csv is created however there are two problems with its content:
There is now a comma between every character.
There is now a pair of quotation marks after every line.
What would I have to change about my code to get an exact copy of file.csv?
The problem is that you are reading the raw text of the file, with f = csvFile.read() then, on writting, you are feeding the data, which is a single lump of text, all in a single string, though a CSV writer object. The CSV writer will see the string as an iterable, and write each of the iterable elements in a CSV cell. Then, there is no data for a second row, and the process ends.
The pickle dumps and loads you perform is just a no-operation: nothing happens there, and if there were any issue, it would rather be due to some unpickleable object reference in the object you are passing to dumps: you'd get an exception, and not differing data when loads is called.
Now, without telling why you want to do this, and what intermediate steps you hav planned for the data, it is hard to tell you: you are performing two non-operations: reading a file, pickling and unpickling its contents, and writting those contents back to disk.
At which point do you need these data structured as rows, or as CSV cells? Just apply the proper transforms where you need it, and you are done.
If you want the whole "do nothing" cycle going through actual having the CSV data separated in different elements in Python you can perform:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
data = list(csv.reader(open(csvFilePath)))
# ^consumes all iterations of the reader: each iteration is a row, composed of a list where each cell value is a list elemnt
pickled_data = pickle.dumps(data)
restored_data = pickle.loads(pickled_data)
csv.writer(open(pathToSaveTo, "wt")).writerows(restored_data)
Perceive as in this snippet the data is read through csv.reader, not directly. Wrapping it in a list call causes all rows to be read and transformed in list items - because the reader is a lazy iterator otherwise (and it would not be pickeable, as one of the attributs it depends for its state is an open file)
I believe the problem is in how you're attempting to write the CSV file, the pickling and unpickling is fine. If you compare f with f_unpickled:
if f==f_unpickled:
print("Same")
This printed in my case. If you print the type, you'll see there's both strings.
The better option is to follow the document style and write each row one at a time rather than putting the entire string in including new lines. Something like this:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
rows = []
csvFile = open(csvFilePath, 'r')
with open(csvFilePath, 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
# pickle and unpickle
rows_pickled = pickle.dumps(rows)
rows_unpickled = pickle.loads(rows_pickled)
if rows==rows_unpickled:
print("Same")
#save unpickled csv file
with open(pathToSaveTo, 'w', newline='') as csvfile:
csvWriter = csv.writer(csvfile)
for row in rows_unpickled:
csvWriter.writerow(row)
This worked when I tested it--albeit it would take more finagling with line separators to get no empty line at the end.
Is there a way, in the code below, to access the variable utterances_dict outside of the with-block? The code below obviously returns the error: ValueError: I/O operation on closed file.
from csv import DictReader
utterances_dict = {}
utterance_file = 'toy_utterances.csv'
with open(utterance_file, 'r') as utt_f:
utterances_dict = DictReader(utt_f)
for line in utterances_dict:
print(line)
I am not an expert on DictReader implementation, however their documentation leaves the implementation open to the reader itself parsing the file after construction. Meaning it may be possible that the underlying file has to remain open until you are done using it. In this case, it would be problematic to attempt to use the utterances_dict outside of the with block because the underlying file will be closed by then.
Even if the current implementation of DictReader does in fact parse the whole csv on construction, it doesn't mean their implementation won't change in the future.
DictReader returns a view of the csv file.
Convert the result to a list of dictionaries.
from csv import DictReader
utterances = []
utterance_file = 'toy_utterances.csv'
with open(utterance_file, 'r') as utt_f:
utterances = [dict(row) for row in DictReader(utt_f) ]
for line in utterances:
print(line)
I was just wondering what this line of code does:
writerow([recordlist[i][0], recordlist[i][1], recordlist[i][2]])
I know its a parameter of some sort, but what does it actually do in all of this code:
recordlist=[["1",chinese, "male"],["2",indian, "female"]]
import math
import csv
file_name = 'info.txt'
ofile = open(file_name, 'a')
writer = csv.writer(ofile, delimiter=',', lineterminator='\n')
for i in range(0,len(recordlist)):
writer.writerow([recordlist[i][0], recordlist[i][1], recordlist[i][2]])
ofile.close()
Thank you!
You've created a csvwriter. It has a method writerow that takes a sequence (list, tuple, etc.) of values to write the underlying file in delimited format, which in this case uses a comma as the delimiter. So it will create a row in the csv file for each row in the recordlist variable, as it iterates over it in the for loop. Each row will consist of the values defined on the first line of your code, separated by commas.
The real answer should be "run it and try it" to see what it does.
Then read the documentation of the csv module in Python here
I want to read two column of a csv file separately, but when I wrote code like below python just show first column and nothing for second, but in the csv file the second column also has lots of rows.
import csv
import pprint
f = open("arachnid.csv", 'r')
read = csv.DictReader(f)
for i in range(3):
read.next()
for i in read:
pprint.pprint(i["binomialAuthority_label"])
for i in read:
pprint.pprint(i["rdf-schema#label"])
The reason for this is that when you use DictReader the way you are using it it will create what is called an iterator/generator. So, when you have iterated over it once, you cannot iterate over it again the way you are doing it.
If you want to keep your logic as is, you can actually call seek(0) on your file reader object to reset its position as such:
f.seek(0)
The next time you iterate over your dictreader object, it will give you what you are looking for. So the part of your code of interest would be this:
for i in read:
pprint.pprint(i["binomialAuthority_label"])
# This is where you set your seek(0) before the second loop
f.seek(0)
for i in read:
pprint.pprint(i['rdf-schema#label'])
Your DictReader instance gets exhausted after your first for i in read: loop, so when you try to do your second loop, there is nothing to iterate over.
What you want to do, once you've iterated over the CSV the first time, you can seek your file back to the start, and create a new instance of the DictReader and start again. You'll want to create a new DictReader instance otherwise you'll need to manually skip the header line.
f = open(filename)
read = csv.DictReader(f)
for i in read:
print i
f.seek(0)
read = csv.DictReader(f)
for i in read:
print i
I haven't been able to re.sub a csv file.
My expression is doing it's job but the writerow is where I'm stuck.
re.sub out
"A1","Address2" "A1","Address2"
0138,"DEERFIELD AVE" 0138,"DEERFIELD"
0490,"REMMINGTON COURT" 0490,"REMMINGTON"
2039,"SANDHILL DR" 2039,"SANDHILL"
import csv
import re
with open('aa_street.txt', 'rb') as f:
reader = csv.reader(f)
read=csv.reader(f)
for row in read:
row_one = re.sub('\s+(DR|COURT|AVE|)\s*$', ' ', row[1])
row_zero = row[0]
print row_one
for row in reader:
print writerow([row[0],row[1]])
Perhaps something like this is what you need?
#!/usr/local/cpython-3.3/bin/python
# "A1","Address2" "A1","Address2"
# 0138,"DEERFIELD AVE" 0138,"DEERFIELD"
# 0490,"REMMINGTON COURT" 0490,"REMMINGTON"
# 2039,"SANDHILL DR" 2039,"SANDHILL"
import re
import csv
with open('aa_street.txt', 'r') as infile, open('actual-output', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
row_zero = row[0]
row_one = re.sub('\s+(DR|COURT|AVE|)\s*$', '', row[1])
writer.writerow([row_zero, row_one])
A file is an iterator—you iterate over it once, and then it's empty.
A csv.reader is also an iterator.
In general, if you want to reuse an iterator, there are three ways to do it:
Re-generate the iterator (and, if its source was an iterator, re-generate that as well, as so on up the chain)—in this case, that means open the file again.
Use itertools.tee.
Copy the iterator into a sequence and reuse that instead.
In the special case of files, you can fake #1 by using f.seek(0). Some other iterators have similar behavior. But in general, you shouldn't rely on this.
Anyway, the last one is the easiest, so let's just see how that works:
reader = list(csv.reader(f))
read = reader
Now you've got a list of all of the rows in the file. You can copy it, loop over it, loop over the copy, close the file, loop over the copy again, it's still there.
Of course the down side it that you need enough memory to put the whole thing in memory (plus, you can't start processing the first line until you've finished reading the last one). If that's a problem, you need to either reorganize your code so it only needs one pass, or re-open (or seek) the file.