Difference between csv.reader and .read when importing csv file - python

What is the difference in importing csv file with reader and with .read
import csv
f = open("nfl.csv", 'r')
data = csv.reader(f)
and using read directly
f = open('nfl.csv', 'r')
data = f.read()

From the docs, the reader will
Return a reader object which will iterate over lines in the given
csvfile.
whereas the read on a file, will
reads some quantity of data and returns it as a string. size is an optional
numeric argument. When size is omitted or negative, the entire
contents of the file will be read and returned; it’s your problem if
the file is twice as large as your machine’s memory.
So, the first way, you can use
for row in reader:
and processes the lines one at a time.
You can also do things one line at a time for a file in general.
The csv module expects comma seprated columns though, so you get a list or a dictionary of the data depending on how you set things up.

Related

Python: reading CSV file line by line, how to detect end?

I read large CSV file (millions of records) by this script. How do I detect the file is at end?
import csv
f = open("file.csv", newline='')
csv_reader = csv.reader(f)
while true:
do something with next(csv_reader)[6]
The obvious solution is to loop over csv_reader, as suggested by this answer. If that is not practical, the docs for the next function say:
Retrieve the next item from the iterator by calling its __next__() method. If default is given, it is returned if the iterator is exhausted, otherwise StopIteration is raised.
thus giving you two ways of detecting the end.
The csv.reader will read the file entirely and store it in the variable which is also an iterable. For reading "line by line", you need this:
for row in csv_reader:
do something
If you directly want the last line:
with open(‘file_name.csv’,’r’) as file:
data = file.readlines()
lastRow = data[-1]
This will be quite slow and memory consuming. Alternative is using pandas.
I solved it with pandas:
import pandas as pd
import numpy as np
csv_reader = pd.read_csv("file.csv", skiprows=2, usecols=[6])
csv_a = csv_reader.to_numpy()
this script skips first 2 rows and then imports 6th column only and converts to array

Python: csv to pickle representation, back to csv messes with file content

I am trying to pickle a csv file and then turn its pickled representation back into a csv file.
This is the code I came up with:
from pathlib import Path
import pickle, csv
csvFilePath = Path('/path/to/file.csv')
pathToSaveTo = Path('/path/to/newFile.csv')
csvFile = open(csvFilePath, 'r')
f = csvFile.read()
csvFile.close()
f_pickled = pickle.dumps(f)
f_unpickled = pickle.loads(f_pickled)
#save unpickled csv file
new_csvFile = open(pathToSaveTo, 'w')
csvWriter = csv.writer(new_csvFile)
csvWriter.writerow(f_unpickled)
new_csvFile.close()
newFile.csv is created however there are two problems with its content:
There is now a comma between every character.
There is now a pair of quotation marks after every line.
What would I have to change about my code to get an exact copy of file.csv?
The problem is that you are reading the raw text of the file, with f = csvFile.read() then, on writting, you are feeding the data, which is a single lump of text, all in a single string, though a CSV writer object. The CSV writer will see the string as an iterable, and write each of the iterable elements in a CSV cell. Then, there is no data for a second row, and the process ends.
The pickle dumps and loads you perform is just a no-operation: nothing happens there, and if there were any issue, it would rather be due to some unpickleable object reference in the object you are passing to dumps: you'd get an exception, and not differing data when loads is called.
Now, without telling why you want to do this, and what intermediate steps you hav planned for the data, it is hard to tell you: you are performing two non-operations: reading a file, pickling and unpickling its contents, and writting those contents back to disk.
At which point do you need these data structured as rows, or as CSV cells? Just apply the proper transforms where you need it, and you are done.
If you want the whole "do nothing" cycle going through actual having the CSV data separated in different elements in Python you can perform:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
data = list(csv.reader(open(csvFilePath)))
# ^consumes all iterations of the reader: each iteration is a row, composed of a list where each cell value is a list elemnt
pickled_data = pickle.dumps(data)
restored_data = pickle.loads(pickled_data)
csv.writer(open(pathToSaveTo, "wt")).writerows(restored_data)
Perceive as in this snippet the data is read through csv.reader, not directly. Wrapping it in a list call causes all rows to be read and transformed in list items - because the reader is a lazy iterator otherwise (and it would not be pickeable, as one of the attributs it depends for its state is an open file)
I believe the problem is in how you're attempting to write the CSV file, the pickling and unpickling is fine. If you compare f with f_unpickled:
if f==f_unpickled:
print("Same")
This printed in my case. If you print the type, you'll see there's both strings.
The better option is to follow the document style and write each row one at a time rather than putting the entire string in including new lines. Something like this:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
rows = []
csvFile = open(csvFilePath, 'r')
with open(csvFilePath, 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
# pickle and unpickle
rows_pickled = pickle.dumps(rows)
rows_unpickled = pickle.loads(rows_pickled)
if rows==rows_unpickled:
print("Same")
#save unpickled csv file
with open(pathToSaveTo, 'w', newline='') as csvfile:
csvWriter = csv.writer(csvfile)
for row in rows_unpickled:
csvWriter.writerow(row)
This worked when I tested it--albeit it would take more finagling with line separators to get no empty line at the end.

Access values outside with-block

Is there a way, in the code below, to access the variable utterances_dict outside of the with-block? The code below obviously returns the error: ValueError: I/O operation on closed file.
from csv import DictReader
utterances_dict = {}
utterance_file = 'toy_utterances.csv'
with open(utterance_file, 'r') as utt_f:
utterances_dict = DictReader(utt_f)
for line in utterances_dict:
print(line)
I am not an expert on DictReader implementation, however their documentation leaves the implementation open to the reader itself parsing the file after construction. Meaning it may be possible that the underlying file has to remain open until you are done using it. In this case, it would be problematic to attempt to use the utterances_dict outside of the with block because the underlying file will be closed by then.
Even if the current implementation of DictReader does in fact parse the whole csv on construction, it doesn't mean their implementation won't change in the future.
DictReader returns a view of the csv file.
Convert the result to a list of dictionaries.
from csv import DictReader
utterances = []
utterance_file = 'toy_utterances.csv'
with open(utterance_file, 'r') as utt_f:
utterances = [dict(row) for row in DictReader(utt_f) ]
for line in utterances:
print(line)

What does this line of code do?

I was just wondering what this line of code does:
writerow([recordlist[i][0], recordlist[i][1], recordlist[i][2]])
I know its a parameter of some sort, but what does it actually do in all of this code:
recordlist=[["1",chinese, "male"],["2",indian, "female"]]
import math
import csv
file_name = 'info.txt'
ofile = open(file_name, 'a')
writer = csv.writer(ofile, delimiter=',', lineterminator='\n')
for i in range(0,len(recordlist)):
writer.writerow([recordlist[i][0], recordlist[i][1], recordlist[i][2]])
ofile.close()
Thank you!
You've created a csvwriter. It has a method writerow that takes a sequence (list, tuple, etc.) of values to write the underlying file in delimited format, which in this case uses a comma as the delimiter. So it will create a row in the csv file for each row in the recordlist variable, as it iterates over it in the for loop. Each row will consist of the values defined on the first line of your code, separated by commas.
The real answer should be "run it and try it" to see what it does.
Then read the documentation of the csv module in Python here

Python CSV parsing fills up memory

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.
with open(file, "rb") as csvfile:
re = csv.DictReader(csvfile)
for row in re:
//insert row['column_name'] into DB
For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?
The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.
But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.
So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.
If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:
delimiter = ','
with open(filename, 'r') as fil:
headers = fil.next()
headers = headers.strip().split(delimiter)
dic_headers = {hdr: headers.index(hdr) for hdr in headers}
for line in fil:
row = line.strip().split(delimiter)
## do something with row[dic_headers['column_name']]
This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.

Categories