I'm maintaining an app that exports CSV files from a multi-file database. The database's design has changed so that the set of columns of an exported table might be appended in each new file of the db. (i.e. db_file0.db has columns 'Header1', 'Header2'; and db_file1.db has columns 'Header1', 'Header2', 'Header3') The order of the columns never changes, so I don't have to worry about data entries moving around within rows.
I want to update the CSV export function so it changes the header row in the file whenever the db's columns change. I slapped together this example of how I can do so, but I'm worried that the new_file.write(old_file.read()) line will use way too much memory on a very large CSV file.
import csv
import io
def update_headers(file, new_headers: list, dialect=None):
new_file = io.StringIO()
# Copy source file to buffer, changing header row first
with open(file) as old_file:
if dialect is None:
dialect = csv.Sniffer().sniff(old_file.read(1024))
old_file.seek(io.SEEK_SET) # Sniffer moves file position; reset to 0
r = csv.reader(old_file, dialect=dialect)
w = csv.writer(new_file, dialect=dialect)
w.writerow(next(r) + new_headers)
new_file.write(old_file.read())
# Overwrite source file with content of buffer
with open(file, 'w', newline='') as csv_file:
csv_file.write(new_file.getvalue())
new_file.close()
Is there a more memory-efficient way to transfer the bulk of the file into and out of the stream? Or is there a way to modify the file on disk without streaming it all into memory first?
Related
I am trying to pickle a csv file and then turn its pickled representation back into a csv file.
This is the code I came up with:
from pathlib import Path
import pickle, csv
csvFilePath = Path('/path/to/file.csv')
pathToSaveTo = Path('/path/to/newFile.csv')
csvFile = open(csvFilePath, 'r')
f = csvFile.read()
csvFile.close()
f_pickled = pickle.dumps(f)
f_unpickled = pickle.loads(f_pickled)
#save unpickled csv file
new_csvFile = open(pathToSaveTo, 'w')
csvWriter = csv.writer(new_csvFile)
csvWriter.writerow(f_unpickled)
new_csvFile.close()
newFile.csv is created however there are two problems with its content:
There is now a comma between every character.
There is now a pair of quotation marks after every line.
What would I have to change about my code to get an exact copy of file.csv?
The problem is that you are reading the raw text of the file, with f = csvFile.read() then, on writting, you are feeding the data, which is a single lump of text, all in a single string, though a CSV writer object. The CSV writer will see the string as an iterable, and write each of the iterable elements in a CSV cell. Then, there is no data for a second row, and the process ends.
The pickle dumps and loads you perform is just a no-operation: nothing happens there, and if there were any issue, it would rather be due to some unpickleable object reference in the object you are passing to dumps: you'd get an exception, and not differing data when loads is called.
Now, without telling why you want to do this, and what intermediate steps you hav planned for the data, it is hard to tell you: you are performing two non-operations: reading a file, pickling and unpickling its contents, and writting those contents back to disk.
At which point do you need these data structured as rows, or as CSV cells? Just apply the proper transforms where you need it, and you are done.
If you want the whole "do nothing" cycle going through actual having the CSV data separated in different elements in Python you can perform:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
data = list(csv.reader(open(csvFilePath)))
# ^consumes all iterations of the reader: each iteration is a row, composed of a list where each cell value is a list elemnt
pickled_data = pickle.dumps(data)
restored_data = pickle.loads(pickled_data)
csv.writer(open(pathToSaveTo, "wt")).writerows(restored_data)
Perceive as in this snippet the data is read through csv.reader, not directly. Wrapping it in a list call causes all rows to be read and transformed in list items - because the reader is a lazy iterator otherwise (and it would not be pickeable, as one of the attributs it depends for its state is an open file)
I believe the problem is in how you're attempting to write the CSV file, the pickling and unpickling is fine. If you compare f with f_unpickled:
if f==f_unpickled:
print("Same")
This printed in my case. If you print the type, you'll see there's both strings.
The better option is to follow the document style and write each row one at a time rather than putting the entire string in including new lines. Something like this:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
rows = []
csvFile = open(csvFilePath, 'r')
with open(csvFilePath, 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
# pickle and unpickle
rows_pickled = pickle.dumps(rows)
rows_unpickled = pickle.loads(rows_pickled)
if rows==rows_unpickled:
print("Same")
#save unpickled csv file
with open(pathToSaveTo, 'w', newline='') as csvfile:
csvWriter = csv.writer(csvfile)
for row in rows_unpickled:
csvWriter.writerow(row)
This worked when I tested it--albeit it would take more finagling with line separators to get no empty line at the end.
Suppose I have a very large csv file.
file = open("foo.csv")
seems to put the whole csv in RAM. If I just need the first row of the csv but don't want python to load the open the entire file, is there a way to do it?
If you just need the first row then you can use the csv module like so.
import csv
with open("foo.csv", "r") as my_csv:
reader = csv.reader(my_csv)
first_row = next(reader)
# do stuff with first_row
The CSV module uses generators so the whole file is not loaded into RAM, more rows are loaded as requested to prevent the whole file being loaded into RAM.
What is the difference in importing csv file with reader and with .read
import csv
f = open("nfl.csv", 'r')
data = csv.reader(f)
and using read directly
f = open('nfl.csv', 'r')
data = f.read()
From the docs, the reader will
Return a reader object which will iterate over lines in the given
csvfile.
whereas the read on a file, will
reads some quantity of data and returns it as a string. size is an optional
numeric argument. When size is omitted or negative, the entire
contents of the file will be read and returned; it’s your problem if
the file is twice as large as your machine’s memory.
So, the first way, you can use
for row in reader:
and processes the lines one at a time.
You can also do things one line at a time for a file in general.
The csv module expects comma seprated columns though, so you get a list or a dictionary of the data depending on how you set things up.
I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.
with open(file, "rb") as csvfile:
re = csv.DictReader(csvfile)
for row in re:
//insert row['column_name'] into DB
For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?
The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.
But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.
So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.
If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:
delimiter = ','
with open(filename, 'r') as fil:
headers = fil.next()
headers = headers.strip().split(delimiter)
dic_headers = {hdr: headers.index(hdr) for hdr in headers}
for line in fil:
row = line.strip().split(delimiter)
## do something with row[dic_headers['column_name']]
This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.
I'm using Python's csv module to do some reading and writing of csv files.
I've got the reading fine and appending to the csv fine, but I want to be able to overwrite a specific row in the csv.
For reference, here's my reading and then writing code to append:
#reading
b = open("bottles.csv", "rb")
bottles = csv.reader(b)
bottle_list = []
bottle_list.extend(bottles)
b.close()
#appending
b=open('bottles.csv','a')
writer = csv.writer(b)
writer.writerow([bottle,emptyButtonCount,100, img])
b.close()
And I'm using basically the same for the overwrite mode(which isn't correct, it just overwrites the whole csv file):
b=open('bottles.csv','wb')
writer = csv.writer(b)
writer.writerow([bottle,btlnum,100,img])
b.close()
In the second case, how do I tell Python I need a specific row overwritten? I've scoured Gogle and other stackoverflow posts to no avail. I assume my limited programming knowledge is to blame rather than Google.
I will add to Steven Answer :
import csv
bottle_list = []
# Read all data from the csv file.
with open('a.csv', 'rb') as b:
bottles = csv.reader(b)
bottle_list.extend(bottles)
# data to override in the format {line_num_to_override:data_to_write}.
line_to_override = {1:['e', 'c', 'd'] }
# Write data to the csv file and replace the lines in the line_to_override dict.
with open('a.csv', 'wb') as b:
writer = csv.writer(b)
for line, row in enumerate(bottle_list):
data = line_to_override.get(line, row)
writer.writerow(data)
You cannot overwrite a single row in the CSV file. You'll have to write all the rows you want to a new file and then rename it back to the original file name.
Your pattern of usage may fit a database better than a CSV file. Look into the sqlite3 module for a lightweight database.