optimize adding column to CSV file (~300GB) with Python - python

I want to add a column to CSV file, which is the difference of two other columns of the same file, I use Python (pandas) to do that and this is what I do:
import pandas as pd
row = ['times1','times2']
for df1 in pd.read_csv('C:/SET/parti_no_diff.CSV',skipinitialspace=True, usecols=row, chunksize=10**7):
df1['time_difference'] = (df1['times2'].astype('datetime64[s]')-df1['times1'].astype('datetime64[s]')).abs()
df1.to_csv('E:/SET/parti_with_diff_seconds.csv',mode='a')
I use a machine with 12GB RAM, and external hard disk of 2TB (5200RPM) (the input are not on the same hard disk as output), the program take more than 24h, how can I optimize it?

Honestly, Python's built in functionality to read and write text files is optimal for this. Read in a single line at a time to a list, add your extra column, then append it to the output text file.
Read in a single line at a time, modify it however you want, then append it to the output file. It'll happen faster than you think. You can use something like tqdm to monitor progress.
Something like:
import csv
from tqdm import tqdm
with open('myfile.txt', newline='') as f:
reader = csv.reader(f)
for row in tqdm(reader):
row.append('new_column')
with open('output.csv', 'a') as outfile:
outfile.write(row)

Related

Get help for Python code that get data from one CSV file and insert it into another CSV file

I am doing data migration.Old application data is exported as one CSV file. We cannot import this CSV file directly to new application. Need to create new CSV template that match with new application and import some data into this new CSV template. I would like to request code that facilitate this requirement.
I'm not exactly sure what template you want to go to. I'm going to assume that you either want to change the number/order of columns or the delimiter.
The simplest thing is to read it line by line and write it:
import csv
with open("Old.csv", 'r') as readfp, open("new.csv", 'w') as writefp:
csvReader = csv.reader(readfp)
csvWriter = csv.writer(writefp, delimiter=',')
for line in csvReader:
#line is a list of strings so you can reorder it as you wish. I'll skip the third column as an example.
csvWriter.writerow(line[:2]+line[3:])
If you have pandas installed this is even simpler
import pandas as pd
df = pd.read_csv("Old.csv")
df.drop(labels=["name_of_bad_col1", "name_of_bad_col2"], sep=',')
df.to_csv("new.csv)
If you are going the pandas route, make sure to checkout the documentations (read_csv, to_csv)

Is it possible to open a large csv without loading it to RAM entirely

Suppose I have a very large csv file.
file = open("foo.csv")
seems to put the whole csv in RAM. If I just need the first row of the csv but don't want python to load the open the entire file, is there a way to do it?
If you just need the first row then you can use the csv module like so.
import csv
with open("foo.csv", "r") as my_csv:
reader = csv.reader(my_csv)
first_row = next(reader)
# do stuff with first_row
The CSV module uses generators so the whole file is not loaded into RAM, more rows are loaded as requested to prevent the whole file being loaded into RAM.

How can I open multiple csv files in a folder, take the average of a column and save in a separate file using python?

I am extremely new at python and need some help with this one. I've tried various codes and none seem to work, so suggestions would be awesome.
I have a folder with about 1500 csv files that each contain multiple columns of data. I need to take the average of the first column called "agr" and save this value in a different excel or csv file. It would be great if I could also somehow save the name of the file with its averaged value so that I can keep track of which file it came from. The name of the files are crop_city (e.g. corn_omaha).
import glob
import csv
import numpy as np
import pandas as pd
path = ('C:/test/*.csv')
for fname in glob.glob(path):
with open(fname) as csvfile:
agr = []
reader = csv.DictReader(fname)
print row['agr']
I know the code above is extremely rudimentary, so any help would be great thanks everyone!
Assuming the first column in these CSV files is a decimal or float, you don't really need to parse the entire line. Just split at the first separator and parse the first token. There is no real advantage to numpy or pandas either. Just use the builtin sum function.
import glob
import os
path = ('test/*.csv') # using local dir for test
outfile.write("Filename,Sum\r\n") # header for output
with open('output.csv', 'w', newline='') as outfile:
for fname in glob.glob(path):
with open(fname) as csvfile:
next(csvfile) # skip header
outfile.writelines("{},{}\r\n".format(os.path.basename(fname),
sum(float(line.split(',', 1)[0].strip())
for line in csvfile)))
Contrary to the answer by #tdelaney, I would not advise you to limit your code by relying on the fact that you are adding up the first column; what if you need to work with the third column next week? It's easy to do this properly by building on the code you provide. Parsing a couple of thousand text files is not going to slow you down.
The csv.DictReader constructor will automatically treat the first row of its input as a header (unless you explicitly specify a list of column names with the fieldnames parameter). So your code can look like this:
import csv
import glob
averages = []
for fname in glob.glob(path):
with open(fname, "rb") as csvfile:
reader = csv.DictReader(csvfile)
values = [ float(row["agr"]) for row in reader ]
avg = sum(values) / len(values)
averages.append((fname, avg))
The list averages now contains the numbers you want. This is how you write it out to another CSV file:
with open("avegages.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["File", "Average agr"])
for row in averages:
writer.writerow(row)
PS. Since you included pandas in your imports, here's one way to do the same thing with pandas. However, I recommend sticking with csv for now. The pandas object model is complex, and hard to wrap your head around.
averages = []
for fname in glob.glob(path):
data = pd.DataFrame.from_csv(fname)
averages.append((fname, data["agr"].mean()))
df_out = pd.DataFrame.from_records(averages, columns=["File", "Average agr"])
df_out.to_csv("averages.csv", index=False)
As you can see the code is a lot shorter, since file i/o and calculations can be done with one statement.

Python CSV parsing fills up memory

I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.
with open(file, "rb") as csvfile:
re = csv.DictReader(csvfile)
for row in re:
//insert row['column_name'] into DB
For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?
The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.
But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.
So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.
If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:
delimiter = ','
with open(filename, 'r') as fil:
headers = fil.next()
headers = headers.strip().split(delimiter)
dic_headers = {hdr: headers.index(hdr) for hdr in headers}
for line in fil:
row = line.strip().split(delimiter)
## do something with row[dic_headers['column_name']]
This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.

Reading CSV files and rewriting them without certain rows Python

I am new to programming. I have hundreds of CSV files in a folder and certain files have the letters DIF in the second column. I want to rewrite the CSV files without those lines in them. I have attempted doing that for one file and have put my attempt below. I need also need help getting the program to do that dfor all the files in my directory. Any help would be appreciated.
Thank you
import csv
reader=csv.reader(open("40_5.csv","r"))
for row in reader:
if row[1] == 'DIF':
csv.writer(open('40_5N.csv', 'w')).writerow(row)
I made some changes to your code:
import csv
import glob
import os
fns = glob.glob('*.csv')
for fn in fns:
reader=csv.reader(open(fn,"rb"))
with open (os.path.join('out', fn), 'wb') as f:
w = csv.writer(f)
for row in reader:
if not 'DIF' in row:
w.writerow(row)
The glob command produces a list of all files ending with .csv in the current directory. If you want to give the source directory as an argument to your program, have a look into sys.argv or argparse (especially the latter is very powerful for command line parsing).
You also have to be careful when opening a file in 'w' mode: It means truncating the file, i.e. in your loop you would always overwrite the existing file, ending up in only one csv line.
The direcotry 'out' must exist or the script will produce an IOError.
Links:
open
sys.argv
argparse
glob
Most sequence types support the in or not in operators, which are much simpler to use to test for values than figuring index positions.
for row in reader:
if not 'DIF' in row:
csv.writer(open('40_5N.csv', 'w')).writerow(row)
If you're willing to install numpy, you can also read a csv file into the convenient numpy array format with either recfromcsv or the more general genfromtxt (genfromtxt requires you specify the comma delimiter), and you can specify which rows and columns to ignore. Documentation can be found here for genfromtxt:
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
And here for recfromcsv: http://nullege.com/codes/search/numpy.recfromcsv?fulldoc=1

Categories