I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.
My CSV is similar to:
ID Location Date Value
1 1 Loc1 2022-01-27 5
2 1 Loc1 2022-01-27 4
3 1 Loc1 2022-01-28 7
4 1 Loc2 2022-01-29 8
5 2 Loc1 2022-01-27 11
6 2 Loc2 2022-01-28 4
7 2 Loc2 2022-01-29 6
8 3 Loc1 2022-01-28 9
9 3 Loc1 2022-01-28 9
10 3 Loc2 2022-01-29 1
{ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
{ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18
Here's what that sample input should look like, processed/summed, and written to a new CSV:
ID Location Date Value
1 Loc1 2022-01-27 9
1 Loc1 2022-01-28 7
1 Loc2 2022-01-29 8
2 Loc1 2022-01-27 11
2 Loc2 2022-01-28 4
2 Loc2 2022-01-29 6
3 Loc1 2022-01-28 18
3 Loc2 2022-01-29 1
I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!
NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)
The appropriate answer is probably to use Dask but you can do with Pandas and chunk. The last_row variable is the last row of the previous chunk is case of the first row of the current chunk have the same ID, Location and Date.
chunksize = 4 # Number of rows
last_row = pd.DataFrame() # Last row of the previous chunk
with open('data.csv') as reader, open('output.csv', 'w') as writer:
# Write headers
writer.write(reader.readline())
reader.seek(0)
for chunk in pd.read_csv(reader, chunksize=chunksize):
df = pd.concat([last_row, chunk])
df = df.groupby(['ID', 'Location', 'Date'], as_index=False)['Value'].sum()
df, last_row = df.iloc[:-1], df.iloc[-1:]
df.to_csv(writer, header=False, index=False)
# Don't forget the last row!
last_row.to_csv(writer, header=False, index=False)
Content of output.csv:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.
Here you could use:
with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
rd, wr = csv.reader(fd), csv.writer(fdout)
_ = wr.writerow(next(rd)) # header line
old = [None]*4
for row in rd:
row[3] = int(row[3]) # convert value field to integer
if row[:3] == old[:3]:
old[3] += row[3] # concatenate values of similar rows
else:
if old[0]: # and write the concatenated row
_ = wr.writerow(old)
old = row
if old[0]: # do not forget the last row...
_ = wr.writerow(old)
With the shown input data, it gives as expected:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.
You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:
from collections import Counter
import csv
data = Counter()
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data[tuple(row[:3])] += int(row[3])
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for key, value in data.items():
csv_output.writerow([*key, value])
Giving the output:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
This avoids storing the input CSV in memory, only the output CSV data.
If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:
from collections import Counter
import csv
def write_id(csv_output, data):
for key, value in data.items():
csv_output.writerow([*key, value])
data.clear()
data = Counter()
current_id = None
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
header = next(csv_input)
csv_output.writerow(header)
for row in csv_input:
if current_id and row[0] != current_id:
write_id(csv_output, data)
data[tuple(row[:3])] += int(row[3])
current_id = row[0]
write_id(csv_output, data)
For the given example, this would give the same output.
Have you tried:
output = []
for key, group in df.groupby([columns]):
output.append((key, group['a'].sum()))
pd.DataFrame(output).to_csv("....csv")
source:
https://stackoverflow.com/a/54244289/7132906
There are a number of answers already that may suffice: #MartinEvans and #Corralien both recommend breaking-up/chunking the input-output.
I'm especially curious if #MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).
If either of those don't work, I think you'll be faced with the question:
What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?
In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).
This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):
import csv
from collections import Counter
from datetime import datetime
# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)
def write_row(row):
global writer
writer.writerow(row)
# Don't let counter get too big (for memory)
def flush_counter(counter):
for key, sum_ in counter.items():
id_, loc, date = key
write_row([id_, loc, date, sum_])
# You said "already grouped by week-number", so:
# - read and sum your input CSV in chunks of "week (number) groups"
# - once the reader reads past a week-group, it concludes week-group is finished
# and flushes the counts for that week-group
last_wk_group = None
counter = Counter()
# Open input
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# Copy header
header = next(reader)
write_row(header)
for row in reader:
# Get "base" values
id_, loc, date = row[0:3]
value = int(row[3])
# 2022-01-27 -> 2022-04
wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')
# Decide if last week-group has passed
if wk_group != last_wk_group:
flush_counter(counter)
counter = Counter()
last_wk_group = wk_group
# Count/sum this week-groups
key = tuple([id_, loc, date_])
counter[key] += value
# Flush remaining week-group counts
flush_counter(counter)
As a basic test, I moved the first row of your sample input to the last row, like #Corralien was asking:
ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4
and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
I'm new to programming and trying to supplement my learning by doing some online tutorials. Today, I started looking at working with CSV files using a tutorial that seemed easy enough to follow, but I've ran into what amounts to an immaterial problem, but it's frustrating me that I can't figure it out, haha. I've spent around two hours Googling and testing things, but I'm just not savvy enough to know what to try next. Help, please! haha.
Here's the code in question:
# importing the csv module
import csv
# csv filename
filename = r'C:\Users\XXX\Documents\AAPL.csv'
# initialize the titles and row list
fields = []
rows = []
# read the csv file
with open(filename, 'r') as csvfile:
# create the csv reader object
csvreader = csv.reader(csvfile)
# extract field names through the first row
fields = next(csvreader)
# extract each data row one by one
for row in csvreader:
rows.append(row)
# get total number of rows
print("total no. of rows: %d"%(csvreader.line_num))
# print the field names
print("Field names are: " + ", ".join(field for field in fields))
# print the first 5 rows of data
print("\nFirst 5 rows are:\n")
for row in rows[:5]:
#parse each column of a row
for col in row:
print("%10s"%col),
print("\n")
The tutorial was actually written for Python 2.X, so the I found the updated formatting for 3.6 and changed that last statement to be:
for col in row:
print('{:>10}'.format(col))
print("\n")
Either way it's written, the results come out in this format:
First 5 rows are:
2013-09-18
66.168571
66.621429
65.808571
66.382858
60.492519
114215500
...
instead of the expected columnar format shown on the tutorial.
I thought I finally found the solution when I read somewhere that you needed the formatting for each item, so I tried:
for col in row:
print('{:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}'.format(*col))
print("\n")
so that the formatting was there for each column, however that seems to create a column for each letter in the field, e.g:
2 0 1 3 - 0 9
The CSV is just a file of AAPLs stock prices--here's the first 9 rows of data if you want to create a CSV for testing:
Date,Open,High,Low,Close,Adj Close,Volume
2013-09-18,66.168571,66.621429,65.808571,66.382858,60.492519,114215500
2013-09-19,67.242859,67.975716,67.035713,67.471428,61.484497,101135300
2013-09-20,68.285713,68.364288,66.571426,66.772858,60.847912,174825700
2013-09-23,70.871429,70.987144,68.942856,70.091431,63.872025,190526700
2013-09-24,70.697144,70.781425,69.688568,69.871429,63.671543,91086100
2013-09-25,69.885712,69.948570,68.775711,68.790001,62.686062,79239300
2013-09-26,69.428574,69.794289,69.128571,69.459999,63.296616,59305400
2013-09-27,69.111427,69.238571,68.674286,68.964287,62.844891,57010100
2013-09-30,68.178574,68.808571,67.772858,68.107140,62.063782,65039100
# importing csv module
import csv
# csv file name
filename = r'C:\Users\XXX\Documents\AAPL.csv'
# initialize the titles and row list
fields = []
rows = []
# read the csv file
with open(filename, 'r') as csvfile:
# create the csv reader object
csvreader = csv.reader(csvfile)
# extract field names through the first row
fields = next(csvreader)
# extract each data row one by one
for row in csvreader:
rows.append(row)
# get total number of rows
print("total no. of rows: %d"%(csvreader.line_num))
# print the field names
print("Field names are: " + ", ".join(field for field in fields))
# print the first 5 rows of data
print("\nFirst 5 rows are:\n")
for row in rows[:5]:
#parse each column of a row
for col in row:
print("%10s"%col,end=',')
print("\n")
You need to replace
print("%10s"%col), with print("%10s"%col,end=',')
Krishnaa208's answer didn't quite give me the right format. print("%10s"%col,end=',') gave a table that included the comma and each field was surrounded by quotes. But it did point me in the right direction, which was:
# print the first 5 rows of data
print("\nFirst 5 rows are:\n")
for row in rows[:5]:
#parse each column of a row
for col in row:
print('{:>12}'.format(col), end = '')
print("\n")
and my results were:
First 5 rows are:
2013-09-18 66.168571 66.621429 65.808571 66.382858 60.492519 114215500
2013-09-19 67.242859 67.975716 67.035713 67.471428 61.484497 101135300
2013-09-20 68.285713 68.364288 66.571426 66.772858 60.847912 174825700
2013-09-23 70.871429 70.987144 68.942856 70.091431 63.872025 190526700
2013-09-24 70.697144 70.781425 69.688568 69.871429 63.671543 91086100
{:>10} was a little to close together since my CSV had the prices down to six decimal points.)
Thanks for the answer, though. I really did help!
I am generating a list of phone numbers from a registry into a Python list. I would then like to iterate through this list and add it to a new column in the middle of an already existing CSV file. The column is index 9 in the CSV. Below is the code I have attempted to write, but I get a "list index out of range error".
import csv
with open('input.csv','r') as csvinput:
with open('output.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput)
i=0
for row in csv.reader(csvinput):
row[9]=writer.writerow(phone_list[i])
i+=1
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-10-74ca13edafc8> in <module>()
6 i=0
7 for row in csv.reader(csvinput):
----> 8 row[9]=writer.writerow(phone_list[i])
9 i+=1
IndexError: list index out of range
Any and all help is appreciated.
Your code should almost work, you can use zip() to iterate over the input file and the list of phones at the same time, instead of using the i variable as a numerical index:
for row, phone in zip(csv.reader(csvinput), phone_list):
row[9] = phone
writer.writerow(row)
Now, if you're still getting that IndexError, that means your csv file has some line that doesn't contain 10 columns in first place (index 9 is the 10th column since indexes start at 0). Double-check your csv file. Try this testing code to check:
for row, phone in zip(csv.reader(csvinput), phone_list):
if len(row) < 10:
print('FOUND A ROW WITH {} COLUMNS: '.format(len(row)), row)
else:
row[9] = phone
writer.writerow(row)
Another solution is to add empty columns to complete the 10 columns to every row that has less than 10 columns:
for row, phone in zip(csv.reader(csvinput), phone_list):
row.extend(['']* (10 - len(row)))
row[9] = phone
writer.writerow(row)
In your for loop, add the new column to the array 'row', then write that row to the csv.
for row in reader:
row.insert(9, phone_list[i])
writer.writerow(row)
if you are using anaconda, or have knowledge of pandas,
import pandas as pd. # You import the pandas library
input = pd.read_csv("input.csv") #your import csv
output = pd.read_csv("output.csv") #your export csv
input_array = list(input['#Name of column']) #use this if you know the name of the column
input_array = list(input.iloc[:,#index of column]) # use this if you know the index number of the column, index starts from 0
output = output.insert(loc=#the index you want; in this case 9 for the 10th column, column='Name of the new column', value=input_array) # you are inserting the array into output.csv
output.to_csv("path and output.csv",index = False) # we are writing a file, index = false means no new index row columns to be added.
import pandas as pd. # You import the pandas library
input = pd.read_csv("input.csv") #your import csv
output = pd.read_csv("output.csv") #your export csv
input_array = list(input['Name of column']) #use this if you know the name of the column
input_array = list(input.iloc[:,0]) # Assuming input csv has only 1 column
output = output.insert(loc=9, column='New_column_name', value=input_array) # assuming the array is to be inserted into output.csv, column 10, in which case its 9 in pandas
output.to_csv("path and output.csv",index = False) # we are writing a file, index = false means no new index row columns to be added.