I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.
My CSV is similar to:
ID Location Date Value
1 1 Loc1 2022-01-27 5
2 1 Loc1 2022-01-27 4
3 1 Loc1 2022-01-28 7
4 1 Loc2 2022-01-29 8
5 2 Loc1 2022-01-27 11
6 2 Loc2 2022-01-28 4
7 2 Loc2 2022-01-29 6
8 3 Loc1 2022-01-28 9
9 3 Loc1 2022-01-28 9
10 3 Loc2 2022-01-29 1
{ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
{ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18
Here's what that sample input should look like, processed/summed, and written to a new CSV:
ID Location Date Value
1 Loc1 2022-01-27 9
1 Loc1 2022-01-28 7
1 Loc2 2022-01-29 8
2 Loc1 2022-01-27 11
2 Loc2 2022-01-28 4
2 Loc2 2022-01-29 6
3 Loc1 2022-01-28 18
3 Loc2 2022-01-29 1
I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!
NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)
The appropriate answer is probably to use Dask but you can do with Pandas and chunk. The last_row variable is the last row of the previous chunk is case of the first row of the current chunk have the same ID, Location and Date.
chunksize = 4 # Number of rows
last_row = pd.DataFrame() # Last row of the previous chunk
with open('data.csv') as reader, open('output.csv', 'w') as writer:
# Write headers
writer.write(reader.readline())
reader.seek(0)
for chunk in pd.read_csv(reader, chunksize=chunksize):
df = pd.concat([last_row, chunk])
df = df.groupby(['ID', 'Location', 'Date'], as_index=False)['Value'].sum()
df, last_row = df.iloc[:-1], df.iloc[-1:]
df.to_csv(writer, header=False, index=False)
# Don't forget the last row!
last_row.to_csv(writer, header=False, index=False)
Content of output.csv:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.
Here you could use:
with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
rd, wr = csv.reader(fd), csv.writer(fdout)
_ = wr.writerow(next(rd)) # header line
old = [None]*4
for row in rd:
row[3] = int(row[3]) # convert value field to integer
if row[:3] == old[:3]:
old[3] += row[3] # concatenate values of similar rows
else:
if old[0]: # and write the concatenated row
_ = wr.writerow(old)
old = row
if old[0]: # do not forget the last row...
_ = wr.writerow(old)
With the shown input data, it gives as expected:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.
You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:
from collections import Counter
import csv
data = Counter()
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data[tuple(row[:3])] += int(row[3])
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for key, value in data.items():
csv_output.writerow([*key, value])
Giving the output:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
This avoids storing the input CSV in memory, only the output CSV data.
If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:
from collections import Counter
import csv
def write_id(csv_output, data):
for key, value in data.items():
csv_output.writerow([*key, value])
data.clear()
data = Counter()
current_id = None
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
header = next(csv_input)
csv_output.writerow(header)
for row in csv_input:
if current_id and row[0] != current_id:
write_id(csv_output, data)
data[tuple(row[:3])] += int(row[3])
current_id = row[0]
write_id(csv_output, data)
For the given example, this would give the same output.
Have you tried:
output = []
for key, group in df.groupby([columns]):
output.append((key, group['a'].sum()))
pd.DataFrame(output).to_csv("....csv")
source:
https://stackoverflow.com/a/54244289/7132906
There are a number of answers already that may suffice: #MartinEvans and #Corralien both recommend breaking-up/chunking the input-output.
I'm especially curious if #MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).
If either of those don't work, I think you'll be faced with the question:
What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?
In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).
This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):
import csv
from collections import Counter
from datetime import datetime
# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)
def write_row(row):
global writer
writer.writerow(row)
# Don't let counter get too big (for memory)
def flush_counter(counter):
for key, sum_ in counter.items():
id_, loc, date = key
write_row([id_, loc, date, sum_])
# You said "already grouped by week-number", so:
# - read and sum your input CSV in chunks of "week (number) groups"
# - once the reader reads past a week-group, it concludes week-group is finished
# and flushes the counts for that week-group
last_wk_group = None
counter = Counter()
# Open input
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# Copy header
header = next(reader)
write_row(header)
for row in reader:
# Get "base" values
id_, loc, date = row[0:3]
value = int(row[3])
# 2022-01-27 -> 2022-04
wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')
# Decide if last week-group has passed
if wk_group != last_wk_group:
flush_counter(counter)
counter = Counter()
last_wk_group = wk_group
# Count/sum this week-groups
key = tuple([id_, loc, date_])
counter[key] += value
# Flush remaining week-group counts
flush_counter(counter)
As a basic test, I moved the first row of your sample input to the last row, like #Corralien was asking:
ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4
and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
import csv
def csv_to_kvs(fileName):
stringFloats = []
with open(fileName,'r') as csvFile:
csvreader = csv.reader(csvFile)
for row in csvreader:
stringFloats.append(row)
print(stringFloats)
I am trying to take a CSV file that is in string,float,float,float format of 10 rows and I have to make the string become a key-value pair with all the floats on the corresponding row.
So if the CSV file is:
age,16,17,18
area,1,7,4
call,2,3,6
The code needs to return {age:[16,17,18],etc...}. Any steps in the right direction are appreciated. I am learning CSV file reading and don't understand it too well.
When you read the row, you have the dictionary key in column 0 and the values in the remaining columns. You can slice the row, optionally converting to float on the way, and assign to the needed dict.
import csv
def csv_to_kvs(fileName):
stringFloats = {}
with open(fileName,'r') as csvFile:
csvreader = csv.reader(csvFile)
for row in csvreader:
# assuming 1 and following should be floats
stringFloats[row[0]] = [float(val) for val in row[1:]]
print(stringFloats)
return stringFloats
(...and come to terms with 4 space indentation!)
I work with csv files and it seems python provides a lot of flexibility for handling csv files.
I found several questions linked to my issue, but I cannot figure out how to combine the solutions effectively...
My starting point CSV file looks like this (note there is only 1 column in the 'header' row):
FILE1
Z1 20 44 3
Z1 21 44 5
Z1 21 44 8
Z1 22 45 10
What I want to do is add a column in between cols #1 and #2, and keep the rest unchanged. This new column has the same # rows as the other columns, but contains the same integer for all entries (10 in my example below). Another important point is I don't really know the number of rows, so I might have to count the # rows somehow first (?) My output should then look like:
FILE1
Z1 10 20 44 3
Z1 10 21 44 5
Z1 10 21 44 8
Z1 10 22 45 10
Is there a simple solution to this?
I think the easiest solution would be to just read each row and write a corresponding new row (with the inserted value) in a new file:
import csv
with open('input.csv', 'r') as infile:
with open('output.csv', 'w') as outfile:
reader = csv.reader(infile, delimiter=' ')
writer = csv.writer(outfile, delimiter=' ')
for row in reader:
new_row = [row[0], 10]
new_row += row[1:]
writer.writerow(new_row)
This might not make sense if you're not doing anything else with the data besides this bulk processing, though. You'd' want to look into csv libraries if that were the case.
Use pandas to import the csv file as a DataFrame named df and then use df.insert(idx, col_name, value); where idx is the index of the newly created column, col_name is the name you assign to this column and value is the list of values you wish to assign to the column. See below for illustration:
import pandas as pd
prices = pd.read_csv('C:\\Users\\abdou.seck\\Documents\\prices.csv')
prices
## Output
Shares Number Prices
0 AAP 100 100.67
1 MSFT 50 56.50
2 SAN 200 19.18
3 GOOG 300 500.34
prices.insert(3, 'Total', prices['Number']*prices['Prices'])
prices
## Output:
Shares Number Prices Total
0 AAP 100 100.67 10067
1 MSFT 50 56.50 2825
2 SAN 200 19.18 3836
3 GOOG 300 500.34 150102
Hope this helps.
Read the header first, then initialize the reader, write the header first, then initialize the writer:
import csv
with open("in.csv", "rb") as in_file:
header = in_file.readline()
csv_file_in = csv.reader(in_file, delimiter=" ")
with open("out.csv","wb") as out_file:
out_file.write(header)
csv_file_out = csv.writer(out_file, delimiter=" ")
for row in csv_file_in:
csv_file_out.writerow([row[0], 10] + row[1:])
Pull the data into a list, insert data for each row into the desired spot, and re-write the data.
import csv
data_to_add = 10
new_column_index = 1 # 0 based index
with open('FILE1.csv','r') as f:
csv_r = csv.reader(f,delimiter=' ')
data = [line for line in csv_r]
for row in data:
row.insert(new_column_index,data_to_add)
with open('FILE1.csv','w') as f:
csv_w = csv.writer(f,delimiter=' ')
for row in data:
csv_w.write(row)
Here's how I might do it with pandas:
import pandas as pd
with open("in.csv") as input_file:
header = input_file.readline()
data = pd.read_csv(input_file, sep=" ")
data.insert(1, "New Data", 10)
with open("out.csv", "w") as output_file:
output_file.write(header)
data.to_csv(output_file, index=False, header=False)