The total amount of precipitation for the year - python

How do I write a program that reads the PRCP column and sums all values in it? We are using the import csv from pathlib import Path. Using python.
The answer should = 1
Example of info:
STATION NAME, DATE, PRCP, TMAX, TMIN
USW00023183PHOENIX AIRPORT, 1/1/2020, 1, 60, 40
USW00023183PHOENIX AIRPORT, 1/2/2020, 0, 64, 41
Tried:
prcps = 0
for item in prcps:
month=item['DATE']
prcps =(item["PRCP"])
if prcps>0[month]:
sum (prcps)
perp = 0
for PRCP in reader:
month = item['DATE']
perp = (item["PRCP"])
perp += perp

You could use Python's CSV library to read the CSV lines in. Each row is parsed as strings. So the PRCP value will first need to be converted into an integer using int().
Using a csv.reader() which converts each row into a list:
import csv
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
print(sum(int(row[2]) for row in csv_input))
This first skips the header row and then extracts the third string value from each row, converts it into an integer and sums them.
Using a csv.DictReader() which assumes the first row is a header and then reads each row as a dictionary:
import csv
with open('input.csv') as f_input:
csv_input = csv.DictReader(f_input)
print(sum(int(row["PRCP"]) for row in csv_input))

Related

How to access each element of each row when inputting line by line in python

I have a tab-delimited csv file.
My csv file:
0.227996254681648 0.337028824833703 0.238163571416268 0.183009231781289 0.085746697332588 0.13412895376826
0.247891283973758 0.335555555555556 0.272129379268419 0.187328622765857 0.085921240923626 0.128372465534807
0.264761012183693 0.337777777777778 0.245917821271498 0.183211905363232 0.080493183753814 0.122786059549795
0.30506091846298 0.337777777777778 0.204265153911403 0.208453197418743 0.0715575291087 0.083682658454807
0.222748815165877 0.337028824833703 0.209714942778068 0.084252659537679 0.142013573559938 0.234672985858848
Now I would like to input each line from the csv file, do something with each element of each row and then do the same thing for the next line and so on.
My code:
lines = []
with open("/path/testfile.csv") as f:
csvReader = csv.reader( f, delimiter="\t" )
for row in csvReader:
x=row[0] #access first floating number of each line from csv
y=row[1] #access second floating number of each line from csv
z=row[2] #access third floating number of each line from csv
r=row[3] #access fourth floating number of each line from csv
s=row[4] #access fifth floating number of each line from csv
t=row[5] #access six floating number of each line from csv
#do something else with each element
Here I only included print(row[0]) into the for loop:
lines = []
with open("/path/testfile.csv") as f:
csvReader = csv.reader( f, delimiter="\t" )
for row in csvReader:
print(row[0])
But when already trying only print(row[0]), it already prints out all values from the csv file. How can I access each element from each row in python?
Not sure if you are familiar with the pandas library. You could use pandas which will simplify things a lot.
Code
import pandas as pd
df = pd.read_csv('./data/data.csv', delimiter='\t', header=None)
print(df)
Output
0 1 2 3 4 5
0 0.227996 0.337029 0.238164 0.183009 0.085747 0.134129
1 0.247891 0.335556 0.272129 0.187329 0.085921 0.128372
2 0.264761 0.337778 0.245918 0.183212 0.080493 0.122786
3 0.305061 0.337778 0.204265 0.208453 0.071558 0.083683
4 0.222749 0.337029 0.209715 0.084253 0.142014 0.234673
You can then you can any operation you want on any column. Example :
df[0] = df[0]*10 # Multiply all numbers in the 0th column by 10
just add another loop:
lines = []
with open("/path/testfile.csv") as f:
csvReader = csv.reader( f, delimiter=" " )
for row in csvReader:
for element in row:
do_something_with(element)
I recommend taking a look at the pandas library, it's a good starting point.
User Guide. It will allow you to easy handle and process your data easier.

How to group columns and sum them, in a large CSV?

I have a large CSV (hundreds of millions of rows) and I need to sum the Value column based on the grouping of the ID, Location, and Date columns.
My CSV is similar to:
ID Location Date Value
1 1 Loc1 2022-01-27 5
2 1 Loc1 2022-01-27 4
3 1 Loc1 2022-01-28 7
4 1 Loc2 2022-01-29 8
5 2 Loc1 2022-01-27 11
6 2 Loc2 2022-01-28 4
7 2 Loc2 2022-01-29 6
8 3 Loc1 2022-01-28 9
9 3 Loc1 2022-01-28 9
10 3 Loc2 2022-01-29 1
{ID: 1, Location: Loc1, Date: 2022-01-27} is one such group, and its sub values 5 and 4 should be summed to 9
{ID: 3, Location: Loc1, Date: 2022-01-28} is another group and its sum should be 18
Here's what that sample input should look like, processed/summed, and written to a new CSV:
ID Location Date Value
1 Loc1 2022-01-27 9
1 Loc1 2022-01-28 7
1 Loc2 2022-01-29 8
2 Loc1 2022-01-27 11
2 Loc2 2022-01-28 4
2 Loc2 2022-01-29 6
3 Loc1 2022-01-28 18
3 Loc2 2022-01-29 1
I know using df.groupby([columns]).sum() would give the desired result, but the CSV is so big I keep getting memory errors. I've tried looking at other ways to read/manipulate CSV data but have still not been successful, so if anyone knows a way I can do this in python without maxing out my memory that would be great!
NB: I know there is a unnamed first column in my initial csv, this is irrelevant and doesn't need to be in the outputted, but doesn't matter if it is :)
The appropriate answer is probably to use Dask but you can do with Pandas and chunk. The last_row variable is the last row of the previous chunk is case of the first row of the current chunk have the same ID, Location and Date.
chunksize = 4 # Number of rows
last_row = pd.DataFrame() # Last row of the previous chunk
with open('data.csv') as reader, open('output.csv', 'w') as writer:
# Write headers
writer.write(reader.readline())
reader.seek(0)
for chunk in pd.read_csv(reader, chunksize=chunksize):
df = pd.concat([last_row, chunk])
df = df.groupby(['ID', 'Location', 'Date'], as_index=False)['Value'].sum()
df, last_row = df.iloc[:-1], df.iloc[-1:]
df.to_csv(writer, header=False, index=False)
# Don't forget the last row!
last_row.to_csv(writer, header=False, index=False)
Content of output.csv:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
If the lines to be concatenated are consecutive, the good old csv module allows to process huge files one line at a time, hence with a minimal memory footprint.
Here you could use:
with open('input.csv') as fd, open('output.csv', 'w', newline='') as fdout:
rd, wr = csv.reader(fd), csv.writer(fdout)
_ = wr.writerow(next(rd)) # header line
old = [None]*4
for row in rd:
row[3] = int(row[3]) # convert value field to integer
if row[:3] == old[:3]:
old[3] += row[3] # concatenate values of similar rows
else:
if old[0]: # and write the concatenated row
_ = wr.writerow(old)
old = row
if old[0]: # do not forget the last row...
_ = wr.writerow(old)
With the shown input data, it gives as expected:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
Not as clean and neat than Pandas code but it should process files greater than the available memory without any problem.
You could use the built in csv library and build up the output line by line. A Counter can be used to combine and count rows with the same entries:
from collections import Counter
import csv
data = Counter()
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data[tuple(row[:3])] += int(row[3])
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for key, value in data.items():
csv_output.writerow([*key, value])
Giving the output:
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1
This avoids storing the input CSV in memory, only the output CSV data.
If this is also too large, a slight variation would be to output data whenever the ID column changes. This would though assume the input is in ID order:
from collections import Counter
import csv
def write_id(csv_output, data):
for key, value in data.items():
csv_output.writerow([*key, value])
data.clear()
data = Counter()
current_id = None
with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
header = next(csv_input)
csv_output.writerow(header)
for row in csv_input:
if current_id and row[0] != current_id:
write_id(csv_output, data)
data[tuple(row[:3])] += int(row[3])
current_id = row[0]
write_id(csv_output, data)
For the given example, this would give the same output.
Have you tried:
output = []
for key, group in df.groupby([columns]):
output.append((key, group['a'].sum()))
pd.DataFrame(output).to_csv("....csv")
source:
https://stackoverflow.com/a/54244289/7132906
There are a number of answers already that may suffice: #MartinEvans and #Corralien both recommend breaking-up/chunking the input-output.
I'm especially curious if #MartinEvans's answer works within your memory constraints: it's the simplest and still-correct solution so far (as I see it).
If either of those don't work, I think you'll be faced with the question:
What makes a chunk with all the ID/Loc/Date groups I need to count contained in that chunk, so no group crosses over a chunk and gets counted multiple times (end up with smaller sub sums, instead of a single and true sum)?
In a comment on the OP you said the input was sorted by "week number". I think this is the single deciding factor for when you have all the counts you'll get for a group of ID/Loc/Date. As the readers crosses week-group boundaries, it'll know it's "safe" to stop counting any of the groups encountered so far, and flush those counts to disk (to avoid holding on to too many counts in memory).
This solution relies on the pre-sorted-ness of your input CSV. Though, if your input was a bit out of sorts: you could run this, test for duplicate groups, re-sort, and re-run this (I see this problem as making a big, memory-constrained reducer):
import csv
from collections import Counter
from datetime import datetime
# Get the data out...
out_csv = open('output.csv', 'w', newline='')
writer = csv.writer(out_csv)
def write_row(row):
global writer
writer.writerow(row)
# Don't let counter get too big (for memory)
def flush_counter(counter):
for key, sum_ in counter.items():
id_, loc, date = key
write_row([id_, loc, date, sum_])
# You said "already grouped by week-number", so:
# - read and sum your input CSV in chunks of "week (number) groups"
# - once the reader reads past a week-group, it concludes week-group is finished
# and flushes the counts for that week-group
last_wk_group = None
counter = Counter()
# Open input
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# Copy header
header = next(reader)
write_row(header)
for row in reader:
# Get "base" values
id_, loc, date = row[0:3]
value = int(row[3])
# 2022-01-27 -> 2022-04
wk_group = datetime.strptime(date, r'%Y-%m-%d').strftime(r'%Y-%U')
# Decide if last week-group has passed
if wk_group != last_wk_group:
flush_counter(counter)
counter = Counter()
last_wk_group = wk_group
# Count/sum this week-groups
key = tuple([id_, loc, date_])
counter[key] += value
# Flush remaining week-group counts
flush_counter(counter)
As a basic test, I moved the first row of your sample input to the last row, like #Corralien was asking:
ID,Location,Date,Value
1,Loc1,2022-01-27,5
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,9
3,Loc1,2022-01-28,9
3,Loc2,2022-01-29,1
1,Loc1,2022-01-27,4
and I still get the correct output (even in the correct order, because 1,Loc1,2022-01-27 appeared first in the input):
ID,Location,Date,Value
1,Loc1,2022-01-27,9
1,Loc1,2022-01-28,7
1,Loc2,2022-01-29,8
2,Loc1,2022-01-27,11
2,Loc2,2022-01-28,4
2,Loc2,2022-01-29,6
3,Loc1,2022-01-28,18
3,Loc2,2022-01-29,1

CSV file to Key Value pairs Python 3

import csv
def csv_to_kvs(fileName):
stringFloats = []
with open(fileName,'r') as csvFile:
csvreader = csv.reader(csvFile)
for row in csvreader:
stringFloats.append(row)
print(stringFloats)
I am trying to take a CSV file that is in string,float,float,float format of 10 rows and I have to make the string become a key-value pair with all the floats on the corresponding row.
So if the CSV file is:
age,16,17,18
area,1,7,4
call,2,3,6
The code needs to return {age:[16,17,18],etc...}. Any steps in the right direction are appreciated. I am learning CSV file reading and don't understand it too well.
When you read the row, you have the dictionary key in column 0 and the values in the remaining columns. You can slice the row, optionally converting to float on the way, and assign to the needed dict.
import csv
def csv_to_kvs(fileName):
stringFloats = {}
with open(fileName,'r') as csvFile:
csvreader = csv.reader(csvFile)
for row in csvreader:
# assuming 1 and following should be floats
stringFloats[row[0]] = [float(val) for val in row[1:]]
print(stringFloats)
return stringFloats
(...and come to terms with 4 space indentation!)

from a CSV file, count unique value in a row and print the total using python

I am new to Python, I have a CSV file that looks something like this:
Date, Profit
02/2019 , 100
03/2019 , 410
03/2019 , 300
04/2019 , 200
I need to write a code in Python that print how many unique dates is in the file
Here is what I have so far, but It only print # 1
import os
import csv
with open('budget_data.csv') as file:
reader = csv.reader(file,delimiter=',')
t_dates = [1]
for row in reader:
if row[0] not in row:
t_dates.append(row[0])
print(len(t_dates))
you can use DictReader that will map the csv headers to the items as a key. You can then use set comprehension to get a list of unique dates and count the length of them.
import csv
with open('test.dat') as file:
reader = csv.DictReader(file, delimiter=',')
print(len({row['Date'] for row in reader}))

Adding in-between column in csv Python

I work with csv files and it seems python provides a lot of flexibility for handling csv files.
I found several questions linked to my issue, but I cannot figure out how to combine the solutions effectively...
My starting point CSV file looks like this (note there is only 1 column in the 'header' row):
FILE1
Z1 20 44 3
Z1 21 44 5
Z1 21 44 8
Z1 22 45 10
What I want to do is add a column in between cols #1 and #2, and keep the rest unchanged. This new column has the same # rows as the other columns, but contains the same integer for all entries (10 in my example below). Another important point is I don't really know the number of rows, so I might have to count the # rows somehow first (?) My output should then look like:
FILE1
Z1 10 20 44 3
Z1 10 21 44 5
Z1 10 21 44 8
Z1 10 22 45 10
Is there a simple solution to this?
I think the easiest solution would be to just read each row and write a corresponding new row (with the inserted value) in a new file:
import csv
with open('input.csv', 'r') as infile:
with open('output.csv', 'w') as outfile:
reader = csv.reader(infile, delimiter=' ')
writer = csv.writer(outfile, delimiter=' ')
for row in reader:
new_row = [row[0], 10]
new_row += row[1:]
writer.writerow(new_row)
This might not make sense if you're not doing anything else with the data besides this bulk processing, though. You'd' want to look into csv libraries if that were the case.
Use pandas to import the csv file as a DataFrame named df and then use df.insert(idx, col_name, value); where idx is the index of the newly created column, col_name is the name you assign to this column and value is the list of values you wish to assign to the column. See below for illustration:
import pandas as pd
prices = pd.read_csv('C:\\Users\\abdou.seck\\Documents\\prices.csv')
prices
## Output
Shares Number Prices
0 AAP 100 100.67
1 MSFT 50 56.50
2 SAN 200 19.18
3 GOOG 300 500.34
prices.insert(3, 'Total', prices['Number']*prices['Prices'])
prices
## Output:
Shares Number Prices Total
0 AAP 100 100.67 10067
1 MSFT 50 56.50 2825
2 SAN 200 19.18 3836
3 GOOG 300 500.34 150102
Hope this helps.
Read the header first, then initialize the reader, write the header first, then initialize the writer:
import csv
with open("in.csv", "rb") as in_file:
header = in_file.readline()
csv_file_in = csv.reader(in_file, delimiter=" ")
with open("out.csv","wb") as out_file:
out_file.write(header)
csv_file_out = csv.writer(out_file, delimiter=" ")
for row in csv_file_in:
csv_file_out.writerow([row[0], 10] + row[1:])
Pull the data into a list, insert data for each row into the desired spot, and re-write the data.
import csv
data_to_add = 10
new_column_index = 1 # 0 based index
with open('FILE1.csv','r') as f:
csv_r = csv.reader(f,delimiter=' ')
data = [line for line in csv_r]
for row in data:
row.insert(new_column_index,data_to_add)
with open('FILE1.csv','w') as f:
csv_w = csv.writer(f,delimiter=' ')
for row in data:
csv_w.write(row)
Here's how I might do it with pandas:
import pandas as pd
with open("in.csv") as input_file:
header = input_file.readline()
data = pd.read_csv(input_file, sep=" ")
data.insert(1, "New Data", 10)
with open("out.csv", "w") as output_file:
output_file.write(header)
data.to_csv(output_file, index=False, header=False)

Categories