Python CSV sum value if they have same ID/name - python

I want to sum all values that have the same name / ID in a csv file
Right now I am only looking for ID with the name 'company'
csv file format:
company A, 100
company B, 200
company A, 300
The end result I am looking for is:
company A, 400
company B, 200
total: 600
My code so far:
import csv
name = ''
num = ''
total = 0
with open('xx.csv', 'r', newline='') as csvfile:
reader = csv.reader(csvfile)
next(csvfile)
for a in reader:
if a[0].__contain__('company'):
name = (a[0])
num = (a[1])
total += float(a[1])
print(str(name) + ', ' + str(num))
print('total: ' + str(total))

First, CSV typically have commas, and the delimiter for csv.reader must be a single character, so I suggest updating your file to properly use commas.
Secondly, to aggregate the companies, you need to store them as you iterate the file. Easiest way is to use a dictionary type.
Then only after you've aggregated everything, should you create a second loop to go over the aggregated values, then print the final total.
import csv
from collections import defaultdict
totals = defaultdict(int)
total = 0
with open('companies.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
# next (csvfile) # shown file has no header
for row in reader:
if not row[0].startswith('company'):
continue
name, value = row
totals[name] += float(value)
total = 0
for name, value in totals.items():
print(f'{name},{value}')
total += value
print(f'total: {total}')

You don't necessarily need to use csv module here. Just read every single line split them from right (rsplit) and fill a dictionary like below:
d = {}
with open('your_file.csv') as f:
# next(f) - If header needs to be skipped
for line in f:
name, value = line.rsplit(',', maxsplit=1)
d[name] = d.get(name, 0) + int(value)
for k, v in d.items():
print(f"{k}, {v}")
print(f"total: {sum(d.values())}")
output:
company A, 400
company B, 200
total: 600
In order not to iterate again through the dictionary's values to calculate the total(I mean in sum(d.values()) expression), you can do add to total while you are printing the items like:
d = {}
with open('new.csv') as f:
for line in f:
name, value = line.rsplit(',', maxsplit=1)
d[name] = d.get(name, 0) + int(value)
total = 0
for k, v in d.items():
total += v
print(f"{k}, {v}")
print(f"total: {total}")

Related

How to create a dictionary contains the summation of the keys for the first dictionary?

I am working on a CSV file. I am looking to make summation of all case numbers for each day of week and put them in a list of dictionaries.  
The list of dictionaries should be:
T = [{'Saturday', total number of cases}, {'Sunday', total number of cases}, {'Monday', total number of cases}, .... ]
That what I did, but it doesn't work:
with open('DataCleaning.csv') as f:
data = [{k: str(v) for k, v in row.items()}
for row in csv.DictReader(f, skipinitialspace=True)]
Case_Count_Saturday = {}
for row in data:
if 'Saturday' in row["dayofweek_name"]:
k = row['CASE_COUNT']
if k in Case_Count_Saturday.keys():
Case_Count_Saturday[k] += 1
else:
Case_Count_Saturday[k] = 1
print(Case_Count_Saturday)
Case_Count_Saturday1 = {}
for case, count in Case_Count_Saturday.items():
k = "saterday"
if k in Case_Count_Saturday1.keys():
Case_Count_Saturday1[k] += 1
else:
Case_Count_Saturday1[k] = 1
print(Case_Count_Saturday1)
This is part of the dataset:

How do you make a dictionary out of 2 elements in each list?

I have a list of lists that I want to make into a dictionary. Basically it's a list of births based on date (year/month/day/day of week/births). I want to tally the total births for each day to see in total how many births on each day of the week.
List example:
[2000,12,3,2,12000],[2000,12,4,3,34000]...
days_counts = {1: 23000, 2: 43000, ..., 7: 11943}
Here's the code so far:
f = open('births.csv', 'r')
text = f.read()
text = text.split("\n")
header = text[0]
data = text[1:]
for d in data:
split_data = d.split(",")
print(split_data)
So basically I want to iterate over each day and add the birth from duplicate days into the same key (obviusly).
EDIT: I have to do this with an if statement that looks for the day of week as a key in the dict. if its found, assign the corresponding births as value. If its not in dict then add key and value. I can't import anything or use lambda functions.
Use a collections.Counter() object to track the counts per day-of-the-week. You also want to use the csv module to handle the file parsing:
import csv
from collections import Counter
per_dow = Counter()
with open('births.csv', 'r') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
dow, births = map(int, row[-2:])
per_dow[dow] += births
I've used a with statement to manage the file object; Python auto-closes the file for you when the with block ends.
Now that you have a Counter object (which is a dictionary with some extra powers), you can now find the day of the week with the most births; the following loop prints out days of the week in order from most to least:
for day, births in per_dow.most_common():
print(day, births)
Without using external libraries or if statements, you can use exception handling
birth_dict = {}
birth_list = [[2000,12,3,2,12000],[2000,12,4,3,34000]]
for birth in birth_list:
try:
birth_dict[birth[3]]+=birth[4]
except KeyError:
birth_dict[birth[3]]=birth[4]
print birth_dict
Ok, after playing around with the code and using print statements where I need them for tests, I finally did it without using any external libraries. A very special thanks to Tobey and the others.
Here's the code with tests:
f = open('births.csv', 'r')
text = f.read()
text = text.split("\n")
header = text[0]
data = text[1:-1]
days_counts = {}
for d in data:
r = d.split(",")
print(r) #<--- used to test
k = r[3]
print(k)#<--- used to test
v = int(r[4])
print(v)#<--- used to test
if k in days_counts:
days_counts[k] += v
print("If : " , days_counts)#<--- used to test
else:
days_counts[k] = v
print("Else : ", days_counts)#<--- used to test
print(days_counts)
Code without tests:
f = open('births.csv', 'r')
text = f.read()
text = text.split("\n")
header = text[0]
data = text[1:-1]
days_counts = {}
for d in data:
r = d.split(",")
k = r[3]
v = int(r[4])
if k in days_counts:
days_counts[k] += v
else:
days_counts[k] = v
print(days_counts)

get value of one column by another column in csv file python

I have my csv file like this:
ID Value Amount
---- ------- -------
A 3 2
A 4 4
B 3 6
C 5 5
A 3 2
B 10 1
I want sum of column "Value" or "Amount" by the column "ID". I want the output that for 'A' it should give me sum of all values which is related to A means [3+4+3].
My Code:
import csv
file = open(datafile.csv)
rows=csv.DictReader(file)
summ=0.0
count=0
for r in rows:
summ=summ+int(r['Value'])
count=count+1
print "Mean for column Value is: ",(summ/count)
file.close()
You can use a defaultdict of list to group the data by the ID column. Then use sum() to produce the totals.
from collections import defaultdict
with open('datafile.csv') as f:
d = defaultdict(list)
next(f) # skip first header line
next(f) # skip second header line
for line in f:
id_, value, amount = line.split()
d[id_].append((int(value), int(amount)))
# sum and average of column Value by ID
for id_ in d:
total = sum(t[0] for t in d[id_])
average = total / float(len(d[id_]))
print('{}: sum = {}, avg = {:.2f}'.format(id_, total, average))
Output for your input data:
A: sum = 10, avg = 3.33
C: sum = 5, avg = 5.00
B: sum = 13, avg = 6.50
It can also be done with a standard Python dictionary. The solution is very similar:
with open('datafile.csv') as f:
d = {}
next(f) # skip first header line
next(f) # skip second header line
for line in f:
id_, value, amount = line.split()
d[id_] = d.get(id_, []) + [(int(value), int(amount))]
# sum and average of column Value by ID
for id_ in d:
total = sum(t[0] for t in d[id_])
average = total / float(len(d[id_]))
print('{}: sum = {}, avg = {:.2f}'.format(id_, total, average))

Sort Average In A file

I have a file with 3 scores for each person. Each person has their own row. I want to use these scores, and get the average of all 3 of them. There scores are separated by tabs and in descending order. For example:
tam 10 6 11
tom 3 7 3
tim 5 4 6
these people would come out with an average of:
tam 9
tom 5
tim 4
I want these to be able to print to the python shell, however not be saved to the file.
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
count = 0
while count < 3:
d.setdefault(names, []).append(average)
count = count + 1
for names, v in sorted(d.items()):
averages = (sum(v)/3)
print(names,average)
averageslist=[]
averageslist.append(averages)
My code only finds the first persons average and outputs it for all of them. I also want it to be descending in order of averages.
You can use the following code that parses your file into a list of (name, average) tuples and prints every entry of the by average sorted list:
import operator
with open("file.txt") as f:
data = []
for line in f:
parts = line.split()
name = parts[0]
vals = parts[1:]
avg = sum(int(x) for x in vals)/len(vals)
data.append((name, avg))
for person in sorted(data, key=operator.itemgetter(1), reverse=True):
print("{} {}".format(*person))
You are almost correct.You are calculating average in the first step.So need of sum(v)/3 again.Try this
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
d[names] = average
for names, v in sorted(d.items(),key=lambda x:x[1],reverse=True): #increasing order==>sorted(d.items(),key=lambda x:x[1])
print(names,v)
#output
('tam', 9)
('tim', 5)
('tom', 4)
To sort by name
for names, v in sorted(d.items()):
print(names,v)
#output
('tam', 9)
('tim', 5)
('tom', 4)
The issue is this:
averages = (sum(v)/3)
print(names,average)
Notice that on the first line you are computing averages (with an s at the end) and on the next line you are printing average (without an s).
Try This:
from operator import itemgetter
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
count = 0
d.setdefault(names, []).append(average)
for names,v in sorted(d.items(), key=itemgetter(1),reverse=True):
print(names,v)

Grouping data in Python

I have the following (space delimited) input:
2012-10-05 PETER 6
2012-10-05 PETER 4
2012-10-06 PETER 60
2012-10-06 TOM 10
2012-10-08 SOMNATH 80
And I would like to achieve the following pipe-delimited output:
(where the columns are [DATE AND NAME, NUM ENTRIES, SUM OF LAST COL])
2012-10-05 PETER|2|10
2012-10-06 PETER|1|60
2012-10-06 TOM|1|10
2012-10-08 SOMNATH|1|80
This is my code so far:
s = open("output.txt","r")
fn=s.readlines()
d = {}
for line in fn:
parts = line.split()
if parts[0] in d:
d[parts[0]][1] += int(parts[2])
d[parts[0]][2] += 1
else:
d[parts[0]] = [parts[1], int(parts[2]), 1]
for date in sorted(d):
print "%s %s|%d|%d" % (date, d[date][0], d[date][2], d[date][1])
I am getting the output as:
2012-10-06 PETER|2|70
instead of
2012-10-06 PETER|1|60
and TOM isn't showing in the list.
What do I need to do to correct my code?
d = collections.defaultdict(list)
with open('output.txt', 'r') as f:
for line in f:
date, name, val = line.split()
d[date, name].append(int(val))
for (date, name), vals in sorted(d.items()):
print '%s %s|%d|%d' % (date, name, len(vals), sum(vals))
<3 itertools
import itertools
with open('output.txt', 'r') as f:
splitlines = (line.split() for line in f if line.strip())
for (date, name), bits in itertools.groupby(splitlines, key=lambda bits: bits[:2]):
total = 0
count = 0
for _, _, val in bits:
total += int(val)
count += 1
print '%s %s|%d|%d' % (date, name, count, total)
If you don't want to use groupby (either it is unavailable, or your input data isn't guaranteed to be sorted), here's a conventional solution (which is effectively just a fixed version of your code):
d = {}
with open('output.txt', 'r') as f:
for line in f:
date, name, val = line.split()
key = (date, name)
if key not in d:
d[key] = [0, 0]
d[key][0] += int(val)
d[key][1] += 1
for key in sorted(d):
date, name = key
total, count = d[key]
print '%s %s|%d|%d' % (date, name, count, total)
Note that we use (date, name) as the key instead of just using date.

Categories