How to sum values of an identical key

How to sum values of an identical key - python

I need Python to read a .txt file and sum up the hours each student has attended school for the year. I need help understanding how to do this when the same student has multiple lines in the file. The .txt file looks something like this:
John0550
John0550
Sally1007
And the ultimate result I'm looking for in Python is to print out a list like:
John has attended 1100 hours
Sally has attended 1007 hours
I know I can't rely on a dict() because it won't accommodate identical keys. So what is the best way to do this?

Suppose you already have a function named split_line that returns the student's name / hours attented pair for each. Your algorithm would look like :
hours_attented_per_student = {} # Create an empty dict
with open("my_file.txt", "r") as file:
for line in file.readlines():
name, hour = split_line(line)
# Check whether you have already counted some hours for this student
if name not in hours_attented_per_student.keys():
# Student was not encountered yet, set its hours to 0
hours_attented_per_student[name] = 0
# Now that student name is in dict, increase the amount of hours attented for the student
hours_attented_per_student[name] += hours

A defaultdict could be helpful here:
import re
from collections import defaultdict
from io import StringIO
# Simulate File
with StringIO('''John0550
John0550
Sally1007''') as f:
# Create defaultdict initialized at 0
d = defaultdict(lambda: 0)
# For each line in the file
for line in f.readlines():
# Split Name from Value
name, value = re.split(r'(^[^\d]+)', line)[1:]
# Sum Value into dict
d[name] += int(value)
# For Display
print(dict(d))
Output:
{'John': 1100, 'Sally': 1007}
Assuming values are already split and parsed:
from collections import defaultdict
entries = [('John', 550), ('John', 550), ('Sally', 1007)]
d = defaultdict(int)
for name, value in entries:
# Sum Value into dict
d[name] += int(value)
# For Display
print(dict(d))

Related

How do i find the percentage of the elements in a list? (Python)

I'm new to python and I'm running into an issue in my project.
I have to read a file containing users + tasks.
Then I should list the user names, and count the number of name were listed in the file.. grouped together. Then once I have the count, calculate the percentage of that count with the number of users listed.
file contents look like this:
user1, task
user2, task
user1, task
user4, task
user4, task
user1, task
Here is my code -
with open('tasks.txt', 'r') as tasks:
for line in tasks.readlines():
mine = line.lower().split(", ")
for i in mine[0].split(", "):
cnt[i] += 1
print("\nThese are the number of tasks assigned to each user: \n" + str(cnt))
t = sum(cnt.values())
d = dict(cnt)
u, v = zip(*d.items())
print(u, v)
for n in v:
divide = float(n / t) * 100
print("The users are assigned this percentage of the tasks: \n")
print(n, divide)
*I would like the results to look like this:
user1 : 3, 50%
user4 : 2, 33%
user2 : 1, 16.7%
If anyone has any suggestions, please let me know

code:
cnt={}
usertask = []
res = {}
with open('task.txt', 'r') as tasks:
for line in tasks.readlines():
mine = line.lower().split(", ")
usertask.append(mine[0])
for i in (list(set(usertask))):
cnt[i]=0
for user in usertask:
cnt[user]+=1
for user,task in cnt.items():
res[user]=task*(100/len(usertask))
print(res)

You could try this:
# read data to a list
with open('tasks.txt', 'r') as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
The original way:
from collections import defaultdict
count = defaultdict(list)
for line in lines:
user, task = line.split(', ')
count[user].append(task)
for user, tasks in count.items():
print(f'{user}: {len(tasks)*100/len(lines)}%')
Or the faster way is to use Counter:
from collections import Counter
users = [line.split(', ')[0] for line in lines]
count = Counter(users)
for user, value in count.items():
print(f'{user}: {value*100/len(lines)}%')

You could simply store all tasks of one user into a dictionary, using a list as value to append each incoming taks.
The amount of tasks per user is just the lenght of that list - all tasks are the sum of all lenghts:
fn = "d.txt"
# write demo data
with open (fn,"w") as f:
f.write("""user1, task
user2, task
user1, task
user4, task
user4, task
user1, task""")
from collections import defaultdict
# use a dicts with values that default to list
users=defaultdict(list)
with open(fn) as tasks:
for line in tasks:
# split your line into 2 parts at 1st ',' - use 1st as user, 2nd as task-text
user, task = line.strip().lower().split(", ",1)
# append task to user, autocreates key if needed
users[user].append(task)
# sum all lenght values together
total_tasks = sum(map(len,users.values()))
# how much % equals one assigned task?
percent_per_task = 100 / total_tasks
for user, t in users.items():
# output stuff
lt = len(t)
print(user, lt, (lt * percent_per_task),'%')
Output:
user1 3 50.0 %
user2 1 16.666666666666668 %
user4 2 33.333333333333336 %

While there is a lot of merit learning how to use the basic python types, the big benefit of python from my point of view is the vast array of libraries available that solve a large number of common problems already.
If you are going to find yourself managing and transforming data files frequently in this project, consider using a library.
import pandas #import the pandas library
df = pandas.read_csv('tasks.txt', header=None, names=['user', 'task']) #read you file into a dataframe, which is a table like object
df['user'].value_counts(normalize=True).mul(100) #count the number of users, where the parameter normalize gives each count as a fraction, then mul (short for multiply) by 100 to turn the fraction into a percentage.

Counter module and dictionaries

In a previous exercise, I've written a code that printed the height of each mountains of a csv file. You can found it here:
import csv
def mountain_height(filename):
""" Read in a csv file of mountain names and heights.
Parse the lines and print the names and heights.
Return the data as a dictionary.
The key is the mountain and the height is the value.
"""
mountains = dict()
msg = "The height of {} is {} meters."
err_msg = "Error: File doesn't exist or is unreadable."
# TYPE YOUR CODE HERE.
try:
with open('mountains.csv', 'r') as handle:
reader = csv.reader(handle, delimiter=',')
for row in reader:
name = row[0]
height = row[1]
mountains[name] = int(height)
for name, height in mountains.items():
print("The height of {names} is {heights} meters.".format(names=name, heights=height))
except:
print("Error: Something wrong with your file location?")
return None
I'm not sure if its ideal, but it seems to work.
Here's a preview of the csv file:
mountains.csv
Now, I have to rewrite this code using the collections' module Counter, to count how many times each mountain range is mentioned. Each row contains a mountain, its height, and the range it is part of.
I also need to add a dictionary that records all the heights of the mountains in a particular range. I must use a list for the values of the heights. The key will be the range name. Each time there's a new mountain in the range, the height has to be added to the list for that key. For example, after reading all the data, mountains['Himalayas'] == [8848, 8586, 8516, 8485, 8201, 8167, 8163, 8126, 8091, 8027]. (The "Himalayas" are the range.)
The output should be to print the top 2 ranges and adding the range name to the counter.
Then, print the average height of the mountains in each range. Return the dictionary object with the ranges and their lists of mountain heights after all the printing.
I have very small notions of the Counter module and I feel overwhelmed by the task.
Do you have some advice on where to start ?
Here's what I've got so far:
from collections import Counter
from collections import defaultdict
from statistics import mean
def mountain_ranges(filename):
ranges = Counter()
heights = defaultdict(list)
Thank you in advance....

The following will print out what you asked for, and return the counter and dictionary of heights.
import csv
from collections import defaultdict, Counter
from statistics import mean
def mountain_height(filename):
""" Read in a csv file of mountain names and heights.
Parse the lines and print the names and heights.
Return the data as a dictionary.
The key is the mountain and the height is the value.
"""
range_heights = defaultdict(list)
range_count = Counter()
# TYPE YOUR CODE HERE.
try:
with open(filename, 'r') as handle:
reader = csv.reader(handle, delimiter=',')
for row in reader:
if row:
name = row[0]
height = row[1]
mrange = row[2]
range_heights[mrange].append(int(height))
range_count[mrange] += 1
except:
print("Error: Something wrong with your file location?")
return None
print("The 2 most frequent ranges are:")
for mrange in range_count.most_common(2):
print(f"{mrange[0]} has {mrange[1]} mountains")
print("The average heights of each range are:")
for mrange, heights in range_heights.items():
print(f"{mrange} -- {mean(heights)}m")
return range_count, range_heights
counts, heights = mountain_height('mountains.csv')
The 2 most frequent ranges are:
Himalayas has 10 mountains
Karakoram has 4 mountains
The average heights of each range are:
Himalayas -- 8321m
Karakoram -- 8194.25m
So you know, I don't personally believe that using a Counter here is necessary or the "right" way to do things, but since it's what you require it's what I've given you. In fact it isn't even the only way you could use a Counter here - you could create a list of each range as you loop through the rows, and then just apply the Counter(list_of_ranges) but for larger files that would mean creating a large list in memory which again seems pointless.
For clarity my personal solution to getting the counts without a Counter would be to just use the range_heights dictionary and dict comprehension like so:
range_counts = {r: len(heights) for r, heights in range_heights.items()}

Count word appear in csv

I have a plain csv file that starts with these 2 rows:
1.Clubhouse,Fibre Ready,.......
2.Clubhouse,Aircon,.........
3....
I want use python to write out a program that count how many times each column appear in csv file . I have tried several ways but it did not work out.
My output should be like this:
Clubhouse: .... times
Fibre Ready: .... times

You can use collections.Counter:
from collections import Counter
import csv
counter = Counter()
with open('furniture.csv') as fobj:
reader = csv.reader(fobj)
for row in reader:
counter.update(row)
for k, v in counter.items():
print('{}: {} times'.format(k, v))
Output for your two lines:
Clubhouse: 2 times
Fibre Ready: 1 times
Fitness Corner: 2 times
Aircon: 2 times
...
You can also access single items::
>>> counter['Clubhouse']
2
>>> counter['Fibre Ready']
1
collections.Counter is useful for this type of tasks:
Dict subclass for counting hashable items. Sometimes called a bag
or multiset. Elements are stored as dictionary keys and their counts
are stored as dictionary values.

Why doesn't this return the average of the column of the CSV file?

def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'

x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)

Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)

Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html

In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1

Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)

Avoiding variables as variable or list names in Python

I'd like to read in a number of text files that have the following structure:
3 560
7 470
2 680
4 620
3 640
...
The first column specifies conditions of a behavioral experiment, the second column reaction times. What I'd like to do is to create an array/list for each condition that contains the reaction times for this condition only. I've previously used Perl for this. As there are many different conditions, I wanted to avoid writing many separate elsif statements and therefore used the condition name in the array name:
push(#{condition.${cond}}, $rt); # $cond being the condition, $rt being the reaction time
For example, creating an array like this:
#condition3 = (560, 640,...);
As I got interested in Python, I wanted to see how I would accomplish this using Python. I found a number of answers discouraging the use of variables in variable names, but have to admit that from these answers I could not derive how I would create lists as described above without reverting to separate elif's for each condition. Is there a way to do this? Any suggestions would be greatly appreciated!
Thomas

A dictionary would be a good way to do this. Try something like this:
from collections import defaultdict
conditions = defaultdict(list)
for cond, rt in data:
conditions[cond].append(rt)

The following code reads the file data.txt with the format you described and computes a dictionary with the reaction times per experiment:
experiments = {}
with open('data.txt', 'r') as f:
data = [l.strip() for l in f.readlines()]
for line in data:
index, value = line.split()
try:
experiments[int(index)].append(value)
except KeyError:
experiments[int(index)] = [value]
print experiments
# prints: {2: ['680'], 3: ['560', '640'], 4: ['620'], 7: ['470']}
You can now access the reaction times per experiment using experiments[2], experiments[3], et cetera.

This is a perfect application for a dictionary, which is similar to a Perl hash:
data = {}
with open('data.txt') as f:
for line in f:
try:
condition, value = map(int, line.strip().split())
data.setdefault(condition, []).append(value)
except Exception:
print 'Bad format for line'
Now you can access your different conditions by indexing data:
>>> data
{2: [680], 3: [560, 640], 4: [620], 7: [470]}
>>> data[3]
[560, 640]

I am not sure about your question, as to why would you think about using elif conditions.
If you store a list of integers in a dictionary, the key being values of the first column a.k.a condition value, and its corresponding value a list of reaction times.
For example:
The dict would be like:
conditions['3'] -> [560, 640, ...]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sum values of an identical key - python

Related

How do i find the percentage of the elements in a list? (Python)

Counter module and dictionaries

Count word appear in csv

Why doesn't this return the average of the column of the CSV file?

Avoiding variables as variable or list names in Python

Categories

Resources