Counter module and dictionaries - python

In a previous exercise, I've written a code that printed the height of each mountains of a csv file. You can found it here:
import csv
def mountain_height(filename):
""" Read in a csv file of mountain names and heights.
Parse the lines and print the names and heights.
Return the data as a dictionary.
The key is the mountain and the height is the value.
"""
mountains = dict()
msg = "The height of {} is {} meters."
err_msg = "Error: File doesn't exist or is unreadable."
# TYPE YOUR CODE HERE.
try:
with open('mountains.csv', 'r') as handle:
reader = csv.reader(handle, delimiter=',')
for row in reader:
name = row[0]
height = row[1]
mountains[name] = int(height)
for name, height in mountains.items():
print("The height of {names} is {heights} meters.".format(names=name, heights=height))
except:
print("Error: Something wrong with your file location?")
return None
I'm not sure if its ideal, but it seems to work.
Here's a preview of the csv file:
mountains.csv
Now, I have to rewrite this code using the collections' module Counter, to count how many times each mountain range is mentioned. Each row contains a mountain, its height, and the range it is part of.
I also need to add a dictionary that records all the heights of the mountains in a particular range. I must use a list for the values of the heights. The key will be the range name. Each time there's a new mountain in the range, the height has to be added to the list for that key. For example, after reading all the data, mountains['Himalayas'] == [8848, 8586, 8516, 8485, 8201, 8167, 8163, 8126, 8091, 8027]. (The "Himalayas" are the range.)
The output should be to print the top 2 ranges and adding the range name to the counter.
Then, print the average height of the mountains in each range. Return the dictionary object with the ranges and their lists of mountain heights after all the printing.
I have very small notions of the Counter module and I feel overwhelmed by the task.
Do you have some advice on where to start ?
Here's what I've got so far:
from collections import Counter
from collections import defaultdict
from statistics import mean
def mountain_ranges(filename):
ranges = Counter()
heights = defaultdict(list)
Thank you in advance....

The following will print out what you asked for, and return the counter and dictionary of heights.
import csv
from collections import defaultdict, Counter
from statistics import mean
def mountain_height(filename):
""" Read in a csv file of mountain names and heights.
Parse the lines and print the names and heights.
Return the data as a dictionary.
The key is the mountain and the height is the value.
"""
range_heights = defaultdict(list)
range_count = Counter()
# TYPE YOUR CODE HERE.
try:
with open(filename, 'r') as handle:
reader = csv.reader(handle, delimiter=',')
for row in reader:
if row:
name = row[0]
height = row[1]
mrange = row[2]
range_heights[mrange].append(int(height))
range_count[mrange] += 1
except:
print("Error: Something wrong with your file location?")
return None
print("The 2 most frequent ranges are:")
for mrange in range_count.most_common(2):
print(f"{mrange[0]} has {mrange[1]} mountains")
print("The average heights of each range are:")
for mrange, heights in range_heights.items():
print(f"{mrange} -- {mean(heights)}m")
return range_count, range_heights
counts, heights = mountain_height('mountains.csv')
The 2 most frequent ranges are:
Himalayas has 10 mountains
Karakoram has 4 mountains
The average heights of each range are:
Himalayas -- 8321m
Karakoram -- 8194.25m
So you know, I don't personally believe that using a Counter here is necessary or the "right" way to do things, but since it's what you require it's what I've given you. In fact it isn't even the only way you could use a Counter here - you could create a list of each range as you loop through the rows, and then just apply the Counter(list_of_ranges) but for larger files that would mean creating a large list in memory which again seems pointless.
For clarity my personal solution to getting the counts without a Counter would be to just use the range_heights dictionary and dict comprehension like so:
range_counts = {r: len(heights) for r, heights in range_heights.items()}

Related

How to sum values of an identical key

I need Python to read a .txt file and sum up the hours each student has attended school for the year. I need help understanding how to do this when the same student has multiple lines in the file. The .txt file looks something like this:
John0550
John0550
Sally1007
And the ultimate result I'm looking for in Python is to print out a list like:
John has attended 1100 hours
Sally has attended 1007 hours
I know I can't rely on a dict() because it won't accommodate identical keys. So what is the best way to do this?
Suppose you already have a function named split_line that returns the student's name / hours attented pair for each. Your algorithm would look like :
hours_attented_per_student = {} # Create an empty dict
with open("my_file.txt", "r") as file:
for line in file.readlines():
name, hour = split_line(line)
# Check whether you have already counted some hours for this student
if name not in hours_attented_per_student.keys():
# Student was not encountered yet, set its hours to 0
hours_attented_per_student[name] = 0
# Now that student name is in dict, increase the amount of hours attented for the student
hours_attented_per_student[name] += hours
A defaultdict could be helpful here:
import re
from collections import defaultdict
from io import StringIO
# Simulate File
with StringIO('''John0550
John0550
Sally1007''') as f:
# Create defaultdict initialized at 0
d = defaultdict(lambda: 0)
# For each line in the file
for line in f.readlines():
# Split Name from Value
name, value = re.split(r'(^[^\d]+)', line)[1:]
# Sum Value into dict
d[name] += int(value)
# For Display
print(dict(d))
Output:
{'John': 1100, 'Sally': 1007}
Assuming values are already split and parsed:
from collections import defaultdict
entries = [('John', 550), ('John', 550), ('Sally', 1007)]
d = defaultdict(int)
for name, value in entries:
# Sum Value into dict
d[name] += int(value)
# For Display
print(dict(d))

Python How do I sum the data values

Basically currently my program reads the Data file (electric info), sums the values up, and after summing the values, it changes all the negative numbers to 0, and keeps the positive numbers as they are. The program does this perfectly. This is the code I currently have:
import csv
from datetime import timedelta
from collections import defaultdict
def convert(item):
try:
return float(item)
except ValueError:
return 0
sums = defaultdict(list)
def daily():
lista = []
with open('Data.csv', 'r') as inp:
reader = csv.reader(inp, delimiter = ';')
headers = next(reader)
for line in reader:
mittaus = max(0,sum([convert(i) for i in line[1:-2]]))
lista.append()
#print(line[0],mittaus) ('#'only printing to check that it works ok)
daily()
My question is: How can I save the data to lists, so I can use them, and add all the values per day, so should look something like this:
1.1.2016;358006
2.1.2016;39
3.1.2016;0 ...
8.1.2016;239143
After had having these in a list (to save later on to a new data file), it should calculate the cumulative values straight after, and should look like this:
1.1.2016;358006
2.1.2016;358045
3.1.2016;358045...
8.1.2016;597188
Having done these, it should be ready to write these datas to a new csv file.
Small peak what's behind the Data file: https://pastebin.com/9HxwcixZ [It's actually divided with ';' , not with ' ' as in the pastebin]
The data file: https://files.fm/u/yuf4bbuk
I have clarified the questions, so you might have seen me ask before. These should be done without external libraries. I hope to find some help.

Outputting Bubble sorting

I have this list called countries.txt that list all the countries by their name, area(in km2), population (eg. ["Afghanistan", 647500.0, 25500100]).
def readCountries(filename):
result=[]
lines=open(filename)
for line in lines:
result.append(line.strip('\n').split(',\t'))
for sublist in result:
sublist[1]=float(sublist[1])
sublist[2]=int(sublist[2])
I am trying to sort through the list using a bubble sort according to the are of each country:
>> c = countryByArea(7)
>>> c
>>["India",3287590.0,1239240000]
When typing in the parameter is should return the nth largest area.
I have this but I'm not sure how to output the information
def countryByArea(area):
myList=readCountries('countries.txt')
for i in range(0,len(list)):
for j in range(0,len(list)-1):
if list[j]>list[j+1]:
temp=list[j]
list[j]=list[j+1]
list[j+1]=temp
first of all, implement a generic bubble sort method. this is a correct bubble sort algorithm implementation... Im sure you can find other implementations on http://rosettacode.org
def bubble_sort(a_list,a_key):
changed=True
while changed:
changed = False
for i in range(len(a_list)-1):
if a_key(a_list[i]) > a_key(a_list[i+1]):
a_list[i],a_list[i+1] = a_list[i+1],a_list[i]
changed = True
then simply pass a key function that represents the data you want to sort by (in this case the middle value or index one of each row
import csv
def sort_by_area(fname):
with open(fname) as f:
a = list(csv.reader(f))
bubble_sort(a,lambda row:int(row[1]))
return a
a = sort_by_area("a_file.txt")
print a[-7] #the 7th largest by area
you can take this info and combine it to complete your assignment ... but really this is a question you should have asked a classmate or your teacher for help with ...

Vector data from a file

I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.
I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)
Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])

Why doesn't this return the average of the column of the CSV file?

def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'
x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)
Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)
Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html
In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1
Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)

Categories