How to find the average of values in a .txt file - python

I need to find the minimum, maximum, and average of values given in a .txt file. I've been able to find the minimum and maximum values but I'm struggling with finding the average of values. I haven't wrote any coding for determining the average as I have no clue where to start. My current code is:
def summaryStats():
filename = input("Enter a file name: ")
file = open(filename)
data = file.readlines()
data = data[0:]
print("The minimum value is " + min(data))
print("The maximum value is " + max(data))
I need to be able to return the average of these values. As of now the .txt document has the following values:
893
255
504
I'm struggling on being able to find the average of these because every way I try to find the sum my result is 0.
Thanks
(sorry I'm just learning to work with files)

You should convert the data retrived from file to integers first, because your data list contains strings not numbers. And after the conversion to integers average can be found easily:
Why conversion to int is required?
>>> '2' > '10' #strings are compared lexicographically
True
Code:
def summaryStats():
filename = input("Enter a file name: ")
with open(filename) as f:
data = [int(line) for line in f]
print("The minimum value is ", min(data))
print("The maximum value is ", max(data))
print("The average value is ", sum(data)/len(data))
Output:
Enter a file name: abc1
The minimum value is 255
The maximum value is 893
The average value is 550.6666666666666

Don't reinvent the wheel, use numpy it takes a couple of instructions. You can import a txt file into a numpy array and then use the built-in function to perform the operations you want:
>>> import numpy as np
>>> data = np.loadtxt('data.txt')
>>> np.average(data)
550.666666667
>>> np.max(data)
893.0
>>> np.min(data)
255.0

Related

Unable to append a list with numpy values

I want to calculate the average vector length from a file that contains coordinates. Ultimately I want to store vector_length as a list called pair_length. I will calculate the average of the pair_length list later on in my program using average() function. Here is a snippet of my code:
from numpy import sqrt
from itertools import islice
from statistics import mean
data = open("coords.txt","r")
def average():
return mean()
pair_length = []
for line in islice(data, 1, None): #the first line is the number of pairs
fields = line.split(" ")
pair_num = int(fields[0]) #the first field is the pair number
x_cord = float(fields[1]) #x-coordinate
y_cord = float(fields[2]) #y-coordinate
vector_length = sqrt(x_cord**2 + y_cord**2) #vector length (all numbers in the coords.txt file are real and positive)
vector_length.append(pair_length)
I receive the error:
AttributeError: 'numpy.float64' object has no attribute 'append'
Here vector_length stores a float value, and hence append operation won't work with it.
Append operation works with lists, in python.
So, what we can do is:
Instead of
vector_length.append(pair_length)
We can do as follows:
pair_length.append(vector_length)
Hope this works.

Is there a way in Python to find a file with the smallest number in its name?

I have a bunch of documents created by one script that are all called like this:
name_*score*
*score* is a float and I need in another script to identify the file with the smallest number in the folder. Example:
name_123.12
name_145.45
This should return string "name_123.12"
min takes a key function. You can use that to define the way min is calculated:
files = [
"name_123.12",
"name_145.45",
"name_121.45",
"name_121.457"
]
min(files, key=lambda x: float((x.split('_')[1])))
# name_121.45
You can try get the number part first, and then convert it to float and sort.
for example:
new_list = [float(name[5:]) for name in YOURLIST] # trim out the unrelated part and convert to float
result = 'name_' + str(min(new_list)) # this is your result
Just wanted to say Mark Meyer is completely right on this one, but you also mentioned that you were reading these file names from a directory. In that case, there is a bit of code you could add to Mark's answer:
import glob, os
os.chdir("/path/to/directory")
files = glob.glob("*")
print(min(files, key=lambda x: float((x.split('_')[1]))))
A way to get the lowest value by providing a directory.
import os
import re
import sys
def get_lowest(directory):
lowest = sys.maxint
for filename in os.listdir(directory):
match = re.match(r'name_\d+(?:\.\d+)', filename)
if match:
number = re.search(r'(?<=_)\d+(?:\.\d+)', match.group(0))
if number:
value = float(number.group(0))
if value < lowest:
lowest = value
return lowest
print(get_lowest('./'))
Expanded on Tim Biegeleisen's answer, thank you Tim!

Returning .txt file contents

I have a file, Testing.txt:
type,stan,820000000,92
paul,tanner,820000095,54
remmy,gono,820000046,68
bono,Jose,820000023,73
simple,rem,820000037,71
I'm trying to create a function that takes this file and returns:
The average of all the grades (last numbers in the file of each line),
and the ID (long numbers within file) of the highest and lowest grades.
I know how to get the average but am stuck trying to get the IDs.
So far my code looks like this:
#Function:
def avg_file(filename):
with open(filename, 'r') as f:
data = [int(line.split()[2]) for line in f]
return sum(data)/len(data)
avg = avg_file(filename)
return avg
#main program:
import q3_function
filename = "testing.txt"
average = q3_function.avg_file(filename)
print (average)
You can use a list comprehension to get the desire pairs of ID and score :
>>>l= [i.split(',')[-2:] for i in open(filename, 'r') if not i=='\n']
[['820000000', '92'], ['820000095', '54'], ['820000046', '68'], ['820000023', '73'], ['820000037', '71']]
Then for calculation the average you can use zip within map and sum functions:
>>> avg=sum(map(int,zip(*l)[1]))/len(l)
>>> avg
71
And for min and max use built-in functions min and max with a proper key :
max_id=max(l,key=itemgetter(1))[0]
min_id=min(l,key=itemgetter(1))[0]
Demo :
>>> from operator import itemgetter
>>> max(l,key=itemgetter(1))
['820000000', '92']
>>> max(l,key=itemgetter(1))[0]
'820000000'
>>> min(l,key=itemgetter(1))[0]
'820000095'
>>> min(l,key=itemgetter(1))
['820000095', '54']
>>> min(l,key=itemgetter(1))[0]
'820000095'
I think using the python csv module would help.
Here is several examples : http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/sorting_csvs.ipynb

Why doesn't this return the average of the column of the CSV file?

def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'
x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)
Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)
Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html
In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1
Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.

Categories