Finding highest mean in CSV file - python

So, I have a csv file as follows: [Smaller Sample]
value,variable
320,1
272,1
284,1
544,2
568,2
544,2
316,3
558,3
542,3
260,4
266,4
710,4
272,5
290,5
558,5
416,6
782,6
626,6
My goal is to find the highest average, of each grouping. So, in this case, grouping 6 is the highest. With this information, I'd then make a new column that compares grouping 6 to all others.
Like so:
320,1,1
272,1,1
284,1,1
544,2,1
568,2,1
544,2,1
316,3,1
558,3,1
542,3,1
260,4,1
266,4,1
710,4,1
272,5,1
290,5,1
558,5,1
416,6,9
782,6,9
626,6,9
I have absolutely no idea where to start. I initially thought maybe I should split each line into a dictionary, then average each grouping, make a new key as the average, then take all of the keys[averaged groupings] and detect which is the highest. I'm just not sure how I'd put it back into CSV, or even execute this while keeping the integrity of the data.

To do this kind of things, I would advise to use the pandas package:
import pandas as pd
# Read your file
data = pd.read_csv("file.csv")
# Get the group means
group_means = data.groupby('variable')['value'].agg('mean')
# Get the group with highest mean
group_max = group_means.idxmax()
# Add the last column to differentiate the highest mean
data['comparison'] = (data['variable'] == group_max).astype(int)

You can use itertools.groupby:
import itertools, csv
_h, *data = csv.reader(open('filename.csv'))
new_data = [(a, list(b)) for a, b in itertools.groupby(data, key=lambda x:x[-1])]
_max = max(new_data, key=lambda x:sum(a for a, _ in x[-1])/float(len(x[-1])))[0]
with open('results.csv', 'w') as f:
write = csv.writer(f)
write.writerows([_h, *[[a, b, 9 if b == _max else 1] for a, b in data]])
Output:
value,variable
320,1,1
272,1,1
284,1,1
544,2,1
568,2,1
544,2,1
316,3,1
558,3,1
542,3,1
260,4,1
266,4,1
710,4,1
272,5,1
290,5,1
558,5,1
416,6,9
782,6,9
626,6,9

Related

Finding outliers in an excel row

As an example say column C has 1000 cells and most are filled with '1' however there are a couple of '2' sprinkled in. I'm trying to be able to find how many '2' there are and print the number.
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
for cell in ws['C']:
print(cell.value)
How can I iterate through the column and just pull how many twos there are?
As #K.Marker pointed out, you can query the count of a specific value in the rows with
[c.value for c in ws['C']].count(2)
But what if you don't know the values and/or you'd like to see the distribution of the values of a particular row? You can use a Counter which has dict-like behaviour.
In [446]: from collections import Counter
In [448]: from collections import Counter
In [449]: counter = Counter([c.value for c in ws[3]])
In [451]: counter
Out[451]: Counter({1: 17, 2: 5})
In [452]: for k, v in counter.items():
...: print('{0} occurs {1} time(s)'.format(k, v))
...:
1 occurs 17 time(s)
2 occurs 5 time(s)
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
num_of_twos = [c.value for c in ws["C"]].count(2)
The list comprehension creates a list of cell values throughout column C, and it counts how many 2 is in it.
Are you looking for how many 2's are there?
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#iterate over it and increase the count
for r in row:
if r==2:
count+=1
Now, this only works with values of "2" and doesn't find other outliers. To find outliers in general you will have to determine a threshold first. In this example I'll use the average value, although you would need to determine the best test to get the threshold for outliers based on your data. Don't worry, statistics are fun!
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#calculatethe average
#using numpy
import numpy as np
NPavg = np.mean(list)
#without numpy
#need to cast it to float - otherwise it will round it to int
avg=sum(row)/float(len(row))
#iterate over it and increase the count
for r in row:
#of course use your own threshold,
#determined appropriately, instead of average
if r>NPavg:
count+=1

Reading a csv file and counting a row depending on another row

I have a csv file where i need to read different columns and sum their numbers up depending on another row in the dataset.
The question is:
How do the flight phases (ex. take off, cruise, landing..) contribute
to fatalities?
I have to sum up column number 23 for each different data in column 28.
I have a solution with masks and a lot of IF statements:
database = pd.read_csv('Aviation.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = database.as_matrix()
TOcounter = 0
for r in data:
if r[28] == "TAKEOFF":
TOcounter += r[23]
print(TOcounter)
This example shows the general idea of my solution. Where i would have to add a lot of if statements and counters for every different data in column 28.
But i was wondering if there is a smarter solution to the issue.
The raw data can be found at: https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv
It sounds like what you are trying to achieve is
df.groupby('Broad.Phase.of.Flight')['Total.Fatal.Injuries'].sum()
This is a quick solution, not checking for errors like if can convert a string for float. Also you should think about in searching for the right column(with text) instead of reliing on the column index (like 23 and 28)
but this should work:
import csv
import urllib2
import collections
url = 'https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv'
response = urllib2.urlopen(url)
df = csv.reader(response)
d = collections.defaultdict(list)
for i,row in enumerate(df):
key = row[28]
if key == "" or i == 0 : continue
val = 0 if(row[23]) =="" else float(row[23])
d.setdefault(key,[]).append(val)
d2 = {}
for k, v in d.iteritems(): d2[k] = sum(v)
for k, v in d2.iteritems(): print "{}:{}".format(k,v)
Result:
TAXI:110.0
STANDING:193.0
MANEUVERING:6430.0
DESCENT:1225.0
UNKNOWN:919.0
TAKEOFF:5267.0
LANDING:592.0
OTHER:107.0
CRUISE:6737.0
GO-AROUND:783.0
CLIMB:1906.0
APPROACH:4493.0

Averaging values in a list of a list of a list in Python

I'm working on a method to average data from multiple files and put the results into a single file. Each line of the files looks like:
File #1
Test1,5,2,1,8
Test2,10,4,3,2
...
File #2
Test1,2,4,5,1
Test2,4,6,10,3
...
Here is the code I use to store the data:
totalData = []
for i in range(0, len(files)):
data = []
if ".csv" in files[i]:
infile = open(files[i],"r")
temp = infile.readline()
while temp != "":
data.append([c.strip() for c in temp.split(",")])
temp = infile.readline()
totalData.append(data)
So what I'm left with is totalData looking like the following:
totalData = [[
[Test1,5,2,1,8],
[Test2,10,4,3,2]],
[[Test1,2,4,5,1],
[Test2,4,6,10,3]]]
What I want to average is for all Test1, Test2, etc, average all the first values and then the second values and so forth. So testAverage would look like:
testAverage = [[Test1,3.5,3,3,4.5],
[Test2,7,5,6.5,2.5]]
I'm struggling to think of a concise/efficient way to do this. Any help is greatly appreciated! Also, if there are better ways to manage this type of data, please let me know.
It just need two loops
totalData = [ [['Test1',5,2,1,8],['Test2',10,4,3,2]],
[['Test1',2,4,5,1],['Test2',4,6,10,3]] ]
for t in range(len(totalData[0])): #tests
result = [totalData[0][t][0],]
for i in range(1,len(totalData[0][0])): #numbers
sum = 0.0
for j in range(len(totalData)):
sum += totalData[j][t][i]
sum /= len(totalData)
result.append(sum)
print result
first flatten it out
results = itertools.chain.from_iterable(totalData)
then sort it
results.sort()
then use groupby
data = {}
for key,values in itertools.groupby(results,lambda x:x[0]):
columns = zip(*values)
data[key] = [sum(c)*1.0/len(c) for c in columns]
and finally just print your data
If your data structure is regular, the best is probably to use numpy. You should be able to install it with pip from the terminal
pip install numpy
Then in python:
import numpy as np
totalData = np.array(totalData)
# remove the last dimension (i.e. 'Test1', 'Test2'), since it's not a number
totalData = np.array(totalData[:, :, 1:], float)
# average
np.mean(totalData, axis=0)

Sort CSV using a key computed from two columns, grab first n largest values

Python amateur here...let's say here I have snippet of an example csv file:
Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2005,23132412,1345234
Country4,,2312421,12412
I need to sort the file by descending GDP per capita (GDP/Population) in a certain year, say, 2002, then grab the first 10 rows with the largest GDP per capita values.
So far, after I import the csv to a 'data' variable, I grab all the 2002 data without missing fields using:
data_2 = []
for row in data:
if row[1] == '2002' and row[2]!= ' ' and row[3] != ' ':
data_2.append(row)
I need to find some way to sort data_2 by row[2]/row[3] descending, preferably without using a class, and then grab each entire row tied to each of the largest 10 values to then write to another csv. If someone could point me in the right direction I would be forever grateful as I've tried countless googles...
This is an approach that will enable you to do one scan of the file to get the top 10 for each country...
It is possible to do this without pandas by utilising the heapq module, the following is untested, but should be a base for you to refer to appropriate documentation and adapt for your purposes:
import csv
import heapq
from itertools import islice
freqs = {}
with open('yourfile') as fin:
csvin = csv.reader(fin)
rows_with_gdp = ([float(row[2]) / float(row[3])] + row for row in islice(csvin, 1, None) if row[2] and row[3])
for row in rows_with_gdp:
cnt = freqs.setdefault(row[2], [[]] * 10) # 2 = year, 10 = num to keep
heapq.heappushpop(cnt, row)
for year, vals in freqs.iteritems():
print year, [row[1:] for row in sorted(filter(None, vals), reverse=True)]
The relevant modules would be:
csv for parsing the input
collections.namedtuple to name the fields
the filter() function to extract the specified year range
heapq.nlargest() to find the largest values
pprint.pprint() for nice output
Here's a little bit to get you started (I would do it all but what is the fun in having someone write your whole program and deprive you of the joy of finishing it):
from __future__ import division
import csv, collections, heapq, pprint
filecontents = '''\
Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2004,23132412,1345234
Country4,2004,2312421,12412
'''
CountryStats = collections.namedtuple('CountryStats', ['country', 'year', 'gdp', 'population'])
dialect = csv.Sniffer().sniff(filecontents)
data = []
for country, year, gdp, pop in csv.reader(filecontents.splitlines()[1:], dialect):
row = CountryStats(country, int(year), int(gdp), int(pop))
if row.year == 2004:
data.append(row)
data.sort(key = lambda s: s.gdp / s.population)
pprint.pprint(data)
Use the optional key argument to the sort function:
array.sort(key=lambda x: x[2])
will sort array using its third element as a key. The value of the key argument should be a lambda expression that takes in a single argument (an arbitrary element of the array being sorted) and returns the key for sorting.
For your GDP example, the lambda function to use would be:
lambda x: float(x[2])/float(x[3]) # x[2] is GDP, x[3] is population
The float function converts the CSV fields from strings into floating point numbers. Since there are no guarantees that this will be successful (improper formatting, bad data, etc), I'd typically do this before sorting, when inserting stuff into the array. You should use floating point division here explicitly, as integer division won't give you the results you expect. If you find yourself doing this often, changing the behavior of the division operator is an option (http://www.python.org/dev/peps/pep-0238/ and related links).

How to average values in one column of a csv if the values in another column are not changing?

EDIT: See end of my post for working code, obtained from zeekay here.
I have a CSV file with two columns (voltage and current). Because the voltage is recorded to many significant digits and the current only has 2, there are many identical current values as the value of the voltage changes. This isn't important to the programming but I'm just explaining how the data is physically obtained. I want to perform the following action:
For as long as the value of the second column (current) does not change, collect the values of the first column (voltage) into a list and average them. Then write a row into a new CSV file which is this averaged value of the voltage in the first column and the constant current value which did not change in the second column. In other words, if there are 20 rows for which the current did not change (say it is 6 uA), the 20 corresponding voltage values are averaged (say this average comes out to be 600 mV) and a row is generated in a new csv file which reads ('0.6','0.000006'). Then I want to continue iterating through the csv which is being read, repeating the above procedure for each set of fixed currents.
I've got the following code so far, but I'm not sure if I'm on the right track:
import sys, csv
with open('filetowriteto.csv','w') as avg:
loadeddata = open('filetoreadfrom.csv','r')
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1]
for row in readloaded:
newcurrent = row[1]
biaslist = []
if newcurrent == oldcurrent:
biaslist.append(row[0])
else :
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg,newcurrent])
newcurrent = row[1]
and then I'm not sure where to go.
Edit: It seems that zeekay is on the right track for what I want to do. I'm trying to implement his itertools.groupby() method but I'm currently getting a blank file generated. Here's my new code so far:
import sys, csv, itertools
with open('VI_avg(12).csv','w') as avg: # this is the file which gets written
loadeddata = open('VI(12).csv','r') # this is the file which is read
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1] # looks like this is no longer required
for current, row in itertools.groupby(readloaded, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
# write it out
writer.writerow(biasavg, current)
Suppose the CSV file being opened is something like this (shortened example):
0.595417,0.000065
0.595177,0.000065
0.594937,0.000065
0.594697,0.000065
0.594457,0.000065
0.594217,0.000065
0.593977,0.000065
0.593737,0.000065
0.593497,0.000064
0.593017,0.000064
0.592777,0.000064
0.592537,0.000064
0.592297,0.000064
0.587018,0.000064
0.586778,0.000064
0.586538,0.000063
0.586299,0.000063
0.586059,0.000063
0.585579,0.000063
0.585339,0.000063
0.585099,0.000063
0.584859,0.000063
0.584619,0.000063
0.584379,0.000063
0.584139,0.000063
0.583899,0.000063
0.583659,0.000063
Final update: Here's the working version, obtained from zeekay:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
You can use itertools.groupby to group results as you read through the csv, which would simplify things a lot. Given your updated example:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
Maybe you can try using pandas:
import pandas
voltage = [1.1, 1.2, 1.3, 2.1, 2.2, 2.3]
current = [1.0, 1.0, 1.1, 1.3, 1.2, 1.3]
df = pandas.DataFrame({'voltage': voltage, 'current': current})
result = df.groupby('current').mean()
# Output
voltage
current
1.0 1.15
1.1 1.30
1.2 2.20
1.3 2.20
result.to_csv('grouped_data.csv')
One way:
curDict = {}
for row in loaded row:
if row[1] not in curDict.keys(): # if not already there create key/value pair
curDict[str(row[1])] = [row[0]]
else: # already exists, add to key/value pair
curDict[str(row[1])].append(row[0])
#You'll end up with:
# {'0.6': [599, 600, 601...], ...}
# write the rows
for k,v in curDict.values():
avgValue = reduce(lambda a,b: a+b, v)/len(v) # calculate the avg of the voltages
writer.writerow([k,avgValue])
This version will do what you describe, but it will average all values with the same voltage, regardless of whether they are consecutive or not. Apologies if that's not what you want, but maybe it can help you along the way:
import csv
from collections import defaultdict
def f(acc, row):
acc[row[1]].append(float(row[0]))
return acc
with open('out.csv', 'w') as out:
writer = csv.writer(out)
data = open('in.csv', 'r')
r = csv.reader(data)
reduced = reduce(f, r, defaultdict(list))
for v, c in reduced.items():
writer.writerow([v, sum(c)/len(c)])
Yet another way using some very small test data (haven't included the csv stuff as you appear to have a handle on that):
#!/usr/bin/python3
test_data = [ # Only 3 currents in testdata:
(0.00030,5), # 5 : Only one entry, total 0.00030 - so should give 0.00030 as the average
(0.00012,6), # 6 : Two entries, total 0.00048 - so should give 0.00024 as the average
(0.00036,6),
(0.00001,7), # 7 : Four entries, total 0.00008 - so should give 0.00002 as the average
(0.00001,7),
(0.00001,7),
(0.00007,7)]
currents = dict()
for row in test_data:
if not row[1] in currents:
matching_currents = list((each[0] for each in test_data if each[1] == row[1]))
current_average = sum(matching_currents) / len(matching_currents)
currents[row[1]] = current_average
print("There were {0} unique currents found:\n".format(len(currents)))
for current,bias in currents.items():
print("Current: {0:2d} ( Average: {1:1.5f} )".format(current,bias))

Categories