Finding outliers in an excel row - python

As an example say column C has 1000 cells and most are filled with '1' however there are a couple of '2' sprinkled in. I'm trying to be able to find how many '2' there are and print the number.
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
for cell in ws['C']:
print(cell.value)
How can I iterate through the column and just pull how many twos there are?

As #K.Marker pointed out, you can query the count of a specific value in the rows with
[c.value for c in ws['C']].count(2)
But what if you don't know the values and/or you'd like to see the distribution of the values of a particular row? You can use a Counter which has dict-like behaviour.
In [446]: from collections import Counter
In [448]: from collections import Counter
In [449]: counter = Counter([c.value for c in ws[3]])
In [451]: counter
Out[451]: Counter({1: 17, 2: 5})
In [452]: for k, v in counter.items():
...: print('{0} occurs {1} time(s)'.format(k, v))
...:
1 occurs 17 time(s)
2 occurs 5 time(s)

import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
num_of_twos = [c.value for c in ws["C"]].count(2)
The list comprehension creates a list of cell values throughout column C, and it counts how many 2 is in it.

Are you looking for how many 2's are there?
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#iterate over it and increase the count
for r in row:
if r==2:
count+=1
Now, this only works with values of "2" and doesn't find other outliers. To find outliers in general you will have to determine a threshold first. In this example I'll use the average value, although you would need to determine the best test to get the threshold for outliers based on your data. Don't worry, statistics are fun!
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#calculatethe average
#using numpy
import numpy as np
NPavg = np.mean(list)
#without numpy
#need to cast it to float - otherwise it will round it to int
avg=sum(row)/float(len(row))
#iterate over it and increase the count
for r in row:
#of course use your own threshold,
#determined appropriately, instead of average
if r>NPavg:
count+=1

Related

Finding highest mean in CSV file

So, I have a csv file as follows: [Smaller Sample]
value,variable
320,1
272,1
284,1
544,2
568,2
544,2
316,3
558,3
542,3
260,4
266,4
710,4
272,5
290,5
558,5
416,6
782,6
626,6
My goal is to find the highest average, of each grouping. So, in this case, grouping 6 is the highest. With this information, I'd then make a new column that compares grouping 6 to all others.
Like so:
320,1,1
272,1,1
284,1,1
544,2,1
568,2,1
544,2,1
316,3,1
558,3,1
542,3,1
260,4,1
266,4,1
710,4,1
272,5,1
290,5,1
558,5,1
416,6,9
782,6,9
626,6,9
I have absolutely no idea where to start. I initially thought maybe I should split each line into a dictionary, then average each grouping, make a new key as the average, then take all of the keys[averaged groupings] and detect which is the highest. I'm just not sure how I'd put it back into CSV, or even execute this while keeping the integrity of the data.
To do this kind of things, I would advise to use the pandas package:
import pandas as pd
# Read your file
data = pd.read_csv("file.csv")
# Get the group means
group_means = data.groupby('variable')['value'].agg('mean')
# Get the group with highest mean
group_max = group_means.idxmax()
# Add the last column to differentiate the highest mean
data['comparison'] = (data['variable'] == group_max).astype(int)
You can use itertools.groupby:
import itertools, csv
_h, *data = csv.reader(open('filename.csv'))
new_data = [(a, list(b)) for a, b in itertools.groupby(data, key=lambda x:x[-1])]
_max = max(new_data, key=lambda x:sum(a for a, _ in x[-1])/float(len(x[-1])))[0]
with open('results.csv', 'w') as f:
write = csv.writer(f)
write.writerows([_h, *[[a, b, 9 if b == _max else 1] for a, b in data]])
Output:
value,variable
320,1,1
272,1,1
284,1,1
544,2,1
568,2,1
544,2,1
316,3,1
558,3,1
542,3,1
260,4,1
266,4,1
710,4,1
272,5,1
290,5,1
558,5,1
416,6,9
782,6,9
626,6,9

Reading a csv file and counting a row depending on another row

I have a csv file where i need to read different columns and sum their numbers up depending on another row in the dataset.
The question is:
How do the flight phases (ex. take off, cruise, landing..) contribute
to fatalities?
I have to sum up column number 23 for each different data in column 28.
I have a solution with masks and a lot of IF statements:
database = pd.read_csv('Aviation.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = database.as_matrix()
TOcounter = 0
for r in data:
if r[28] == "TAKEOFF":
TOcounter += r[23]
print(TOcounter)
This example shows the general idea of my solution. Where i would have to add a lot of if statements and counters for every different data in column 28.
But i was wondering if there is a smarter solution to the issue.
The raw data can be found at: https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv
It sounds like what you are trying to achieve is
df.groupby('Broad.Phase.of.Flight')['Total.Fatal.Injuries'].sum()
This is a quick solution, not checking for errors like if can convert a string for float. Also you should think about in searching for the right column(with text) instead of reliing on the column index (like 23 and 28)
but this should work:
import csv
import urllib2
import collections
url = 'https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv'
response = urllib2.urlopen(url)
df = csv.reader(response)
d = collections.defaultdict(list)
for i,row in enumerate(df):
key = row[28]
if key == "" or i == 0 : continue
val = 0 if(row[23]) =="" else float(row[23])
d.setdefault(key,[]).append(val)
d2 = {}
for k, v in d.iteritems(): d2[k] = sum(v)
for k, v in d2.iteritems(): print "{}:{}".format(k,v)
Result:
TAXI:110.0
STANDING:193.0
MANEUVERING:6430.0
DESCENT:1225.0
UNKNOWN:919.0
TAKEOFF:5267.0
LANDING:592.0
OTHER:107.0
CRUISE:6737.0
GO-AROUND:783.0
CLIMB:1906.0
APPROACH:4493.0

compare sum of column values in python

I have a csv file loaded in a python object. 15 of the columns contains binary values. I have several thousands rows.
I want to count the sum of the binary values of each of the columns and sort the result ascendingly.
I only made it to:
sum1=sum(products['1'])
sum2=sum(products['2'])
sum3=sum(products['3'])
....
...
sum15=sum(products['15'])
and process the result manually. Is there a programmatic way to achieve this?
How about this:
sorted_sum = sorted([sum(products[i]) for i in range(1, 16)])
sorted_sum is the sorted list of column sums. However, I believe the index should run from 0 to 14, not 1 to 15.
you will find the solution here :
with open("file.csv") as fin:
headerline = fin.next()
list_sum_product=[]
for i in range(15):
total = 0
for row in csv.reader(fin):
total += int(row[i])
list_sum_product.append(total)
print sorted(list_sum_product)

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

How to average values in one column of a csv if the values in another column are not changing?

EDIT: See end of my post for working code, obtained from zeekay here.
I have a CSV file with two columns (voltage and current). Because the voltage is recorded to many significant digits and the current only has 2, there are many identical current values as the value of the voltage changes. This isn't important to the programming but I'm just explaining how the data is physically obtained. I want to perform the following action:
For as long as the value of the second column (current) does not change, collect the values of the first column (voltage) into a list and average them. Then write a row into a new CSV file which is this averaged value of the voltage in the first column and the constant current value which did not change in the second column. In other words, if there are 20 rows for which the current did not change (say it is 6 uA), the 20 corresponding voltage values are averaged (say this average comes out to be 600 mV) and a row is generated in a new csv file which reads ('0.6','0.000006'). Then I want to continue iterating through the csv which is being read, repeating the above procedure for each set of fixed currents.
I've got the following code so far, but I'm not sure if I'm on the right track:
import sys, csv
with open('filetowriteto.csv','w') as avg:
loadeddata = open('filetoreadfrom.csv','r')
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1]
for row in readloaded:
newcurrent = row[1]
biaslist = []
if newcurrent == oldcurrent:
biaslist.append(row[0])
else :
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg,newcurrent])
newcurrent = row[1]
and then I'm not sure where to go.
Edit: It seems that zeekay is on the right track for what I want to do. I'm trying to implement his itertools.groupby() method but I'm currently getting a blank file generated. Here's my new code so far:
import sys, csv, itertools
with open('VI_avg(12).csv','w') as avg: # this is the file which gets written
loadeddata = open('VI(12).csv','r') # this is the file which is read
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1] # looks like this is no longer required
for current, row in itertools.groupby(readloaded, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
# write it out
writer.writerow(biasavg, current)
Suppose the CSV file being opened is something like this (shortened example):
0.595417,0.000065
0.595177,0.000065
0.594937,0.000065
0.594697,0.000065
0.594457,0.000065
0.594217,0.000065
0.593977,0.000065
0.593737,0.000065
0.593497,0.000064
0.593017,0.000064
0.592777,0.000064
0.592537,0.000064
0.592297,0.000064
0.587018,0.000064
0.586778,0.000064
0.586538,0.000063
0.586299,0.000063
0.586059,0.000063
0.585579,0.000063
0.585339,0.000063
0.585099,0.000063
0.584859,0.000063
0.584619,0.000063
0.584379,0.000063
0.584139,0.000063
0.583899,0.000063
0.583659,0.000063
Final update: Here's the working version, obtained from zeekay:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
You can use itertools.groupby to group results as you read through the csv, which would simplify things a lot. Given your updated example:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
Maybe you can try using pandas:
import pandas
voltage = [1.1, 1.2, 1.3, 2.1, 2.2, 2.3]
current = [1.0, 1.0, 1.1, 1.3, 1.2, 1.3]
df = pandas.DataFrame({'voltage': voltage, 'current': current})
result = df.groupby('current').mean()
# Output
voltage
current
1.0 1.15
1.1 1.30
1.2 2.20
1.3 2.20
result.to_csv('grouped_data.csv')
One way:
curDict = {}
for row in loaded row:
if row[1] not in curDict.keys(): # if not already there create key/value pair
curDict[str(row[1])] = [row[0]]
else: # already exists, add to key/value pair
curDict[str(row[1])].append(row[0])
#You'll end up with:
# {'0.6': [599, 600, 601...], ...}
# write the rows
for k,v in curDict.values():
avgValue = reduce(lambda a,b: a+b, v)/len(v) # calculate the avg of the voltages
writer.writerow([k,avgValue])
This version will do what you describe, but it will average all values with the same voltage, regardless of whether they are consecutive or not. Apologies if that's not what you want, but maybe it can help you along the way:
import csv
from collections import defaultdict
def f(acc, row):
acc[row[1]].append(float(row[0]))
return acc
with open('out.csv', 'w') as out:
writer = csv.writer(out)
data = open('in.csv', 'r')
r = csv.reader(data)
reduced = reduce(f, r, defaultdict(list))
for v, c in reduced.items():
writer.writerow([v, sum(c)/len(c)])
Yet another way using some very small test data (haven't included the csv stuff as you appear to have a handle on that):
#!/usr/bin/python3
test_data = [ # Only 3 currents in testdata:
(0.00030,5), # 5 : Only one entry, total 0.00030 - so should give 0.00030 as the average
(0.00012,6), # 6 : Two entries, total 0.00048 - so should give 0.00024 as the average
(0.00036,6),
(0.00001,7), # 7 : Four entries, total 0.00008 - so should give 0.00002 as the average
(0.00001,7),
(0.00001,7),
(0.00007,7)]
currents = dict()
for row in test_data:
if not row[1] in currents:
matching_currents = list((each[0] for each in test_data if each[1] == row[1]))
current_average = sum(matching_currents) / len(matching_currents)
currents[row[1]] = current_average
print("There were {0} unique currents found:\n".format(len(currents)))
for current,bias in currents.items():
print("Current: {0:2d} ( Average: {1:1.5f} )".format(current,bias))

Categories