Reading a csv file and counting a row depending on another row - python

I have a csv file where i need to read different columns and sum their numbers up depending on another row in the dataset.
The question is:
How do the flight phases (ex. take off, cruise, landing..) contribute
to fatalities?
I have to sum up column number 23 for each different data in column 28.
I have a solution with masks and a lot of IF statements:
database = pd.read_csv('Aviation.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = database.as_matrix()
TOcounter = 0
for r in data:
if r[28] == "TAKEOFF":
TOcounter += r[23]
print(TOcounter)
This example shows the general idea of my solution. Where i would have to add a lot of if statements and counters for every different data in column 28.
But i was wondering if there is a smarter solution to the issue.
The raw data can be found at: https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv

It sounds like what you are trying to achieve is
df.groupby('Broad.Phase.of.Flight')['Total.Fatal.Injuries'].sum()

This is a quick solution, not checking for errors like if can convert a string for float. Also you should think about in searching for the right column(with text) instead of reliing on the column index (like 23 and 28)
but this should work:
import csv
import urllib2
import collections
url = 'https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv'
response = urllib2.urlopen(url)
df = csv.reader(response)
d = collections.defaultdict(list)
for i,row in enumerate(df):
key = row[28]
if key == "" or i == 0 : continue
val = 0 if(row[23]) =="" else float(row[23])
d.setdefault(key,[]).append(val)
d2 = {}
for k, v in d.iteritems(): d2[k] = sum(v)
for k, v in d2.iteritems(): print "{}:{}".format(k,v)
Result:
TAXI:110.0
STANDING:193.0
MANEUVERING:6430.0
DESCENT:1225.0
UNKNOWN:919.0
TAKEOFF:5267.0
LANDING:592.0
OTHER:107.0
CRUISE:6737.0
GO-AROUND:783.0
CLIMB:1906.0
APPROACH:4493.0

Related

Finding highest mean in CSV file

So, I have a csv file as follows: [Smaller Sample]
value,variable
320,1
272,1
284,1
544,2
568,2
544,2
316,3
558,3
542,3
260,4
266,4
710,4
272,5
290,5
558,5
416,6
782,6
626,6
My goal is to find the highest average, of each grouping. So, in this case, grouping 6 is the highest. With this information, I'd then make a new column that compares grouping 6 to all others.
Like so:
320,1,1
272,1,1
284,1,1
544,2,1
568,2,1
544,2,1
316,3,1
558,3,1
542,3,1
260,4,1
266,4,1
710,4,1
272,5,1
290,5,1
558,5,1
416,6,9
782,6,9
626,6,9
I have absolutely no idea where to start. I initially thought maybe I should split each line into a dictionary, then average each grouping, make a new key as the average, then take all of the keys[averaged groupings] and detect which is the highest. I'm just not sure how I'd put it back into CSV, or even execute this while keeping the integrity of the data.
To do this kind of things, I would advise to use the pandas package:
import pandas as pd
# Read your file
data = pd.read_csv("file.csv")
# Get the group means
group_means = data.groupby('variable')['value'].agg('mean')
# Get the group with highest mean
group_max = group_means.idxmax()
# Add the last column to differentiate the highest mean
data['comparison'] = (data['variable'] == group_max).astype(int)
You can use itertools.groupby:
import itertools, csv
_h, *data = csv.reader(open('filename.csv'))
new_data = [(a, list(b)) for a, b in itertools.groupby(data, key=lambda x:x[-1])]
_max = max(new_data, key=lambda x:sum(a for a, _ in x[-1])/float(len(x[-1])))[0]
with open('results.csv', 'w') as f:
write = csv.writer(f)
write.writerows([_h, *[[a, b, 9 if b == _max else 1] for a, b in data]])
Output:
value,variable
320,1,1
272,1,1
284,1,1
544,2,1
568,2,1
544,2,1
316,3,1
558,3,1
542,3,1
260,4,1
266,4,1
710,4,1
272,5,1
290,5,1
558,5,1
416,6,9
782,6,9
626,6,9

Appending values in rows into new columns?

I have this data in a CSV:
first, middle, last, id, fte
Alexander,Frank,Johnson,460700,1
Ashley,Jane,Smith,470000,.5
Ashley,Jane,Smith,470000,.25
Ashley,Jane,Smith,470000,.25
Steve,Robert,Brown,460001,1
I need to find the rows of people with the same ID numbers, and then combine the FTEs of those rows into the same line. I'll also need to add 0s for the rows that don't have duplicates. For example (using data above):
first, middle, last, id, fte1, fte2, fte3, fte4
Alexander,Frank,Johnson,460700,1,0,0,0
Ashley,Jane,Smith,470000,.5,.25,.25,0
Steve,Robert,Brown,460001,1,0,0,0
Basically, we're looking at the jobs people hold. Some people work one 40-hour per week job (1.0 FTE), some work two 20-hour per week jobs (.5 and .5 FTEs), some might work 4 10-hour per week jobs (.25, .25, .25, and .25 FTEs), and some may have other combinations. We only get one row of data per employee, so we need the FTEs on the same row.
This is what we have so far. Right now, our current code only works if they have two FTEs. If they have three or four, it just overwrites them with the last two (so if they have 3, it gives us 2 and 3. If they have 4, it gives us 3 and 4).
f = open('data.csv')
csv_f = csv.reader(f)
dataset = []
for row in csv_f:
dictionary = {}
dictionary["first"] = row[2]
dictionary["middle"] = row[3]
dictionary["last"] = row[4]
dictionary["id"] = row[10]
dictionary["fte"] = row[12]
dataset.append(dictionary)
def is_match(dict1, dict2):
return (dict1["id"] == dict2["id"])
def find_match(dictionary, dict_list):
for index in range(0, len(dict_list)):
if is_match(dictionary, dict_list[index]):
return index
return -1
def process_data(dataset):
result = []
for index in range(1, len(dataset)):
data_dict = dataset[index]
match_index = find_match(data_dict, result)
id = str(data_dict["id"])
if match_index == -1:
result.append(data_dict)
else:
(result[match_index])["fte2"] = data_dict["fte"]
return result
f.close()
for row in process_data(dataset):
print(row)
Any help would be extremely appreciated! Thanks!
I would say to use the pandas library to make it simple. You can use group by along with aggregation. The below is an example following the aggregation example provided here https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm.
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv')
grouped = df.groupby('id')
print grouped['fte'].agg(np.sum)

Finding outliers in an excel row

As an example say column C has 1000 cells and most are filled with '1' however there are a couple of '2' sprinkled in. I'm trying to be able to find how many '2' there are and print the number.
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
for cell in ws['C']:
print(cell.value)
How can I iterate through the column and just pull how many twos there are?
As #K.Marker pointed out, you can query the count of a specific value in the rows with
[c.value for c in ws['C']].count(2)
But what if you don't know the values and/or you'd like to see the distribution of the values of a particular row? You can use a Counter which has dict-like behaviour.
In [446]: from collections import Counter
In [448]: from collections import Counter
In [449]: counter = Counter([c.value for c in ws[3]])
In [451]: counter
Out[451]: Counter({1: 17, 2: 5})
In [452]: for k, v in counter.items():
...: print('{0} occurs {1} time(s)'.format(k, v))
...:
1 occurs 17 time(s)
2 occurs 5 time(s)
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
num_of_twos = [c.value for c in ws["C"]].count(2)
The list comprehension creates a list of cell values throughout column C, and it counts how many 2 is in it.
Are you looking for how many 2's are there?
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#iterate over it and increase the count
for r in row:
if r==2:
count+=1
Now, this only works with values of "2" and doesn't find other outliers. To find outliers in general you will have to determine a threshold first. In this example I'll use the average value, although you would need to determine the best test to get the threshold for outliers based on your data. Don't worry, statistics are fun!
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#calculatethe average
#using numpy
import numpy as np
NPavg = np.mean(list)
#without numpy
#need to cast it to float - otherwise it will round it to int
avg=sum(row)/float(len(row))
#iterate over it and increase the count
for r in row:
#of course use your own threshold,
#determined appropriately, instead of average
if r>NPavg:
count+=1

Averaging values in a list of a list of a list in Python

I'm working on a method to average data from multiple files and put the results into a single file. Each line of the files looks like:
File #1
Test1,5,2,1,8
Test2,10,4,3,2
...
File #2
Test1,2,4,5,1
Test2,4,6,10,3
...
Here is the code I use to store the data:
totalData = []
for i in range(0, len(files)):
data = []
if ".csv" in files[i]:
infile = open(files[i],"r")
temp = infile.readline()
while temp != "":
data.append([c.strip() for c in temp.split(",")])
temp = infile.readline()
totalData.append(data)
So what I'm left with is totalData looking like the following:
totalData = [[
[Test1,5,2,1,8],
[Test2,10,4,3,2]],
[[Test1,2,4,5,1],
[Test2,4,6,10,3]]]
What I want to average is for all Test1, Test2, etc, average all the first values and then the second values and so forth. So testAverage would look like:
testAverage = [[Test1,3.5,3,3,4.5],
[Test2,7,5,6.5,2.5]]
I'm struggling to think of a concise/efficient way to do this. Any help is greatly appreciated! Also, if there are better ways to manage this type of data, please let me know.
It just need two loops
totalData = [ [['Test1',5,2,1,8],['Test2',10,4,3,2]],
[['Test1',2,4,5,1],['Test2',4,6,10,3]] ]
for t in range(len(totalData[0])): #tests
result = [totalData[0][t][0],]
for i in range(1,len(totalData[0][0])): #numbers
sum = 0.0
for j in range(len(totalData)):
sum += totalData[j][t][i]
sum /= len(totalData)
result.append(sum)
print result
first flatten it out
results = itertools.chain.from_iterable(totalData)
then sort it
results.sort()
then use groupby
data = {}
for key,values in itertools.groupby(results,lambda x:x[0]):
columns = zip(*values)
data[key] = [sum(c)*1.0/len(c) for c in columns]
and finally just print your data
If your data structure is regular, the best is probably to use numpy. You should be able to install it with pip from the terminal
pip install numpy
Then in python:
import numpy as np
totalData = np.array(totalData)
# remove the last dimension (i.e. 'Test1', 'Test2'), since it's not a number
totalData = np.array(totalData[:, :, 1:], float)
# average
np.mean(totalData, axis=0)

How to average values in one column of a csv if the values in another column are not changing?

EDIT: See end of my post for working code, obtained from zeekay here.
I have a CSV file with two columns (voltage and current). Because the voltage is recorded to many significant digits and the current only has 2, there are many identical current values as the value of the voltage changes. This isn't important to the programming but I'm just explaining how the data is physically obtained. I want to perform the following action:
For as long as the value of the second column (current) does not change, collect the values of the first column (voltage) into a list and average them. Then write a row into a new CSV file which is this averaged value of the voltage in the first column and the constant current value which did not change in the second column. In other words, if there are 20 rows for which the current did not change (say it is 6 uA), the 20 corresponding voltage values are averaged (say this average comes out to be 600 mV) and a row is generated in a new csv file which reads ('0.6','0.000006'). Then I want to continue iterating through the csv which is being read, repeating the above procedure for each set of fixed currents.
I've got the following code so far, but I'm not sure if I'm on the right track:
import sys, csv
with open('filetowriteto.csv','w') as avg:
loadeddata = open('filetoreadfrom.csv','r')
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1]
for row in readloaded:
newcurrent = row[1]
biaslist = []
if newcurrent == oldcurrent:
biaslist.append(row[0])
else :
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg,newcurrent])
newcurrent = row[1]
and then I'm not sure where to go.
Edit: It seems that zeekay is on the right track for what I want to do. I'm trying to implement his itertools.groupby() method but I'm currently getting a blank file generated. Here's my new code so far:
import sys, csv, itertools
with open('VI_avg(12).csv','w') as avg: # this is the file which gets written
loadeddata = open('VI(12).csv','r') # this is the file which is read
writer=csv.writer(avg)
readloaded=csv.reader(loadeddata)
listloaded=list(readloaded)
oldcurrent=listloaded[0][1] # looks like this is no longer required
for current, row in itertools.groupby(readloaded, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
# write it out
writer.writerow(biasavg, current)
Suppose the CSV file being opened is something like this (shortened example):
0.595417,0.000065
0.595177,0.000065
0.594937,0.000065
0.594697,0.000065
0.594457,0.000065
0.594217,0.000065
0.593977,0.000065
0.593737,0.000065
0.593497,0.000064
0.593017,0.000064
0.592777,0.000064
0.592537,0.000064
0.592297,0.000064
0.587018,0.000064
0.586778,0.000064
0.586538,0.000063
0.586299,0.000063
0.586059,0.000063
0.585579,0.000063
0.585339,0.000063
0.585099,0.000063
0.584859,0.000063
0.584619,0.000063
0.584379,0.000063
0.584139,0.000063
0.583899,0.000063
0.583659,0.000063
Final update: Here's the working version, obtained from zeekay:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
You can use itertools.groupby to group results as you read through the csv, which would simplify things a lot. Given your updated example:
import csv
import itertools
with open('VI(12).csv') as input, open('VI_avg(12).csv','w') as output:
reader = csv.reader(input)
writer = csv.writer(output)
for current, row in itertools.groupby(reader, lambda x: x[1]):
biaslist = [float(x[0]) for x in row]
biasavg = float(sum(biaslist))/len(biaslist)
writer.writerow([biasavg, current])
Maybe you can try using pandas:
import pandas
voltage = [1.1, 1.2, 1.3, 2.1, 2.2, 2.3]
current = [1.0, 1.0, 1.1, 1.3, 1.2, 1.3]
df = pandas.DataFrame({'voltage': voltage, 'current': current})
result = df.groupby('current').mean()
# Output
voltage
current
1.0 1.15
1.1 1.30
1.2 2.20
1.3 2.20
result.to_csv('grouped_data.csv')
One way:
curDict = {}
for row in loaded row:
if row[1] not in curDict.keys(): # if not already there create key/value pair
curDict[str(row[1])] = [row[0]]
else: # already exists, add to key/value pair
curDict[str(row[1])].append(row[0])
#You'll end up with:
# {'0.6': [599, 600, 601...], ...}
# write the rows
for k,v in curDict.values():
avgValue = reduce(lambda a,b: a+b, v)/len(v) # calculate the avg of the voltages
writer.writerow([k,avgValue])
This version will do what you describe, but it will average all values with the same voltage, regardless of whether they are consecutive or not. Apologies if that's not what you want, but maybe it can help you along the way:
import csv
from collections import defaultdict
def f(acc, row):
acc[row[1]].append(float(row[0]))
return acc
with open('out.csv', 'w') as out:
writer = csv.writer(out)
data = open('in.csv', 'r')
r = csv.reader(data)
reduced = reduce(f, r, defaultdict(list))
for v, c in reduced.items():
writer.writerow([v, sum(c)/len(c)])
Yet another way using some very small test data (haven't included the csv stuff as you appear to have a handle on that):
#!/usr/bin/python3
test_data = [ # Only 3 currents in testdata:
(0.00030,5), # 5 : Only one entry, total 0.00030 - so should give 0.00030 as the average
(0.00012,6), # 6 : Two entries, total 0.00048 - so should give 0.00024 as the average
(0.00036,6),
(0.00001,7), # 7 : Four entries, total 0.00008 - so should give 0.00002 as the average
(0.00001,7),
(0.00001,7),
(0.00007,7)]
currents = dict()
for row in test_data:
if not row[1] in currents:
matching_currents = list((each[0] for each in test_data if each[1] == row[1]))
current_average = sum(matching_currents) / len(matching_currents)
currents[row[1]] = current_average
print("There were {0} unique currents found:\n".format(len(currents)))
for current,bias in currents.items():
print("Current: {0:2d} ( Average: {1:1.5f} )".format(current,bias))

Categories