How to compare these data sets from a csv? Python 2.7 - python

I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?

I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.

In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay

A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.

Related

How to remove an element and append python list conditionally?

I receive timeseries data from a broker and want to implement condition monitoring on this data. I want to analyze the data in a window of size 10. The window size must always stay the same. When the 11th data comes, I need to check its value against two thresholds which are calculated from the 10 values inside a window. If the 11th data is outsider, I must delete the data from the list and if it is within the range, I must delete the first element and add the 11th data to the last element. So this way the size of window stays the same. The code is simplified. data comes each 1 second.
temp_list = []
window_size = 10
if len(temy_list) <= window_size :
temp_list.append(data)
if len(temp_list) == 10:
avg = statistics.mean(temp_list)
std = statistics.stdev(temp_list)
u_thresh = avg + 3*std
l_thresh = avg - 3*std
temp_list.append(data)
if temp_list[window_size] < l_thresh or temp_list[window_size] > u_thresh:
temp_list.pop(-1)
else:
temp_list.pop(0)
temp_list.append(data)
With this code the list does not get updated and 11th data is stored and then no new data. I don't know how to correctly implement it. Sorry, if it is a simple question. I am still not very comfortable with python list. Thank you for your hint/help.
With how your code currently is if you plan to keep the last data point you add it twice instead. You can simplify your code down to make it a bit more clear and straightforward.
##First setup your initial variables
temp_list = []
window_size = 10
Then -
While(True):
data = ##Generate/Get data here
## If less than 10 data points add them to list
if len(temp_list) < window_size :
temp_list.append(data)
## If already at 10 check if its within needed range
else:
avg = statistics.mean(temp_list)
std = statistics.stdev(temp_list)
u_thresh = avg + 3*std
l_thresh = avg - 3*std
## If within range add point to end of list and remove first element
if(data >= l_thresh and data <= u_thresh):
temp_list.pop(0)
temp_list.append(data)

Concatenating tables with axis=1 in Orange python

I'm fairly new to Orange.
I'm trying to separate rows of angle (elv) into intervals.
Let's say, if I want to separate my 90-degree angle into 8 intervals, or 90/8 = 11.25 degrees per interval.
Here's the table I'm working with
Here's what I did originally, separating them by their elv value
Here's the result that I want, x rows 16 columns separated by their elv value.
But I want them done dynamically.
I list them out and turn each list into a table with x rows and 2 columns.
This is what I originally did
from Orange.data.table import Table
from Orange.data import Domain, Domain, ContinuousVariable, DiscreteVariable
import numpy
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame()
num = 10 #number of intervals that we want to seperate our elv into.
interval = 90.00/num #separating them into degree/interval
low = 0
high = interval
table = []
first = []
second = []
for i in range(num):
between = []
if i != 0: #not the first run
low = high
high = high + interval
for row in in_data: #Run through the whole table to see if the elv falls in between interval
if row[0] >= low and row[0] < high:
between.append(row)
elv = "elv" + str(i)
err = "err" + str(i)
domain = Domain([ContinuousVariable.make(err)],[ContinuousVariable.make(elv)])
data = Table.from_numpy(domain, numpy.array(between))
print("table number ", i)
print(data[:3])
Here's the output
But as you can see, these are separated tables being assigned every loop.
And I have to find a way to concatenate axis = 1 for these tables.
Even the source code for Orange3 forbids this for some reason.

Append recursively each record from dictionary to obtain values

Following my question submitted in the last few days, i have a defaultdict which contains in each of the lines a record of the ticket sale for a deviceID, or passenger for a bus sale. The whole devicedict contains all the tickets sold for a given year, around 1 million.The defaultdict is indexed by the deviceID which is the key.
I need to know the average delay between the purchase date and the actual date of departure for each ticket purchase. My problem is that i can't seem to extract each record from the dictionary.
So devicedict contains for each key devicedict[key] a list of over 60 diferent characteristics: date_departure, date_arrival etc. In each turn of the loop i want to process something like devicedict[deviceID][field of interest] do something with it, and for example extract the median delay between each purchase.
I've tried using append, and using nested arrays, but it doesnt return each individual record by itself.
ValoresDias is the sum of the delays for each ticket(purchase date minus departure) in seconds divided by a day-86400, and ValoresTotalesDias is just an increment variable. The total median delay should be ValoresDias/ValoresTotalesDias for all the records.
with open('Salida1.csv',newline='', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
#rows1 = list(csv_reader)
#print(len(rows1))
line_count = 0
count=0
for row in csv_reader:
key = row[20]
devicedict[key].append(row)
if line_count == 0:
print(f'Column names are {", ".join(row)}')
line_count += 1
else:
#print(f'\t{row[0]} works in the {row[20]} department, and was born in {row[2]}.')
#print(row['id'], row['idapp'])
#print(len(row))
#print(list(row))
mydict5ordenado.append(list(row))
line_count += 1
print(len(devicedict.keys()))
f = "%Y-%m-%d %H:%M:%S"
p = devicedict.keys()
for i in range(0,len(devicedict)):
mydict.append(devicedict[list(p)[i]])
print(mydict[i])
print("Los campos temporales:")
#print(mydict[i][4])
#print(mydict[i][3])
out1=datetime.datetime.strptime(mydict[i][4], f)
out2=datetime.datetime.strptime(mydict[i][3], f)
out3=out1-out2
valoresTotalesDias+=1
valoresDias+=out3.seconds/86400
#This is what i am trying to obtain for each record without hardcoding
#I want to access each field in the above loop
count1=len(devicedict['4ff70ad8e2e74f49'])
for i in range(0,count1):
mydict5.append(devicedict['4ff70ad8e2e74f49'][i])
print(len(mydict5))
for i in range (0,len(mydict5)):
print(mydict5[i][7])
print("Tipo de Bus:")
print(mydict5[i][16])
print(mydict5[i][14])
if (mydict5[i][16]=='P'):
preferente+=1
Mydict[i] should contain only one line of the record, that is one sale for each passenger not the whole record.

Performing different operations on columns in a file

I am trying to write code that will handle my input file of numbers, and then perform various operations on them. For example, The first column is a name. The second is an hourly rate, and the third is hours. The File looks like this,
John 15 8
Sam 10 4
Mike 16 10
John 19 15
I want to go through and if a name is a duplicate (John in the example) it will average the 2nd number (hourly rate), get the sum the 3rd number (hours), and delete the duplicate leaving 1 John with average wage and total hours. If not a duplicate it will just output the original entry.
I cannot figure out how to keep track of the duplicate, and then move on to the next line in the row. Is there any way to do this without using line.split()?
This problem is easier if you break it up into parts.
First, you want to read through the file and parse each line into three variables, the name, the hourly rate, and the hours.
Second, you need to handle the matching on the first value (the name). You need some kind of data structure to store values in; a dict is probably the right thing here.
Thirdly, you need to compute the average at the end (you can't compute it along the way because you need the count of values).
Putting it together, I would do something like this:
class PersonRecord:
def __init__(self, name):
self.name = name
self.hourly_rates = []
self.total_hours = 0
def add_record(self, hourly_rate, hours):
self.hourly_rates.append(hourly_rate)
self.total_hours += hours
def get_average_hourly_rate(self):
return sum(self.hourly_rates) / len(self.hourly_rates)
def compute_person_records(data_file_path):
person_records = {}
with open(data_file_path, 'r') as data_file:
for line in data_file:
parts = line.split(' ')
name = parts[0]
hourly_rate = int(parts[1])
hours = int(parts[2])
person_record = person_records.get(name)
if person_record is None:
person_record = PersonRecord(name)
person_records[name] = person_record
person_record.add_record(hourly_rate, hours)
return person_records
def main():
person_records = compute_person_records()
for person_name, person_record in person_records.items():
print('{name} {average_hourly_rate} {total_hours}'.format(
name=person_name,
average_hourly_rate=person_record.get_average_hourly_rate(),
total_hours=person_record.total_hours))
if __name__ == '__main__':
main()
Here we go. Just groupby the name and aggregate on the rate and hours taking the mean and sum as shown below.
#assume d is the name of your DataFrame.
d.groupby(by =['name']).agg({'rate': "mean", 'hours':'sum'})
Here's a version that's not particularly efficient. I wouldn't run it on lots of data, but it's easy to read and returns your data to its original form, which is apparently what you want...
from statistics import mean
input = '''John 15 8
Sam 10 4
Mike 16 10
John 19 15'''
lines = input.splitlines()
data = [line.split(' ') for line in lines]
names = set([item[0] for item in data])
processed = [(name, str(mean([int(i[1]) for i in data if i[0] == name])), str(sum([int(i[2]) for i in data if i[0] == name]))) for name in names]
joined = [' '.join(p) for p in processed]
line_joined = '\n'.join(joined)
a=[] #list to store all the values
while(True): #infinite while loop to take any number of values given
try: #for giving any number of inputs u want
l=input().split()
a.append(l)
except(EOFError):
break;
for i in a:
m=[i] #temperory list which will contain duplicate values
for j in range(a.index(i)+1,len(a)):
if(i[0]==a[j][0]):
m.append(a[j]) #appending duplicates
a.pop(j) #popping duplicates from main list
hr=0 #initializing hourly rate and hours with 0
hrs=0
if(len(m)>1):
for k in m:
hr+=int(k[1])
hrs+=int(k[2])# calculating total hourly rate and hours
i[1]=hr/len(m)
i[2]=hrs/len(m)#finding average
for i in a:
print(i[0],i[1],i[2]) # printing the final list
Read comments in the code for code explanation
You can do:
from collections import defaultdict
with open('file_name') as fd:
data = fd.read().splitlines()
line_elems = []
for line in data:
line_elems.append(line.split())
a_dict = defaultdict(list)
for e in line_elems:
a_dict[e[0]].append((e[1], e[2]))
final_dict = {}
for key in a_dict:
if len(a_dict[key]) > 1:
hour_rates = [float(x[0]) for x in a_dict[key]]
hours = [float(x[1]) for x in a_dict[key]]
ave_rate = sum(hour_rates) / len(hour_rates)
total_hours = sum(hours)
final_dict[key] = (ave_rate, total_hours)
else:
final_dict[key] = a_dict[key]
print(final_dict)
# write to file or do whatever

How can I loop through an index and keep the associated row information?

I have a loop within my function that is supposed to find the max rate, min rate, and compute the average, and the function that I wrote is doing this right, but how can I keep the row information when I find the max, and min within my data? I'm a beginner at python, but here is the loop that I have.
max_rate = -1
min_rate = 25
count = 0
sum = 0
with open(file_names, "r") as file_out:
# skips the headers in the file
next(file_out)
for line in file_out:
values = line.split(",")
# since rate is index 6 that is what we are going to compare to values above
if float(values[6]) > max_rate:
max_rate = float(values[6])
if float(values[6]) < min_rate:
min_rate = float(values[6])
count += 1
# sum up all rates in the rates column
sum = float(values[6]) + sum
avg_rate = sum / count
print(avg_rate)
I have printed the average just to test my function. Hopefully the question I am asking makes sense, I don't just want the 6th index but I want the rest of the row information that has the min or the max. An example would be to get the company name, state, zip, and rate. Don't worry about indentations, I don't know if I formatted it right in the code block here, but all the indents are right in my code chunk.
It looks like you're working with CSV or other table-like data. Pandas handles this really well. An example would be:
import pandas as pd
df = pd.read_csv('something.csv')
print(df)
print(f'\nMax Rate: {df.rate.max()}')
print(f'Avg Rate: {df.rate.mean()}')
print(f'Min Rate: {df.rate.min()}')
print(f'Last Company (Alphabetically): {df.company_name.max()}')
Yields:
company_name state zip rate
0 Company1 Inc. Texas 76189 0.6527
1 Company2 LLC. Pennsylvania 18657 0.7265
2 Company3 Corp Indiana 47935 0.5267
Max Rate: 0.7265
Avg Rate: 0.6353
Min Rate: 0.5267
Last Company (Alphabetically): Company3 Corp
Try this:
max_rate = []
min_rate = []
count = 0
total = 0
with open(file_names, "r") as file_out:
# skips the headers in the file
next(file_out)
# reset max, min, total sum and count
max_rate = []
min_rate = []
total = 0
count = 0
for line in file_out:
values = line.split(",")
max_rate = max(values, max_rate or values, key=lambda x: x[6])
min_rate = min(values, min_rate or values, key=lambda x: x[6])
# sum up all rates in the rates column
total += float(values[6])
count += 1
avg_rate = total / count
print(avg_rate)
This will attribute the whole list for the min and max related to the 6th column as you intended. The max_rate or values code will evaluate the maximum value between values and max_rate lists only if max_rate is not empty (that will be the case in the first interaction of the for loop) that will prevent an IndexError. Same thing for min_rate
An important change I've made on your code is the name for the variable sum. That's a Python registered keyword and it's not a good practice to use it as a variable name, so prefer using something like total or total_sum instead.
Those suggestions are great. Thanks, I also found out that I could just assign the line to a variable underneath my if statements as well. And then at the beginning of my function I can assign this variables to an empty string. Like
info_high = ""
info_low = ""
info_high = line
info_low = line
and it will be able to save the row information I need, and then I would just index the information that I need.

Categories