Performing different operations on columns in a file - python

I am trying to write code that will handle my input file of numbers, and then perform various operations on them. For example, The first column is a name. The second is an hourly rate, and the third is hours. The File looks like this,
John 15 8
Sam 10 4
Mike 16 10
John 19 15
I want to go through and if a name is a duplicate (John in the example) it will average the 2nd number (hourly rate), get the sum the 3rd number (hours), and delete the duplicate leaving 1 John with average wage and total hours. If not a duplicate it will just output the original entry.
I cannot figure out how to keep track of the duplicate, and then move on to the next line in the row. Is there any way to do this without using line.split()?

This problem is easier if you break it up into parts.
First, you want to read through the file and parse each line into three variables, the name, the hourly rate, and the hours.
Second, you need to handle the matching on the first value (the name). You need some kind of data structure to store values in; a dict is probably the right thing here.
Thirdly, you need to compute the average at the end (you can't compute it along the way because you need the count of values).
Putting it together, I would do something like this:
class PersonRecord:
def __init__(self, name):
self.name = name
self.hourly_rates = []
self.total_hours = 0
def add_record(self, hourly_rate, hours):
self.hourly_rates.append(hourly_rate)
self.total_hours += hours
def get_average_hourly_rate(self):
return sum(self.hourly_rates) / len(self.hourly_rates)
def compute_person_records(data_file_path):
person_records = {}
with open(data_file_path, 'r') as data_file:
for line in data_file:
parts = line.split(' ')
name = parts[0]
hourly_rate = int(parts[1])
hours = int(parts[2])
person_record = person_records.get(name)
if person_record is None:
person_record = PersonRecord(name)
person_records[name] = person_record
person_record.add_record(hourly_rate, hours)
return person_records
def main():
person_records = compute_person_records()
for person_name, person_record in person_records.items():
print('{name} {average_hourly_rate} {total_hours}'.format(
name=person_name,
average_hourly_rate=person_record.get_average_hourly_rate(),
total_hours=person_record.total_hours))
if __name__ == '__main__':
main()

Here we go. Just groupby the name and aggregate on the rate and hours taking the mean and sum as shown below.
#assume d is the name of your DataFrame.
d.groupby(by =['name']).agg({'rate': "mean", 'hours':'sum'})

Here's a version that's not particularly efficient. I wouldn't run it on lots of data, but it's easy to read and returns your data to its original form, which is apparently what you want...
from statistics import mean
input = '''John 15 8
Sam 10 4
Mike 16 10
John 19 15'''
lines = input.splitlines()
data = [line.split(' ') for line in lines]
names = set([item[0] for item in data])
processed = [(name, str(mean([int(i[1]) for i in data if i[0] == name])), str(sum([int(i[2]) for i in data if i[0] == name]))) for name in names]
joined = [' '.join(p) for p in processed]
line_joined = '\n'.join(joined)

a=[] #list to store all the values
while(True): #infinite while loop to take any number of values given
try: #for giving any number of inputs u want
l=input().split()
a.append(l)
except(EOFError):
break;
for i in a:
m=[i] #temperory list which will contain duplicate values
for j in range(a.index(i)+1,len(a)):
if(i[0]==a[j][0]):
m.append(a[j]) #appending duplicates
a.pop(j) #popping duplicates from main list
hr=0 #initializing hourly rate and hours with 0
hrs=0
if(len(m)>1):
for k in m:
hr+=int(k[1])
hrs+=int(k[2])# calculating total hourly rate and hours
i[1]=hr/len(m)
i[2]=hrs/len(m)#finding average
for i in a:
print(i[0],i[1],i[2]) # printing the final list
Read comments in the code for code explanation

You can do:
from collections import defaultdict
with open('file_name') as fd:
data = fd.read().splitlines()
line_elems = []
for line in data:
line_elems.append(line.split())
a_dict = defaultdict(list)
for e in line_elems:
a_dict[e[0]].append((e[1], e[2]))
final_dict = {}
for key in a_dict:
if len(a_dict[key]) > 1:
hour_rates = [float(x[0]) for x in a_dict[key]]
hours = [float(x[1]) for x in a_dict[key]]
ave_rate = sum(hour_rates) / len(hour_rates)
total_hours = sum(hours)
final_dict[key] = (ave_rate, total_hours)
else:
final_dict[key] = a_dict[key]
print(final_dict)
# write to file or do whatever

Related

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Hey guys this is my first post. I am planning on building an anime recommendation engine using python. I came across a problem where I made a list called genre_list which stores the genres that I want to filter from the huge data spreadsheet I was given. I am using the Pandas library and it has an isin() function to check if the values of a list is included in the datasheet and its supposed to filter it out. I am using the function but its not able to detect "Action" from the datasheet although it is there. I got a feeling there's something wrong with the data types and I probably have to work around it somehow but I'm not sure how.
I downloaded my csv file from this link for anyone interested!
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
print(genre_list)
df_genre = df[df["genre"].isin(genre_list)]
# df_genre = df["genre"]
print(df_genre)
Outout:
[1]: https://i.stack.imgur.com/XZzcc.png
You want to check if ANY value in your user input list is in each of the list values in the "genre" column. The "isin" function will check if your input in it's entirety is in a cell value, which is not what you want here. Change that line to this:
df_genre = df[df['genre'].apply(lambda x: any([i in x for i in genre_list]))]
Let me know if you need any more help.
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
# List of all cells and their genre put into a list
col_list = df["genre"].values.tolist()
temp_list = []
# Each val in the list is compared with the genre_list to see if there is a match
for index, val in enumerate(col_list):
if all(x in val for x in genre_list):
# If there is a match, the UID of that cell is added to a temp_list
temp_list.append(df['uid'].iloc[index])
print(temp_list)
# This checks if the UID is contained in the temp_list of UIDs that have these genres
df_genre = df["uid"].isin(temp_list)
new_df = df.loc[df_genre, "title"]
# Prints all Anime with the specified genres
print(new_df)
This is another approach I took and works as well. Thanks for all the help :D
To make a selection from a dataframe, you can write this:
df_genre = df.loc[df['genre'].isin(genre_list)]
I've downloaded the file animes.csv from Kaggle and read it into a dataframe. What I found is that the column genre actually contains strings (of lists), not lists. So one way to get what you want would be:
...
m = df["genre"].str.contains(r"'(?:" + "|".join(genre_list) + r")'")
df_genre = df[m]
Result for genre_list = ["Action"]:
uid ... link
3 5114 ... https://myanimelist.net/anime/5114/Fullmetal_A...
4 31758 ... https://myanimelist.net/anime/31758/Kizumonoga...
5 37510 ... https://myanimelist.net/anime/37510/Mob_Psycho...
7 38000 ... https://myanimelist.net/anime/38000/Kimetsu_no...
9 2904 ... https://myanimelist.net/anime/2904/Code_Geass_...
... ... ... ...
19301 10350 ... https://myanimelist.net/anime/10350/Hakuouki_S...
19303 1293 ... https://myanimelist.net/anime/1293/Urusei_Yatsura
19304 150 ... https://myanimelist.net/anime/150/Blood_
19305 4177 ... https://myanimelist.net/anime/4177/Bounen_no_X...
19309 450 ... https://myanimelist.net/anime/450/InuYasha_Mov...
[4215 rows x 12 columns]
If you want to transform the values of the genre column for some reason into lists, then you could do either
df["genre"] = df["genre"].str[1:-1].str.replace("'", "").str.split(r",\s*")
or
df["genre"] = df["genre"].map(eval)
Afterwards
df_genre = df[~df["genre"].map(set(genre_list).isdisjoint)]
would give you the filtered dataframe.

How do I get this for loop to print a year for the amount of times of a value in another column

So I have a column release.TOTAL with values like [38,24,44,58,50,..]. This column states how many major films were made in a given year. What I want is for this to make a list that lists the year for each of the values. For example, if there were 25 movies made in 2016 there would be 25 2016s in the list.
total_years = []
for i in release.TOTAL:
for j in range(i):
for k in release.YEAR:
total_years.append(k)
This is the function I have now but its printing the entire column each time the for loop runs. So how can I edit it so it does what I want.
If i understand release is dataframe that has two columns YEAR,TOTAL
def append_years(yr,val,list_in):
for i in range(val):
list_in.append(yr)
return list_in
total_years = []
for i in range(len(release)):
total_years=append_years(release.YEAR[i],release.TOTAL[i],total_years)
print(total_years)
I think you want something like this?
class release:
YEAR = null
TOTAL = list()
release.YEAR = "2016"
release.TOTAL = [38,24,44,58,50]
total_years = []
for i in range( len(release.TOTAL)):
total_years.append(release.YEAR)
print (total_years)

find most frequent pairs in a dataframe

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

How can I loop through an index and keep the associated row information?

I have a loop within my function that is supposed to find the max rate, min rate, and compute the average, and the function that I wrote is doing this right, but how can I keep the row information when I find the max, and min within my data? I'm a beginner at python, but here is the loop that I have.
max_rate = -1
min_rate = 25
count = 0
sum = 0
with open(file_names, "r") as file_out:
# skips the headers in the file
next(file_out)
for line in file_out:
values = line.split(",")
# since rate is index 6 that is what we are going to compare to values above
if float(values[6]) > max_rate:
max_rate = float(values[6])
if float(values[6]) < min_rate:
min_rate = float(values[6])
count += 1
# sum up all rates in the rates column
sum = float(values[6]) + sum
avg_rate = sum / count
print(avg_rate)
I have printed the average just to test my function. Hopefully the question I am asking makes sense, I don't just want the 6th index but I want the rest of the row information that has the min or the max. An example would be to get the company name, state, zip, and rate. Don't worry about indentations, I don't know if I formatted it right in the code block here, but all the indents are right in my code chunk.
It looks like you're working with CSV or other table-like data. Pandas handles this really well. An example would be:
import pandas as pd
df = pd.read_csv('something.csv')
print(df)
print(f'\nMax Rate: {df.rate.max()}')
print(f'Avg Rate: {df.rate.mean()}')
print(f'Min Rate: {df.rate.min()}')
print(f'Last Company (Alphabetically): {df.company_name.max()}')
Yields:
company_name state zip rate
0 Company1 Inc. Texas 76189 0.6527
1 Company2 LLC. Pennsylvania 18657 0.7265
2 Company3 Corp Indiana 47935 0.5267
Max Rate: 0.7265
Avg Rate: 0.6353
Min Rate: 0.5267
Last Company (Alphabetically): Company3 Corp
Try this:
max_rate = []
min_rate = []
count = 0
total = 0
with open(file_names, "r") as file_out:
# skips the headers in the file
next(file_out)
# reset max, min, total sum and count
max_rate = []
min_rate = []
total = 0
count = 0
for line in file_out:
values = line.split(",")
max_rate = max(values, max_rate or values, key=lambda x: x[6])
min_rate = min(values, min_rate or values, key=lambda x: x[6])
# sum up all rates in the rates column
total += float(values[6])
count += 1
avg_rate = total / count
print(avg_rate)
This will attribute the whole list for the min and max related to the 6th column as you intended. The max_rate or values code will evaluate the maximum value between values and max_rate lists only if max_rate is not empty (that will be the case in the first interaction of the for loop) that will prevent an IndexError. Same thing for min_rate
An important change I've made on your code is the name for the variable sum. That's a Python registered keyword and it's not a good practice to use it as a variable name, so prefer using something like total or total_sum instead.
Those suggestions are great. Thanks, I also found out that I could just assign the line to a variable underneath my if statements as well. And then at the beginning of my function I can assign this variables to an empty string. Like
info_high = ""
info_low = ""
info_high = line
info_low = line
and it will be able to save the row information I need, and then I would just index the information that I need.

How to compare these data sets from a csv? Python 2.7

I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?
I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.
In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.

Categories