I have a data frame containing the customers ratings of the restaurants they went to and few other attributes.
What i want to do is to calculate the difference between the average star rating for the last year and the average star rating for
the first year of a restaurant.
data = {'rating_id': ['1', '2','3','4','5','6','7'],
'user_id': ['56', '13','56','99','99','13','12'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df.head()
df['star_rating'] = df['star_rating'].astype(float)
# calculate the average of the stars of the first year
ratings_mean_firstYear= df.groupby(['restaurant_id','first_year']).agg({'star_rating':[np.mean]})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_firstYear.reset_index()
# calculate the average of the stars of the last year
ratings_mean_lastYear= df.groupby(['restaurant_id','last_year']).agg({'star_rating':[np.mean]})
ratings_mean_lastYear.columns = ['avg_lastYear']
ratings_mean_lastYear.reset_index()
# merge the means into a single table
ratings_average = ratings_mean_firstYear.merge(
ratings_mean_lastYear.groupby('restaurant_id')['avg_lastYear'].max()
, on='restaurant_id'
)
ratings_average.head(20)
My problem is that the averages of the first and last years are the exact same and that makes no sens, i don't really know what i did wrong in my thought process here..i suspect something is going on with .agg since it's the first time i use pandas lib.
Any suggestions?
Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using rating_year == first_year criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. –
Here is an example, I added more lines and changed years to have more matches:
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '56','56','56', '99','99','99','99','99'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx', 'xxx','yyy','yyy','xxx'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','4.0','2.5','3.0'],
'rating_year': ['2012', '2020','2001','2020', '2012', '2020','2001','2020','2019'],
'first_year': ['2012', '2012','2001','2001','2012', '2012','2001','2001','2012'],
'last_year': ['2020', '2020','2020','2020','2020','2020','2020','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
ratings_mean_firstYear = df[df.rating_year == df.first_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_lastYear= df[df.rating_year == df.last_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_lastYear.columns = ['avg_lastYear']
result:
ratings_mean_firstYear.merge(ratings_mean_lastYear, left_index=True, right_index=True)
avg_firstYear avg_lastYear
restaurant_id
xxx 1.65 3.45
yyy 2.60 3.75
Related
I am trying to create a data frame that includes all the dates between '1/1/2019' and '28/02/2022', for each of 3 countries and each of 9 SKUs.
I am following this approach for days that is working fine:
days = pd.DataFrame(pd.date_range(start='1/1/2019', end='28/02/2022',freq='D'))
for i in range(26):
days=days.append(pd.DataFrame(pd.date_range(start='1/1/2019', end='28/02/2022',freq='D')))
days.rename(columns={0:'Date'},inplace=True)
days.reset_index(inplace=True,drop=True)
This data frame gives a column with 31185 rows that correspond to the 1155 days between that range multiplied by 27 (3x9).
Now I would like to assign the country name and sku name. To populate the countries I am following this manual approach, but it takes a while to load. Is there a smarter way to do it? I would like to do the same for the 9 SKUs.
for i in range(0,10395):
days.loc[i,'country']= 'Austria'
for i in range(10395,20790):
days.loc[i,'country']= 'Germany'
for i in range(20790,31185):
days.loc[i,'country']= 'Switzerland'
Thanks in advance!
A more efficient apporach is cross-multiply the 3 arrays representing Date, Country and SKU together:
df = pd.MultiIndex.from_product([
pd.date_range(start='1/1/2019', end='28/02/2022', freq='D'),
['Austria', 'Germany', 'Switzerland'],
[f'SKU {i}' for i in range(9)]
], names=['Date', 'Country', 'SKU']).to_frame(index=False)
Try days.loc[range(0,10395),'country']= 'Austria'
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I'm trying to loop through a table that contains covid-19 data. My table has 4 columns: month, day, location, and cases. The values of each column in the table is stored in its own list, so each list has the same length. (Ie. there is a month list, day list, location list, and cases list). There are 12 months, with up to 31 days in a month. Cases are recorded for many locations around the world. I would like to figure out what day of the year had the most total combined global cases. I'm not sure how to structure my loops appropriately. An oversimplified sample version of the table represented by the lists is shown below.
In this small example, the result would be month 1, day 3 with 709 cases (257 + 452).
Month
Day
Location
Cases
1
1
CAN
124
1
1
USA
563
1
2
CAN
242
1
2
USA
156
1
3
CAN
257
1
3
USA
452
.
.
...
...
12
31
...
...
I assume that you've put all the data in the same data frame, df.
df = pandas.DataFrame()
df['Month'] = name_of_your_month_list
df['Day'] = name_of_your_daylist
df['Location'] = name_of_your_location_list
df['Cases'] = name_of_your_cases_list
df.Cases.max() gives you the biggest number of cases. I assume that there is on year only in the dataset. So df[df.Cases==df.Cases.max()].index gives youth index that you search
For the the day, just filter :
df[df.index==df[df.Cases==df.Cases.max()].index].Day
For the month:
df[df.index==df[df.Cases==df.Cases.max()].index].Month
For the number of cases:
df[df.index==df[df.Cases==df.Cases.max()].index].Cases
For the country :
df[df.index==df[df.Cases==df.Cases.max()].index].Location
Reading the comment, it is not clear if you search the biggest cases in a Location or of the day. If its from the day, you'll have to filter first with a groupby('Day') function, to use it as groupby('Day').max()
You group your dataframe by month and day. Then iterate through the groups to find the group in which the sum of cases in all locations was max as shown below:
import pandas as pd
df = pd.DataFrame({'Month':[1,1,1,1,1,1], 'Day':[1,1,2,2,3,3],
'Location':['CAN', 'USA', 'CAN', 'USA','CAN', 'USA'],
'Cases':[124,563,242,156,257,452]})
grouped = df.groupby(['Month', 'Day'])
max_sum = 0
max_day = None
for idx, group in grouped:
if group['Cases'].sum() > max_sum:
max_sum = group['Cases'].sum()
max_day = group
month = max_day['Month'].iloc[1]
day = max_day['Day'].iloc[1]
print(f'Maximum cases of {max_sum} occurred on {month}/{day}.')
#prints: Maximum cases of 709 occurred on 1/3
If you don't want to use Pandas, this is how you do it:
months = [1,1,1,1,1,1]
days = [1,1,2,2,3,3]
locations = ['CAN', 'USA', 'CAN', 'USA','CAN', 'USA']
cases = [124,563,242,156,257,452]
dic = {}
target_day = 0
count = 0
for i in range(len(days)):
if days[i] != target_day:
target_day = days[i]
count = cases[i]
else:
count += cases[i]
dic[f'{months[i]}/{days[i]}'] = count
max_cases = max(dic.values())
worst_day = list(dic.keys())[list(dic.values()).index(max_cases)]
print(f'Maximum cases of {max_cases} occurred on {worst_day}.')
#Prints: Maximum cases of 709 occurred on 1/3.
you can check the max value in your cases list first. then map the max case's index with other three lists and obtain their values.
ex: caseList = [1,2,3,52,1,0]
the maximum is 52. its index is 3. in your case you can get the monthList[3], dayList[3],
locationList[3] respectively. then you get the relevant day, month and country which is having the most total global cases.
check whether this will help in your scenario.
You may use this strategy to get the required result.
daylist,monthlist,location,Cases = [1, 2, 3, 4], [1,1,1,1],['CAN','USA','CAN','USA'],[124,563,242,999]
maxCases = Cases.index(max(Cases))
print("Max Case:",Cases[maxCases])
print("Location:",location[maxCases])
print("Month:",monthlist[maxCases])
print("Day:",daylist[maxCases])
Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999
I'm new to Pandas and trying to put together training data for a neural net problem.
Essentially, I have 2 DataFrames:
One DataFrame has a column for the primary_key and 3 columns for 3 different positions (sports positions, for this example assume First Base, Second Base, Third Base if you'd like). Each position has the player ID's for the player in that position.
On a second DataFrame, I have various statistics for each player like Height and Weight.
My ultimate goal is to add columns from the second DataFrame to the first DataFrame so that each position has the associated Height and Weight for a particular player represented as columns. Then, I'm going to export this DataFrame as a csv, arrange the columns in a particular order, and use that for my training data, where each column is a training feature and each row is a training set. I've worked out a solution, but I'm wondering if I'm doing it in the most efficient manner possible, fully utilizing Pandas functions and features.
Here's what my code looks like:
****EDIT: I should point out, this is just a simplification of what my code looks like. In reality, my DataFrames are being pulled from CSVs, not constructed from dictionaries created by myself. ****
import pandas as pd
dict_1 = {'primary_key' : ['a', 'b', 'c', 'd'],
'position_1_ID' : ['ida', 'idb', 'idc', 'idd'],
'position_2_ID' : ['ide', 'idb', 'idg', 'idd'],
'position_3_ID' : ['idg', 'idf', 'idc', 'idh']
}
dict_2 = {'position_ID' : ['ida', 'idb', 'idc', 'idd', 'ide', 'idf', 'idg', 'idh'],
'Height' : ['70', '71', '72', '73', '74', '75', '76', '77'],
'Weight' : ['200', '201', '202', '203', '204', '205', '206', '207']
}
positions = pd.DataFrame(dict_1)
players = pd.DataFrame(dict_2)
position_columns = ['position_1_ID', 'position_2_ID', 'position_3_ID']
carry = positions
previous = None
for p in position_columns:
merged = carry.merge(right = players, left_on = p, right_on = 'position_ID', suffixes = [previous, p] )
carry = merged
previous = p
carry.to_csv()
After this code runs, I have a DataFrame which contains the following columns:
'primary_key'
'position_1_ID'
'position_2_ID'
'position_3_ID'
'position_IDposition_1_ID'
'position_IDposition_2_ID'
'position_IDposition_3_ID'
'Heightposition_1_ID'
'Weightposition_1_ID'
'Heightposition_2_ID'
'Weightposition_2_ID'
'Heightposition_3_ID'
'Weightposition_3_ID'
It's not pretty, but this gives me the ability to eventually export a csv with a particular column order, and it doesn't take a prohibitively long time to produce the DataFrame.
That being said, I'm doing this project partially to learn Pandas. I would like to see if there are cleaner ways to do this.
Thanks!
You can use melt, merge and unstack:
df_out = carry.melt('primary_key')\
.merge(players, left_on='value', right_on='position_ID')\
.set_index(['primary_key','variable'])\
.drop('value', axis=1)\
.unstack()
df_out.columns = [f'{i}{j}' if i != 'position_ID' else f'{i}' for i,j in df_out.columns]
print(df_out)
Output:
position_ID position_ID position_ID Heightposition_1_ID Heightposition_2_ID Heightposition_3_ID Weightposition_1_ID Weightposition_2_ID Weightposition_3_ID
primary_key
a ida ide idg 70 74 76 200 204 206
b idb idb idf 71 71 75 201 201 205
c idc idg idc 72 76 72 202 206 202
d idd idd idh 73 73 77 203 203 207
height_dict = {k:v for k, v in zip(dict_2['position_ID'], dict_2['Height'])}
weight_dict = {k:v for k, v in zip(dict_2['position_ID'], dict_2['Weight'])}
positions = pd.DataFrame(dict_1)
positions['p1_height'] = positions['position_ID1'].map(height_dict)
Similar steps for all the 3 ids for both height and weight.
You can loop, instead of writing repeated similar steps.
Hope this helps.
positions.to_csv()