I'm new to Pandas and trying to put together training data for a neural net problem.
Essentially, I have 2 DataFrames:
One DataFrame has a column for the primary_key and 3 columns for 3 different positions (sports positions, for this example assume First Base, Second Base, Third Base if you'd like). Each position has the player ID's for the player in that position.
On a second DataFrame, I have various statistics for each player like Height and Weight.
My ultimate goal is to add columns from the second DataFrame to the first DataFrame so that each position has the associated Height and Weight for a particular player represented as columns. Then, I'm going to export this DataFrame as a csv, arrange the columns in a particular order, and use that for my training data, where each column is a training feature and each row is a training set. I've worked out a solution, but I'm wondering if I'm doing it in the most efficient manner possible, fully utilizing Pandas functions and features.
Here's what my code looks like:
****EDIT: I should point out, this is just a simplification of what my code looks like. In reality, my DataFrames are being pulled from CSVs, not constructed from dictionaries created by myself. ****
import pandas as pd
dict_1 = {'primary_key' : ['a', 'b', 'c', 'd'],
'position_1_ID' : ['ida', 'idb', 'idc', 'idd'],
'position_2_ID' : ['ide', 'idb', 'idg', 'idd'],
'position_3_ID' : ['idg', 'idf', 'idc', 'idh']
}
dict_2 = {'position_ID' : ['ida', 'idb', 'idc', 'idd', 'ide', 'idf', 'idg', 'idh'],
'Height' : ['70', '71', '72', '73', '74', '75', '76', '77'],
'Weight' : ['200', '201', '202', '203', '204', '205', '206', '207']
}
positions = pd.DataFrame(dict_1)
players = pd.DataFrame(dict_2)
position_columns = ['position_1_ID', 'position_2_ID', 'position_3_ID']
carry = positions
previous = None
for p in position_columns:
merged = carry.merge(right = players, left_on = p, right_on = 'position_ID', suffixes = [previous, p] )
carry = merged
previous = p
carry.to_csv()
After this code runs, I have a DataFrame which contains the following columns:
'primary_key'
'position_1_ID'
'position_2_ID'
'position_3_ID'
'position_IDposition_1_ID'
'position_IDposition_2_ID'
'position_IDposition_3_ID'
'Heightposition_1_ID'
'Weightposition_1_ID'
'Heightposition_2_ID'
'Weightposition_2_ID'
'Heightposition_3_ID'
'Weightposition_3_ID'
It's not pretty, but this gives me the ability to eventually export a csv with a particular column order, and it doesn't take a prohibitively long time to produce the DataFrame.
That being said, I'm doing this project partially to learn Pandas. I would like to see if there are cleaner ways to do this.
Thanks!
You can use melt, merge and unstack:
df_out = carry.melt('primary_key')\
.merge(players, left_on='value', right_on='position_ID')\
.set_index(['primary_key','variable'])\
.drop('value', axis=1)\
.unstack()
df_out.columns = [f'{i}{j}' if i != 'position_ID' else f'{i}' for i,j in df_out.columns]
print(df_out)
Output:
position_ID position_ID position_ID Heightposition_1_ID Heightposition_2_ID Heightposition_3_ID Weightposition_1_ID Weightposition_2_ID Weightposition_3_ID
primary_key
a ida ide idg 70 74 76 200 204 206
b idb idb idf 71 71 75 201 201 205
c idc idg idc 72 76 72 202 206 202
d idd idd idh 73 73 77 203 203 207
height_dict = {k:v for k, v in zip(dict_2['position_ID'], dict_2['Height'])}
weight_dict = {k:v for k, v in zip(dict_2['position_ID'], dict_2['Weight'])}
positions = pd.DataFrame(dict_1)
positions['p1_height'] = positions['position_ID1'].map(height_dict)
Similar steps for all the 3 ids for both height and weight.
You can loop, instead of writing repeated similar steps.
Hope this helps.
positions.to_csv()
Related
I am trying to implement a logic where i have two dataframes. Say A (left) and B (right).
I need to find matching rows of A in B (i understand this can be done via a "inner" join). But my use case says i only need all rows from dataframe A but also a column from the matched record in B, the ID column (i understand this can be one via select). But here the problem arises, as after inner join the returned dataframe has rows from dataframe B also, which i dont need, but if i take leftsemi, then i wont be able to fetch the ID column from B dataframe.
For example:
def fetch_match_condition():
return ["EMAIL","PHONE1"]
match_condition = fetch_match_condition()
from pyspark.sql.functions import sha2, concat_ws
schema_col = ["FNAME","LNAME","EMAIL","PHONE1","ADD1"]
schema_col_master = schema_col.copy()
schema_col_master.append("ID")
data_member = [["ALAN","AARDVARK","lbrennan.kei#malchikzer.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","8653827956","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
master_df = spark.createDataFrame(data_member,schema_col)
master_df = master_df.withColumn("ID", sha2(concat_ws("||", *master_df.columns), 256))
test_member = [["ALAN","AARDVARK","lbren.kei#malchik.gq","7346281938","4176 BARRINGTON CT"],
["DAVID","DINGO","ocoucoumamai#hatberkshire.com","99997492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"],
["JAKUB","FILIP","qcermak#email.czanek","87463829","KOHOUTOVYCH 93 602 42 MLADA BOLESLAV"]]
test_member_1 = [["DAVID","DINGO","ocoucoumamai#hatberkshire.com","7362537492","2622 MONROE AVE"],
["MICHAL","MARESOVA","annasimkova#chello.cz","435261789","FRANTISKA CERNEHO 61 623 99 UTERY"]]
test_df = spark.createDataFrame(test_member,schema_col)
test_df_1 = spark.createDataFrame(test_member_1,schema_col)
matched_df = test_df.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
master_df = master_df.union(matched_df.select(schema_col_master))
# Here only second last record will match and get added back to the master along with the same ID as it was in master_df
matched_df = test_df_1.join(master_df, match_condition, "inner").select(input_df["*"], master_df["ID"])
# Here the problem arises since i need the match to be only 2, since the test_df_1 has only two records but the match will be 3 since master_df(david record) will also be selected.
PS : (Is there a way i can take leftsemi and then use withColumn and a UDF which fetches me this ID based on the row values of leftsemi dataframe from B dataframe ?)
Can anyone propose a solution for this ?
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I have been playing around with a dataset about football, and need to group my ['position'] by values, and assign them to a new variable.
First, here is my dataframe
df = player_stats[['id','player','date','team_name','fixture_name','position','shots', 'shots_on_target', 'xg',
'xa', 'attacking_pen_area_touches', 'penalty_area_entry_passes',
'carries_total_distance', 'total_distance_passed', 'aerial_sucess_perc',
'passes_attempted', 'passes_completed', 'short_pass_accuracy_perc', 'medium_pass_accuracy_perc',
'long_pass_accuracy_perc', 'final_third_entry_passes', 'carries_total_distance', 'ball_recoveries',
'total_distance_passed', 'dribbles_completed', 'dribbles_attempted', 'touches',
'tackles_won', 'tackles_attempted']]
I have split my ['position'] as it had multiple string-values, and added them to a column called ['position_new].
position_new
AM 277
CB 938
CM 534
DF 7
DM 604
FW 766
GK 389
LB 296
LM 149
LW 284
MF 5
RB 300
RM 160
RW 323
WB 275
What I need, is basically to have 3 different variables who have all the same columns, but are separated by the value in the position_new. Look at the below scheme:
So: my variable: Att, need to have all the columns of df, but only with values in position_new that are equal too: FW, LF, RW.
I know how to hardcode it, but cannot get my head around, how to transform it into a for loop.
Here is my loop..
for col in df[29:30]:
if df.loc[df['position_new'] == 'FW', 'LW', 'RW']:
att = df
elif df.loc[df['position_new'] == 'AM', 'CM', 'DM', 'LM', 'RM']:
mid = df
else:
defender = df
Thank you!
I'm not sure what you are trying to do but it looks like you want to get all positions that are of type attackers, midfielders, and defenders based on their two-letter abbreviation into separate variables.
What you are doing is not optimal because it won't work on any generic data frame with this type of info.
But, if you want to do it for just this case, you are simply missing the correct comparison operator in your for loop. Try:
if df.loc[df['position_new'].isin(['FW', 'LW', 'RW'])]:
I have a data frame containing the customers ratings of the restaurants they went to and few other attributes.
What i want to do is to calculate the difference between the average star rating for the last year and the average star rating for
the first year of a restaurant.
data = {'rating_id': ['1', '2','3','4','5','6','7'],
'user_id': ['56', '13','56','99','99','13','12'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df.head()
df['star_rating'] = df['star_rating'].astype(float)
# calculate the average of the stars of the first year
ratings_mean_firstYear= df.groupby(['restaurant_id','first_year']).agg({'star_rating':[np.mean]})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_firstYear.reset_index()
# calculate the average of the stars of the last year
ratings_mean_lastYear= df.groupby(['restaurant_id','last_year']).agg({'star_rating':[np.mean]})
ratings_mean_lastYear.columns = ['avg_lastYear']
ratings_mean_lastYear.reset_index()
# merge the means into a single table
ratings_average = ratings_mean_firstYear.merge(
ratings_mean_lastYear.groupby('restaurant_id')['avg_lastYear'].max()
, on='restaurant_id'
)
ratings_average.head(20)
My problem is that the averages of the first and last years are the exact same and that makes no sens, i don't really know what i did wrong in my thought process here..i suspect something is going on with .agg since it's the first time i use pandas lib.
Any suggestions?
Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using rating_year == first_year criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. –
Here is an example, I added more lines and changed years to have more matches:
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '56','56','56', '99','99','99','99','99'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx', 'xxx','yyy','yyy','xxx'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','4.0','2.5','3.0'],
'rating_year': ['2012', '2020','2001','2020', '2012', '2020','2001','2020','2019'],
'first_year': ['2012', '2012','2001','2001','2012', '2012','2001','2001','2012'],
'last_year': ['2020', '2020','2020','2020','2020','2020','2020','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
ratings_mean_firstYear = df[df.rating_year == df.first_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_lastYear= df[df.rating_year == df.last_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_lastYear.columns = ['avg_lastYear']
result:
ratings_mean_firstYear.merge(ratings_mean_lastYear, left_index=True, right_index=True)
avg_firstYear avg_lastYear
restaurant_id
xxx 1.65 3.45
yyy 2.60 3.75
I have a pandas dataframe with a format exactly like the one in this question and I'm trying to achieve the same result. In my case, I am calculating the fuzz-ratio between the row's index and it's corresponding col.
If I try this code (based on the answer to the linked question)
def get_similarities(x):
return x.index + x.name
test_df = test_df.apply(get_similarities)
the concatenation of the row index and col name happens cell-wise, just as intended. Running type(test_df) returns pandas.core.frame.DataFrame, as expected.
However, if I adapt the code to my scenario like so
def get_similarities(x):
return fuzz.partial_ratio(x.index, x.name)
test_df = test_df.apply(get_similarities)
it doesn't work. Instead of a dataframe, I get back a series (the return type of that function is an int)
I don't understand why the two samples would not behave the same nor how to fix my code so it returns a dataframe, with the fuzzy.ratio for each cell between the a row's index for that cell and the col name for that cell.
what about the following approach?
assuming that we have two sets of strings:
In [245]: set1
Out[245]: ['car', 'bike', 'sidewalk', 'eatery']
In [246]: set2
Out[246]: ['walking', 'caring', 'biking', 'eating']
Solution:
In [247]: from itertools import product
In [248]: res = np.array([fuzz.partial_ratio(*tup) for tup in product(set1, set2)])
In [249]: res = pd.DataFrame(res.reshape(len(set1), -1), index=set1, columns=set2)
In [250]: res
Out[250]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
There is a way to accomplish this via DataFrame.apply with some row manipulations.
Assuming the 'test_df` is as follows:
In [73]: test_df
Out[73]:
walking caring biking eating
car carwalking carcaring carbiking careating
bike bikewalking bikecaring bikebiking bikeeating
sidewalk sidewalkwalking sidewalkcaring sidewalkbiking sidewalkeating
eatery eaterywalking eaterycaring eaterybiking eateryeating
In [74]: def get_ratio(row):
...: return row.index.to_series().apply(lambda x: fuzz.partial_ratio(x,
...: row.name))
...:
In [75]: test_df.apply(get_ratio)
Out[75]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
It took some digging, but I figured it out. The problem comes from the fact that DataFrame.apply is either applied column-wise or row-wise, not cell by cell. So your get_similarities function is actually getting access to an entire row or column of data at a time! By default it gets the entire column -- so to solve your problem, you just have to make a get_similarities function that returns a list where you manually call fuzz.partial_ratio on each element, like this:
import pandas as pd
from fuzzywuzzy import fuzz
def get_similarities(x):
l = []
for rname in x.index:
print "Getting ratio for %s and %s" % (rname, x.name)
score = fuzz.partial_ratio(rname,x.name)
print "Score %s" % score
l.append(score)
print len(l)
print
return l
a = pd.DataFrame([[1,2],[3,4]],index=['apple','banana'], columns=['aple','banada'])
c = a.apply(get_similarities,axis=0)
print c
print type(c)
I left my print statements in their so you can see what the DataFrame.apply call is doing for yourself -- that's when it clicked for me.