Consider the Following DataFrame
candy = pd.DataFrame({'Name':['Bob','Bob','Bob','Annie','Annie','Annie','Daniel','Daniel','Daniel'], 'Candy': ['Chocolate', 'Chocolate', 'Lollies','Chocolate', 'Chocolate', 'Lollies','Chocolate', 'Chocolate', 'Lollies'], 'Value':[15,15,10,25,30,12,40,40,16]})
After reading the following post, I am aware that apply works on the whole Dataframe and transform works on a series.
Apply vs transform on a group object
So if I want to append the total $ spend on candy per person, I can simply use the following.
candy['Total Spend'] = candy.groupby(['Name'])['Value'].transform(sum)
But if I need to append the total $ chocolate spend per person - it feels like I have no choice but to create a separate dataframe and then merging it back by using the apply function since transform only works on a series.
chocolate = candy.groupby(['Name']).apply(lambda x: x[x['Candy'] == 'Chocolate']['Value'].sum()).reset_index(name = 'Total_Chocolate_Spend')
candy = pd.merge(candy, chocolate, how = 'left',left_on=['Name'], right_on=['Name'])
While I don't mind writing the above code to solve this problem. Is it possible to 'transform' the applied results back to the dataframe without having to create a separate dataframe and merging it?
What is actually happening when the transform function is used? Is a separate series being stored in memory and then merged back by the indexes similar to what I have done in the apply then merged method?
There are other methods. For example:
Create a temp column with just the chocolate value using df.where:
candy["choc_val"] = candy.Value.where(candy.Candy =="Chocolate", other=0)
candy["Total_Chocolate_Spend"] = candy.groupby("Name").choc_val.transform(sum)
candy = candy.drop(columns="choc_val")
output:
Name Candy Value Total Spend Total_Chocolate_Spend
0 Bob Chocolate 15 40 30
1 Bob Chocolate 15 40 30
2 Bob Lollies 10 40 30
3 Annie Chocolate 25 67 55
4 Annie Chocolate 30 67 55
5 Annie Lollies 12 67 55
6 Daniel Chocolate 40 96 80
7 Daniel Chocolate 40 96 80
8 Daniel Lollies 16 96 80
I don't know if this is more performant or easier to read.
I do not have much to add to the excellent reference you provided on apply vs. transform, but you can do what you want without creating a separate dataframe, for example you can do
candy.groupby(['Name']).apply(lambda x: x.assign(Total_Chocolate_Spend = x[x['Candy'] == 'Chocolate']['Value'].sum()))
this uses assign for each group in groupby to populate Total_Chocolate_Spend with the number you want
Related
I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df
I have a dataframe like this:
brand1 brand2 brand3
survey 22 33 12
clothes 19 22 19
shoes 34 12 15
What I'd like to do is count how many clothes I have and how many shoes in total, not taking into consideration the categories. I'm not sure how to do this since "survey" is not a column.
I basically want this:
survey
clothes 100
shoes 100
Any advice would be helpful.
Try
df.sum(axis = 1)
This should give the sum of values of the rows, then to display you can use a dictionary with keys as the survey column names, and value as the df.sum's values (maybe after storing it in a list).
currently, I'm using pandas DataFrame.filter to filter the records of the dataset. if I give a word, I have got all the records that are matching with that word. now if I give two words that are present in the dataset but they are not in one record then I got an empty set. Is there any way in either pandas or other python modules that I can find something that can search multiple words ( not in one record )?
With python list comprehension, we can build a full-text search by mapping. in pandas DataFrame.filter uses indexing. is there any difference between mapping and indexing? if yes what is it and which can give a better performance?
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
pokemon[pokemon['CustomerID'].isin(['200','5'])]
Output:
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
5 Female 31 17 40
200 Male 30 137 83
Name Qty.
0 Apple 3
1 Orange 4
2 Cake 5
Considering the above dataframe, if you want to find quantities of Apples and Oranges, you can do it like this:
result = df[df['Name'].isin(['Apple','Orange'])]
print (result)
I want to take a large dataframe with around 26,000 rows, foodList, and multiply the column foodList['food_quant'] by a certain value from the dataframe foodConversions. To determine this value from foodConversions, another column foodList['food_name'] has a string which corresponds to the index of foodConversions. I am doing this to convert grams of different foods to calories, and each food type has a different number of calories.
I've tried doing nested loops to go through every value in foodConversions and see if it is equal to foodList['food_name'], but that's super slow and never actually finishes running for some reason; hence, I would prefer to move away from this method.
I have also tried using applymap and a lambda function, but I don't think I've done this right.
Lastly, I've tried to use the methods outlined in another stackoverflow problem, but I wasn't sure how to apply it to my situation or if it even works for my situation. Here's the link to it: Multiply dataframe with values from other dataframe
Here are the two dataframes:
foodConversions = pd.Dataframe([2,3], index=['meat','vegetables'], columns=['cal/gram'])
cal/gram
meat 2
vegetables 3
foodList = pd.Dataframe([['meat',40]['meat',30]['vegetables',20]['meat',10]], columns=['food_name','food_quant'])
food_name food_quant
0 meat 40
1 meat 30
2 vegetables 20
3 meat 10
And the output should look like:
food_name food_quant
0 meat 80
1 meat 60
2 vegetables 60
3 meat 20
Hopefully that made sense, I tried to be as thorough as possible so I'm sorry for the lengthy explanation. Thanks everyone for you help!
We can do reindex or loc or map or merge
reindex|loc
df2.assign(food_quant=df2.food_quant*(df1['cal/gram'].reindex(df2.food_name).values))# change reindex to loc
Out[121]:
food_name food_quant
0 meat 80
1 meat 60
2 vegetables 60
3 meat 20
map|replace
df2.assign(food_quant=df2.food_quant*df2.food_name.map(df1['cal/gram']))
df2.assign(food_quant=df2.food_quant*df2.food_name.replace(df1['cal/gram']))
Try using:
print(foodList.set_index('food_name').mul(foodConversions.reindex(foodList['food_name'])['cal/gram'], axis=0).reset_index())
Output:
food_name food_quant
0 meat 80
1 meat 60
2 vegetables 60
3 meat 20
I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):
df1:
PRODUCT_ID PRODUCT_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce"
1 185965653252 "Chicken Salad with Dressing"
2 165958565556 "Pork and Honey Rissoles"
3 655262522233 "Cheese, Ham and Tomato Sandwich"
4 857485966653 "Coleslaw with Yoghurt Dressing"
5 524156285551 "Lemon and Raspberry Cheesecake"
I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:
df2 (also saved as dict_2)
PROD_ID PROD_DESCRIPTION
0 548576 "Fish Burger"
1 156956 "Chckn Salad w/Ranch Dressing"
2 257848 "Rissoles - Lamb & Rosemary"
3 298770 "Lemn C-cake"
4 651452 "Potato Salad with Bacon"
5 100256 "Cheese Cake - Lemon Raspberry Coulis"
What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:
PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"
1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"
2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"
3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN
4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN
5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"
6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"
I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.
I've used the following questions and answers (amongst others):
Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array
and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.
I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.
I'm using Python 2.7, pandas and have fuzzywuzzy installed.
using fuzz.ratio as my distance metric, calculate my distance matrix like this
df3 = pd.DataFrame(index=df.index, columns=df2.index)
for i in df3.index:
for j in df3.columns:
vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
vj = df2.get_value(j, 'PROD_DESCRIPTION')
df3.set_value(
i, j, fuzz.ratio(vi, vj))
print(df3)
0 1 2 3 4 5
0 63 15 24 23 34 27
1 26 84 19 21 52 32
2 18 31 33 12 35 34
3 10 31 35 10 41 42
4 29 52 32 10 42 12
5 15 28 21 49 8 55
Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.
threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)
Make assignments
df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df
You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:
d = {
'df1_id': [],
'df1_prod_desc': [],
'df2_id': [],
'df2_prod_desc': [],
'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
for _, df2_row in df2.iterrows():
d['df1_id'] = df1_row['PRODUCT_ID']
...
df3 = pd.DataFrame.from_dict(d)
I don't have enough reputation to be able to comment on answer from #piRSquared. Hence this answer.
The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)
A million thanks to #piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.