Pandas dataframe ratio of difference of consecutive columns to first value - python

Suppose I have the DataFrame (called df)
'name' 'order' 'quantity'
'A' 1 10
'A' 2 15
'A' 3 5
'B' 1 2
'B' 2 6
What I want is building another dataframe containing a column with the ratio of the differences of consecutive columns (consecutive in terms of column order) to the first value.
I am easily able to retrieve the difference in said ratio (the numerator) as
def compute_diff(x):
quantity_diff = x.quantity.diff()
return quantity_diff
diff_df = df.sort_values('order').groupby('name').apply(compute_diff).reset_index(name='diff')
This gives me
'name' 'level_1' 'quantity'
'A' 0 NaN
'A' 1 5
'A' 1 -10
'B' 1 NaN
'B' 2 4
Now I want the ratio instead, as per description. Specifically, I'd want
'name' 'level_1' 'quantity'
'A' 1 NaN
'A' 2 0.5
'A' 3 -0.6666
'B' 1 NaN
'B' 2 2
How to?

After performing your groupby, use pct_change:
# Sort the DataFrame, if necessary.
df = df.sort_values(['name', 'order'])
# Use groupby and pcnt_change on the 'quantity' column.
df['quantity'] = df.groupby('name')['quantity'].pct_change()
The resulting output:
name order quantity
0 A 1 NaN
1 A 2 0.500000
2 A 3 -0.666667
3 B 1 NaN
4 B 2 2.000000

You could take your result and divide it by the shifted 'quantity' column in df:
diff_df.quantity = diff_df.quantity / df.quantity.shift(1)

Related

Returning all rows where all values in a list are present in the same column using pandas

My data frame looks something like as below -
Here I want to find only those resturants where all food items are present - I first tried to group the rows by resturant_id but its not working
Code used for grouping -
df_new = df_new.groupby('resturant_id') ##does not group,
result is same as before
Then I basically have a list of food_items, so I go in each row looking for the food items
eg data -
items_list = ['a','b']
resturant_id price food_item
1 1 'a'
1 1 'b'
2 1 'b'
3 1 'a'
3 2 'b'
So, basically my ask is to find only those resturants where all food items are found - in our case it will be resturant 1 and 3 - because they have 'a' and 'b' both
IsIn method in pandas looks for either 'a' or 'b' but but not both at the same time (meaning it does 'OR' not 'AND') - how to do this using pandas?
If I get the resturant having all items in list then comparison of resturants should happen to return the cheapest resturant for all products.
I tried isin as below but its not working as expected -
choice = ['a','b']
df_new = df[np.isin(df['item_label'], choice)].reset_index(drop=True)
resturant_id price item_label
0 5 4.0 a
1 5 8.0 b
2 6 5.0 a
3 7 2.5 a
4 7 3.0 b
It returns even those rows where any food item is present - I want only those restrants where all food items are present - then want to find the cheapest resturant if there are more than 1 such resturant - such as 1 and 3 as explained in above example
5
We can use GroupBy.filter :
import numpy as np
new_df = \
df.groupby('resturant_id').filter(lambda x: np.isin(choice, x['food_item']).all())
resturant_id price food_item
0 1 1 'a'
1 1 1 'b'
3 3 1 'a'
4 3 2 'b'
Another option
new_df = \
df.loc[pd.get_dummies(df['food_item'])
.groupby(df['resturant_id'])
.transform('sum')
.gt(0)
.all(axis=1)]
Or if you want select items:
new_df = \
df.loc[pd.get_dummies(df['food_item'])
.groupby(df['resturant_id'])[['a', 'b']]
.transform('sum')
.gt(0)
.all(axis=1)]
Now we can get the cheaper for each product, note that GroupBy.rank is required as there may be a price tie
s = new_df.groupby('food_item')['price'].rank()
print(s)
0 1.5
1 1.0
3 1.5
4 2.0
Name: price, dtype: float64
cheaper_df = new_df.loc[s.eq(s.groupby(new_df['food_item']).transform('min'))]
print(cheaper_df)
resturant_id price food_item
0 1 1 a
1 1 1 b
3 3 1 a
cheaper_df.groupby('food_item')['resturant_id'].agg(list)
food_item
a [1, 3]
b [1]
Name: resturant_id, dtype: object

Use groupby in dataframe to perform data filtering and element-wise subtraction

I have a dataframe composed by the following table:
A B C D
A1 5 3 4
A1 8 1 0
A2 1 1 0
A2 1 9 1
A2 1 3 1
A3 0 4 7
...
I need to group the data according to the 'A' label, then check whether the sum of the 'B' column for each label is larger than 10. If it is larger than 10 then perform an operation that involves subtracting 'C' and 'D'. Finally, I need to drop all rows that identify those 'A' labels for which the condition on the sum is not larger than 10. I am trying to use the groupby method, but I am not sure this is the right way to go. So far I have grouped everything with df.groupby('A')['B'].sum() and get a list of sums per grouped label in order to check the aforementioned condition on the 10 elements. But then how to apply the subtraction between columns C and D and also drop the irrelevant rows?
Use GroupBy.transform with sum for new Series filled by aggregate values and filter rows greater like 10 in boolean indexing with Series.gt and then subtract columns:
df = df[df.groupby('A')['B'].transform('sum').gt(10)].copy()
df['E'] = df['C'].sub(df['D'])
print (df)
A B C D E
0 A1 5 3 4 -1
1 A1 8 1 0 1
Similar idea if need sum column:
df['sum'] = df.groupby('A')['B'].transform('sum')
df['E'] = df['C'].sub(df['D'])
df = df[df['sum'].gt(10)].copy()
print (df)
A B C D sum E
0 A1 5 3 4 13 -1
1 A1 8 1 0 13 1

Selecting Column values based on dictionary keys

I have a dictionary from which I want decided which columns value I want to choose sort of like an if condition using a dictionary.
import pandas as pd
dictname = {'A': 'Select1', 'B':'Select2','C':'Select3'}
DataFrame = pd.DataFrame([['A',1,2,3,4],['B',1,2,3,4],['B',1,3,4,5],['C',1,5,6,7]], columns=['Name','Score','Select1','Select2','Select3'])
So I want to create a new coilumn called ChosenValue which selects values based on the row value in the column 'Name' e.e. ChosenValue should equal to column 'Select1'' s value if the row value in 'Name' = 'A' and then ChosenValue should equal to 'Select2''s value if the row value in 'Name' = 'B' and so forth. I really want something to link it to the dictionary 'dictname'
Use Index.get_indexer to get a list of indices. After that, you can just index into the underlying numpy array.
idx = df.columns.get_indexer(df.Name.map(dictname))
df['ChosenValue'] = df.values[np.arange(len(df)), idx]
df
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7
If you know that every Name is in the dictionary, you could use lookup:
In [104]: df["ChosenValue"] = df.lookup(df.index, df.Name.map(dictname))
In [105]: df
Out[105]:
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7

How Select The Rows In A Dataframe with the Maximum Value in a Column

I have a dataframe where I want to select all the rows that
df = A B C D
'a' 1 1 1
'b' 1 2 1
'c' 1 1 1
'a' 1 2 2
'a' 2 2 2
'b' 1 2 2
And I want to get the rows where the value in one column is the maximum for that group. So for the example above if I wanted to group be 'A' and 'B' and get the rows that have the greatest value in 'C'
df = A B C D
'a' 1 2 2
'b' 1 2 2
'c' 1 1 1
'a' 2 2 2
I know that I want to use a groupby, but I'm not sure what to do after that.
The easiest way is to use the transform function. This basically let's you apply a function against a group that retains the same index as the original dataframe. In this case, you can see you get the following from the transform
In [13]: df.groupby(['A', 'B'])['C'].transform(max)
Out[13]:
0 2
1 2
2 1
3 2
4 2
5 2
Name: C, dtype: int64
This has the exact same index as the original dataframe, so you can use it to create a filter.
df[df['C'] == df.groupby(['A', 'B'])['C'].transform(max)]
Out[11]:
A B C D
1 b 1 2 1
2 c 1 1 1
3 a 1 2 2
4 a 2 2 2
5 b 1 2 2
For much more information on this, see the pandas groupby documentation, which is excellent.

Filling cells with conditional column means

Consider the following DataFrame:
df2 = pd.DataFrame({
'VAR_1' : [1,1,1,3,3],
'GROUP': [1,1,1,2,2],
})
My goal ist to create a seperate column "GROUP_MEAN" which holds the column "VAR_1" arithmetic mean value.
But - it should always consider the row value in "GROUP".
GROUP VAR_1 GROUP_MEAN
0 1 1 Mean Value GROUP = 1
1 1 1 Mean Value GROUP = 1
2 1 1 Mean Value GROUP = 1
3 2 3 Mean Value GROUP = 2
4 2 3 Mean Value GROUP = 2
I can easily access the overall mean:
df2['GROUP_MEAN'] = df2['VAR_1'].mean()
How do I go about making this conditional on a another column value?
I think this is a perfect use case for transform:
>>> df2 = pd.DataFrame({'VAR_1' : [1,2,3,4,5], 'GROUP': [1,1,1,2,2]})
>>> df2["GROUP_MEAN"] = df2.groupby('GROUP')['VAR_1'].transform('mean')
>>> df2
GROUP VAR_1 GROUP_MEAN
0 1 1 2.0
1 1 2 2.0
2 1 3 2.0
3 2 4 4.5
4 2 5 4.5
[5 rows x 3 columns]
Typically you use transform when you want to broadcast the result across all entries of the group.
assuming that the actual data-frame has columns in addition to VAR_1
ts = df2.groupby( 'GROUP' )['VAR_1'].aggregate( np.mean )
df2[ 'GROUP_MEAN' ] = ts[ df2.GROUP ].values
alternatively last line could also be:
df2 = df2.join( ts, on='GROUP', rsuffix='_MEAN' )

Categories