Change numeric column based on category after group by - python

I have a df like this below:
dff = pd.DataFrame({'id':[1,1,2,2], 'categ':['A','B','A','B'],'cost':[20,5, 30,10] })
dff
id categ cost
0 1 A 20
1 1 B 5
2 2 A 30
3 2 B 10
What i want is to make a new df where I group by id and then the cost of category B takes the
20% of the price of category A, and at the same time category A loses this amount. I would like my desired output to be like this:
id category price
0 1 A 16
1 1 B 9
2 2 A 24
3 2 B 16
I have done this below but it only reduces the price of by 20%. Any idea how to do what i want?
dff['price'] = np.where(dff['category'] == 'A', dff['price'] * 0.8, dff['price'])

Do pivot then modify and stack back
s = df.pivot(*df)
s['B'] = s['B'] + s['A'] * 0.2
s['A'] *= 0.8
s = s.stack().reset_index(name='cost')
s
Out[446]:
id categ cost
0 1 A 16.0
1 1 B 9.0
2 2 A 24.0
3 2 B 16.0

You can transform to broadcast the 'A' value to every row in the group and take 20% of it. Then using map you can subtract for 'A' and add for 'B'
s = df['cost'].where(df.categ.eq('A')).groupby(df['id']).transform('first')*0.2
df['cost'] = df['cost'] + s*df['categ'].map({'A': -1, 'B': 1})
id categ cost
0 1 A 16.0
1 1 B 9.0
2 2 A 24.0
3 2 B 16.0

Related

How can I select top rows of a dataframe,which sum of one column is 90% of sum of that original one?

Let's see one example,if we have got a dataframe like below :
category value
0 A 4
1 B 3
2 C 2
3 D 1
since sum values of top 3 rows is 9 ,is there a quick way to select that part of dataframe,so I can get :
category value
0 A 4
1 B 3
2 C 2
Thanks!
Create another column with a cumulative % and filter on that.
df['cumpct'] = df['value'].cumsum()/ df['value'].sum()*100
df[df['cumpct] <= 90]
You can calculate the cumulative sum of values as follows:
Code:
import sys
from io import StringIO
import pandas as pd
if __name__ == '__main__':
input_data = """
index category value
0 A 4
1 B 3
2 C 2
3 D 1 """
df = pd.read_csv(StringIO(input_data), sep="\s+", engine="python")
df.sort_values(by=["value"], ascending=[False], inplace=True)
df["cumsum"] = df["value"].cumsum() / df["value"].sum()
sel_df = df[df["cumsum"] <= 0.9]
print(sel_df)
Result:
index category value cumsum
0 0 A 4 0.4
1 1 B 3 0.7
2 2 C 2 0.9

python pandas - transform custom aggregation

Having the following data frame, of user activity across 2 days:
user score
0 A 10
1 A 0
2 B 5
I would like to calculate the average user score for that time and transform the result to all the rows:
import pandas as pd
df = pd.DataFrame({'user' : ['A','A','B'],
'score': [10,0,5]})
df["avg"] = df.groupby(['user']).transform("sum")["score"]
df.head()
This could gives me the some of each user:
user score avg
0 A 10 10
1 A 0 10
2 B 5 5
And now I would like to divide each score by the number of days (2) to get:
user score avg
0 A 10 5
1 A 0 5
2 B 5 2.5
Can this be done on the same line where I calculated the sum?
You can divide output Series by 2:
df = pd.DataFrame({'user' : ['A','A','B'],
'score': [10,0,5]})
df["avg"] = df.groupby(['user']).transform("sum")["score"] / 2
print (df)
user score avg
0 A 10 5.0
1 A 0 5.0
2 B 5 2.5
here you can something like that
df["avg"] = df.groupby(['user']).transform("sum")["score"]/2
In [54]: df.head()
Out[54]:
user score avg
0 A 10 5.0
1 A 0 5.0
2 B 5 2.5

Groupby column name and add results as additional columns

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'stuff_1_var_1': range(5),
'stuff_1_var_2': range(2, 7),
'stuff_2_var_1': range(3, 8),
'stuff_2_var_2': range(5, 10)
})
stuff_1_var_1 stuff_1_var_2 stuff_2_var_1 stuff_2_var_2
0 0 2 3 5
1 1 3 4 6
I would like to groupby based on the column headers and then add the mean and median of each group as new columns. So my expected output looks like this:
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
Brief explanation:
we have two groups stuff_1_var_ and stuff_2_var_ for which would calculate the mean and median per row. So, e.g. for stuff_1_var_ it would be:
# values from stuff_1_var_1 and stuff_1_var_2
(0 + 2) / 2 = 1 and
( 1 + 3) / 2 = 2
The values are then added as a new column stuff_1_var_mean; analogue for meadian and stuff_2_var_.
I got until:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T
stuff_1_var_ stuff_2_var_
0 mean 1 4
median 1 4
1 mean 2 5
median 2 5
How can I do the final step(s)?
Your solution should be changed:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T.unstack()
dfgb.columns = dfgb.columns.map(lambda x: f'{x[0]}{x[1]}')
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Unfortunately for axis=1 is not implemented agg, so possible solution is create mean and median separately and then concat:
dfgb = df.groupby(pattern, axis=1).agg(['mean','median'])
NotImplementedError: axis other than 0 is not supported
pattern = df.columns.str.extract('(^stuff_\d_var_)', expand=False)
g = df.groupby(pattern, axis=1)
dfgb = pd.concat([g.mean().add_suffix('mean'),
g.median().add_suffix('median')], axis=1)
dfgb = dfgb.iloc[:, [0,2,1,3]]
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Here's a way you can do:
col = 'stuff_1_var_'
use_col = [x for x in df.columns if 'stuff_1' in x]
df[f'{col}mean'] = df[use_col].mean(1)
df[f'{col}median'] = df[use_col].median(1)
col2 = 'stuff_2_var_'
use_col = [x for x in df.columns if 'stuff_2' in x]
df[f'{col2}mean'] = df[use_col].mean(1)
df[f'{col2}median'] = df[use_col].median(1)
print(df.iloc[:,-4:]) # showing last four new columns
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1.0 1.0 4.0 4.0
1 2.0 2.0 5.0 5.0
2 3.0 3.0 6.0 6.0
3 4.0 4.0 7.0 7.0
4 5.0 5.0 8.0 8.0
Ofcourse, you can put it in a function to avoid repeating the same code.

How to make a counter of values change by group in Python

I'm trying to make a counter that would change values only when it's different from the previous row or the ID I'm grouping by changes
Let's say I have the following dataframe:
ID Flag New_Column
A NaN 1
A 0 1
A 0 1
A 0 1
A 1 2
A 1 2
A 1 2
A 0 3
A 0 3
A 0 3
A 1 4
A 1 4
A 1 4
B NaN 1
B 0 1
I want to create New_Column where every time the Flag values changes, I'd increment New_Column by one and if the ID changes, it would reset to one and start over
Here is what I tried to do using np.select but it's not working
df['New_Column'] = None
df['Flag_Lag'] = df.sort_values(by=['ID', 'Date_Time'], ascending=True).groupby(['ID'])['Flag'].shift(1)
df['ID_Lag'] = df.sort_values(by=['ID', 'Date_Time'], ascending=True).groupby(['ID'])['ID'].shift(1)
conditions = [((df['Flag'] != df['Flag_Lag']) & (df['ID'] == df['ID_Lag'])),
((df['Flag'] == df['Flag_Lag']) & (df['ID'] == df['ID_Lag'])),
((df['Flag_Lag'] == np.nan) & (df['New_Column'].shift(1) == 1)),
((df['ID'] != df['ID_Lag']))
]
choices = [(df['New_Column'].shift(1) + 1),
(df['New_Column'].shift(1)),
(df['New_Column'].shift(1)),
1]
df['New_Column'] = np.select(conditions, choices, default=np.nan)
With this code, the first value for New_Column is 1, the second is NaN and the rest is None
Does anyone know a better way to do this?
Group by ID and use cumulative sum of (current is not equal previous)
df['new'] = df.groupby('ID') \
apply(lambda x: x['Flag'].fillna(0).diff().ne(0).cumsum()).reset_index(level=0, drop=True)
ID Flag New_Column new
0 A NaN 1 1
1 A 0.0 1 1
2 A 0.0 1 1
3 A 0.0 1 1
4 A 1.0 2 2
5 A 1.0 2 2
6 A 1.0 2 2
7 A 0.0 3 3
8 A 0.0 3 3
9 A 0.0 3 3
10 A 1.0 4 4
11 A 1.0 4 4
12 A 1.0 4 4
13 B NaN 1 1
14 B 0.0 1 1
If speed is not a concern and you want some easy-to-read code, you could simply iterate over the dataframe and run a simple function for each row.
def f(row):
global previous_ID, previous_flag, previous_count
if previous_ID == False: #let's start the count
row['New_Column'] = 1
elif previous_ID != row['ID']: #let's start the count over
row['New_Column'] = 1
elif previous_flag == row['Flag']: #same ID, same Flag
row['New_Column'] = previous_count
else: #same ID, different Flag
row['New_Column'] = previous_count + 1
previous_ID = row['ID']
previous_flag = row['Flag']
previous_count = row['New_Column']
You should fill your NaN values with a 0 probably or add a special case in the function for it.
You can run the function in the following way:
previous_ID, previous_flag, previous_count = False, False, False
df['New_Columns'] = []
for i, row in df.iterrows():
f(row)
And that's it.

How do I reshape this dataset in Python pandas?

Say I have a dataset like this:
is_a is_b is_c population infected
1 0 1 50 20
1 1 0 100 10
0 1 1 20 10
...
How do I reshape it to look like this?
feature 0 1
a 10/20 30/150
b 20/50 20/120
c 10/100 30/70
...
In the original dataset, I have features a, b, and c as their own separate columns. In the transformed dataset, these same variables are listed under column feature, and two new columns 0 and 1 are produced, corresponding to the values that these features can take on.
In the original dataset where is_a is 0, add infected values and divide them by population values. Where is_a is 1, do the same, add infected values and divide them by population values. Rinse and repeat for is_b and is_c. The new dataset will have these fractions (or decimals) as shown. Thank you!
I've tried pd.pivot_table and pd.melt but nothing comes close to what I need.
After doing the wide_to_long , your question is more clear
df=pd.wide_to_long(df,['is'],['population','infected'],j='feature',sep='_',suffix='\w+').reset_index()
df
population infected feature is
0 50 20 a 1
1 50 20 b 0
2 50 20 c 1
3 100 10 a 1
4 100 10 b 1
5 100 10 c 0
6 20 10 a 0
7 20 10 b 1
8 20 10 c 1
df.groupby(['feature','is']).apply(lambda x : sum(x['infected'])/sum(x['population'])).unstack()
is 0 1
feature
a 0.5 0.200000
b 0.4 0.166667
c 0.1 0.428571
I tried this on your small dataframe, but I am not sure it will work on a larger dataset.
dic_df = {}
for letter in ['a', 'b', 'c']:
dic_da = {}
dic_da[0] = df[df['is_'+str(letter)] == 0].infected.sum()/df[df['is_'+str(letter)] == 0].population.sum()
dic_da[1] = df[df['is_'+str(letter)] == 1].infected.sum()/df[df['is_'+str(letter)] == 1].population.sum()
dic_df[letter] = dic_da
dic_df
dic_df_ = pd.DataFrame(data = dic_df).T.reset_index().rename(columns= {'index':'feature'})
feature 0 1
0 a 0.5 0.200000
1 b 0.4 0.166667
2 c 0.1 0.428571
Here, DF would be your original DataFrame
Aux_NewDF = [{'feature': feature,
0 : '{}/{}'.format(DF['infected'][DF['is_{}'.format(feature.lower())]==0].sum(), DF['population'][DF['is_{}'.format(feature.lower())]==0].sum()),
1 : '{}/{}'.format(DF['infected'][DF['is_{}'.format(feature.lower())]==1].sum(), DF['population'][DF['is_{}'.format(feature.lower())]==1].sum())} for feature in ['a','b','c']]
NewDF = pd.DataFrame(Aux_NewDF)

Categories