Trying to take a df and create a new column thats based on the difference between the Value in a group and that groups max:
Group Value
A 4
A 6
A 10
B 5
B 8
B 11
End up with a new column "from_max"
from_max
6
4
0
6
3
0
I tried this but a ValueError:
df['from_max'] = df.groupby(['Group']).apply(lambda x: x['Value'].max() - x['Value'])
Thanks in Advance
Option 1
vectorised groupby + transform
df['from_max'] = df.groupby('Group').Value.transform('max') - df.Value
df
Group Value from_max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
Option 2
index aligned subtraction
df['from_max'] = (df.groupby('Group').Value.max() - df.set_index('Group').Value).values
df
Group Value from_max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
I think need GroupBy.transform for return Series with same size as original DataFrame:
df['from_max'] = df.groupby(['Group'])['Value'].transform(lambda x: x.max() - x)
Or:
df['from_max'] = df.groupby(['Group'])['Value'].transform(max) - df['Value']
Alternative is Series.map by aggregate max:
df['from_max'] = df['Group'].map(df.groupby(['Group'])['Value'].max()) - df['Value']
print (df)
Group Value from_max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
Using reindex
df['From_Max']=df.groupby('Group').Value.max().reindex(df.Group).values-df.Value.values
df
Out[579]:
Group Value From_Max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
Related
I would like to obtain the 'Value' column below, from the original df:
A B C Column_To_Use
0 2 3 4 A
1 5 6 7 C
2 8 0 9 B
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
Use DataFrame.lookup:
df['Value'] = df.lookup(df.index, df['Column_To_Use'])
print (df)
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
So I have the following pandas dataframe:
import pandas as pd
sample_df = pd.DataFrame({'note': ['D','C','D','C'], 'time': [1,1,4,6], 'val': [6,4,7,9]})
which gives the result
note time val
0 D 1 6
1 C 1 4
2 D 4 7
3 C 6 9
What I want is
note index time val
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I tried sample_df.set_index('note',append=True) and it didn't work.
Add DataFrame.swaplevel with DataFrame.sort_index by first level:
df = sample_df.set_index('note', append=True).swaplevel(1,0).sort_index(level=0)
print (df)
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
If need set level name add DataFrame.rename_axis:
df = (sample_df.rename_axis('idx')
.set_index('note',append=True)
.swaplevel(1,0)
.sort_index(level=0))
print (df)
time val
note idx
C 1 1 4
3 6 9
D 0 1 6
2 4 7
Alternatively:
sample_df.index.rename('old_index', inplace=True)
sample_df.reset_index(inplace=True)
sample_df.set_index(['note','old_index'], inplace=True)
sample_df.sort_index(level=0, inplace=True)
print (sample_df)
time val
note old_index
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I am using MultiIndex create the target index
sample_df.index=pd.MultiIndex.from_arrays([sample_df.note,sample_df.index])
sample_df.drop('note',1,inplace=True)
sample_df=sample_df.sort_index(level=0)
sample_df
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
I would use set_index and pop to simultaneously discard column 'note' and set new index
df.set_index([df.pop('note'), df.index]).sort_index(level=0)
Out[380]:
time val
note
C 1 1 4
3 6 9
D 0 1 6
2 4 7
how to drop columns with more than 50 kinds of values using function?
here drop columns:date_dispatch,con_birth_dt,dat_cust_open,cust_mgr_team,mng_issu_date,created_date
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
label 1
date_dispatch 2883
con_birth_dt 12617
con_sex_mf 2
dat_cust_open 264
cust_mgr_team 2250
mng_issu_date 1796
um_num 38
created_date 2900
hqck_flag 2
dqck_flag 2
tzck_flag 2
yhlcck_flag 2
bzjck_flag 2
gzck_flag 2
jjsz_flag 2
e_yhlcck_flag 2
zq_flag 2
xtsz_flag 1
whsz_flag 1
hjsz_flag 2
yb_flag 2
qslc_flag 2
Use drop with index values filtered by boolean indexing:
a = app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
df = app_train.drop(a.index[a > 50], axis=1)
Another solution is add reindex for missing columns and then filter by inverted condition <=:
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 50]
Sample:
app_train = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (app_train)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
a = (app_train.select_dtypes('object')
.apply(pd.Series.nunique, axis = 0)
.reindex(app_train.columns, fill_value=0))
df = app_train.loc[:, a <= 5]
print (df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
nunique + loc
You can use nunique followed by loc with Boolean indexing:
n = 5 # maximum number of unique values permitted
counts = app_train.select_dtypes(['object']).apply(pd.Series.nunique)
df = app_train.loc[:, ~app_train.columns.isin(counts[counts > n].index)]
# data from jezrael
print(df)
B C D E F
0 4 7 1 5 a
1 5 8 3 3 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
Given the following DataFrame:
>>> pd.DataFrame(data=[['a',1],['a',2],['b',3],['b',4],['c',5],['c',6],['d',7],['d',8],['d',9],['e',10]],columns=['key','value'])
key value
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
6 d 7
7 d 8
8 d 9
9 e 10
I'm looking for a method that will change the structure based on the key value, like so:
a b c d e
0 1 3 5 7 10
1 2 4 6 8 10 <- 10 is duplicated
2 2 4 6 9 10 <- 10 is duplicated
The result row number is as the longest group count (d in the above example) and the missing values are duplicates of the last available value.
Create MultiIndex by set_index with counter column by cumcount, reshape by unstack, repalce missing values by last non missing ones with ffill and last converting all data to integers if necessary:
df = df.set_index([df.groupby('key').cumcount(),'key'])['value'].unstack().ffill().astype(int)
Another solution with custom lambda function:
df = (df.groupby('key')['value']
.apply(lambda x: pd.Series(x.values))
.unstack(0)
.ffill()
.astype(int))
print (df)
key a b c d e
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10
Using pivot , with groupby + cumcount
df.assign(key2=df.groupby('key').cumcount()).pivot('key2','key','value').ffill().astype(int)
Out[214]:
key a b c d e
key2
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10
I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5