I am trying to create a new Column that displays a cumulative count based off values in separate columns.
So for the code below, I'm trying to create two new columns based off Cause and Answer Columns. So for the values in Column Answer, if In is situated in Column Cause I want to provide a cumulative count in a new column.
import pandas as pd
d = ({
'Cause' : ['In','','','In','','In','In'],
'Answer' : ['Yes','No','Maybe','No','Yes','No','Yes'],
})
df = pd.DataFrame(d)
Output:
Answer Cause
0 Yes In
1 No
2 Maybe
3 No In
4 Yes
5 No In
6 Yes In
Intended Output:
Answer Cause Count_No Count_Yes
0 Yes In 1
1 No
2 Maybe
3 No In 1
4 Yes
5 No In 2
6 Yes In 2
I have tried the following but get an error.
df['cumsum'] = df.groupby(['Answer'])['Cause'].cumsum()
Here is one way -
for val in ['Yes', 'No']:
cond = df.Answer.eq(val) & df.Cause.eq('In')
df.loc[cond, 'Count_' + val] = cond[cond].cumsum()
df
# Cause Answer Count_Yes Count_No
#0 In Yes 1.0 NaN
#1 No NaN NaN
#2 Maybe NaN NaN
#3 In No NaN 1.0
#4 Yes NaN NaN
#5 In No NaN 2.0
#6 In Yes 2.0 NaN
Without for loop : -)
s=df.loc[df.Cause=='In'].Answer.str.get_dummies()
pd.concat([df,s.cumsum().mask(s!=1,'')],axis=1).fillna('')
Out[62]:
Answer Cause No Yes
0 Yes In 1
1 No
2 Maybe
3 No In 1
4 Yes
5 No In 2
6 Yes In 2
Related
I would like to take a dataframe such as:
USER PACKAGE
0 1 1
1 1 1
2 1 2
3 1 1
4 1 2
5 1 3
6 2 ...
And select the distinct USERS and then have new columns that are based on the frequency of the different packages. i.e highest frequency package, second highest etc.
User First Second Third
0 1 1 2 3
1 2 ...
I can implement this with for loops but thats obviously bad using dataframes, I need to run this on millions of records, can't quite find a vectorized way of doing it.
Cheers
On SO you're supposed to attempt it and post your own code. Here are some hints for implementing the solution:
Do .groupby('USER')... then .value_counts() ...
(don't need to .sort(), since .value_counts() does that by default)
take the .head(3)...
then pivot into a table, in that same pivot command there's an option to add the column names 'First, Second, Third'
You can use SeriesGroupBy.value_counts with default sorting, so get first 3 index values, convert to Series, reshape by Series.unstack, rename columns and last convert index to column:
print (df)
USER PACKAGE
0 1 1
1 1 1
2 1 2
3 1 1
4 1 2
5 1 3
6 2 3
df = (df.groupby('USER')['PACKAGE']
.apply(lambda x: pd.Series(x.value_counts().index[:3]))
.unstack()
.rename(columns= dict(enumerate(['First','Second','Third'])))
.reset_index())
print (df)
USER First Second Third
0 1 1.0 2.0 3.0
1 2 3.0 NaN NaN
If need all counts:
df = (df.groupby('USER')['PACKAGE']
.apply(lambda x: pd.Series(x.value_counts().index))
.unstack())
print (df)
0 1 2
USER
1 1.0 2.0 3.0
2 3.0 NaN NaN
EDIT: Another idea, I hope faster is use:
s = (df.groupby('USER')['PACKAGE']
.apply(lambda x: x.value_counts().index[:3]))
df = (pd.DataFrame(s.tolist(),index=s.index, columns=['First','Second','Third'])
.reset_index())
print (df)
USER First Second Third
0 1 1 2.0 3.0
1 2 3 NaN NaN
I assumed the count the number of user and package occurrences
USER =[1,1,1,1,1,1,2]
PACKAGE=[1,1,2,1,2,3,3]
df=pd.DataFrame({'user':USER,'package':PACKAGE})
results=df.groupby(['user','package']).size()
results=results.sort_values(ascending=False)
results=results.unstack(level='package').fillna(0)
results=results.rename(columns={1:'First',2:'Second',3:'Third'})
print(results)
output:
package First Second Third
user
1 3.0 2.0 1.0
2 0.0 0.0 1.0
The highest frequency package is type 1, second highest package is type 2 and third highest package is type 3 for user 1. the highest rank for user 2 is type 3. You can do a lookup on the results to produce this output.
Try using Groupby:
df.groupby(['X']).get_group('A')
lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.
You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))
I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
I have a dataframe as below:
df
ID val
1 0.0
2 yes
3 1.0
4 0.0
5 yes
How do I match the previous value with the current value if the column val equals "yes"
I tried df['val'] = df['val'].replace('yes', np.nan).bfill().astype(str) , but wont work as desired.
desired output
ID val
1 yes
2 yes
3 1.0
4 yes
5 yes
can we use np.where along with bfill? how to go about with this?
How about:
df.loc[df['val'].shift(-1).eq('yes'), 'val'] = 'yes'
Output:
ID val
0 1 yes
1 2 yes
2 3 1.0
3 4 yes
4 5 yes
I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)
I am trying to create a new column in an existing df. The values of the new column are created by a combination of the groupby and rolling sum. How do I do this?
I've tried two approaches both resulting in either NaN values or 'incompatible index of the inserted column with frame index'
df = something like this:
HomeTeam FTHP
0 Bristol Rvs 0
1 Crewe 0
2 Hartlepool 3
3 Huddersfield 1
and I've tried:
(1)
df['new'] = df.groupby('HomeTeam')['FTHP'].rolling(4).sum()
(2)
df['new'] = df.groupby('HomeTeam').FTHP.apply(lambda x: x.rolling(4).mean())
(1) outputs the following which are the values that I would like to add in a new column.
HomeTeam
Brighton 12 NaN
36 NaN
49 NaN
72 2.0
99 2.0
And I am trying to add these values in a new columns next to the appropriate HomeTeam. Resulting in a NaN for the first three (as it is rolling(4)) and pick up values after, something like:
HomeTeam FTHP RollingMean
0 Bristol Rvs 0 NaN
1 Crewe 0 NaN
2 Hartlepool 3 NaN
3 Huddersfield 1 NaN
To ensure alignment on the original (non-duplicated) index:
df.groupby('HomeTeam', as_index=False)['FTHP'].rolling(4).sum().reset_index(0, drop=True)
With a df:
HomeTeam FTHP
A a 0
B b 1
C b 2
D a 3
E b 4
grouping with as_index=False adds an ngroup value as the 0th level, preserving the original index in the 1st level:
df.groupby('HomeTeam', as_index=False)['FTHP'].rolling(2).sum()
#0 A NaN
# D 3.0
#1 B NaN
# C 3.0
# E 6.0
#Name: FTHP, dtype: float64
Drop level=0 to ensure alignment on the original index. Your original index should not be duplicated, else you get a ValueError.