I am trying to create a new column in an existing df. The values of the new column are created by a combination of the groupby and rolling sum. How do I do this?
I've tried two approaches both resulting in either NaN values or 'incompatible index of the inserted column with frame index'
df = something like this:
HomeTeam FTHP
0 Bristol Rvs 0
1 Crewe 0
2 Hartlepool 3
3 Huddersfield 1
and I've tried:
(1)
df['new'] = df.groupby('HomeTeam')['FTHP'].rolling(4).sum()
(2)
df['new'] = df.groupby('HomeTeam').FTHP.apply(lambda x: x.rolling(4).mean())
(1) outputs the following which are the values that I would like to add in a new column.
HomeTeam
Brighton 12 NaN
36 NaN
49 NaN
72 2.0
99 2.0
And I am trying to add these values in a new columns next to the appropriate HomeTeam. Resulting in a NaN for the first three (as it is rolling(4)) and pick up values after, something like:
HomeTeam FTHP RollingMean
0 Bristol Rvs 0 NaN
1 Crewe 0 NaN
2 Hartlepool 3 NaN
3 Huddersfield 1 NaN
To ensure alignment on the original (non-duplicated) index:
df.groupby('HomeTeam', as_index=False)['FTHP'].rolling(4).sum().reset_index(0, drop=True)
With a df:
HomeTeam FTHP
A a 0
B b 1
C b 2
D a 3
E b 4
grouping with as_index=False adds an ngroup value as the 0th level, preserving the original index in the 1st level:
df.groupby('HomeTeam', as_index=False)['FTHP'].rolling(2).sum()
#0 A NaN
# D 3.0
#1 B NaN
# C 3.0
# E 6.0
#Name: FTHP, dtype: float64
Drop level=0 to ensure alignment on the original index. Your original index should not be duplicated, else you get a ValueError.
Related
I have a dataframe and would like to assign multiple values from one row to multiple other rows.
I get it to work with .iloc but for some when I use conditions with .loc it only returns nan.
df = pd.DataFrame(dict(A = [1,2,0,0],B=[0,0,0,10],C=[3,4,5,6]))
df.index = ['a','b','c','d']
When I use loc with conditions or with direct index names:
df.loc[df['A']>0, ['B','C']] = df.loc['d',['B','C']]
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']]
it will return
A B C
a 1.0 NaN NaN
b 2.0 NaN NaN
c 0.0 0.0 5.0
d 0.0 10.0 6.0
but when I use .iloc it actually works as expected
df.iloc[0:2,1:3] = df.iloc[3,1:3]
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
is there a way to do this with .loc or do I need to rewrite my code to get the row numbers from my mask?
When you use labels, pandas perform index alignment, and in your case there is no common indices thus the NaNs, while location based indexing does not align.
You can assign a numpy array to prevent index alignment:
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']].values
output:
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)
I have a df:
a b c
1 2 3 6
2 2 5 7
3 4 6 8
I want every nth row of groupby a:
w=df.groupby('a').nth(0) #first row
x=df.groupby('a').nth(1) #second row
The second group of the df has no second row, in this case I want to have 'None' values.
[In:] df.groupby('a').nth(1)
[Out:]
a b c
1 2 5 7
2 None None None
Or maybe simplier:
The df has 1-4 rows within groups. If a group has less than 4 rows, I want to extend the group, so that it has 4 rows and fill the missing rows with 'None'. Afterwards if I pick the nth row of groups, I have the desired output.
If you are just interested in a specific nth but not have enough rows in some groups, you can consider to use reindex with unique value from the column a like:
print (df.groupby('a').nth(1).reindex(df['a'].unique()).reset_index())
a b c
0 2 5.0 7.0
1 4 NaN NaN
One way is to assign a count/rank column and reindex/stack:
n=2
(df.assign(rank=df.groupby('a').cumcount())
.query(f'rank < #n')
.set_index(['a','rank'])
.unstack('rank')
.stack('rank', dropna=False)
.reset_index()
.drop('rank', axis=1)
)
Output:
a b c
0 2 3.0 6.0
1 2 5.0 7.0
2 4 6.0 8.0
3 4 NaN NaN
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS
0 104.0 PUTNAM Y 3.0
1 197.0 LEXINGTON NaN NaN
2 NaN LEXINGTON N 3.0
3 201.0 BERKELEY NaN 1.0
4 203.0 BERKELEY Y NaN
this is my data frame. I wanted to create a user defined functions which returns the data frame which shows number of missing values in data frame by column and row number of missing value.
output df should look like this.
col_name index
st_num 2
st_num 6
st_name 8
Num_bedrooms 2
Num_bedrooms 5
Num_bedrooms 7
Num_bedrooms 8 .......
You can slice the index by the isnull for each column to get the indices. Also possible with stacking and groupby.
def summarize_missing(df):
# Null counts
s1 = df.isnull().sum().rename('No. Missing')
s2 = pd.Series(data=[df.index[m].tolist() for m in [df[col].isnull() for col in df.columns]],
index=df.columns,
name='Index')
# Other way, probably overkill
#s2 = (df.isnull().replace(False, np.NaN).stack().reset_index()
# .groupby('level_1')['level_0'].agg(list)
# .rename('Index'))
return pd.concat([s1, s2], axis=1, sort=False)
summarize_missing(df)
# No. Missing Index
#ST_NUM 1 [2]
#ST_NAME 0 NaN
#OWN_OCCUPIED 2 [1, 3]
#NUM_BEDROOMS 2 [1, 4]
Here's another way:
m = df.isna().sum().to_frame().rename(columns={0: 'No. Missing'})
m['index'] = m.index.map(lambda x: ','.join(map(str, df.loc[df[x].isna()].index.values)))
print(m)
No. Missing index
ST_NUM 1 2
ST_NAME 0
OWN_OCCUPIED 2 1,3
NUM_BEDROOMS 2 1,4
I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:
The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0
Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.