Compute rolling sum shifted for each group - python

My goal is to perform a groupby, then creating rolling total stats and then shift. I need it to shift the first instance of each unique player. Right now it is shifting the entire dataframe once, and not doing it for each grouped player.
Original Data -
player date won
0 A 2016-01-11 0
1 A 2016-02-01 0
2 A 2016-02-01 1
3 A 2016-02-01 1
4 A 2016-10-24 0
5 A 2016-10-31 0
6 A 2018-10-22 0
7 B 2016-10-24 0
8 B 2016-10-24 1
9 B 2017-11-13 0
Things I've tried -
1
temp = temp_master.groupby('player', sort=False)[count_fields].rolling(10, min_periods=1).sum().shift(1).reset_index(drop=True)
temp = temp.add_suffix('_total')
temp['won_total'].head(10)
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 0.0
9 1.0
2
temp = temp_master.groupby('player', sort=False)[count_fields].shift(1).rolling(10, min_periods=1).sum().reset_index(drop=True)
temp = temp.add_suffix('_total')
temp['won_total'].head(10)
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 3.0
3
temp = temp_master.groupby('player', sort=False)[count_fields].rolling(10, min_periods=1).sum().reset_index(drop=True)
temp = temp.add_suffix('_total')
temp = temp.shift(1)
temp['won_total'].head(10)
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 2.0
8 0.0
9 1.0
This is what I need the results to be -
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 NaN
8 0.0
9 1.0
index #7 should equal NaN. It should be the first instance of player B and I want it to shift at the first instance of every new player to sumarrize stats by player.
index 8 should equal 0
index 9 should equal 1
It looks like attempt #1 & #3 is close but it's not assigning the NaN value on the new player. #3 isn't doing a groupedby player anymore though so I know that won't really work.
Also, this will be done on a good amount of data (around 100K-300K rows) and the 'count_fields' column contains around 3K-4K columns that I am calculating. Just something to be aware of.
Any ideas on how to create running stats by player and shift down at for every player?

You need apply here , this two functions are not chain under the groupby object , sum is under the groupby , but shift will implement to the result after sum which is whole columns
temp = temp_master.groupby('player', sort=False)['won'].apply(lambda x : x.rolling(10, min_periods=1).sum().shift(1))\
.reset_index(drop=True)
temp
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 NaN
8 0.0
9 1.0
Name: won, dtype: float64

Another option if you don't want to use apply is to layer a second groupby call and perform the shifting:
(df.groupby('player', sort=False)
.won.rolling(10, min_periods=1)
.sum()
.groupby(level=0)
.shift()
.reset_index(drop=True))
0 NaN
1 0.0
2 0.0
3 1.0
4 2.0
5 2.0
6 2.0
7 NaN
8 0.0
9 1.0
Name: won, dtype: float64

Related

Adding new rows with new values at some specific columns in pandas

Assume we have a table looks like the following:
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
5
100
1990129
1
2
3
1
7
100
1990212
1
2
3
week_num skip the "4" and "6" because the corresponding "people" is 0. However, we want the all the rows included like the following table.
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
4
0
1990122
1
2
3
1
5
100
1990129
1
2
3
1
6
0
1990205
1
2
3
1
7
100
1990212
1
2
3
The date starts with 1990101, the next row must +7 days if it is a continuous week_num(Ex: 1,2 is continuous; 1,3 is not).
How can we use python(pandas) to achieve this goal?
Note: Each id has 10 week_num(1,2,3,...,10), the output must include all "week_num" with corresponding "people" and "date".
Update: Other columns like "level","a","b" should stay the same even we add the skipped week_num.
This assumes that the date restarts at 1990-01-01 for each id:
import itertools
# reindex to get all combinations of ids and week numbers
df_full = (df.set_index(["id", "week_num"])
.reindex(list(itertools.product([1,2], range(1, 11))))
.reset_index())
# fill people with zero
df_full = df_full.fillna({"people": 0})
# forward fill some other columns
cols_ffill = ["level", "a", "b"]
df_full[cols_ffill] = df_full[cols_ffill].ffill()
# reconstruct date from week starting from 1990-01-01 for each id
df_full["date"] = pd.to_datetime("1990-01-01") + (df_full.week_num - 1) * pd.Timedelta("1w")
df_full
# out:
id week_num people date level a b
0 1 1 20.0 1990-01-01 1.0 2.0 3.0
1 1 2 30.0 1990-01-08 1.0 2.0 3.0
2 1 3 40.0 1990-01-15 1.0 2.0 3.0
3 1 4 0.0 1990-01-22 1.0 2.0 3.0
4 1 5 100.0 1990-01-29 1.0 2.0 3.0
5 1 6 0.0 1990-02-05 1.0 2.0 3.0
6 1 7 100.0 1990-02-12 1.0 2.0 3.0
7 1 8 0.0 1990-02-19 1.0 2.0 3.0
8 1 9 0.0 1990-02-26 1.0 2.0 3.0
9 1 10 0.0 1990-03-05 1.0 2.0 3.0
10 2 1 0.0 1990-01-01 1.0 2.0 3.0
11 2 2 0.0 1990-01-08 1.0 2.0 3.0
12 2 3 0.0 1990-01-15 1.0 2.0 3.0
13 2 4 0.0 1990-01-22 1.0 2.0 3.0
14 2 5 0.0 1990-01-29 1.0 2.0 3.0
15 2 6 0.0 1990-02-05 1.0 2.0 3.0
16 2 7 0.0 1990-02-12 1.0 2.0 3.0
17 2 8 0.0 1990-02-19 1.0 2.0 3.0
18 2 9 0.0 1990-02-26 1.0 2.0 3.0
19 2 10 0.0 1990-03-05 1.0 2.0 3.0

fill NA of a column with elements of another column

i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?
Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3
IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0

Reset the counter when column has non zero value

I have the dataframe with a column.
A
0.0
0.0
0.0
12.0
0.0
0.0
34.0
0.0
0.0
0.0
0.0
11.0
I want the output like this with a counter column. I want the counter to be restarted after non zero value. For the row after every non zero value, the counter should be intilaized again and then should increment.
A Counter
0.0 1
0.0 2
0.0 3
12.0 4
0.0 1
0.0 2
34.0 3
0.0 1
0.0 2
0.0 3
0.0 4
11.0 5
Let us try cumsum create the groupby key , [::-1] here is reverse the order
df['Counter'] = df.A.groupby(df.A.ne(0)[::-1].cumsum()).cumcount()+1
Out[442]:
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 4
11 5
dtype: int64

Pandas dataframe insert missing row and fill with previous row

I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN
Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN
If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5
Try this
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))
df=df.sort_values("id").reset_index(drop=True)
output
id A B
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 0.0 1.0
3 3.0 0.0 1.0
4 4.0 1.0 NaN
5 5.0 0.0 NaN

Pandas: get two different rows with same pair of values in two different columns

I have two columns _Id and _ParentId with this example data. Using this I want to group _Id with _ParentId.
_Id _ParentId
1 NaN
2 NaN
3 1.0
4 2.0
5 NaN
6 2.0
After grouping the result should be shown as below.
_Id _ParentId
1 NaN
3 1.0
2 NaN
4 2.0
6 2.0
5 NaN
The main aim for this is to group which _Id belongs to which _ParentId (e.g _Id 3 belongs to _Id 1).
I have attempted to use groupby and duplicated but I can't seem to get the results shown above.
Use sort_values on temp
In [3188]: (df.assign(temp=df._ParentId.combine_first(df._Id))
.sort_values(by='temp').drop('temp', 1))
Out[3188]:
_Id _ParentId
0 1 NaN
2 3 1.0
1 2 NaN
3 4 2.0
5 6 2.0
4 5 NaN
Details
In [3189]: df._ParentId.combine_first(df._Id)
Out[3189]:
0 1.0
1 2.0
2 1.0
3 2.0
4 5.0
5 2.0
Name: _ParentId, dtype: float64
In [3190]: df.assign(temp=df._ParentId.combine_first(df._Id))
Out[3190]:
_Id _ParentId temp
0 1 NaN 1.0
1 2 NaN 2.0
2 3 1.0 1.0
3 4 2.0 2.0
4 5 NaN 5.0
5 6 2.0 2.0
Your expected output is quite the same as input, just that IDs 4 and 6 are together, with NaNs being at different places. Its not possible to have that expected output.
Here is how group-by would ideally work:
print("Original: ")
print(df)
df = df.fillna(-1) # if not replaced with another character , the grouping won't show NaNs.
df2 = df.groupby('_Parent')
print("\nAfter grouping: ")
for key, item in df2:
print (df2.get_group(key))
Output:
Original:
_Id _Parent
0 1 NaN
1 2 NaN
2 3 1.0
3 4 2.0
4 5 NaN
5 6 2.0
After grouping:
_Id _Parent
0 1 0.0
1 2 0.0
4 5 0.0
_Id _Parent
2 3 1.0
_Id _Parent
3 4 2.0
5 6 2.0

Categories