I have a dataframe with dates, id's and values.
For example:
date id value
2016-08-28 A 1
2016-08-28 B 1
2016-08-29 C 2
2016-09-02 B 0
2016-09-03 A 3
2016-09-06 C 1
2017-01-15 B 2
2017-01-18 C 3
2017-01-18 A 2
I want to apply a rolling mean by element, stating one after, so that the result would be like:
date id value rolling_mean
2016-08-28 A 1 NaN
2016-08-28 B 1 NaN
2016-08-29 C 2 NaN
2016-09-02 B 0 0.5
2016-09-03 A 3 2.0
2016-09-06 C 1 1.5
2017-01-15 B 2 1.0
2017-01-18 C 3 2.0
2017-01-18 A 2 2.5
The closest I've come to this was:
grouped = df.groupby(["id", "value"])
df["rolling_mean"] = grouped["value"].shift(1).rolling(window = 2).mean()
But this gives me the wrong values back, as it keeps the order with the remaining elements.
Any ideia?
Thank you in advance,
You can just groupby id and use transform:
df['rolling_mean'] = df.groupby('id')['value'].transform(lambda x: x.rolling(2).mean())
Output:
date id value rolling_mean
0 2016-08-28 A 1 NaN
1 2016-08-28 B 1 NaN
2 2016-08-29 C 2 NaN
3 2016-09-02 B 0 0.5
4 2016-09-03 A 3 2.0
5 2016-09-06 C 1 1.5
6 2017-01-15 B 2 1.0
7 2017-01-18 C 3 2.0
8 2017-01-18 A 2 2.5
Fix your code with groupby with id
grouped = df.groupby(["id"])
df['rolling_mean']=grouped["value"].rolling(window = 2).mean().reset_index(level=0,drop=True)
df
Out[67]:
date id value rolling_mean
0 2016-08-28 A 1 NaN
1 2016-08-28 B 1 NaN
2 2016-08-29 C 2 NaN
3 2016-09-02 B 0 0.5
4 2016-09-03 A 3 2.0
5 2016-09-06 C 1 1.5
6 2017-01-15 B 2 1.0
7 2017-01-18 C 3 2.0
8 2017-01-18 A 2 2.5
Like this:
df['rolling_mean'] = df.groupby('id')['value'].rolling(2).mean().reset_index(0,drop=True).sort_index()
Output:
date id value rolling_mean
0 2016-08-28 A 1 nan
1 2016-08-28 B 1 nan
2 2016-08-29 C 2 nan
3 2016-09-02 B 0 0.50
4 2016-09-03 A 3 2.00
5 2016-09-06 C 1 1.50
6 2017-01-15 B 2 1.00
7 2017-01-18 C 3 2.00
8 2017-01-18 A 2 2.50
Related
The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0
Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0
First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0
We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0
I have a DataFrame where I have 1000s of rows and 100s of column where I want to forwardfill the data but grouped by id and original data ( date range). What I mean by original data is if we have a data for id 1 for date 01/01/2020 but null value for date 01/05/2020, 02/02/2020, I would like to fill the data on 01/05/2020 but not 02/02/2020 since 02/02/2020 is not within 30 days period. When we ffill, it fills all data based on last result.
import pandas as pd
import numpy as np
res= pd.DataFrame({'id':[1,1,1,1,1,2,2],
'date':['01/01/2020','01/05/2020','02/03/2020','02/05/2020','04/01/2020','01/01/2020','01/02/2020'],
'result':[1.5,np.nan,np.nan,2.6,np.nan,np.nan,6.0]})
res['result1']= res.groupby(['id']).apply(lambda x: x.result.ffill()).reset_index(drop=True)
result I get is:
id date result result1
0 1 01/01/2020 1.5 1.5
1 1 01/05/2020 NaN 1.5
2 1 02/03/2020 NaN 1.5
3 1 02/05/2020 2.6 2.6
4 1 04/01/2020 NaN 2.6
5 2 01/01/2020 NaN NaN
6 2 01/02/2020 6.0 6.0
What I want is :
id date result result1
0 1 01/01/2020 1.5 1.5
1 1 01/05/2020 NaN 1.5
2 1 02/03/2020 NaN NaN
3 1 02/05/2020 2.6 2.6
4 1 04/01/2020 NaN NaN
5 2 01/01/2020 NaN NaN
6 2 01/02/2020 6.0 6.0
You can try with merge_asof
res['date']=pd.to_datetime(res['date'])
res = res.sort_values('date')
res1 = res.dropna(subset=['result']).rename(columns={'result':'result1'})
out = pd.merge_asof(res.reset_index(),res1 , by ='id', on ='date',tolerance = pd.Timedelta(30, unit='d'),direction = 'backward').sort_values('index')
Out[72]:
index id date result result1
0 0 1 2020-01-01 1.5 1.5
3 1 1 2020-01-05 NaN 1.5
4 2 1 2020-02-03 NaN NaN
5 3 1 2020-02-05 2.6 2.6
6 4 1 2020-04-01 NaN NaN
1 5 2 2020-01-01 NaN NaN
2 6 2 2020-01-02 6.0 6.0
Not so elegant as Ben's merge_asof, but you can do something like this:
res['date'] = pd.to_datetime(res['date'])
# valid blocks
valids = res['result'].notna().cumsum()
# first dates in each block
first_dates = res.groupby(['id',valids])['date'].transform('min')
# How far we ffill
mask = (res['date']-first_dates)<pd.Timedelta('30D')
# ffill and then mask
res['result1'] = res['result'].groupby(res['id']).ffill().where(mask)
Output:
id date result result1
0 1 2020-01-01 1.5 1.5
1 1 2020-01-05 NaN 1.5
2 1 2020-02-03 NaN NaN
3 1 2020-02-05 2.6 2.6
4 1 2020-04-01 NaN NaN
5 2 2020-01-01 NaN NaN
6 2 2020-01-02 6.0 6.0
I have a dataframe df:
Serial_no date Index x y
1 2014-01-01 1 2.0 3.0
1 2014-03-01 2 3.0 3.0
1 2014-04-01 3 6.0 2.0
2 2011-03-01 1 5.1 1.3
2 2011-04-01 2 5.8 0.6
2 2011-05-01 3 6.5 -0.1
2 2011-07-01 4 3.0 5.0
3 2019-10-01 1 7.9 -1.5
3 2019-11-01 2 8.6 -2.2
3 2020-01-01 3 10.0 -3.6
3 2020-02-01 4 10.7 -4.3
3 2020-03-01 5 4.0 3.0
Notice:
The data is grouped by Serial_no and the date is data reported monthly (first of every month).
The Index column is set so each consecutive reported date is a consecutive number in the series.
The number of reported dates in each group Serial_no are different.
The interval of reported dates date are different for each group Serial_no (they don't start or end on the same date for each group).
The problem:
There is no reported data for some dates date in the time series. Notice some dates are missing in each Serial_no group.
I want to add a row in each group for those missing dates date and have the data reported in x and y columns as 'NaN'.
Example of the dataframe I need:
Serial_no date Index x y
1 2014-01-01 1 2.0 3.0
1 2014-02-01 2 NaN NaN
1 2014-03-01 3 3.0 3.0
1 2014-04-01 4 6.0 2.0
2 2011-03-01 1 5.1 1.3
2 2011-04-01 2 5.8 0.6
2 2011-05-01 3 6.5 -0.1
2 2011-06-01 4 NaN NaN
2 2011-07-01 5 3.0 5.0
3 2019-10-01 1 7.9 -1.5
3 2019-11-01 2 8.6 -2.2
3 2019-12-01 3 NaN NaN
3 2020-01-01 4 10.0 -3.6
3 2020-02-01 5 10.7 -4.3
3 2020-03-01 6 4.0 3.0
I know how to replace the blank cells with NaN once the rows with missing dates are inserted, using the following code:
import pandas as pd
import numpy as np
df['x'].replace('', np.nan, inplace=True)
df['y'].replace('', np.nan, inplace=True)
I also know how to reset the index once the rows with missing dates are inserted, using the following code:
df["Index"] = df.groupby("Serial_no",).cumcount('date')
However, I'm unsure how to locate the the missing dates in each group and insert the row for those (monthly reported) dates. Any help is appreciated.
Use custom function with DataFrame.asfreq in GroupBy.apply and then reassign Index by GroupBy.cumcount:
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.groupby('Serial_no')
.apply(lambda x: x.asfreq('MS'))
.drop('Serial_no', axis=1))
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
print (df)
Serial_no date Index x y
0 1 2014-01-01 1 2.0 3.0
1 1 2014-02-01 2 NaN NaN
2 1 2014-03-01 3 3.0 3.0
3 1 2014-04-01 4 6.0 2.0
4 2 2011-03-01 1 5.1 1.3
5 2 2011-04-01 2 5.8 0.6
6 2 2011-05-01 3 6.5 -0.1
7 2 2011-06-01 4 NaN NaN
8 2 2011-07-01 5 3.0 5.0
9 3 2019-10-01 1 7.9 -1.5
10 3 2019-11-01 2 8.6 -2.2
11 3 2019-12-01 3 NaN NaN
12 3 2020-01-01 4 10.0 -3.6
13 3 2020-02-01 5 10.7 -4.3
14 3 2020-03-01 6 4.0 3.0
Alternative solution with DataFrame.reindex:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='MS', name='date'))
df = df.set_index('date').groupby('Serial_no').apply(f).drop('Serial_no', axis=1)
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
One option is with complete from pyjanitor, which abstracts the process for exposing missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# create a mapping that is applied across each Serial_no group
new_dates = {'date':lamba d: pd.date_range(d.min(), d.max(), freq='MS')}
(df
.complete(new_dates, by='Serial_no')
.assign(Index = lambda df: df.groupby('Serial_no')
.Index
.cumcount()
.add(1))
)
Serial_no date Index x y
0 1 2014-01-01 1 2.0 3.0
1 1 2014-02-01 2 NaN NaN
2 1 2014-03-01 3 3.0 3.0
3 1 2014-04-01 4 6.0 2.0
4 2 2011-03-01 1 5.1 1.3
5 2 2011-04-01 2 5.8 0.6
6 2 2011-05-01 3 6.5 -0.1
7 2 2011-06-01 4 NaN NaN
8 2 2011-07-01 5 3.0 5.0
9 3 2019-10-01 1 7.9 -1.5
10 3 2019-11-01 2 8.6 -2.2
11 3 2019-12-01 3 NaN NaN
12 3 2020-01-01 4 10.0 -3.6
13 3 2020-02-01 5 10.7 -4.3
14 3 2020-03-01 6 4.0 3.0
I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN
Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN
If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5
Try this
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))
df=df.sort_values("id").reset_index(drop=True)
output
id A B
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 0.0 1.0
3 3.0 0.0 1.0
4 4.0 1.0 NaN
5 5.0 0.0 NaN
I have a dataframe with 50 columns. I want to replace NAs with 0 in 10 columns.
What's the simplest, most readable way of doing this?
I was hoping for something like:
cols = ['a', 'b', 'c', 'd']
df[cols].fillna(0, inplace=True)
But that gives me ValueError: Must pass DataFrame with boolean values only.
I found this answer, but it's rather hard to understand.
you can use update():
In [145]: df
Out[145]:
a b c d e
0 NaN NaN NaN 3 8
1 NaN NaN NaN 8 7
2 NaN NaN NaN 2 8
3 NaN NaN NaN 7 4
4 NaN NaN NaN 4 9
5 NaN NaN NaN 1 9
6 NaN NaN NaN 7 7
7 NaN NaN NaN 6 5
8 NaN NaN NaN 0 0
9 NaN NaN NaN 9 5
In [146]: df.update(df[['a','b','c']].fillna(0))
In [147]: df
Out[147]:
a b c d e
0 0.0 0.0 0.0 3 8
1 0.0 0.0 0.0 8 7
2 0.0 0.0 0.0 2 8
3 0.0 0.0 0.0 7 4
4 0.0 0.0 0.0 4 9
5 0.0 0.0 0.0 1 9
6 0.0 0.0 0.0 7 7
7 0.0 0.0 0.0 6 5
8 0.0 0.0 0.0 0 0
9 0.0 0.0 0.0 9 5
In [15]: cols= ['one', 'two']
In [16]: df
Out[16]:
one two three four five
a -0.343241 0.453029 -0.895119 bar False
b NaN NaN NaN NaN NaN
c 0.839174 0.229781 -1.244124 bar True
d NaN NaN NaN NaN NaN
e 1.300641 -1.797828 0.495313 bar True
f -0.182505 -1.527464 0.712738 bar False
g NaN NaN NaN NaN NaN
h 0.626568 -0.971003 1.192831 bar True
In [17]: df[cols]=df[cols].fillna(0)
In [18]: df
Out[18]:
one two three four five
a -0.343241 0.453029 -0.895119 bar False
b 0.000000 0.000000 NaN NaN NaN
c 0.839174 0.229781 -1.244124 bar True
d 0.000000 0.000000 NaN NaN NaN
e 1.300641 -1.797828 0.495313 bar True
f -0.182505 -1.527464 0.712738 bar False
g 0.000000 0.000000 NaN NaN NaN
h 0.626568 -0.971003 1.192831 bar True
And a version using column slicing which might be useful in your case:
In [46]:
df
Out[46]:
a b c d e
0 NaN NaN NaN 3 8
1 NaN NaN NaN 8 7
2 NaN NaN NaN 2 8
3 NaN NaN NaN 7 4
4 NaN NaN NaN 4 9
5 9 NaN NaN 1 9
6 NaN NaN NaN 7 7
7 NaN NaN NaN 6 5
8 NaN NaN NaN 0 0
9 NaN NaN NaN 9 5
In [47]:
df.loc[:,'a':'c'] = df.loc[:,'a':'c'].fillna(0)
df
Out[47]:
a b c d e
0 0 0 0 3 8
1 0 0 0 8 7
2 0 0 0 2 8
3 0 0 0 7 4
4 0 0 0 4 9
5 9 0 0 1 9
6 0 0 0 7 7
7 0 0 0 6 5
8 0 0 0 0 0
9 0 0 0 9 5