I am trying to fill all missing values until the end of the dataframe but unable to do so. In the example below, I am taking average of the last three values. My code is only filling until 2017-01-10 whereas I want to fill until 2017-01-14. For 1/14, I want to use values from 11,12 & 13.Please help.
import pandas as pd
df = pd.DataFrame([
{"ds":"2017-01-01","y":3},
{"ds":"2017-01-02","y":4},
{"ds":"2017-01-03","y":6},
{"ds":"2017-01-04","y":2},
{"ds":"2017-01-05","y":7},
{"ds":"2017-01-06","y":9},
{"ds":"2017-01-07","y":8},
{"ds":"2017-01-08","y":2},
{"ds":"2017-01-09"},
{"ds":"2017-01-10"},
{"ds":"2017-01-11"},
{"ds":"2017-01-12"},
{"ds":"2017-01-13"},
{"ds":"2017-01-14"}
])
df["y"].fillna(df["y"].rolling(3,min_periods=1).mean(),axis=0,inplace=True)
Result:
ds y
0 2017-01-01 3.0
1 2017-01-02 4.0
2 2017-01-03 6.0
3 2017-01-04 2.0
4 2017-01-05 7.0
5 2017-01-06 9.0
6 2017-01-07 8.0
7 2017-01-08 2.0
8 2017-01-09 5.0
9 2017-01-10 2.0
10 2017-01-11 NaN
11 2017-01-12 NaN
12 2017-01-13 NaN
13 2017-01-14 NaN
Desired output:
You can iterate over the values in y and if a nan value is encountered, look at the 3 earlier values and use .at[] to set the mean of the 3 earlier values as the new value:
for index, value in df['y'].items():
if np.isnan(value):
df['y'].at[index] = df['y'].iloc[index-3: index].mean()
Resulting dataframe for the missing values:
7 2017-01-08 2.000000
8 2017-01-09 6.333333
9 2017-01-10 5.444444
10 2017-01-11 4.592593
11 2017-01-12 5.456790
12 2017-01-13 5.164609
13 2017-01-14 5.071331
Related
Hey I have a doubt on pandas rolling function.
I am currently using it to get mean for last 10 days of my time series data.
Example df:
column
2020-12-04 14
2020-12-05 15
2020-12-06 16
2020-12-07 17
2020-12-08 18
2020-12-09 19
2020-12-13 20
2020-12-14 11
2020-12-16 12
2020-12-17 13
Usage:
df['column'].rolling('10D').mean()
But the function calculates the rolling mean over the 10 calendar days. like if the current row date is 2020-12-17 it calculates till 2020-12-07.
However I would like the rolling mean on the last 10 days that are in the data frame. i.e I would want till 2020-12-04.
How can I acheive it?
Edit: So I can also have a 15 mins interval datetime index so doing window=10 is not helping in that case. Though it works here.
As said in the comments by #cs95, if you want to consider only the rows that are in the dataframe, you can ignore that your data is part of a timeseries and just specify a window sized by a number of rows, instead of by a number of days. In essence
df['column'].rolling(window=10).mean()
Just one little detail to remember. You have missing dates in you dataframe. You should fill that, otherwise it will not be a 10 day window. Instead you would have a 10-dates rolling window,which would be pretty meaningless if dates are randoly missing.
r = pd.date_range(start=df1.Date.min(), end=df1.Date.max())
df1 = df1.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index()
which gives you the dataframe:
Date column
0 2020-12-04 14.0
1 2020-12-05 15.0
2 2020-12-06 16.0
3 2020-12-07 17.0
4 2020-12-08 18.0
5 2020-12-09 19.0
6 2020-12-10 0.0
7 2020-12-11 0.0
8 2020-12-12 0.0
9 2020-12-13 20.0
10 2020-12-14 11.0
11 2020-12-15 0.0
12 2020-12-16 12.0
13 2020-12-17 13.0
Then applying:
df1['Mean']=df1['column'].rolling(window=10).mean()
returns
Date column Mean
0 2020-12-04 14.0 NaN
1 2020-12-05 15.0 NaN
2 2020-12-06 16.0 NaN
3 2020-12-07 17.0 NaN
4 2020-12-08 18.0 NaN
5 2020-12-09 19.0 NaN
6 2020-12-10 0.0 NaN
7 2020-12-11 0.0 NaN
8 2020-12-12 0.0 NaN
9 2020-12-13 20.0 11.9
10 2020-12-14 11.0 11.6
11 2020-12-15 0.0 10.1
12 2020-12-16 12.0 9.7
13 2020-12-17 13.0 9.3
I have a dateset like
Sno change date
0 NaN 2017-01-01
1 NaN 2017-02-01
2 NaN 2017-03-01
3 NaN 2017-04-01
4 NaN 2017-05-01
5 NaN 2017-06-01
6 NaN 2017-07-01
7 NaN 2017-08-01
8 0.0 2017-09-01
9 NaN 2017-10-01
10 NaN 2017-11-01
11 1 2017-12-01
12 NaN 2018-01-01
13 NaN 2018-02-01
I want to get the last 5 rows of "date" column in the data frame when the value in column "change" changes from NaN to anything else. So for this example, it will be divided into two sets:
Sno date
3 2017-04-01
4 2017-05-01
5 2017-06-01
6 2017-07-01
7 2017-08-01
8 2017-09-01
and
Sno date
6 2017-07-01
7 2017-08-01
8 2017-09-01
9 2017-10-01
10 2017-11-01
11 2017-12-01
Can anyone help me to get this? Thank you
You can try something like this, with loc and isna:
#df=df.set_index('Sno')
idxs=df.index[~df.change.isna()]
sets=[df.loc[i-5:i,['date']] for i in idxs]
Output:
sets
[ date
Sno
3 2017-04-01
4 2017-05-01
5 2017-06-01
6 2017-07-01
7 2017-08-01
8 2017-09-01,
date
Sno
6 2017-07-01
7 2017-08-01
8 2017-09-01
9 2017-10-01
10 2017-11-01
11 2017-12-01]
You can use isna() to check for NaN values, then np.whereto extract the locations of last row, finally,np.r_` for creating slices:
s = df.change.isna()
valids = np.where(s.shift() & (~s))[0]
[df.iloc[np.r_[x-5:x]] for x in valid]
[ Sno change date
3 3 NaN 2017-04-01
4 4 NaN 2017-05-01
5 5 NaN 2017-06-01
6 6 NaN 2017-07-01
7 7 NaN 2017-08-01,
Sno change date
6 6 NaN 2017-07-01
7 7 NaN 2017-08-01
8 8 0.0 2017-09-01
9 9 NaN 2017-10-01
10 10 NaN 2017-11-01]
For a given pandas data frame called full_df which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
Do full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and #JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)
I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks
You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()
You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')
Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.
I have a dataframe that includes two columns like the following:
date value
0 2017-05-01 1
1 2017-05-08 4
2 2017-05-15 9
each row shows Monday of the week and I have a value only for that specific day. I want to estimate this value for the whole week days until the next Monday, and get the following output:
date value
0 2017-05-01 1
1 2017-05-02 1
2 2017-05-03 1
3 2017-05-04 1
4 2017-05-05 1
5 2017-05-06 1
6 2017-05-07 1
7 2017-05-08 4
8 2017-05-09 4
9 2017-05-10 4
10 2017-05-11 4
11 2017-05-12 4
12 2017-05-13 4
13 2017-05-14 4
14 2017-05-15 9
15 2017-05-16 9
16 2017-05-17 9
17 2017-05-18 9
18 2017-05-19 9
19 2017-05-20 9
20 2017-05-21 9
in this link it shows how to select the range in Dataframe but I don't know how to fill the value column as I explained.
Here is a solution using pandas reindex and ffill:
# Make sure dates is treated as datetime
df['date'] = pd.to_datetime(df['date'], format = "%Y-%m-%d")
from pandas.tseries.offsets import DateOffset
# Create target dates: all days in the weeks in the original dataframe
new_index = pd.date_range(start=df['date'].iloc[0],
end=df['date'].iloc[-1] + DateOffset(6),
freq='D')
# Temporarily set dates as index, conform to target dates and forward fill data
# Finally reset the index as in the original df
out = df.set_index('date')\
.reindex(new_index).ffill()\
.reset_index(drop=False)\
.rename(columns = {'index' : 'date'})
Which gives the expected result:
date value
0 2017-05-01 1.0
1 2017-05-02 1.0
2 2017-05-03 1.0
3 2017-05-04 1.0
4 2017-05-05 1.0
5 2017-05-06 1.0
6 2017-05-07 1.0
7 2017-05-08 4.0
8 2017-05-09 4.0
9 2017-05-10 4.0
10 2017-05-11 4.0
11 2017-05-12 4.0
12 2017-05-13 4.0
13 2017-05-14 4.0
14 2017-05-15 9.0
15 2017-05-16 9.0
16 2017-05-17 9.0
17 2017-05-18 9.0
18 2017-05-19 9.0
19 2017-05-20 9.0
20 2017-05-21 9.0