I want to up-sample a series from weekly to daily frequency by forward filling the result.
If the last observation of my original series is NaN, I would have expected this value to be replaced by the previous valid value, but instead it remains as NaN.
SETUP
import numpy as np
import pandas as pd
all_dates = pd.date_range(start='2018-01-01', freq='W-WED', periods=4)
ts = pd.Series([1, 2, 3], index=all_dates[:3])
ts[all_dates[3]] = np.nan
ts
Out[16]:
2018-01-03 1.0
2018-01-10 2.0
2018-01-17 3.0
2018-01-24 NaN
Freq: W-WED, dtype: float64
RESULT
ts.resample('B').ffill()
ts.resample('B').ffill()
Out[17]:
2018-01-03 1.0
2018-01-04 1.0
2018-01-05 1.0
2018-01-08 1.0
2018-01-09 1.0
2018-01-10 2.0
2018-01-11 2.0
2018-01-12 2.0
2018-01-15 2.0
2018-01-16 2.0
2018-01-17 3.0
2018-01-18 3.0
2018-01-19 3.0
2018-01-22 3.0
2018-01-23 3.0
2018-01-24 NaN
Freq: B, dtype: float64
While I was expecting the last value to be 3 as well.
Does anyone has an explanation of this behaviour?
resample() returns DatetimeIndexResampler
You need to return the original pandas Series.
You can use asfreq() method to do it, before filling the Nan https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html.
So, this should work:
ts.resample('B').asfreq().ffill()
The point of resample and ffillis simply to propagate forward from the first day of the week - if the first day of the week is NaN, that's what gets filled forward. For example:
ts.iloc[1] = np.nan
ts.resample('B').ffill()
2018-01-03 1.0
2018-01-04 1.0
2018-01-05 1.0
2018-01-08 1.0
2018-01-09 1.0
2018-01-10 NaN
2018-01-11 NaN
2018-01-12 NaN
2018-01-15 NaN
2018-01-16 NaN
2018-01-17 3.0
2018-01-18 3.0
2018-01-19 3.0
2018-01-22 3.0
2018-01-23 3.0
2018-01-24 NaN
Freq: B, dtype: float64
In most cases, propagating from the previous week's data would not be desired behaviour. If you'd like to use previous weeks' data in the case of missing values in the original (weekly) series, just fillna on that first with ffill.
Related
I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.
You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000
I want to create a DataFrame or TimeSerie using an index of an existing TimeSerie and the values from another TimeSerie with different time indices. The TimeSeries look like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 172.3000
2018-01-03 174.5500
2018-01-04 173.4700
2018-01-05 175.3700
2018-01-08 175.6100
2018-01-09 175.0600
2018-01-10 174.3000
2018-01-11 175.4886
2018-01-12 177.3600
2018-01-16 179.3900
2018-01-17 179.2500
2018-01-18 180.1000
...
and
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-09 175.610
2018-01-16 177.360
2018-01-23 180.100
...
I want to use the index from the first TS and fill it with the values with appropriate index form the second TS. Like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-08 NaN
2018-01-09 175.610
2018-01-10 NaN
2018-01-11 NaN
2018-01-12 NaN
2018-01-16 177.360
2018-01-17 NaN
2018-01-18 NaN
...
Thx
IIUC, use Series.reindex:
new_s = s2.reindex(s1.index)
#2018-01-02 NaN
#2018-01-03 NaN
#2018-01-04 NaN
#2018-01-05 NaN
#2018-01-08 NaN
#2018-01-09 175.61
#2018-01-10 NaN
#2018-01-11 NaN
#2018-01-12 NaN
#2018-01-16 177.36
#2018-01-17 NaN
#2018-01-18 NaN
#Name: s2, dtype: float64
convert your series data structure into a Data frame Data structure then use the following line :
pd.merge(TS1,TS2,left_index=True,right_index=True,how='left').iloc[:,-1]
Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
df
>>>
data1 data2 data3
2018-01-01 00:00:00 1.0 3.0 NaN
2018-01-01 01:00:00 8.0 5.0 8.0
2018-01-01 02:00:00 5.0 NaN 6.0
2018-01-01 03:00:00 4.0 7.0 4.0
2018-01-01 04:00:00 NaN 8.0 NaN
... ... ... ...
2018-01-07 20:00:00 8.0 7.0 NaN
2018-01-07 21:00:00 5.0 4.0 5.0
2018-01-07 22:00:00 NaN 6.0 NaN
2018-01-07 23:00:00 2.0 4.0 3.0
2018-01-08 00:00:00 NaN NaN NaN
I want to select a specific time each day, then set all value in a day equal to the data of that time.
For example, I want to select 1:00:00, then all data of 2018-01-01 will be equal to 2018-01-01 01:00:00, all data of 2018-01-02 will be equal to 2018-01-02 01:00:00,etc.,
I know how to select the data of the time:
timestamp = "01:00:00"
df[df.index.strftime("%H:%M:%S") == timestamp]
but I don't know how to set data of the day equal to it.
Thank you for reading.
Check with reindex
s=df[df.index.strftime("%H:%M:%S") == timestamp]
s.index=s.index.date
df[:]=s.reindex(df.index.date).values
Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng))),
columns=['data'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
I want to calculate std() for rolling with windows = 5, if more than half of the elements in the windows = NaN, the rolling calculation is equal to NaN, if less than half of the elements in the windows = NaN, dropna() and calculate std() for the rest of the elements.
I only know how to calculate normal rolling:
df.rolling(5).std()
How could I specify the conditon of the rolling calculation
I think you can use the argument min_periods in the rolling function
df['rollingstd'] = df.rolling(5, min_periods=3).std()
df.head(20)
Out put:
data rollingstd
2018-01-01 00:00:00 1.0 NaN
2018-01-01 01:00:00 6.0 NaN
2018-01-01 02:00:00 1.0 2.886751
2018-01-01 03:00:00 NaN 2.886751
2018-01-01 04:00:00 5.0 2.629956
2018-01-01 05:00:00 3.0 2.217356
2018-01-01 06:00:00 NaN 2.000000
2018-01-01 07:00:00 NaN NaN
2018-01-01 08:00:00 3.0 1.154701
2018-01-01 09:00:00 NaN NaN
2018-01-01 10:00:00 5.0 NaN
2018-01-01 11:00:00 9.0 3.055050
2018-01-01 12:00:00 NaN 3.055050
2018-01-01 13:00:00 9.0 2.309401
2018-01-01 14:00:00 1.0 3.829708
2018-01-01 15:00:00 0.0 4.924429
2018-01-01 16:00:00 3.0 4.031129
2018-01-01 17:00:00 0.0 3.781534
2018-01-01 18:00:00 1.0 1.224745
2018-01-01 19:00:00 NaN 1.414214
Here is an alternative more custom method :
Write a custom method for your logic which taken an array of window size elements as input and return the wanted result for that window:
def cus_mean(x):
notnone = ~(np.isnan(x))
if notnone.sum()>2:
return np.mean([y for y in x if ~(np.isnan(y))])
Then call the rolling function on your dataframe as below:
df.rolling(5).apply(cus_mean)
I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15
and would like to add a column named 'date_difference' that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09),
so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference**
K001 1 2018-01-03 2018-01-09
K001 1 2018-01-08 2018-01-10 -1
K001 2 2018-01-01 2018-01-05
K001 2 2018-01-15 2018-01-18 10
K002 4 2018-01-04 2018-01-07
K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn't work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I've searched for an hour but couldn't find a hint...) I would sincerely appreciate if you guys give some advice.
Here is one potential way to do this.
First create a boolean mask, then use numpy.where and Series.shift to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number'])
df['date_difference'] = (np.where(mask, (df['contract_year_month'] -
df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference
0 K001 1 2018-01-03 2018-01-09 NaN
1 K001 1 2018-01-08 2018-01-10 -1.0
2 K001 2 2018-01-01 2018-01-05 NaN
3 K001 2 2018-01-15 2018-01-18 10.0
4 K002 4 2018-01-04 2018-01-07 NaN
5 K002 4 2018-01-09 2018-01-15 2.0
Here's one approach using your grouby() (Updated based on feedback from #piRSquared):
In []:
(df['collection_year_month']
.groupby([df['identifier'], df['id_number']])
.shift() - df['contract_year_month']).dt.days
Out[]:
0 NaN
1 -1.0
2 NaN
3 10.0
4 NaN
5 2.0
dtype: float64
You can just assign this to df['date_difference']