Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng))),
columns=['data'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
I want to calculate std() for rolling with windows = 5, if more than half of the elements in the windows = NaN, the rolling calculation is equal to NaN, if less than half of the elements in the windows = NaN, dropna() and calculate std() for the rest of the elements.
I only know how to calculate normal rolling:
df.rolling(5).std()
How could I specify the conditon of the rolling calculation
I think you can use the argument min_periods in the rolling function
df['rollingstd'] = df.rolling(5, min_periods=3).std()
df.head(20)
Out put:
data rollingstd
2018-01-01 00:00:00 1.0 NaN
2018-01-01 01:00:00 6.0 NaN
2018-01-01 02:00:00 1.0 2.886751
2018-01-01 03:00:00 NaN 2.886751
2018-01-01 04:00:00 5.0 2.629956
2018-01-01 05:00:00 3.0 2.217356
2018-01-01 06:00:00 NaN 2.000000
2018-01-01 07:00:00 NaN NaN
2018-01-01 08:00:00 3.0 1.154701
2018-01-01 09:00:00 NaN NaN
2018-01-01 10:00:00 5.0 NaN
2018-01-01 11:00:00 9.0 3.055050
2018-01-01 12:00:00 NaN 3.055050
2018-01-01 13:00:00 9.0 2.309401
2018-01-01 14:00:00 1.0 3.829708
2018-01-01 15:00:00 0.0 4.924429
2018-01-01 16:00:00 3.0 4.031129
2018-01-01 17:00:00 0.0 3.781534
2018-01-01 18:00:00 1.0 1.224745
2018-01-01 19:00:00 NaN 1.414214
Here is an alternative more custom method :
Write a custom method for your logic which taken an array of window size elements as input and return the wanted result for that window:
def cus_mean(x):
notnone = ~(np.isnan(x))
if notnone.sum()>2:
return np.mean([y for y in x if ~(np.isnan(y))])
Then call the rolling function on your dataframe as below:
df.rolling(5).apply(cus_mean)
Related
I am trying to add some dataframes that contain NaN values. The data frames are index by time series, and in my case a NaN is meaningful, it means that a measurement wasn't done. So if all the data frames I'm adding have a NaN for a given timestamp, I need the result to have a NaN for this timestamp. But if one or more df have a value for the timestamp, I need to have the sum of theses values.
EDIT : Also, in my case, a 0 is different from an NaN, it means that there was a mesurement and it mesured 0 activity, different from a NaN meaning that there was no mesurement. So any solution using fillna(0) won't work.
I haven't found a proper way to do this yet. Here is an exemple of what I want to do :
import pandas as pd
df1 = pd.DataFrame({'value': [0, 1, 1, 1, np.NaN, np.NaN, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df2 = pd.DataFrame({'value': [0, 5, 5, 5, 5, 5, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df1 + df2
What i get :
df1 + df2
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 NaN
2020-01-01 00:50:00 NaN
2020-01-01 01:00:00 NaN
What I would want to have as a result :
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
Does anybody know a clean way to do so ?
Thank you.
(I'm using Python 3.9.1 and pandas 1.2.4)
You can use add with the fill_value=0 option. This will maintain the "all NaN" combinations as NaN:
df1.add(df2, fill_value=0)
output:
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.
You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000
Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
df
>>>
data1 data2 data3
2018-01-01 00:00:00 1.0 3.0 NaN
2018-01-01 01:00:00 8.0 5.0 8.0
2018-01-01 02:00:00 5.0 NaN 6.0
2018-01-01 03:00:00 4.0 7.0 4.0
2018-01-01 04:00:00 NaN 8.0 NaN
... ... ... ...
2018-01-07 20:00:00 8.0 7.0 NaN
2018-01-07 21:00:00 5.0 4.0 5.0
2018-01-07 22:00:00 NaN 6.0 NaN
2018-01-07 23:00:00 2.0 4.0 3.0
2018-01-08 00:00:00 NaN NaN NaN
I want to select a specific time each day, then set all value in a day equal to the data of that time.
For example, I want to select 1:00:00, then all data of 2018-01-01 will be equal to 2018-01-01 01:00:00, all data of 2018-01-02 will be equal to 2018-01-02 01:00:00,etc.,
I know how to select the data of the time:
timestamp = "01:00:00"
df[df.index.strftime("%H:%M:%S") == timestamp]
but I don't know how to set data of the day equal to it.
Thank you for reading.
Check with reindex
s=df[df.index.strftime("%H:%M:%S") == timestamp]
s.index=s.index.date
df[:]=s.reindex(df.index.date).values
I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()
I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks