I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()
Related
import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds (or df) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample with 1 hour.
The df without NaNs (just after df = ds.to_dataframe()):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df with NaNs (df_1h):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h:
>>> df_1h.isnull().sum()
foo 30
dtype: int64
I have large df with datettime index with hourly time step and precipitation values in several columns. My precipitation valuesare a cumulative total during the day (from 1:00 am to 0:00 am of the next day) and are reset after every day, example:
datetime S1
2000-01-01 00:00:00 4.5 ...
2000-01-01 01:00:00 0 ...
2000-01-01 02:00:00 0 ...
2000-01-01 03:00:00 0 ...
2000-01-01 04:00:00 0
2000-01-01 05:00:00 0
2000-01-01 06:00:00 0
2000-01-01 07:00:00 0
2000-01-01 08:00:00 0
2000-01-01 09:00:00 0
2000-01-01 10:00:00 0
2000-01-01 11:00:00 6.5
2000-01-01 12:00:00 7.5
2000-01-01 13:00:00 8.7
2000-01-01 14:00:00 8.7
...
2000-01-01 22:00:00 8.7
2000-01-01 23:00:00 8.7
2000-01-02 00:00:00 8.7
2000-01-02 01:00:00 0
I am trying to go from this to the actual hourly values, so the value for 1:00 am for every day is fine and then I want to substract the value from the timestep before.
Can I somehow use if statement inside of df.apply?
I thought of smth like:
df_copy = df.copy()
df = df.apply(lambda x: if df.hour !=1: era5_T[x]=era5_T[x]-era5_T_copy[x-1])
But this is not working since I'm not calling a function? I could work with a for loop but that doesn't seem like the most efficient way as I'm working with a big dataset.
You can use numpy.where and pd.Series.shift to acheive the result
import numpy as np
df['hourly_S1'] = np.where(df.hour ==1, df.S1, df.S1-df.S1.shift())
Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng))),
columns=['data'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
I want to calculate std() for rolling with windows = 5, if more than half of the elements in the windows = NaN, the rolling calculation is equal to NaN, if less than half of the elements in the windows = NaN, dropna() and calculate std() for the rest of the elements.
I only know how to calculate normal rolling:
df.rolling(5).std()
How could I specify the conditon of the rolling calculation
I think you can use the argument min_periods in the rolling function
df['rollingstd'] = df.rolling(5, min_periods=3).std()
df.head(20)
Out put:
data rollingstd
2018-01-01 00:00:00 1.0 NaN
2018-01-01 01:00:00 6.0 NaN
2018-01-01 02:00:00 1.0 2.886751
2018-01-01 03:00:00 NaN 2.886751
2018-01-01 04:00:00 5.0 2.629956
2018-01-01 05:00:00 3.0 2.217356
2018-01-01 06:00:00 NaN 2.000000
2018-01-01 07:00:00 NaN NaN
2018-01-01 08:00:00 3.0 1.154701
2018-01-01 09:00:00 NaN NaN
2018-01-01 10:00:00 5.0 NaN
2018-01-01 11:00:00 9.0 3.055050
2018-01-01 12:00:00 NaN 3.055050
2018-01-01 13:00:00 9.0 2.309401
2018-01-01 14:00:00 1.0 3.829708
2018-01-01 15:00:00 0.0 4.924429
2018-01-01 16:00:00 3.0 4.031129
2018-01-01 17:00:00 0.0 3.781534
2018-01-01 18:00:00 1.0 1.224745
2018-01-01 19:00:00 NaN 1.414214
Here is an alternative more custom method :
Write a custom method for your logic which taken an array of window size elements as input and return the wanted result for that window:
def cus_mean(x):
notnone = ~(np.isnan(x))
if notnone.sum()>2:
return np.mean([y for y in x if ~(np.isnan(y))])
Then call the rolling function on your dataframe as below:
df.rolling(5).apply(cus_mean)
I have a dataframe with datetimes as index. There are some gaps in the index so I upsample it to have 1 second gap only. I want to fill the gaps by doing half forward filling (from the left side of the gap) and half backward filling (from the right side of the gap).
Input:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:10 4
Upsampled Input, with 10 second:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 NaN
2000-01-01 00:00:20 NaN
2000-01-01 00:00:30 NaN
2000-01-01 00:00:40 NaN
2000-01-01 00:00:50 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 NaN
2000-01-01 00:01:20 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:01:40 NaN
2000-01-01 00:01:50 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 NaN
2000-01-01 00:02:20 NaN
2000-01-01 00:02:30 NaN
2000-01-01 00:02:40 NaN
2000-01-01 00:02:50 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
Output I want:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
I managed to get the results I want by getting the edges of the gaps after the upsampling, performing a forward fill across all the gap, and then updating just the right half with the value of the right edge, but since my data is so large, it takes forever to run as some of my files have 1M gaps to fill. I basically do this using a for loop that goes through all the identified gaps.
Is there a way this could be done faster?
Thanks!
Edit:
I only want to upsample and fill gaps where the time difference is smaller than or equal to a given value, in the example only those up to 1 minute, so the last 2 rows won't have an upsample and filling between them.
If you data is 1 min apart, you can do:
df.set_index(0).asfreq('10S').ffill(limit=3).bfill(limit=2)
output:
1
0
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
Setup
ts = pd.Series([0, 1, 2, 3], pd.date_range('2000-01-01', periods=4, freq='min'))
merge_asof with direction='nearest'
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('right'),
left_index=True,
right_index=True,
direction='nearest'
)
left right
2000-01-01 00:00:00 0.0 0
2000-01-01 00:00:10 NaN 0
2000-01-01 00:00:20 NaN 0
2000-01-01 00:00:30 NaN 0
2000-01-01 00:00:40 NaN 1
2000-01-01 00:00:50 NaN 1
2000-01-01 00:01:00 1.0 1
2000-01-01 00:01:10 NaN 1
2000-01-01 00:01:20 NaN 1
2000-01-01 00:01:30 NaN 1
2000-01-01 00:01:40 NaN 2
2000-01-01 00:01:50 NaN 2
2000-01-01 00:02:00 2.0 2
2000-01-01 00:02:10 NaN 2
2000-01-01 00:02:20 NaN 2
2000-01-01 00:02:30 NaN 2
2000-01-01 00:02:40 NaN 3
2000-01-01 00:02:50 NaN 3
2000-01-01 00:03:00 3.0 3
reindex with method='nearest'
ts.reindex(ts.asfreq('10s').index, method='nearest')
2000-01-01 00:00:00 0
2000-01-01 00:00:10 0
2000-01-01 00:00:20 0
2000-01-01 00:00:30 1
2000-01-01 00:00:40 1
2000-01-01 00:00:50 1
2000-01-01 00:01:00 1
2000-01-01 00:01:10 1
2000-01-01 00:01:20 1
2000-01-01 00:01:30 2
2000-01-01 00:01:40 2
2000-01-01 00:01:50 2
2000-01-01 00:02:00 2
2000-01-01 00:02:10 2
2000-01-01 00:02:20 2
2000-01-01 00:02:30 3
2000-01-01 00:02:40 3
2000-01-01 00:02:50 3
2000-01-01 00:03:00 3
Freq: 10S, dtype: int64
Note: that the decision on how to determine nearest produces slightly different results between the two solutions.
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('merge_asof'),
left_index=True,
right_index=True,
direction='nearest'
).assign(reindex=ts.reindex(ts.asfreq('10s').index, method='nearest'))
left merge_asof reindex
2000-01-01 00:00:00 0.0 0 0
2000-01-01 00:00:10 NaN 0 0
2000-01-01 00:00:20 NaN 0 0
2000-01-01 00:00:30 NaN 0 1 # This row is different
2000-01-01 00:00:40 NaN 1 1
2000-01-01 00:00:50 NaN 1 1
2000-01-01 00:01:00 1.0 1 1
2000-01-01 00:01:10 NaN 1 1
2000-01-01 00:01:20 NaN 1 1
2000-01-01 00:01:30 NaN 1 2 # This row is different
2000-01-01 00:01:40 NaN 2 2
2000-01-01 00:01:50 NaN 2 2
2000-01-01 00:02:00 2.0 2 2
2000-01-01 00:02:10 NaN 2 2
2000-01-01 00:02:20 NaN 2 2
2000-01-01 00:02:30 NaN 2 3 # This row is different
2000-01-01 00:02:40 NaN 3 3
2000-01-01 00:02:50 NaN 3 3
2000-01-01 00:03:00 3.0 3 3
I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks