I am trying get the 10 days aggregate of my data which has NaN values. The sum of 10 days should return a nan values if there is a NaN value in the 10 day duration.
When I apply the below code, pandas is considering NaN as Zero and returning the sum of remaining days.
dateRange = pd.date_range(start_date, periods=len(data), freq='D')
# Creating a data frame so that the timeseries can handle numpy array.
df = pd.DataFrame(data)
base_Series = pd.DataFrame(list(df.values), index=dateRange)
# Converting to aggregated series
agg_series = base_Series.resample('10D', how='sum')
agg_data = agg_series.values
Sample Data:
2011-06-01 46.520536
2011-06-02 8.988311
2011-06-03 0.133823
2011-06-04 0.274521
2011-06-05 1.283360
2011-06-06 2.556313
2011-06-07 0.027461
2011-06-08 0.001584
2011-06-09 0.079193
2011-06-10 2.389549
2011-06-11 NaN
2011-06-12 0.195844
2011-06-13 0.058720
2011-06-14 6.570925
2011-06-15 0.015107
2011-06-16 0.031066
2011-06-17 0.073008
2011-06-18 0.072198
2011-06-19 0.044534
2011-06-20 0.240080
Output:
2011-06-01 62.254651
2011-06-11 7.301481
This uses numpy sum which will return nan if nan is present in the sum
In [35]: s = Series(randn(100),index=date_range('20130101',periods=100))
In [36]: s.iloc[11] = np.nan
In [37]: s.resample('10D',how=lambda x: x.values.sum())
Out[37]:
2013-01-01 6.910729
2013-01-11 NaN
2013-01-21 -1.592541
2013-01-31 -2.013012
2013-02-10 1.129273
2013-02-20 -2.054807
2013-03-02 4.669622
2013-03-12 3.489225
2013-03-22 0.390786
2013-04-01 -0.005655
dtype: float64
to filter out those days which have any NaNs, I propose that you do
noNaN_days_only = s.groupby(lambda x: x.date).filter(lambda x: ~x.isnull().any()
where s is a DataFrame
Just apply an agg function:
agg_series = base_Series.resample('10D').agg(lambda x: np.nan if np.isnan(x).all() else np.sum(x) )
Related
I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN
I have a big dataframe that contains around 7,000,000 rows of time series data that looks like this
timestamp | values
2019-08-01 14:53:01 | 20.0
2019-08-01 14:53:55 | 29.0
2019-08-01 14:53:58 | 22.4
...
2019-08-02 14:53:25 | 27.9
I want to create a column that is a lag version of 1 day for each row, since my timestamps don't match up perfectly, I can't use the normal shift() method.
The result will be something like this:
timestamp | values | lag
2019-08-01 14:53:01 | 20.0 | Nan
2019-08-01 14:53:55 | 29.0 | Nan
2019-08-01 14:53:58 | 22.4 | Nan
...
2019-08-02 14:53:25 | 27.9 | 20.0
I found some posts related to get the closest timestamp to a given time: Find closest row of DataFrame to given time in Pandas and tried the methods, it does the job but takes too long to run, here's what I have:
def get_nearest(data, timestamp):
index = data.index.get_loc(timestamp,"nearest")
return data.iloc[index, 0]
df['lag'] = [get_nearest(df, dt) for dt in df.index]
Any efficient ways to solve the problem?
Hmmmm, not sure if this will work out to be more efficient, but merge_asof is an approach worth looking at as won't require a udf.
df['date'] = df.timestamp.dt.date
df2 = df.copy()
df2['date'] = df2['date'] + pd.to_timedelta(1,unit ='D')
df2['timestamp'] = df2['timestamp'] + pd.to_timedelta(1,unit ='D')
pd.merge_asof(df,df2, on = 'timestamp', by = 'date', direction = 'nearest')
The approach essentially merges the previous day value to the next day and then matches to the nearest timestamp.
Assuming your dates are sorted, one way to do this quickly would be to use pd.DateTimeIndex.searchsorted to find all the matching dates in O[N log N] time.
Creating some test data, it might look something like this:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(
{'values': np.random.rand(10)},
index=sorted(np.random.choice(pd.date_range('2019-08-01', freq='T', periods=10000), 10, replace=False))
)
def add_lag(df):
ind = df.index.searchsorted(df.index - pd.DateOffset(1))
out_of_range = (ind <= 0) | (ind >= df.shape[0])
ind[out_of_range] = 0
lag = df['values'].values[ind]
lag[out_of_range] = np.nan
df['lag'] = lag
return df
add_lag(df)
values lag
2019-08-01 06:17:00 0.548814 NaN
2019-08-01 10:51:00 0.715189 NaN
2019-08-01 13:56:00 0.602763 NaN
2019-08-02 09:50:00 0.544883 0.715189
2019-08-03 14:06:00 0.423655 0.423655
2019-08-04 03:00:00 0.645894 0.423655
2019-08-05 07:40:00 0.437587 0.437587
2019-08-07 00:41:00 0.891773 0.891773
2019-08-07 07:05:00 0.963663 0.891773
2019-08-07 15:55:00 0.383442 0.891773
With this approach, a dataframe with 1 million rows can be computed in tens of milliseconds:
df = pd.DataFrame(
{'values': np.random.rand(1000000)},
index=sorted(np.random.choice(pd.date_range('2019-08-01', freq='T', periods=10000000), 1000000, replace=False))
)
%timeit add_lag(df)
# 10 loops, best of 3: 71.5 ms per loop
Note however that this doesn't find the nearest value to a lag of one day, but the nearest value after a lag of one day. If you want the nearest value in either direction, you'll need to modify this approach.
There is a dataframe. The period column contains lists. These lists contain time spans.
#load data
df = pd.DataFrame(data, columns=['task_id', 'target_start_date', 'target_end_date'])
df['target_start_date'] = pd.to_datetime(df.target_start_date)
df['target_end_date'] = pd.to_datetime(df.target_end_date)
df['period'] = np.nan
#create period column
z = dict()
freq = 'M'
for i in range(0, len(df)):
l = pd.period_range(df.target_start_date[i], df.target_end_date[i], freq=freq)
l = l.to_native_types()
z[i] = l
df.period = z.values()
Output
task_id target_start_date target_end_date period
0 35851 2019-04-01 07:00:00 2019-04-01 07:00:00 [2019-04]
1 35852 2020-02-26 11:30:00 2020-02-26 11:30:00 [2020-02]
2 35854 2019-05-17 07:00:00 2019-06-01 17:30:00 [2019-05, 2019-06]
3 35855 2019-03-20 11:30:00 2019-04-07 15:00:00 [2019-03, 2019-04]
4 35856 2019-04-06 08:00:00 2019-04-26 19:00:00 [2019-04]
enter image description here
Then I add columns which are called time slices.
#create slices
date_min = df.target_start_date.min()
date_max = df.target_end_date.max()
period = pd.period_range(date_min, date_max, freq=freq)
#add columns
for i in period:
df[str(i)] = np.nan
result
enter image description here
How can I fill Nan values for True, if this value is in the list in the period column?
enter image description here
Apply a function across the dataframe rows
def fillit(row):
for i in row.period:
row[i] = True
df.apply(fillit), axis=1)
My approach was to iterate over rows and column names and compare values:
import numpy as np
import pandas as pd
# handle assignment error
pd.options.mode.chained_assignment = None
# setup test data
data = {'time': [['2019-04'], ['2019-01'], ['2019-03'], ['2019-06', '2019-05']]}
data = pd.DataFrame(data=data)
# create periods
date_min = data.time.min()[0]
date_max = data.time.max()[0]
period = pd.period_range(date_min, date_max, freq='M')
for i in period:
data[str(i)] = np.nan
# compare and fill data
for index, row in data.iterrows():
for column in data:
if data[column].name in row['time']:
data[column][index] = 'True'
Output:
time 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
0 [2019-04] NaN NaN NaN True NaN NaN
1 [2019-01] True NaN NaN NaN NaN NaN
2 [2019-03] NaN NaN True NaN NaN NaN
3 [2019-06, 2019-05] NaN NaN NaN NaN True True
I am trying to shift my date time index such that 2018-04-09 will show as 2018-04-08 one day ahead and only shifting the last row, I tried a few ways with different error such as below:
df.index[-1] = df.index[-1] + pd.offsets.Day(1)
TypeError: Index does not support mutable operations
Can you kindly advise a suitable way please?
My df looks like this:
FinalPosition
dt
2018-04-03 1.32
2018-04-04 NaN
2018-04-05 NaN
2018-04-06 NaN
2018-04-09 NaN
Use rename if values of DatetimeIndex are unique:
df = df.rename({df.index[-1]: df.index[-1] + pd.offsets.Day(1)})
print (df)
FinalPosition
dt
2018-04-03 1.32
2018-04-04 NaN
2018-04-05 NaN
2018-04-06 NaN
2018-04-10 NaN
If possible not unique for me working DatetimeIndex.insert:
df.index = df.index[:-1].insert(len(df), df.index[-1] + pd.offsets.Day(1))
Use .iloc
Ex:
import pandas as pd
df = pd.DataFrame({"datetime": ["2018-04-09"]})
df["datetime"] = pd.to_datetime(df["datetime"])
print df["datetime"].iloc[-1:] - pd.offsets.Day(1)
Output:
0 2018-04-08
Name: datetime, dtype: datetime64[ns]
How do I fill in NAN values in dataframe with a default date of 2015-01-01
what do I use here df['SIGN_DATE'] = df['SIGN_DATE'].fillna(??, inplace=True)
>>>df.SIGN_DATE.head()
0 2012-03-28 14:14:18
1 2011-05-18 00:41:48
2 2011-06-13 16:36:58
3 nan
4 2011-05-22 23:43:56
Name: SIGN_DATE, dtype: object
type(df.SIGN_DATE)
pandas.core.series.Series
df['SIGN_DATE'].fillna(value=pd.to_datetime('1/1/2015'), inplace=True)