How to disallow keys other than Timestamps in DatetimeIndex? - python

Pandas does not restrict DatetimeIndex keys to only Timestamps. Why it is so and is there any way to make such restriction?
df = pd.DataFrame({"A":{"2019-01-01":12.0,"2019-01-03":27.0,"2019-01-04":15.0},
"B":{"2019-01-01":25.0,"2019-01-03":27.0,"2019-01-04":27.0}}
)
df.index = pd.to_datetime(df.index)
df.loc['2010-05-05'] = 1 # string index
df.loc[150] = 1 # integer index
print(df)
I get the following dataframe:
A B
2019-01-01 00:00:00 12.0 25.0
2019-01-03 00:00:00 27.0 27.0
2019-01-04 00:00:00 15.0 27.0
2010-05-05 1.0 1.0
150 1.0 1.0
Of course I cannot do
df.index = pd.to_datetime(df.index)
once again because of last two rows.
However I'd like if 2 last rows could not be added throwing an error.
Is it possible?

You have a slight misconception about the type of your index. It is not a DateTimeIndex:
>>> df.index
Index([2019-01-01 00:00:00, 2019-01-03 00:00:00, 2019-01-04 00:00:00,
'2010-05-05', 150],
dtype='object')
The index becomes an Object dtype index as soon as you add a different type value. DateTimeIndex's can't have types of than timestamps, the type of the index is changed.
If you would like to remove all values that are not datetimes from your index, you can do that with pd.to_datetime and errors='coerce'
df.index = pd.to_datetime(df.index, errors='coerce')
A B
2019-01-01 12.0 25.0
2019-01-03 27.0 27.0
2019-01-04 15.0 27.0
2010-05-05 1.0 1.0
NaT 1.0 1.0
To access only elements that have a valid Timestamp as index, you can use notnull:
df[df.index.notnull()]
A B
2019-01-01 12.0 25.0
2019-01-03 27.0 27.0
2019-01-04 15.0 27.0
2010-05-05 1.0 1.0

You can check if each index is a pd._libs.tslibs.timestamps.Timestamp instance:
flags = [isinstance(idx, pd._libs.tslibs.timestamps.Timestamp) for idx in df.reset_index()['index']]
df = df[flags]
However, note that you can certainly do both pd.to_datetime('2010-05-05') and pd.to_datetime(150). At least, they still result in valid datetime stamp without throwing an exception/error/

Related

Sum of dataframes : treating NaN as 0 when summed with other values, but returning NaN where all summed elements are NaN

I am trying to add some dataframes that contain NaN values. The data frames are index by time series, and in my case a NaN is meaningful, it means that a measurement wasn't done. So if all the data frames I'm adding have a NaN for a given timestamp, I need the result to have a NaN for this timestamp. But if one or more df have a value for the timestamp, I need to have the sum of theses values.
EDIT : Also, in my case, a 0 is different from an NaN, it means that there was a mesurement and it mesured 0 activity, different from a NaN meaning that there was no mesurement. So any solution using fillna(0) won't work.
I haven't found a proper way to do this yet. Here is an exemple of what I want to do :
import pandas as pd
df1 = pd.DataFrame({'value': [0, 1, 1, 1, np.NaN, np.NaN, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df2 = pd.DataFrame({'value': [0, 5, 5, 5, 5, 5, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df1 + df2
What i get :
df1 + df2
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 NaN
2020-01-01 00:50:00 NaN
2020-01-01 01:00:00 NaN
What I would want to have as a result :
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
Does anybody know a clean way to do so ?
Thank you.
(I'm using Python 3.9.1 and pandas 1.2.4)
You can use add with the fill_value=0 option. This will maintain the "all NaN" combinations as NaN:
df1.add(df2, fill_value=0)
output:
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN

Select most recent example of multiindex dataframe

I have a similiar problem as in Getting the last element of a level in a multiindex. In the mentioned question the multiindex dataframe has for each group a start number which is always the same.
However, my problem is slightly different. I have again two columns. One column with an integer (in the MWE below it is a bool) and a second column with a datetime index. Similar, to the above example, I want select for each unique value in the first column the last row. In my example, it means the value with the most recent timestamp. The solution from the question above does not work, since I have no fixed start value for the second column.
MWE:
import pandas as pd
df = pd.DataFrame(range(10), index=pd.date_range(pd.Timestamp("2020.01.01"), pd.Timestamp("2020.01.01") + pd.Timedelta(hours=50), 10))
mask = (df.index.hour > 1) & (df.index.hour < 9)
df.groupby(mask)
df = df.groupby(mask).rolling("4h").mean()
The resulting dataframe looks like:
0
False 2020-01-01 00:00:00 0.0
2020-01-01 11:06:40 2.0
2020-01-01 16:40:00 3.0
2020-01-01 22:13:20 4.0
2020-01-02 09:20:00 6.0
2020-01-02 14:53:20 7.0
2020-01-02 20:26:40 8.0
True 2020-01-01 05:33:20 1.0
2020-01-02 03:46:40 5.0
2020-01-03 02:00:00 9.0
Now, I want to get for each value in the first column the row with the most recent time stamp. I.e., I would like to get the following dataframe:
0
False 2020-01-02 20:26:40 8.0
True 2020-01-03 02:00:00 9.0
I would really appreciate ideas like in the mentioned link which do this.
Assuming values in level 1 are sorted try with groupby tail:
out = df.groupby(level=0).tail(1)
out:
0
False 2020-01-02 20:26:40 8.0
True 2020-01-03 02:00:00 9.0
If not sort_index first:
out = df.sort_index(level=1).groupby(level=0).tail(1)
out:
0
False 2020-01-02 20:26:40 8.0
True 2020-01-03 02:00:00 9.0

How to expand a DateTimeIndex with some months before?

I have a time-series corresponding to the end of the month for some dates of interest:
Date
31-01-2005 0.0
28-02-2006 0.0
30-06-2020 0.0
Name: Whatever, dtype: float64
I'd like to expand this dataframe's index with two month samples before each data point resulting in the following dataframe:
Date
30-11-2004 NaN
31-12-2004 NaN
31-01-2005 0.0
31-12-2005 NaN
31-01-2006 NaN
28-02-2006 0.0
30-04-2020 NaN
31-05-2020 NaN
30-06-2020 0.0
Name: Whatever, dtype: float64
How can I do that? Note that I am only interested in the resulting index.
My naive attempt was to do:
df.index.apply(lambda x: [x - pd.DateOffset(months=2), x - pd.DateOffset(months=1), x])
but index doesn't have an apply function.
I think you need DataFrame.reindex with date_range:
idx = [y for x in df.index for y in pd.date_range(x - pd.DateOffset(months=2), x, freq='M')]
df = df.reindex(pd.to_datetime(idx))
print (df)
Whatever
2004-11-30 NaN
2004-12-31 NaN
2005-01-31 0.0
2005-12-31 NaN
2006-01-31 NaN
2006-02-28 0.0
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 0.0

How to create a timeseries from a dataframe of event durations?

I have a dataframe full of bookings for one room (rows: booking_id, check-in date and check-out date that I want to transform into a timeseries indexed by all year days (index: days of year, feature: booked or not).
I have calculated the duration of the bookings, and reindexed the dataframe daily.
Now I need to forward-fill the dataframe, but only a limited number of times: the duration of each booking.
Tried iterating through each row with ffill but it applies to the entire dataframe, not to selected rows.
Any idea how I can do that?
Here is my code:
import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
[2, '2019-01-03', '2019-01-07', 4],
[3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])
#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])
#create timeseries indexed on check-in date
df2 = df.set_index('check-in')
#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)
I have this:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 NaN NaT NaN
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 NaN NaT NaN
2019-01-05 NaN NaT NaN
2019-01-06 NaN NaT NaN
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 NaN NaT NaN
2019-01-12 NaN NaT NaN
2019-01-13 NaN NaT NaN
I expect to have:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 1.0 2019-01-02 1.0
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 2.0 2019-01-07 4.0
2019-01-05 2.0 2019-01-07 4.0
2019-01-06 2.0 2019-01-07 4.0
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 3.0 2019-01-13 3.0
2019-01-12 3.0 2019-01-13 3.0
2019-01-13 NaN NaT NaN
filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values
#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)
ts = ts.ffill().where(m)
First we create a series where the dates are ffilled. Then we create a mask where the index is less than the filled values. Then we fill based on our mask.
If you want to include the row with the check out date, change m from < to <=
I think to "forward-fill the dataframe" you should use pandas interpolate method. Documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
you can do something like this:
int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')
look at the specific documentation for interpolate, there is a lot of custom functionality you can add with flags to the method.
EDIT:
to do this using the row value in the duration column for each interpolation, this is a bit messy but I think it should work (there may be a less hacky, cleaner solution using some functionality in pandas or another library i am unaware of):
#get rows with nans in them:
nans_df = df2[df2.isnull()]
#get rows without nans in them:
non_nans_df = df2[~df2.isnull()]
#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []
#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
previous_day = nan_index - pd.DateOffset(1)
#this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
if previous_day not in non_nans_df.index:
continue
date_offset = 0
#here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
date_offset += 1
#this gets us the last date in the sequence of continuous days with all nan values after this current one.
end_sequence_date = nan_index + pd.DateOffset(date_offset)
#this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate.
df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date])
# now we pull the duration value for the first row in our df_to_interpolate dataframe.
limit_val = int(df_to_interpolate['duration'][0])
#here we interpolate the dataframe using the limit_val
df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')
#append df_to_interpolate to our list that gets combined at the end.
dfs.append(df_to_interpolate)
#gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value.
final_df = pd.concat(dfs)

Replacing NaN returns ValueError: Array conditional must be same shape as self

My aim is to impute error values (zeros and negatives) using 'ffill' (if they occur before 7 am) and 'interpolate' (for error >= 7 am). My 'text' file contains thousands of days and hundreds of columns. Below is a small part of it showing three days with errors both before and after 7 am.
date a b c
2016-03-02 06:55:00 0.0 1.0 0.0
2016-03-02 07:00:00 2.0 2.0 0.0
2016-03-02 07:55:00 3.0 0.0 3.0
2016-03-03 06:10:00 -4.0 4.0 0.0
2016-03-03 07:00:00 5.0 5.0 5.0
2016-03-03 07:05:00 6.0 0.0 6.0
2016-03-03 08:05:00 7.0 0.0 7.0
2016-03-03 17:40:00 8.0 8.0 -8.0
2016-03-04 05:55:00 0.0 9.0 0.0
2016-03-04 06:00:00 0.0 0.0 10.0
A small varition of codes from another post below work perfectly with other df when 'date' is a column.
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Change zeros and negatives to NaN
df.replace(0, np.nan, inplace=True)
df[df < 0] = np.nan
# construct Boolean switch series
switch = (df.index - df.index.normalize()) > pd.to_timedelta('07:00:00')
# use numpy.where to differentiate between two scenarios
df.iloc[:, 0:] = df.iloc[:, 0:].interpolate().where(switch, df.iloc[:, 0:].ffill())
But, when the 'date' is made index, the codes return ValueError: Array conditional must be same shape as self. Any help is appreciated.
The following finally solved my problem.
df['date'] = pd.to_datetime(df['date'])
# don't set column 'date' to index
# Change zeros and negatives to NaN
df.replace(0, np.nan, inplace=True)
df[df.loc[:, df.columns != 'date'] < 0] = np.nan # change negatives to NaN,
# but exclude column 'date',
# otherwise, column 'date' will be
# converted to NaT
# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')
# use numpy.where to differentiate between two scenarios
df.iloc[:, 0:] = df.iloc[:, 0:].interpolate().where(switch, df.iloc[:, 0:].ffill())
Thanks #jpp for suggesting the most important last two lines here.

Categories