Shift DateTime index within a Pandas MultiIndex - python

I have a csv file that looks like this when I load it:
# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')
# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y
>>> y.head()
user datehour
A 2020-02-01 00:00:00 0
2020-02-01 01:00:00 0
2020-02-01 02:00:00 1
2020-02-01 03:00:00 0
2020-02-01 04:00:00 0
Name: y, dtype: int64
I have the following function to create a lagged feature of an index level:
def shift_index(a, dt_idx_name, lag_freq, lag):
# get datetime index of relevant level
ac = a.copy()
dti = ac.index.get_level_values(dt_idx_name)
# shift it
dti_shifted = dti.shift(lag, freq=lag_freq)
# put it back where you found it
ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)
return ac
But when I run:
y_lag = shift_index(y, 'datehour', 'H', 1), I get the following error:
ValueError: Level values must be unique...
(I can actually suppress this error by adding verify_integrity=False
in .index.set_levels... in the function, but that (predictably) causes problems down the line)
Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour') shows a freq='H' attribute right after it's created, but freq=None once its reloaded from csv.
That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing.
df.index.freq = pd.tseries.frequencies.to_offset("H"). And this answer also didn't work for my MultiIndex.
So I think I could solve this if I were able to set the freq attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y data with shifted DateTime MultiIndex component, such as with my shift_index function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.

After much fidgeting, I was able to set an hourly frequency using asfreq('H') on grouped data, such that each group has unique values for the datehour index.
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y
Peeking at an index value shows the correct frequency.
y.index[0]
# ('A', Timestamp('2020-02-01 00:00:00', freq='H'))
All this is doing is setting the index in two parts. The user goes first so that the nested datehour index can be unique within it. Once the datehour index is unique, then asfreq can be used without difficulty.
If you try asfreq on a non-unique index, it will not work.
y_load.set_index('datehour').asfreq('H')
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-433-3ba51b619417> in <module>
# ----> 1 y_load.set_index('datehour').asfreq('H')
# ...
# ValueError: cannot reindex from a duplicate axis

Related

How can you drop certain dates in a data frame grouped by day?

I am working on a code that groupes a data frame by date:
gk = df_HR.groupby(['date'])
I now get a data frame where every first row from each date is looking like this:
2022-05-23 22:18 60 2022-05-23 22:18:00 1653344280 1.000000
2022-05-24 00:00 54 2022-05-24 00:00:00 1653350400 0.900000
....
I want to drop as an example all the data for the date '2022-05-24'. However, when I use the .drop() function I get the error 'DataFrameGroupBy' object has no attribute 'drop''.
How can I still drop all the data from this date?
Save your group by result in Dataframe-df and then use below code to select list of dates you want to drop .
date_list_filter = [datetime(2009, 5, 2),
datetime(2010, 8, 22)]
df.drop(date_list, inplace=True)
hope this helps !
From what i gather, the goal is to group the data frames by date, and drop dataframes with date's on a certain day
import pandas as pd
# ...
gk = df_HR.groupby(['date'])
good_dfs = []
for date, sub_df in gk:
if DATE_TO_DROP not in date:
good_dfs.append(sub_df)
final_df = pd.concat(good_dfs)
Alternatively, you can just drop rows where 'date' has that string included
df_HR.drop(df_HR[ DATE_TO_REMOVE in df_HR.date].index, inplace=True)
The above is for removing a single date. if you have multiple dates here are those two options again
option1:
dates_to_drop = []
gk = df_HR.groupby(['date'])
good_dfs = []
for date, sub_df in gk:
for bad_date in dates_to_drop:
if bad_date in date:
good_dfs.append(sub_df)
final_df = pd.concat(good_dfs)
option2:
dates_to_drop = []
for bad_date in dates_to_drop:
df_HR.drop(df_HR[ bad_date in df_HR.date ].index, inplace=True)
The reason we have to loop through is because the dates in the DF include more than just the string you're looking for. checking for substring existence in python involves using the 'in' operator. But we can't check if a list of strings is in a string, and so we loop over bad dates, removing all rows with each bad date.
See below code to explain further
my_date=[datetime(2009, 5, 2),
datetime(2010, 8, 22),
datetime(2022,8,22),
datetime(2009,5,2),
datetime(2010,8,22)
]
df=pd.DataFrame(my_date)
df.columns=['Date']
df1=df.groupby('Date').mean()
df1 # now see below data of dataframe df1
df1.drop('2009-05-02',inplace=True)
# given date will be dropped-see screenshot
df1

DateTimeIndex should be sorted, but isn't

I am trying to resample a DateTime Series in pandas as follows:
df = pd.read_csv(pathToParam + "/" + file)
df.drop(["LAT", "LON", "STATION_HEIGHT"], axis = 1, inplace=True)
df.set_index(df.DATE, inplace=True, drop=True)
if granularity == "daily":
df.index = pd.to_datetime(df.index, cache=False)
df = df.sort_index()
df = df.resample("8H", closed="right").bfill()
The Dataframe looks like this:
DATE
STATION_ID
CLOUD_COVER_TOTAL
2016-01-01
1048
6.7
2016-01-02
1048
7.8
2016-01-03
1048
7.8
But I always get this error:
ValueError: index must be monotonic increasing or decreasing
I tried parse_dates = True and searched for possible solutions on a variety of platforms, still empty handed. Pls help.
Most likely one of the rows in your csv has an empty value where the date should be.
I can recreate your problem only if I intentionally put a blank date in:
dateSeries = ["2016-01-01", "", "2016-01-02", "2016-01-04"]
data = [[1048, 6.7], [1048, 7.8], [1048, 7.8], [1048,7.8]]
df = pd.DataFrame(data, index = dateSeries, columns=["STATION_ID", "CLOUD_COVER_TOTAL"])
df.index = pd.to_datetime(df.index, cache=False)
df = df.sort_index()
df = df.resample("8H", closed="right").bfill()
Draws this error
ValueError: index must be monotonic increasing or decreasing
If I have values in each index it works fine. You can find such problematic records with things like:
df.loc[None]
or
df.loc[""]
or
df.loc[pd.NaT]

Trying to find the nearest date before and after a specified date from a list of dates in a comma separated string within a pandas Dataframe

tldr; I have an index_date in dtype: datetime64[ns] <class 'pandas.core.series.Series'> and a list_of_dates of type <class 'list'> with individual elements in str format. What's the best way to convert these to the same data type so I can sort the dates into closest before and closest after index_date?
I have a pandas dataframe (df) with columns:
ID_string object
indexdate datetime64[ns]
XR_count int64
CT_count int64
studyid_concat object
studydate_concat object
modality_concat object
And it looks something like:
ID_string indexdate XR_count CT_count studyid_concat studydate_concat
0 55555555 2020-09-07 10 1 ['St1', 'St5'...] ['06/22/2019', '09/20/2020'...]
1 66666666 2020-06-07 5 0 ['St11', 'St17'...] ['05/22/2020', '06/24/2020'...]
Where the 0 element in studyid_concat ("St1") corresponds to the 0 element in studydate_concat, and in modality_concat, etc. I did not show modality_concat for space reasons, but it's something like ['XR', 'CT', ...]
My current goal is to find the closest X-ray study performed before and after my indexdate, as well as being able to rank studies from closest to furthest. I'm somewhat new to pandas, but here is my current attempt:
df = pd.read_excel(path_to_excel, sheet_name='Sheet1')
# Convert comma separated string from Excel to lists of strings
df.studyid_concat = df.studyid_concat.str.split(',')
df.studydate_concat = df.studydate_concat.str.split(',')
df.modality_concat = df.modality_concat.str.split(',')
for x in in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate']
# Had to use subscript [0] below because result of above was a list in an array
studyid_list = df.loc[df['ID_string'] == x, 'studyid_concat'].values[0]
date_list = df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
xr_date_list = [date_list[x] for x in range(len(date_list)) if modality_list[x]=="XR"]
xr_studyid_list = [studyid_list[x] for x in range(len(studyid_list)) if modality_list[x]=="XR"]
That's about as far as I got because I'm somewhat confused on datatypes here. My indexdate is currently in dtype: datetime64[ns] <class 'pandas.core.series.Series'> which I was thinking of converting using the datetime module, but was having a hard time figuring out how. I also wasn't sure if I needed to. My xr_study_list is a list of strings containing dates in format 'mm/dd/yyyy'. I think if I could figure out the rest if I could get the data types in the right format. I'd just compare if the dates are >= or < indexdate to sort into before/after, and then subtract each date by indexdate and sort. I think whatever I do with my xr_date_list, I'd just have to be sure to do the same with xr_studyid_list to keep track of the unique study id
Edit: Desired output dataframe would look like
ID_string indexdate StudyIDBefore StudyDateBefore
0 55555555 2020-09-07 ['St33', 'St1', ...] [2020-09-06, 2019-06-22, ...]
1 66666666 2020-06-07 ['St11', 'St2', ...] [2020-05-22, 2020-05-01, ...]
Where the "before" variables would be sorted from nearest to furthest, and similar "after columns would exist. My current goal is just to check if a study exists within 3 days before and after this indexdate, but having the above dataframe would give me the flexibility if I need to start looking beyond the nearest study.
Think I found my own answer after spending some time thinking about it some more and referencing more of pandas to_datetime documentation. Basically realized I could convert my list of string dates using pd.to_datetime
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
Then could subtract my index date from this list. Opted to do this within a temporary dataframe so I could keep track of the other column values (like study ID, modality, etc.).
Full code is below:
for x in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate'].values[0]
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
studyid_list = df.loc[df['ID_string'] == x, '_concat'].values[0]
tempdata = list(zip(studyid_list, date_list, modality_list))
tempdf = pd.DataFrame(tempdata, columns=['studyid', 'studydate', 'modality'])
tempdf['indexdate'] = index_date
tempdf['timedelta'] = tempdf['studydate']-tempdf['index_date']
tempdf['study_done_wi_3daysbefore'] = np.where((tempdf['timedelta']>=np.timedelta64(-3,'D')) & (tempdf['timedelta']<np.timedelta64(0,'D')), True, False)
tempdf['study_done_wi_3daysafter'] = np.where((tempdf['timedelta']<=np.timedelta64(3,'D')) & (tempdf['timedelta']>=np.timedelta64(0,'D')), True, False)
tempdf['study_done_onindex'] = np.where(tempdf['timedelta']==np.timedelta64(0,'D'), True, False)
XRonindex[x] = True if len(tempdf.loc[(tempdf['study_done_onindex']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
XRwi3days[x] = True if len(tempdf.loc[(tempdf['study_done_wi_3daysbefore']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
# can later map these values back to my original dataframe as a new column

Pandas: exact indexing on timeseries

I'm working on two pandas Timeseries object (the second coming from a group by on 30 minutes):
df_lookup = pd.DataFrame(np.arange(10, 16),
index=('2017-12-15 17:58:00', '2017-12-15 17:59:00',
'2017-12-15 18:00', '2017-12-15 18:01:00',
'2017-12-15 18:02:00', '2017-12-15 18:03:00',
)
)
df_lookup.index = pd.to_datetime(df_lookup.index)
avg_30min = pd.DataFrame([0.066627, 0.1234, 0.0432, 0.234],
index=("2017-12-15 18:00:00", "2017-12-15 18:30:00",
"2017-12-15 19:00:00", "2017-12-15 19:30:00",
)
)
avg_30min.index = pd.to_datetime(avg_30min.index)
I need to iterate over the second, avg_30min, and lookup into the first, df_lookup in order to extract the value at index idx.
for idx, row in avg_30min.iterrows():
value_in_lookup_df = df_lookup.loc[idx]
# Here I'd use the object from the lookup to add a detail into a plot.
I tried using loc and iloc, the former returns:
KeyError: 'the label [2017-12-15 18:00:00] is not in the [index]'
while the latter:
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2017-12-15 18:00:00] of <class 'pandas._libs.tslib.Timestamp'>
The expected result would be the row from df_lookup which index matches idx, somewhat similar to a dictionary lookup in plain python (row_from_lookup = lookup_df[idx]).
What's the right method to have an exact match on a pandas Timeseries?
It looks like you want a merge on the index columns.
avg_30min.merge(df_lookup, left_index=True, right_index=True)
0_x 0_y
2017-12-15 18:00:00 0.066627 12
Alternatively, find the intersection of indexes, and concatenate.
idx = avg_30min.index.intersection(df_lookup.index)
pd.concat([avg_30min.loc[idx], df_lookup.loc[idx]], 1, ignore_index=True)
0 1
2017-12-15 18:00:00 0.066627 12
Given a datetime.datetime object such as:
dt_obj = datetime.datetime(2017, 12, 15, 18, 0)
which can be extracted e.g. from another dataframe such as avg_30min in the example reported, a lookup into a dataframe which uses a dtype='datetime64[ns]' as index can be performed using get_loc on the index column:
>>> df_lookup.index.get_loc(dt_obj)
2
Then the position can be used to retrieve the requested row, with df_lookup.iloc[].

pandas.DatetimeIndex frequency is None and can't be set

I created a DatetimeIndex from a "date" column:
sales.index = pd.DatetimeIndex(sales["date"])
Now the index looks as follows:
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-04', '2003-01-06',
'2003-01-07', '2003-01-08', '2003-01-09', '2003-01-10',
'2003-01-11', '2003-01-13',
...
'2016-07-22', '2016-07-23', '2016-07-24', '2016-07-25',
'2016-07-26', '2016-07-27', '2016-07-28', '2016-07-29',
'2016-07-30', '2016-07-31'],
dtype='datetime64[ns]', name='date', length=4393, freq=None)
As you see, the freq attribute is None. I suspect that errors down the road are caused by the missing freq. However, if I try to set the frequency explicitly:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-148-30857144de81> in <module>()
1 #### DEBUG
----> 2 sales_train = disentangle(df_train)
3 sales_holdout = disentangle(df_holdout)
4 result = sarima_fit_predict(sales_train.loc[5002, 9990]["amount_sold"], sales_holdout.loc[5002, 9990]["amount_sold"])
<ipython-input-147-08b4c4ecdea3> in disentangle(df_train)
2 # transform sales table to disentangle sales time series
3 sales = df_train[["date", "store_id", "article_id", "amount_sold"]]
----> 4 sales.index = pd.DatetimeIndex(sales["date"], freq="d")
5 sales = sales.pivot_table(index=["store_id", "article_id", "date"])
6 return sales
/usr/local/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
89 else:
90 kwargs[new_arg_name] = new_arg_value
---> 91 return func(*args, **kwargs)
92 return wrapper
93 return _deprecate_kwarg
/usr/local/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
399 'dates does not conform to passed '
400 'frequency {1}'
--> 401 .format(inferred, freq.freqstr))
402
403 if freq_infer:
ValueError: Inferred frequency None from passed dates does not conform to passed frequency D
So apparently a frequency has been inferred, but is stored neither in the freq nor inferred_freq attribute of the DatetimeIndex - both are None. Can someone clear up the confusion?
You have a couple options here:
pd.infer_freq
pd.tseries.frequencies.to_offset
I suspect that errors down the road are caused by the missing freq.
You are absolutely right. Here's what I use often:
def add_freq(idx, freq=None):
"""Add a frequency attribute to idx, through inference or directly.
Returns a copy. If `freq` is None, it is inferred.
"""
idx = idx.copy()
if freq is None:
if idx.freq is None:
freq = pd.infer_freq(idx)
else:
return idx
idx.freq = pd.tseries.frequencies.to_offset(freq)
if idx.freq is None:
raise AttributeError('no discernible frequency found to `idx`. Specify'
' a frequency string with `freq`.')
return idx
An example:
idx=pd.to_datetime(['2003-01-02', '2003-01-03', '2003-01-06']) # freq=None
print(add_freq(idx)) # inferred
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'], dtype='datetime64[ns]', freq='B')
print(add_freq(idx, freq='D')) # explicit
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'], dtype='datetime64[ns]', freq='D')
Using asfreq will actually reindex (fill) missing dates, so be careful of that if that's not what you're looking for.
The primary function for changing frequencies is the asfreq function.
For a DatetimeIndex, this is basically just a thin, but convenient
wrapper around reindex which generates a date_range and calls reindex.
It seems to relate to missing dates as 3kt notes. You might be able to "fix" with asfreq('D') as EdChum suggests but that gives you a continuous index with missing data values. It works fine for some some sample data I made up:
df=pd.DataFrame({ 'x':[1,2,4] },
index=pd.to_datetime(['2003-01-02', '2003-01-03', '2003-01-06']) )
df
Out[756]:
x
2003-01-02 1
2003-01-03 2
2003-01-06 4
df.index
Out[757]: DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'],
dtype='datetime64[ns]', freq=None)
Note that freq=None. If you apply asfreq('D'), this changes to freq='D':
df.asfreq('D')
Out[758]:
x
2003-01-02 1.0
2003-01-03 2.0
2003-01-04 NaN
2003-01-05 NaN
2003-01-06 4.0
df.asfreq('d').index
Out[759]:
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-04', '2003-01-05',
'2003-01-06'],
dtype='datetime64[ns]', freq='D')
More generally, and depending on what exactly you are trying to do, you might want to check out the following for other options like reindex & resample: Add missing dates to pandas dataframe
I'm not sure if earlier versions of python have this, but 3.6 has this simple solution:
# 'b' stands for business days
# 'w' for weekly, 'd' for daily, and you get the idea...
df.index.freq = 'b'
It could happen if for examples the dates you are passing aren't sorted.
Look at this example:
example_ts = pd.Series(data=range(10),
index=pd.date_range('2020-01-01', '2020-01-10', freq='D'))
example_ts.index = pd.DatetimeIndex(np.hstack([example_ts.index[-1:],
example_ts.index[:-1]]), freq='D')
The previous code goes into your error, because of the non-sequential dates.
example_ts = pd.Series(data=range(10),
index=pd.date_range('2020-01-01', '2020-01-10', freq='D'))
example_ts.index = pd.DatetimeIndex(np.hstack([example_ts.index[:-1],
example_ts.index[-1:]]), freq='D')
This one runs correctly, instead.
I am not sure but I was having the same error. I was not able to resolve my issue by suggestions posted above but solved it using the below solution.
Pandas DatetimeIndex + seasonal_decompose = missing frequency.
Best Regards
Similar to some of the other answers here, my problem was that my data had missing dates.
Instead of dealing with this issue in Python, I opted to change my SQL query that I was using to source the data. So instead of skipping dates, I wrote the query such that it would fill in missing dates with the value 0.
It seems to be an issue with missing values in the index. I have simply re-build the index based on the original index in the frequency I needed:
df.index = pd.date_range(start=df.index[0], end=df.index[-1], freq="h")

Categories