DatetimeIndex: what is the purpose of 'freq' attribute? - python

I miss the point of the 'freq' attribute in a pandas DatatimeIndex object. It can be passed at construction time or set at any time as a property but I don't see any difference in the behaviour of the DatatimeIndex object when this property changes.
Plase look at this example. We add 1 day to a DatetimeIndex that has freq='B' but the returned index contains non-business days:
import pandas as pd
from pandas.tseries.offsets import *
rng = pd.date_range('2012-01-05', '2012-01-10', freq=BDay())
index = pd.DatetimeIndex(rng)
print(index)
index2 = index + pd.Timedelta('1D')
print(index2)
This is the output:
DatetimeIndex(['2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10'], dtype='datetime64[ns]', freq='B')
DatetimeIndex(['2012-01-06', '2012-01-07', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Why isn't freq considered when performing computation (+/- Timedelta) on the DatetimeIndex?
Why freq doesn't reflect the actual data contained in the DatetimeIndex? ( it says 'B' even though it contains non-business days)

You are looking for shift
index.shift(1)
Out[336]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Also BDay will do that too
from pandas.tseries.offsets import BDay
index + BDay(1)
Out[340]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')

From github issue:
The freq attribute is meant to be purely descriptive, so it doesn't
and shouldn't impact calculations. Potentially docs could be clearer.

Related

discarding all elements of datetimeindex except first and last

I have the following datetimeindex:
dates = DatetimeIndex(['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04',
'2022-03-05', '2022-03-06', '2022-03-07', '2022-03-08',
'2022-03-09', '2022-03-10',
...
'2022-06-06', '2022-06-07', '2022-06-08', '2022-06-09',
'2022-06-10', '2022-06-11', '2022-06-12', '2022-06-13',
'2022-06-14', '2022-06-15'],
dtype='datetime64[ns]', length=107, freq='D')
I want to discard all elements except the first and last one. How do I do that? I tried this:
[dates[0]] + dates[-1]] but it returns an actual list of datetimes and not this:
DatetimeIndex(['2022-03-01', '2022-06-15'],
dtype='datetime64[ns]', length=2, freq='D')
Index with a list to select multiple items.
>>> dates[[0, -1]]
DatetimeIndex(['2022-03-01', '2022-06-15'], dtype='datetime64[ns]', freq=None)
This is covered in the NumPy user guide under Integer array indexing. In the Pandas user guide, there's related info under Selection by position.
Here's a way to do it:
print(dates[::len(dates)-1])
Output:
DatetimeIndex(['2022-03-01', '2022-06-15'], dtype='datetime64[ns]', freq=None)
This is slicing using a step that skips right from the start to the end (explanation suggested by #wjandrea).

How can I undo a time series conversion of a pandas dataframe?

I set the index of my dataframe to a time series:
new_data.index = pd.DatetimeIndex(new_data.index)}
How can I convert this timeseries data back into the original string format?
Pandas index objects often have methods equivalent to those available to series. Here you can use pd.Index.astype:
df = pd.DataFrame(index=['2018-01-01', '2018-05-15', '2018-12-25'])
df.index = pd.DatetimeIndex(df.index)
# DatetimeIndex(['2018-01-01', '2018-05-15', '2018-12-25'],
# dtype='datetime64[ns]', freq=None)
df.index = df.index.astype(str)
# Index(['2018-01-01', '2018-05-15', '2018-12-25'], dtype='object')
Note strings in Pandas are stored in object dtype series. If you need a specific format, this can also be accommodated:
df.index = df.index.strftime('%d-%b-%Y')
# Index(['01-Jan-2018', '15-May-2018', '25-Dec-2018'], dtype='object')
See Python's strftime directives for conventions.

Randomly select n dates from pandas date_range

Given a date, I'm using pandas date_range to generate additional 30 dates:
import pandas as pd
from datetime import timedelta
pd.date_range(startdate, startdate + timedelta(days=30))
Out of these 30 dates, how can I randomly select 10 dates in order starting from date in first position and ending with date in last position?
use np.random.choice to choose specified number of items from a given set of choices.
In order to guarantee that the first and last dates are preserved, I pull them out explicitly and choose 8 more dates at random.
I then pass them back to pd.to_datetime and sort_values to ensure they stay in order.
dates = pd.date_range('2011-04-01', periods=30, freq='D')
random_dates = pd.to_datetime(
np.concatenate([
np.random.choice(dates[1:-1], size=8, replace=False),
dates[[0, -1]]
])
).sort_values()
random_dates
DatetimeIndex(['2011-04-01', '2011-04-02', '2011-04-03', '2011-04-13',
'2011-04-14', '2011-04-21', '2011-04-22', '2011-04-26',
'2011-04-27', '2011-04-30'],
dtype='datetime64[ns]', freq=None)
You can use numpy.random.choice with replace=False if is not necessary explicitly get first and last value (if yes use another answer):
a = pd.date_range('2011-04-01', periods=30, freq='D')
print (pd.to_datetime(np.sort(np.random.choice(a, size=10, replace=False))))
DatetimeIndex(['2011-04-01', '2011-04-03', '2011-04-05', '2011-04-09',
'2011-04-12', '2011-04-17', '2011-04-22', '2011-04-24',
'2011-04-29', '2011-04-30'],
dtype='datetime64[ns]', freq=None)

How to include end date in pandas date_range method?

From pd.date_range('2016-01', '2016-05', freq='M', ).strftime('%Y-%m'), the last month is 2016-04, but I was expecting it to be 2016-05. It seems to me this function is behaving like the range method, where the end parameter is not included in the returning array.
Is there a way to get the end month included in the returning array, without processing the string for the end month?
A way to do it without messing with figuring out month ends yourself.
pd.date_range(*(pd.to_datetime(['2016-01', '2016-05']) + pd.offsets.MonthEnd()), freq='M')
DatetimeIndex(['2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
'2016-05-31'],
dtype='datetime64[ns]', freq='M')
You can use .union to add the next logical value after initializing the date_range. It should work as written for any frequency:
d = pd.date_range('2016-01', '2016-05', freq='M')
d = d.union([d[-1] + 1]).strftime('%Y-%m')
Alternatively, you can use period_range instead of date_range. Depending on what you intend to do, this might not be the right thing to use, but it satisfies your question:
pd.period_range('2016-01', '2016-05', freq='M').strftime('%Y-%m')
In either case, the resulting output is as expected:
['2016-01' '2016-02' '2016-03' '2016-04' '2016-05']
For the later crowd. You can also try to use the Month-Start frequency.
>>> pd.date_range('2016-01', '2016-05', freq='MS', format = "%Y-%m" )
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-01'],
dtype='datetime64[ns]', freq='MS')
Include the day when specifying the dates in date_range call
pd.date_range('2016-01-31', '2016-05-31', freq='M', ).strftime('%Y-%m')
array(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05'],
dtype='|S7')
I had a similar problem when using datetime objects in dataframe. I would set the boundaries through .min() and .max() functions and then fill in missing dates using the pd.date_range function. Unfortunately the returned list/df was missing the maximum value.
I found two work arounds for this:
1) Add "closed = None" parameter in the pd.date_range function. This worked in the example below; however, it didn't work for me when working only with dataframes (no idea why).
2) If option #1 doesn't work then you can add one extra unit (in this case a day) using the datetime.timedelta() function. In the case below it over indexed by a day but it can work for you if the date_range function isn't giving you the full range.
import pandas as pd
import datetime as dt
#List of dates as strings
time_series = ['2020-01-01', '2020-01-03', '2020-01-5', '2020-01-6', '2020-01-7']
#Creates dataframe with time data that is converted to datetime object
raw_data_df = pd.DataFrame(pd.to_datetime(time_series), columns = ['Raw_Time_Series'])
#Creates an indexed_time list that includes missing dates and the full time range
#Option No. 1 is to use the closed = None parameter choice.
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max(),freq='D',closed= None)
print('indexed_time option #! = ', indexed_time)
#Option No. 2 if the function allows you to extend the time by one unit (in this case day)
#by using the datetime.timedelta function to get what you need.
indexed_time = pd.date_range(start = raw_data_df.Raw_Time_Series.min(),end = raw_data_df.Raw_Time_Series.max()+dt.timedelta(days=1),freq='D')
print('indexed_time option #2 = ', indexed_time)
#In this case you over index by an extra day because the date_range function works properly
#However, if the "closed = none" parameters doesn't extend through the full range then this is a good work around
I dont think so. You need to add the (n+1) boundary
pd.date_range('2016-01', '2016-06', freq='M' ).strftime('%Y-%m')
The start and end dates are strictly inclusive. So it will not
generate any dates outside of those dates if specified.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Either way, you have to manually add some information. I believe adding just one more month is not a lot of work.
The explanation for this issue is that the function pd.to_datetime() converts a '%Y-%m' date string by default to the first of the month datetime, or '%Y-%m-01':
>>> pd.to_datetime('2016-05')
Timestamp('2016-05-01 00:00:00')
>>> pd.date_range('2016-01', '2016-02')
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01'],
dtype='datetime64[ns]', freq='D')
Then everything follows from that. Specifying freq='M' includes month ends between 2016-01-01 and 2016-05-01, which is the list you receive and excludes 2016-05-31. But specifying month starts 'MS' like the second answer provides, includes 2016-05-01 as it falls within the range. pd.date_range() default behavior isn't like the range method since ends are included. From the docs:
closed controls whether to include start and end that are on the boundary. The default includes boundary points on either end.

'Index' object has no attribute 'tz_localize'

I'm trying to convert all instances of 'GMT' time in a time/date column ('Created_At') in a csv file so that it is all formatted in 'EST'.
Please see below:
import pandas as pd
from pandas.tseries.resample import TimeGrouper
from pandas.tseries.offsets import DateOffset
from pandas.tseries.index import DatetimeIndex
cambridge = pd.read_csv('\Users\cgp\Desktop\Tweets.csv')
cambridge['Created_At'] = pd.to_datetime(pd.Series(cambridge['Created_At']))
cambridge.set_index('Created_At', drop=False, inplace=True)
cambridge.index = cambridge.index.tz_localize('GMT').tz_convert('EST')
cambridge.index = cambridge.index - DateOffset(hours = 12)
The error I'm getting is:
cambridge.index = cambridge.index.tz_localize('GMT').tz_convert('EST')
AttributeError: 'Index' object has no attribute 'tz_localize'
I've tried various different things but am stumped as to why the Index object won't recognized the tz_attribute. Thank you so much for your help!
Replace
cambridge.set_index('Created_At', drop=False, inplace=True)
with
cambridge.set_index(pd.DatetimeIndex(cambridge['Created_At']), drop=False, inplace=True)
Hmm. Like the other tz_localize current problem, this works fine for me. Does this work for you? I have simplified some of the calls a bit from your example:
df2 = pd.DataFrame(randn(3, 3), columns=['A', 'B', 'C'])
# randn(3,3) returns nine random numbers in a 3x3 array.
# the columns argument to DataFrame names the 3 columns.
# no datetimes here! (look at df2 to check)
df2['A'] = pd.to_datetime(df2['A'])
# convert the random numbers to datetimes -- look at df2 again
# if A had values to_datetime couldn't handle, we'd clean up A first
df2.set_index('A',drop=False, inplace=True)
# and use that column as an index for the whole df2;
df2.index = df2.index.tz_localize('GMT').tz_convert('US/Eastern')
# make it timezone-conscious in GMT and convert that to Eastern
df2.index.tzinfo
<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

Categories