discarding all elements of datetimeindex except first and last - python

I have the following datetimeindex:
dates = DatetimeIndex(['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04',
'2022-03-05', '2022-03-06', '2022-03-07', '2022-03-08',
'2022-03-09', '2022-03-10',
...
'2022-06-06', '2022-06-07', '2022-06-08', '2022-06-09',
'2022-06-10', '2022-06-11', '2022-06-12', '2022-06-13',
'2022-06-14', '2022-06-15'],
dtype='datetime64[ns]', length=107, freq='D')
I want to discard all elements except the first and last one. How do I do that? I tried this:
[dates[0]] + dates[-1]] but it returns an actual list of datetimes and not this:
DatetimeIndex(['2022-03-01', '2022-06-15'],
dtype='datetime64[ns]', length=2, freq='D')

Index with a list to select multiple items.
>>> dates[[0, -1]]
DatetimeIndex(['2022-03-01', '2022-06-15'], dtype='datetime64[ns]', freq=None)
This is covered in the NumPy user guide under Integer array indexing. In the Pandas user guide, there's related info under Selection by position.

Here's a way to do it:
print(dates[::len(dates)-1])
Output:
DatetimeIndex(['2022-03-01', '2022-06-15'], dtype='datetime64[ns]', freq=None)
This is slicing using a step that skips right from the start to the end (explanation suggested by #wjandrea).

Related

Splitting datetime index to identify consecutive timestamps (Python)

I know there are several posts on how to split series based on consecutive values, and I have adopted some of their code, but I'm not sure what I am doing wrong.
I have a long datetimeindex ("times" below), and I want to split it to identify consecutive groupings. So, every time there is a gap in time longer than the normal time increment, I want it to split. The index is evenly incremented (10m between times).
times = pd.date_range(start=start_date, end=end_date, freq=frequency).difference(x.index)
splits = ((times-pd.Series(times).shift(-1)).abs() != frequency)
consec = np.split(times, splits)
"Splits" is a boolean array that accurately indicates where the splits should occur, so that seems to be working correctly.
However, when I actually use np.split, instead of splitting into sections, the output is like this, where it is only keeping the values that are at the split indices:
[DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex(['2003-02-05 09:20:00'], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex(['2003-02-09 01:20:00'], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], dtype='datetime64[ns]', freq=None), DatetimeIndex([], .... etc
Any ideas on why this is happening?
indices_or_sections argument of np.split accepts an array of integers.
To get it from split (an array of bools) you can use np.where:
splits_indexes = np.where(splits)
Then you can call
consec = np.split(times, splits_indexes + 1)
The +1 is there to identify the beginning of a new part.

Is there a combined frequency argument in pd.date_range() function?

How can I add only two days to my frequency? I would like to select Wednesdays and Mondays.
the code below only generates Wednesdays in my data.
pd.date_range(11/21/2019, periods=5, freq='W-WED')
I don't think pd.date_range supports combine frequency string as in your case. In your case, you need to construct 2 datetimeindexes and using union and slicing to get desired output
ix_mon = pd.date_range('11/21/2019', periods=5, freq='W-MON')
ix_wed = pd.date_range('11/21/2019', periods=5, freq='W-WED')
ix_mw = ix_mon.union(ix_wed)[:5]
Out[806]:
DatetimeIndex(['2019-11-25', '2019-11-27', '2019-12-02', '2019-12-04',
'2019-12-09'],
dtype='datetime64[ns]', freq=None)

DatetimeIndex: what is the purpose of 'freq' attribute?

I miss the point of the 'freq' attribute in a pandas DatatimeIndex object. It can be passed at construction time or set at any time as a property but I don't see any difference in the behaviour of the DatatimeIndex object when this property changes.
Plase look at this example. We add 1 day to a DatetimeIndex that has freq='B' but the returned index contains non-business days:
import pandas as pd
from pandas.tseries.offsets import *
rng = pd.date_range('2012-01-05', '2012-01-10', freq=BDay())
index = pd.DatetimeIndex(rng)
print(index)
index2 = index + pd.Timedelta('1D')
print(index2)
This is the output:
DatetimeIndex(['2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10'], dtype='datetime64[ns]', freq='B')
DatetimeIndex(['2012-01-06', '2012-01-07', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Why isn't freq considered when performing computation (+/- Timedelta) on the DatetimeIndex?
Why freq doesn't reflect the actual data contained in the DatetimeIndex? ( it says 'B' even though it contains non-business days)
You are looking for shift
index.shift(1)
Out[336]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Also BDay will do that too
from pandas.tseries.offsets import BDay
index + BDay(1)
Out[340]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
From github issue:
The freq attribute is meant to be purely descriptive, so it doesn't
and shouldn't impact calculations. Potentially docs could be clearer.

Converting datetimeindex to timestamp for pd.date_range

I have two lists:
"max_" consists of datetime types:
2012-04-20 00:00:00
2012-11-29 00:00:00
2013-11-22 00:00:00
"min_" , consists of datetimeindex:
DatetimeIndex(['2012-07-11'], dtype='datetime64[ns]', name=u'Date', freq=None)
DatetimeIndex(['2013-02-05', '2013-10-23'], dtype='datetime64[ns]', name=u'Date', freq=None)
DatetimeIndex([], dtype='datetime64[ns]', name=u'Date', freq=None)
My expected output is to take a range from each max value to its respective min value, for example, the first one would be range (2012-04-20 to 2012-07-11). I've tried:
pd.date_range(max_, min_)
TypeError: Cannot convert input [DatetimeIndex(['2012-07-11'], dtype='datetime64[ns]', name=u'Date', freq=None)] of type <class 'pandas.core.indexes.datetimes.DatetimeIndex'> to Timestamp
I'm not sure how to get around the conversion part, additionally, I'd like to have only the first value for the min_ lists (and ignore any additional).
I think you need to just specify items in your lists that you want to create the range on:
pd.date_range(min_[0],max_[0])
If you are trying to print the ranges:
for date in max_:
print (pd.date_range(min_[0],max_[date])

Randomly select n dates from pandas date_range

Given a date, I'm using pandas date_range to generate additional 30 dates:
import pandas as pd
from datetime import timedelta
pd.date_range(startdate, startdate + timedelta(days=30))
Out of these 30 dates, how can I randomly select 10 dates in order starting from date in first position and ending with date in last position?
use np.random.choice to choose specified number of items from a given set of choices.
In order to guarantee that the first and last dates are preserved, I pull them out explicitly and choose 8 more dates at random.
I then pass them back to pd.to_datetime and sort_values to ensure they stay in order.
dates = pd.date_range('2011-04-01', periods=30, freq='D')
random_dates = pd.to_datetime(
np.concatenate([
np.random.choice(dates[1:-1], size=8, replace=False),
dates[[0, -1]]
])
).sort_values()
random_dates
DatetimeIndex(['2011-04-01', '2011-04-02', '2011-04-03', '2011-04-13',
'2011-04-14', '2011-04-21', '2011-04-22', '2011-04-26',
'2011-04-27', '2011-04-30'],
dtype='datetime64[ns]', freq=None)
You can use numpy.random.choice with replace=False if is not necessary explicitly get first and last value (if yes use another answer):
a = pd.date_range('2011-04-01', periods=30, freq='D')
print (pd.to_datetime(np.sort(np.random.choice(a, size=10, replace=False))))
DatetimeIndex(['2011-04-01', '2011-04-03', '2011-04-05', '2011-04-09',
'2011-04-12', '2011-04-17', '2011-04-22', '2011-04-24',
'2011-04-29', '2011-04-30'],
dtype='datetime64[ns]', freq=None)

Categories