Randomly select n dates from pandas date_range - python

Given a date, I'm using pandas date_range to generate additional 30 dates:
import pandas as pd
from datetime import timedelta
pd.date_range(startdate, startdate + timedelta(days=30))
Out of these 30 dates, how can I randomly select 10 dates in order starting from date in first position and ending with date in last position?

use np.random.choice to choose specified number of items from a given set of choices.
In order to guarantee that the first and last dates are preserved, I pull them out explicitly and choose 8 more dates at random.
I then pass them back to pd.to_datetime and sort_values to ensure they stay in order.
dates = pd.date_range('2011-04-01', periods=30, freq='D')
random_dates = pd.to_datetime(
np.concatenate([
np.random.choice(dates[1:-1], size=8, replace=False),
dates[[0, -1]]
])
).sort_values()
random_dates
DatetimeIndex(['2011-04-01', '2011-04-02', '2011-04-03', '2011-04-13',
'2011-04-14', '2011-04-21', '2011-04-22', '2011-04-26',
'2011-04-27', '2011-04-30'],
dtype='datetime64[ns]', freq=None)

You can use numpy.random.choice with replace=False if is not necessary explicitly get first and last value (if yes use another answer):
a = pd.date_range('2011-04-01', periods=30, freq='D')
print (pd.to_datetime(np.sort(np.random.choice(a, size=10, replace=False))))
DatetimeIndex(['2011-04-01', '2011-04-03', '2011-04-05', '2011-04-09',
'2011-04-12', '2011-04-17', '2011-04-22', '2011-04-24',
'2011-04-29', '2011-04-30'],
dtype='datetime64[ns]', freq=None)

Related

Is there a combined frequency argument in pd.date_range() function?

How can I add only two days to my frequency? I would like to select Wednesdays and Mondays.
the code below only generates Wednesdays in my data.
pd.date_range(11/21/2019, periods=5, freq='W-WED')
I don't think pd.date_range supports combine frequency string as in your case. In your case, you need to construct 2 datetimeindexes and using union and slicing to get desired output
ix_mon = pd.date_range('11/21/2019', periods=5, freq='W-MON')
ix_wed = pd.date_range('11/21/2019', periods=5, freq='W-WED')
ix_mw = ix_mon.union(ix_wed)[:5]
Out[806]:
DatetimeIndex(['2019-11-25', '2019-11-27', '2019-12-02', '2019-12-04',
'2019-12-09'],
dtype='datetime64[ns]', freq=None)

DatetimeIndex: what is the purpose of 'freq' attribute?

I miss the point of the 'freq' attribute in a pandas DatatimeIndex object. It can be passed at construction time or set at any time as a property but I don't see any difference in the behaviour of the DatatimeIndex object when this property changes.
Plase look at this example. We add 1 day to a DatetimeIndex that has freq='B' but the returned index contains non-business days:
import pandas as pd
from pandas.tseries.offsets import *
rng = pd.date_range('2012-01-05', '2012-01-10', freq=BDay())
index = pd.DatetimeIndex(rng)
print(index)
index2 = index + pd.Timedelta('1D')
print(index2)
This is the output:
DatetimeIndex(['2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10'], dtype='datetime64[ns]', freq='B')
DatetimeIndex(['2012-01-06', '2012-01-07', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Why isn't freq considered when performing computation (+/- Timedelta) on the DatetimeIndex?
Why freq doesn't reflect the actual data contained in the DatetimeIndex? ( it says 'B' even though it contains non-business days)
You are looking for shift
index.shift(1)
Out[336]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
Also BDay will do that too
from pandas.tseries.offsets import BDay
index + BDay(1)
Out[340]: DatetimeIndex(['2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11'], dtype='datetime64[ns]', freq='B')
From github issue:
The freq attribute is meant to be purely descriptive, so it doesn't
and shouldn't impact calculations. Potentially docs could be clearer.

How to delete a date from pandas date_range

So I have a pandas date_range like so
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
I want to remove all the extra days resulting from leap years.
I do a for loop
for each in index:
if each.month==2 and each.day==29:
print(each) # I actually want to delete this item from dates
But my problem is that I don't know how to delete the item. The regular python list methods and functions doesn't work.
I've looked everywhere on SO. I've looked at the documentation for pandas.date_range but found nothing
Any help will be appreciated.
You probably want to use drop to remove the rows.
import pandas as pd
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
leap = []
for each in dates:
if each.month==2 and each.day ==29:
leap.append(each)
dates = dates.drop(leap)
You could try creating two Series objects to store the months and days separately and use them as masks.
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D') #All dates between range
days = dates.day #Store all the days
months = dates.month #Store all the months
dates = dates[(days != 29) & (months != 2)] #Filter dates using a mask
Just to check if the approach works, If you change the != condition to ==, we can see the dates you wish to eliminate.
UnwantedDates = dates[(days == 29) & (months == 2)]
Output:
DatetimeIndex(['2008-02-29', '2012-02-29'], dtype='datetime64[ns]', freq=None)
You can try:
dates = dates[~dates['Date'].str.contains('02-29')]
In place of Date you will have to put the name of the column where the dates are stored.
You don't have to use the for loop so it is faster to run.

Filling in missing date/times in my pd.date_range

I have a column of data which looks like the following:
I am trying to set a range of the entire month:
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
But my column of data (above) is missing a few hours, and I am unsure where (since my data is 2 million rows large.
I tried to use the reindex command, but it instead seemed to have filled everyhthing with zeroes.
The code that I was using is as follows:
df = pd.DataFrame(df_csv)
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
df = df.reindex(rng,fill_value=0.0)
How do I properly fill in the missing date/times without filling everything with a 0?
I think you need set_index from column date first, then is possible use reindex:
#cast column date if dtype is not datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date').reindex(rng,fill_value=0.0)
You get all NaN values, because reindexing int index by datetime values (After using fill_value=0.0 all NaN are replaced to 0.0).
Also if column date is sorted, you can use more general solution with selecting first and last value of column date:
start_date = df.date.iat[0]
end_date = df.date.iat[-1]
rng = pd.date_range(start_date, end_date, freq='S')

Python Pandas: Resample date range

I create a list of pandas datetimes with the following line:
range = pd.date_range(start = '5/3/2005', periods =5+1, freq='1D')
Is there a quick way to resample that list so that it spans the same range but with a frequency of 30 min? (so far I can only see applications of that logic to Series or DataFrames index, but no daterange)
One way is:
date_range = pd.date_range(start = '5/3/2005', periods =5+1, freq='1D')
new_date_range = date_range.to_series().asfreq('30 min').index
Also, range is a builtin function, so I would not call the variable "range".
I hope this helps.
You could also do
date_range = pd.date_range(start = '5/3/2005', periods =5+1, freq='1D')
new_date_range = pd.date_range(date_range.min(), date_range.max(), freq='30 min')

Categories