Tricky slicing specifications on business-day datetimeindex - python

I have a pandas dataframe with a business-day-based DateTimeIndex. For each month that's in the index, I also have a single 'marker' day specified.
Here's a toy version of that dataframe:
# a dataframe with business dates as the index
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# each month has an single, arbitrary marker day specified
marker_dates = [df.index[12], df.index[33], df.index[57]]
For each month in the index, I need to calculate average of the foo column in specific slice of rows in that month.
There are two different ways I need to be able to specify those slices:
1) m'th day to n'th day.
Example might be (2rd to 4th business day in that month). So april would be the average of 1 (apr2), 4 (apr3), and 5 (apr 6) = 3.33. May would be 33 (may 4), 34 (may 5), 35 (may 6) = 34. I don't consider the weekends/holidays that don't occur in the index as days.
2) m'th day before/after the marker date to the n'th day before/after the marker date.
Example might be "average of the slice from 1 day before the marker date to 1 day after the marker date in each month" Eg. In April, the marker date is 17Apr. Looking at the index, we want the average of apr16, apr17, and apr20.
For Example 1, I had an ugly solution that foreach month I would slice the rows of that month away, and then apply df_slice.iloc[m:n].mean()
Whenever I start doing iterative things with pandas, I always suspect I'm doing it wrong. So I imagine there is a cleaner, pythonic/vectorized way to make this result for all the months
For Example 2, I don't not know a good way to do this slice-averaging based on arbitrary dates across many months.

Use BDay() from pandas.tseries.offsets
import pandas as pd
from pandas.tseries.offsets import BDay
M=2
N=4
start_date = pd.datetime(2015,4,1)
end_date = pd.datetime(2015,6,30)
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# for month starts
marker_dates = pd.date_range(start=start_date, end=end_date, freq='BMS')
# create IntervalIndex
bins = pd.IntervalIndex.from_tuples([ (d + (M-1)*BDay(), d + (N-1)*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-02, 2015-04-06] 3.333333
#[2015-05-04, 2015-05-06] 34.000000
#[2015-06-02, 2015-06-04] 63.000000
# any markers
marker_dates = [df.index[12], df.index[33], df.index[57]]
# M Bday before, and N Bday after
bins = pd.IntervalIndex.from_tuples([ (d - M*BDay(), d + N*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-15, 2015-04-23] 18.428571
#[2015-05-14, 2015-05-22] 48.000000
#[2015-06-17, 2015-06-25] 81.428571

The most pythonic/vectorized (pandonic?) way to do this might be to use df.rolling and df.shift to generate the window over which you'll take the average, then df.reindex to select the value at the dates you've marked.
For your example (2), this could look like:
df['foo'].rolling(3).mean().shift(-1).reindex(marker_dates)
Out[8]:
2015-04-17 17.333333
2015-05-18 47.000000
2015-06-19 80.333333
Name: foo, dtype: float64
This could be wrapped in a small function:
def window_mean_at_indices(df, indices, begin=-1, end=1):
return df.rolling(1+end-begin).mean().shift(-end).reindex(indices)
Helping to make it more clear how to apply this to situation (1):
month_starts = pd.date_range(df.index.min(), df.index.max(), freq='BMS')
month_starts
Out[11]: DatetimeIndex(['2015-04-01', '2015-05-01', '2015-06-01'],
dtype='datetime64[ns]', freq='BMS')
window_mean_at_indices(df['foo'], month_starts, begin=1, end=3)
Out[12]:
2015-04-01 3.333333
2015-05-01 34.000000
2015-06-01 63.000000
Freq: BMS, Name: foo, dtype: float64

For your first problem you can use grouper and iloc i.e
low = 2
high= 4
slice_mean = df.groupby(pd.Grouper(level=0,freq='m')).apply(lambda x : x.iloc[low-1:high].mean())
# or df.resample('m').apply(lambda x : x.iloc[low-1:high].mean())
foo
2015-04-30 3.333333
2015-05-31 34.000000
2015-06-30 63.000000
For your second problem you can concat the dates and take the groupy mean per month i.e
idx = pd.np.where(df.index.isin(pd.Series(marker_dates)))[0]
#array([12, 33, 57])
temp = pd.concat([df.iloc[(idx+i)] for i in [-1,0,1]])
foo
2015-04-16 15
2015-05-15 46
2015-06-18 78
2015-04-17 18
2015-05-18 47
2015-06-19 81
2015-04-20 19
2015-05-19 48
2015-06-22 82
# Groupby mean
temp.groupby(pd.Grouper(level=0,freq='m')).mean()
# or temp.resample('m').mean()
foo
2015-04-30 17.333333
2015-05-31 47.000000
2015-06-30 80.333333
dtype: float64
since the index of output aint specified in the question do let us know what the index of output be.

Here's what I managed to come up with:
Import pandas and setup the dataframe
import pandas as pd
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B')
Start with a pure list of marker dates, since I'm guessing that what you're really starting with:
marker_dates = [
pd.to_datetime('2015-04-17', format='%Y-%m-%d'),
pd.to_datetime('2015-05-18', format='%Y-%m-%d'),
pd.to_datetime('2015-06-19', format='%Y-%m-%d')
]
marker_df = pd.DataFrame([], columns=['marker', 'start', 'end', 'avg'])
marker_df['marker'] = marker_dates
For the case where you want to just test ranges, input the start and end manually here instead of calculating it. If you want to change the range you can change the arguments to shift():
marker_df['start'] = df.index.shift(-1)[df.index.isin(marker_df['marker'])]
marker_df['end'] = df.index.shift(1)[df.index.isin(marker_df['marker'])]
Finally, use DataFrame.apply() to do a row by row calculation of averages:
marker_df.apply(
lambda x: df[(x['start'] <= df.index) & (df.index <= x['end'])]['foo'].mean(),
axis=1
)
Which gives us this result:
marker start end avg
0 2015-04-17 2015-04-16 2015-04-20 17.000000
1 2015-05-18 2015-05-15 2015-05-19 46.666667
2 2015-06-19 2015-06-18 2015-06-22 80.000000

Related

Pandas - Compute sum of a column as week-wise columns

I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.

Group by list of different time ranges in Pandas

Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5

Iterate over pd df with date column by week python

I have a one month DataFrame with a datetime object column and a bunch of functions I want to apply to it - by week. So I want to loop over the DataFrame and apply the functions to each week. How do I iterate over weekly time periods?
My DataFrame looks like this:
here is some random datetime code:
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,2), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
df['random_num'] = np.random.choice(range(0,1000), size=n)
My week length is inconsistent (sometimes I have 1000 tweets per week sometimes 100,000). Could please someone give me an example of how to loop over this dataframe by week? (I don't need aggregation or groupby functions.)
If you really don't want to use groupby and aggregations then:
for week in df['date'].dt.week.unique():
this_weeks_data = df[df['date'].dt.week == week]
This will, of course, go wrong if you have data from more than one year.
Given your sample dataframe
date random_num
0 2017-01-01 214
1 2018-01-19 655
2 2017-01-24 663
3 2017-01-26 723
4 2017-01-01 974
First, you can try to set the index to datetime object as follows
df.set_index(df.date, inplace=True)
df.drop('date', axis=1, inplace=True)
This sets the index to the date column and drops the original column. You will get
>>> df.head()
date random_num
2017-01-01 214
2018-01-19 655
2017-01-24 663
2017-01-26 723
2017-01-01 974
Then you can use the pandas groupby function to group the data as per your frequency and apply any function of your choice.
# To group by week and count the number of occurances
>>> df.groupby(pd.Grouper(freq='W')).count().head()
date random_num
2017-01-01 11
2017-01-08 65
2017-01-15 55
2017-01-22 66
2017-01-29 45
# To group by week and sum the random numbers per week
>>> df.groupby(pd.Grouper(freq='W')).sum().head()
date random_num
2017-01-01 7132
2017-01-08 33916
2017-01-15 31028
2017-01-22 31509
2017-01-29 22129
You can also apply any generic function myFunction by using the apply method of pandas
df.groupby(pd.Grouper(freq='W')).apply(myFunction)
If you want to apply a function myFunction to any specific column columnName after grouping, you can also do that as follows
df.groupby(pd.Grouper(freq='W'))[columnName].apply(myFunction)
[SOLVED FOR MULTIPLE YEARS]
pd.Grouper(freq='W') works fine but sometimes I have come across some undesired behaviors related to how weeks are split when the number of weeks are not even. So this is why I sometimes prefer to do the week split by hand like shown in this example.
So, having a dataset that spans in multiple years
import numpy as np
import pandas as pd
import datetime
# Create dataset
np.random.seed(123)
n = 100000
date = pd.to_datetime({
'year': np.random.choice(range(2017, 2020), size=n),
'month': np.random.choice(range(1, 13), size=n),
'day': np.random.choice(range(1, 28), size=n)
})
random_num = np.random.choice(
range(0, 1000),
size=n)
df = pd.DataFrame({'date': date, 'random_num': random_num})
Such as:
print(df.head())
date random_num
0 2019-12-11 413
1 2018-06-08 594
2 2019-08-06 983
3 2019-10-11 73
4 2017-09-19 32
First create a helper index that allows you to iterate by week (considering the year as well):
df['grp_idx'] = df['date'].apply(
lambda x: '%s-%s' % (x.year, '{:02d}'.format(x.week)))
print(df.head())
date random_num grp_idx
0 2019-12-11 413 2019-50
1 2018-06-08 594 2018-23
2 2019-08-06 983 2019-32
3 2019-10-11 73 2019-41
4 2017-09-19 32 2017-38
Then just apply your function that makes a computation on the weekly-subset, something like this:
def something_to_do_by_week(week_data):
"""
Computes the mean random value.
"""
return week_data['random_num'].mean()
weekly_mean = df.groupby('grp_idx').apply(something_to_do_by_week)
print(weekly_mean.head())
grp_idx
2017-01 515.875668
2017-02 487.226704
2017-03 503.371681
2017-04 497.717647
2017-05 475.323420
Once you have your weekly metrics you'll probably would like to get back to actual dates which are more useful than year-week indices:
def from_year_week_to_date(year_week):
"""
"""
year, week = year_week.split('-')
year, week = int(year), int(week)
date = pd.to_datetime('%s-01-01' % year)
date += datetime.timedelta(days=week * 7)
return date
weekly_mean.index = [from_year_week_to_date(x) for x in weekly_mean.index]
print(weekly_mean.head())
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647
2017-02-05 475.323420
dtype: float64
Finally, now you can make nice plots with nice interpretable dates:
Just as a sanity check, the computation using pd.Grouper(freq='W') gives me almost the same results (somehow it adds an extra week at the beginning of the pd.Series)
df.set_index('date').groupby(
pd.Grouper(freq='W')
).mean().head()
Out[27]:
random_num
date
2017-01-01 532.736364
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1

Add missing dates to pandas dataframe

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')

Categories