Difference between pandas resample 'M' and 'MS' - python

I'm using the function resample to change the daily data to be a monthly data of a pandas dataframe. Reading the documentation I found that I could define the rule='M' or rule='MS'. The first is "calendar month end" and the second is "calendar month begin". What is the difference between the two?

It doesn't set the same date as index of the resampled groups.
Here is an example:
date = pd.Series([0,1,2],
index=pd.to_datetime(['2022-01-15',
'2022-01-20',
'2022-02-15']))
2022-01-15 0
2022-01-20 1
2022-02-15 2
dtype: int64
# resampling MS:
date.resample('MS').mean()
2022-01-01 0.5
2022-02-01 2.0
Freq: MS, dtype: float64
# resampling M:
date.resample('M').mean()
2022-01-31 0.5
2022-02-28 2.0
Freq: M, dtype: float64
Note the difference in the dates of the index. For 'MS' the dates of the groups are always the first of the month, for 'M' the last day.

Related

Pandas: resample hourly values to monthly values with offset

I want to aggregate a pandas.Series with an hourly DatetimeIndex to monthly values - while considering the offset to midnight.
Example
Consider the following (uniform) timeseries that spans about 1.5 months.
import pandas as pd
hours = pd.Series(1, pd.date_range('2020-02-23 06:00', freq = 'H', periods=1008))
hours
# 2020-02-23 06:00:00 1
# 2020-02-23 07:00:00 1
# ..
# 2020-04-05 04:00:00 1
# 2020-04-05 05:00:00 1
# Freq: H, Length: 1000, dtype: int64
I would like to sum these to months while considering, that days start at 06:00 in this use-case. The result should be:
2020-02-01 06:00:00 168
2020-03-01 06:00:00 744
2020-04-01 06:00:00 96
freq: MS, dtype: int64
How do I do that??
What I've tried and what works
I can aggregate to days while considering the offset, using the offset parameter:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
days
# 2020-02-23 06:00:00 24
# 2020-02-24 06:00:00 24
# ..
# 2020-04-03 06:00:00 24
# 2020-04-04 06:00:00 24
# Freq: D, dtype: int64
Using the same method to aggregate to months does not work. The timestamps do not have a time component, and the values are incorrect:
months = hours.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 162 # wrong
# 2020-03-01 744
# 2020-04-01 102 # wrong
# Freq: MS, dtype: int64
I could do the aggregation to months as a second step after aggregating to days. In that case, the values are correct, but the time component is still missing from the timestamps:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
months = days.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 168
# 2020-03-01 744
# 2020-04-01 96
# Freq: MS, dtype: int64
My current workaround is adding the timedelta and resetting the frequency manually.
months.index += pd.Timedelta('06:00:00')
months.index.freq = 'MS'
months
# 2020-02-01 06:00:00 168
# 2020-03-01 06:00:00 744
# 2020-04-01 06:00:00 96
# freq: MS, dtype: int64
Not too much of an improvement on your attempt, but you could write the resampling as
months = hours.resample('D', offset='06:00:00').sum().resample('MS').sum()
changing the index labels still requires the hack you've been doing, as in adding the time delta manually and setting freq to MS
note that you can pass a string representation of the time delta to offset.
The reason two resampling operations are needed is because when the resampling frequency is greater than 'D', the offset is ignored. Once your resample at the daily level is performed with the offset, the result can be further resampled without specifying the offset.
I believe this is buggy behaviour, and I agree with you that hours.resample('MS', offset='06:00:00').sum() should produce the expected result.
Essentially, there are two issues:
the binning is incorrect when there is an offset applied & the frequency is greater than 'D'. The offset is ignored.
the offset is not reflected in the final output, the output truncates to the start or end of the period. I'm not sure if the behaviour you're expecting can be generalized for all users.
That there is a related bug issue impacting resampling with offsets. I have not determined yet whether that and the issue you face have the same root cause. Its the same root cause.

Grouping and summing time differences from pandas dataframe

I have a dataframe like in example below:
Timestamp ComponentName Utilization
18.10.2020-19:07.10 A Available
19.10.2020-21:07.10 A Available
19.10.2020-19:07.10 A In use
22.10.2020-19:07.10 A In use
25.10.2020-19:07.10 A In use
And desired output should be:
ComponentName Total_Inuse_time Total_Available_time
A 6 days 1 day 2 hours
Basicly I want to have total inuse time and available time for each component.
I have tried grouping by component names and aggregating with sum on Time differences but could not get the desired result.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Timestamp'] = df.groupby(['ComponentName', 'Utilization'])['Timestamp'].diff().fillna(pd.Timedelta(0))
sums = df.groupby(['ComponentName', 'Utilization'])['Timestamp'].sum()
Output:
>>> sums
ComponentName Utilization
A Available 1 days 02:00:00
In use 6 days 00:00:00
Name: Timestamp, dtype: timedelta64[ns]
>>> sums['A']
Utilization
Available 1 days 02:00:00
In use 6 days 00:00:00
Name: Timestamp, dtype: timedelta64[ns]
>>> sums['A']['Available']
Timedelta('1 days 02:00:00')

Pandas: align two time series given an "overlapping" region?

I have two time-series, A and B. The time series have different lengths and different dates and different frequencies.
The core issue is that I want to find the dates between the two time series that overlap.
This is easy enough to do, but there's a twist.
Imagine time series A has a value on 01/01/2020 and series B has a value on 02/01/2020. These are only one day apart, and UNLESS there's a better candidate (e.g. if B ALSO has a value on 01/01/02020), I want to just include this one.
So, really the twist consists of a "lag" of, say, 5 days. Then what I want to do is the following:
Consider all date in time series A.
For each such date, find the corresponding date in time series B.
For each date for which a match in B was NOT found, set a lag = 1 day, and search in B again.
Repeat step 3 with lag = 2, 3, ..., 5 days.
Return a new times series C, which consists of the dates of A and the values of B, found by matching using the above procedure.
You can achieve this by, at step 2, finding the nearest date in B using a simple min
from random import random as r
import random
random.seed(123456)
ts_a = pd.Series([pd.Timestamp(f"{d+1}-jan-2021") for d in range(5)])
base = pd.Timestamp(f"01-jan-2021")
one_day = pd.Timedelta("1D")
ts_b = pd.Series([base + int(10*r())*one_day for _ in range(5)])
print(ts_a)
print(ts_b)
for t in ts_a:
deltas = abs(ts_b - t)
index_closest = deltas.idxmin()
print(f"The closest day to {t} in ts_b is {ts_b.iloc[index_closest]}")
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
dtype: datetime64[ns]
0 2021-01-09
1 2021-01-08
2 2021-01-01
3 2021-01-02
4 2021-01-01
dtype: datetime64[ns]
The closest day to 2021-01-01 00:00:00 in ts_b is 2021-01-01 00:00:00
The closest day to 2021-01-02 00:00:00 in ts_b is 2021-01-02 00:00:00
The closest day to 2021-01-03 00:00:00 in ts_b is 2021-01-02 00:00:00
The closest day to 2021-01-04 00:00:00 in ts_b is 2021-01-02 00:00:00
The closest day to 2021-01-05 00:00:00 in ts_b is 2021-01-08 00:00:00

pd.date_range , Length of values does not match lenght of index

I have a data frame composed as follows, containing daily precipitation data from 1971 to 2017:
df.dtypes
Out[28]:
date datetime64[ns]
mm float64
dtype: object
df.head()
Out[29]:
date mm
0 1971-01-01 07:00:00 2.2
1 1971-01-02 07:00:00 2.0
2 1971-01-03 07:00:00 3.0
3 1971-01-04 07:00:00 0.0
4 1971-01-05 07:00:00 0.0
I want to ask the user to which time frame he´s interested to filter the available data, therefore:
start_date_entry = input("Input the month and the year to start your analysis (i.e. YYYY-MM-DD): ")
year, month, day = map(int, start_date_entry.split('-'))
start_date_form = datetime(year, month, day,7,0,0)
end_date_entry = input("Input the last month and year to end your analysis (i.e. YYYY-MM-DD): ")
year, month, day = map(int, end_date_entry.split('-'))
end_date_form = datetime(year, month, day,7,0,0)
Both start_date_form and end_date_form have the following format:
type(start_date_form)
Out[31]: datetime.datetime
print(start_date_form)
2010-01-01 07:00:00
Now, I want to filtrate my data frame with the new dates:
df['date']=pd.date_range(start=(start_date_form), end=(end_date_form),freq='D')
I get the following error:
Length of values (1827) does not match length of index (17167)
I am new to python and programming, I would appreciate some help on this. Thanks

Cumulative elapsed minutes from Pandas datetime Series

I have a column of datetime stamps. I need a column of total minutes elapsed from the first to the last value.
I have:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
>>> df
timestamp
0 2001-01-01 06:00:00
1 2001-01-01 06:01:00
2 2001-01-01 06:15:00
I need to add a column that gives the running total:
timestamp minutes
1-1-2001 6:00 0
1-1-2001 6:01 1
1-1-2001 6:15 15
1-1-2001 7:00 60
1-1-2001 7:35 95
Having a hard time manipulating the datetime Series to allow me to total up the timestamp.
I've looked at a lot of posts and can't find anything that does what I'm trying to do. Would appreciate any ideas!
You can chain a few methods together:
>>> df['minutes'] = df['timestamp'].diff().fillna(0).dt.total_seconds()\
... .cumsum().div(60).astype(int)
>>> df
timestamp minutes
0 2001-01-01 06:00:00 0
1 2001-01-01 06:01:00 1
2 2001-01-01 06:15:00 15
Creation:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
Walkthrough
The easiest way to break this down is to separate each intermediate method call.
df['timestamp'].diff() gives you a Series of Pandas-equivalent of Python's datetime.timedelta, the differences in times from each value to the next.
>>> df['timestamp'].diff()
0 NaT
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
This contains an N/A value (NaT/not a time) because there's nothing to subtract from the first value. You can simply fill it with the zero-value for timedeltas:
>>> df['timestamp'].diff().fillna(0)
0 00:00:00
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
Now you need to get an actual integer (minutes) from these objects. In .dt.total_seconds(), .dt is an "accessor" that is a way of accessing a bunch of methods that let you work with datetime-like data:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds()
0 0.0
1 60.0
2 840.0
Name: timestamp, dtype: float64
The result is the incremental second-change as a float. You need this on a cumulative basis, in minutes, and as an integer. That's what the final 3 operations do:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds().cumsum().div(60).astype(int)
0 0
1 1
2 15
Name: timestamp, dtype: int64
Note that astype(int) will do rounding if you have seconds that aren't fully divisible by 60.

Categories