Resample pandas dataframe and interpolate missing values for timeseries data - python

I need to resample timeseries data and interpolate missing values in 15 min intervals over the course of an hour. Each ID should have four rows of data per hour.
In:
ID Time Value
1 1/1/2019 12:17 3
1 1/1/2019 12:44 2
2 1/1/2019 12:02 5
2 1/1/2019 12:28 7
Out:
ID Time Value
1 2019-01-01 12:00:00 3.0
1 2019-01-01 12:15:00 3.0
1 2019-01-01 12:30:00 2.0
1 2019-01-01 12:45:00 2.0
2 2019-01-01 12:00:00 5.0
2 2019-01-01 12:15:00 7.0
2 2019-01-01 12:30:00 7.0
2 2019-01-01 12:45:00 7.0
I wrote a function to do this, however efficiency goes down drastically when trying to process a larger dataset.
Is there a more efficient way to do this?
import datetime
import pandas as pd
data = pd.DataFrame({'ID': [1,1,2,2],
'Time': ['1/1/2019 12:17','1/1/2019 12:44','1/1/2019 12:02','1/1/2019 12:28'],
'Value': [3,2,5,7]})
def clean_dataset(data):
ids = data.drop_duplicates(subset='ID')
data['Time'] = pd.to_datetime(data['Time'])
data['Time'] = data['Time'].apply(
lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,15*(dt.minute // 15)))
data = data.drop_duplicates(subset=['Time','ID']).reset_index(drop=True)
df = pd.DataFrame(columns=['Time','ID','Value'])
for i in range(ids.shape[0]):
times = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
id_data = data[data['ID']==ids.iloc[i]['ID']]
clean_data = times.join(id_data.set_index('Time'), on='Time')
clean_data = clean_data.interpolate(method='linear', limit_direction='both')
clean_data.drop(clean_data.tail(1).index,inplace=True)
df = df.append(clean_data)
return df
clean_dataset(data)

Linear interpolation does become slow with a large data set. Having a loop in your code is also responsible for a large part of the slowdown. Anything that can be removed from the loop and pre-computed will help increase efficiency. For example, if you pre-define the data frame that you use to initialize times, the code becomes 14% more efficient:
times_template = pd.DataFrame(pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min'),columns=['Time'])
for i in range(ids.shape[0]):
times = times_template.copy()
Profiling your code confirms that the interpolation takes the longest amount of time (22.7%), followed by the join (13.1%), the append (7.71%), and then the drop (7.67%) commands.

You can use:
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create MultiIndex and reindex
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
#interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0
If range cannot be change:
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#end in 13:00
rng = pd.date_range('1/1/2019 12:00','1/1/2019 13:00',freq='15min')
mux = pd.MultiIndex.from_product([data['ID'].unique(), rng], names=['ID','Time'])
data = data.set_index(['ID','Time']).reindex(mux).reset_index()
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
#remove last row per groups
data = data[data['ID'].duplicated(keep='last')]
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
5 2 2019-01-01 12:00:00 5.0
6 2 2019-01-01 12:15:00 7.0
7 2 2019-01-01 12:30:00 7.0
8 2 2019-01-01 12:45:00 7.0
EDIT:
Another solution with merge and left join instead reindex:
from itertools import product
#round datetimes by 15 minutes
data['Time'] = pd.to_datetime(data['Time'])
minutes = pd.to_timedelta(15*(data['Time'].dt.minute // 15), unit='min')
data['Time'] = data['Time'].dt.floor('H') + minutes
#change date range for 4 values (to `12:45`)
rng = pd.date_range('1/1/2019 12:00','1/1/2019 12:45',freq='15min')
#create helper DataFrame and merge with left join
df = pd.DataFrame(list(product(data['ID'].unique(), rng)), columns=['ID','Time'])
print (df)
ID Time
0 1 2019-01-01 12:00:00
1 1 2019-01-01 12:15:00
2 1 2019-01-01 12:30:00
3 1 2019-01-01 12:45:00
4 2 2019-01-01 12:00:00
5 2 2019-01-01 12:15:00
6 2 2019-01-01 12:30:00
7 2 2019-01-01 12:45:00
data = df.merge(data, how='left')
##interpolate per groups
data['Value'] = (data.groupby('ID')['Value']
.apply(lambda x: x.interpolate(method='linear', limit_direction='both')))
print (data)
ID Time Value
0 1 2019-01-01 12:00:00 3.0
1 1 2019-01-01 12:15:00 3.0
2 1 2019-01-01 12:30:00 2.0
3 1 2019-01-01 12:45:00 2.0
4 2 2019-01-01 12:00:00 5.0
5 2 2019-01-01 12:15:00 7.0
6 2 2019-01-01 12:30:00 7.0
7 2 2019-01-01 12:45:00 7.0

Related

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

Python - Sum of column values between 2 dates

I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling("10D").sum() but I can't get the syntax right!!
Please please help!
Try:
df["Total Units sold in the last 10 days"] = df.rolling(on="Date", window="10D", closed="both").sum()["Units Sold"]
print(df)
Prints:
Date Units Sold Total Units sold in the last 10 days
0 2019-01-01 5 5.0
1 2019-01-01 4 9.0
2 2019-01-05 1 10.0
3 2019-01-12 3 4.0
4 2019-01-15 2 6.0
5 2019-02-04 7 7.0

Drop overlapping periods less than 6 months in pandas dataframe

I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01

How to use groupby() with between_time()?

I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.

How to divide 60 mins datapoints into 15 mins?

I have a dataset with every 60 mins interval value. Now, I want to divide them into 15mins interval using the averages between those 2 hourly values. How do I do that?
Time A
2016-01-01 00:00:00 1
2016-01-01 01:00:00 5
2016-01-01 02:00:00 13
So, I now want it to be in 15mins interval with average values:
Time A
2016-01-01 00:00:00 1
2016-01-01 00:15:00 2 ### at 2016-01-01 00:00:00 values is 1 and
2016-01-01 00:30:00 3 ### at 2016-01-01 01:00:00 values is 5.
2016-01-01 00:45:00 4 ### Therefore we have to fill 4 values ( 15 mins interval )
2016-01-01 01:00:00 5 ### with the average of the hour values.
2016-01-01 01:15:00 7
2016-01-01 01:30:00 9
2016-01-01 01:45:00 11
2016-01-01 02:00:00 13
I tried resampling it with mean to 15 mins but it won't work ( obviously ) and it given Nan values. Can anyone help me out? on how to do it?
I would just resample: df.resample("15min").interpolate("linear")
As you have the column Time set as index already, it should directly work
We can do this in one line with resample, replace and interpolate:
df.resample('15min').sum().replace(0, np.NaN).interpolate()
Output
A
Time
2016-01-01 00:00:00 1.0
2016-01-01 00:15:00 2.0
2016-01-01 00:30:00 3.0
2016-01-01 00:45:00 4.0
2016-01-01 01:00:00 5.0
2016-01-01 01:15:00 7.0
2016-01-01 01:30:00 9.0
2016-01-01 01:45:00 11.0
2016-01-01 02:00:00 13.0
You can do that like this:
import pandas as pd
df = pd.DataFrame({
'Time': ["2016-01-01 00:00:00", "2016-01-01 01:00:00", "2016-01-01 02:00:00"],
'A': [1 , 5, 13]
})
df['Time'] = pd.to_datetime(df['Time'])
new_idx = pd.DatetimeIndex(start=df['Time'].iloc[0], end=df['Time'].iloc[-1], freq='15min')
df2 = df.set_index('Time').reindex(new_idx).interpolate().reset_index()
df2.rename(columns={'index': 'Time'}, inplace=True)
print(df2)
# Time A
# 0 2016-01-01 00:00:00 1.0
# 1 2016-01-01 00:15:00 2.0
# 2 2016-01-01 00:30:00 3.0
# 3 2016-01-01 00:45:00 4.0
# 4 2016-01-01 01:00:00 5.0
# 5 2016-01-01 01:15:00 7.0
# 6 2016-01-01 01:30:00 9.0
# 7 2016-01-01 01:45:00 11.0
# 8 2016-01-01 02:00:00 13.0
If you want column A in the result to be an integer you can add something like:
df2['A'] = df2['A'].round().astype(int)

Categories