I have a dataset with every 60 mins interval value. Now, I want to divide them into 15mins interval using the averages between those 2 hourly values. How do I do that?
Time A
2016-01-01 00:00:00 1
2016-01-01 01:00:00 5
2016-01-01 02:00:00 13
So, I now want it to be in 15mins interval with average values:
Time A
2016-01-01 00:00:00 1
2016-01-01 00:15:00 2 ### at 2016-01-01 00:00:00 values is 1 and
2016-01-01 00:30:00 3 ### at 2016-01-01 01:00:00 values is 5.
2016-01-01 00:45:00 4 ### Therefore we have to fill 4 values ( 15 mins interval )
2016-01-01 01:00:00 5 ### with the average of the hour values.
2016-01-01 01:15:00 7
2016-01-01 01:30:00 9
2016-01-01 01:45:00 11
2016-01-01 02:00:00 13
I tried resampling it with mean to 15 mins but it won't work ( obviously ) and it given Nan values. Can anyone help me out? on how to do it?
I would just resample: df.resample("15min").interpolate("linear")
As you have the column Time set as index already, it should directly work
We can do this in one line with resample, replace and interpolate:
df.resample('15min').sum().replace(0, np.NaN).interpolate()
Output
A
Time
2016-01-01 00:00:00 1.0
2016-01-01 00:15:00 2.0
2016-01-01 00:30:00 3.0
2016-01-01 00:45:00 4.0
2016-01-01 01:00:00 5.0
2016-01-01 01:15:00 7.0
2016-01-01 01:30:00 9.0
2016-01-01 01:45:00 11.0
2016-01-01 02:00:00 13.0
You can do that like this:
import pandas as pd
df = pd.DataFrame({
'Time': ["2016-01-01 00:00:00", "2016-01-01 01:00:00", "2016-01-01 02:00:00"],
'A': [1 , 5, 13]
})
df['Time'] = pd.to_datetime(df['Time'])
new_idx = pd.DatetimeIndex(start=df['Time'].iloc[0], end=df['Time'].iloc[-1], freq='15min')
df2 = df.set_index('Time').reindex(new_idx).interpolate().reset_index()
df2.rename(columns={'index': 'Time'}, inplace=True)
print(df2)
# Time A
# 0 2016-01-01 00:00:00 1.0
# 1 2016-01-01 00:15:00 2.0
# 2 2016-01-01 00:30:00 3.0
# 3 2016-01-01 00:45:00 4.0
# 4 2016-01-01 01:00:00 5.0
# 5 2016-01-01 01:15:00 7.0
# 6 2016-01-01 01:30:00 9.0
# 7 2016-01-01 01:45:00 11.0
# 8 2016-01-01 02:00:00 13.0
If you want column A in the result to be an integer you can add something like:
df2['A'] = df2['A'].round().astype(int)
Related
How do I replace duplicates for each group with NaNs while keeping the rows?
I need to keep rows without removing and perhaps keeping the first original value where it shows up first.
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
'value': [10,10,10,10,12,12,12,12],
'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 10 Jackie
2 2019-01-01 02:00:00 10 Jackie
3 2019-01-01 03:00:00 10 Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 12 Zoop
6 2019-09-01 04:00:00 12 Zoop
7 2019-09-01 05:00:00 12 Zoop
Desired Dataframe:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
Edit:
Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.
I assume you check duplicates on columns value and ID and further check on date of column date
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan
Out[269]:
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
As #Trenton suggest, you may use pd.NA to avoid import numpy
(Note: as #rafaelc sugguest: here is the link explain detail differences between pd.NA and np.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA
Out[273]:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 <NA> Jackie
2 2019-01-01 02:00:00 <NA> Jackie
3 2019-01-01 03:00:00 <NA> Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 <NA> Zoop
6 2019-09-01 04:00:00 <NA> Zoop
7 2019-09-01 05:00:00 <NA> Zoop
This is working if the dataframe is sorted - as in your example:
import numpy as np # to be used for np.nan
df['duplicate'] = df['value'].shift(1) # create a duplicate column
df['value'] = df.apply(lambda x: np.nan if x['value'] == x['duplicate'] \
else x['value'], axis=1) # conditional replace
df = df.drop('duplicate', axis=1) # drop helper column
Group on the dates and take the first observed value (not necessarily the first when sorted by time), then merge the result back to the original dataframe.
df2 = df.groupby([df['date'].dt.date, 'ID'], as_index=False).first()
>>> df.drop(columns='value').merge(df2, on=['date', 'ID'], how='left')[df.columns]
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()
I am trying to set up a function with two different dictionaries.
datetime demand
0 2016-01-01 00:00:00 50.038
1 2016-01-01 00:00:10 50.021
2 2016-01-01 00:00:20 50.013
datetime dap
2016-01-01 00:00:00+01:00 23.86
2016-01-01 01:00:00+01:00 22.39
2016-01-01 02:00:00+01:00 20.59
As you can see, the dates are equal however the deltaT is different.
The function I have set up is as follows
for key, value in dap.items():
a = demand * value
print(a)
How do I make sure that in this function the dap value 23.86 is used for the datetime interval 2016-01-01 00:00:00 until 2016-01-01 01:00:00? This would mean that from the first dictionary indexed values 1-6 should be applied in the equation for 2016-01-01 00:00:00+01:00 23.86, and indexed values 7-12 are used for dap value 22.39 and so on?
datetime demand
0 2019-01-01 00:00:00 50.038
1 2019-01-01 00:00:10 50.021
2 2019-01-01 00:00:20 50.013
3 2019-01-01 00:00:30 50.004
4 2019-01-01 00:00:40 50.004
5 2019-01-01 00:00:50 50.009
6 2019-01-01 00:01:00 50.012
7 2019-01-01 00:01:10 49.998
8 2019-01-01 00:01:20 49.983
9 2019-01-01 00:01:30 49.979
10 2019-01-01 00:01:40 49.983
11 2019-01-01 00:01:50 49.983
12 2019-01-01 00:02:00 49.983
I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5
I have a pandas df and with df['Battery capacity'] = df['total_load'].cumsum() + 5200
I subtract the values from "total_load" with the values from "Battery_capacity".
So, now I would like to add something to my code that breaks the adding/subtracting at a certain value. For example I don't want any higher values than 5200. So let's say at 13:00:00 the adding up should stop at 5200.
How could I implement that in my code? Scott Boston proposed an if-statement, but how would you do that with my code df['Battery capacity'] = df['total_load'].cumsum(if battery capacity = 5200, then stop adding) + 5200
Should I try to write a function?
Output should be something like that:
time total_load battery capacity
2016-06-01 12:00:00 2150 4487.7
2016-06-01 13:00:00 1200 5688 (but should stop at 5200)
2016-06-01 14:00:00 1980 5200 (don't actually add values now because we are still at 5200)
You can use np.clip to clip upper and lower bounds.
df['Battery capacity'] = np.clip(df['total_load'].cumsum() + 5200,-np.inf,5200)
Or as #jezrael points out Pandas Series has clip method:
df['Battery capacity'] = (df['total_load'].cumsum() + 5200).clip(-np.inf,5200)
Output:
Battery capacity total_load
2016-01-01 00:00:00 4755.0000 -445.0000
2016-01-01 01:00:00 4375.0000 -380.0000
2016-01-01 02:00:00 4025.0000 -350.0000
2016-01-01 03:00:00 3685.0000 -340.0000
2016-01-01 04:00:00 2955.4500 -729.5500
2016-01-01 05:00:00 1870.4500 -1085.0000
2016-01-01 06:00:00 879.1500 -991.3000
2016-01-01 07:00:00 -2555.8333 -3434.9833
2016-01-01 08:00:00 -1952.7503 603.0830
2016-01-01 09:00:00 -864.7503 1088.0000
2016-01-01 10:00:00 1155.2497 2020.0000
2016-01-01 11:00:00 2336.2497 1181.0000
2016-01-01 12:00:00 4486.2497 2150.0000
2016-01-01 13:00:00 5200.0000 1200.8330
2016-01-01 14:00:00 5200.0000 1980.0000
2016-01-01 15:00:00 5200.0000 -221.2667
Now, if you didn't want the value to go below zero replace -np.inf with 0.
Battery capacity total_load
2016-01-01 00:00:00 4755.0000 -445.0000
2016-01-01 01:00:00 4375.0000 -380.0000
2016-01-01 02:00:00 4025.0000 -350.0000
2016-01-01 03:00:00 3685.0000 -340.0000
2016-01-01 04:00:00 2955.4500 -729.5500
2016-01-01 05:00:00 1870.4500 -1085.0000
2016-01-01 06:00:00 879.1500 -991.3000
2016-01-01 07:00:00 0.0000 -3434.9833
2016-01-01 08:00:00 0.0000 603.0830
2016-01-01 09:00:00 0.0000 1088.0000
2016-01-01 10:00:00 1155.2497 2020.0000
2016-01-01 11:00:00 2336.2497 1181.0000
2016-01-01 12:00:00 4486.2497 2150.0000
2016-01-01 13:00:00 5200.0000 1200.8330
2016-01-01 14:00:00 5200.0000 1980.0000
2016-01-01 15:00:00 5200.0000 -221.2667