Python Pandas Period Strings does not work on minutes - python

my df is like this:
timestamp power
0 2022-01-01 00:00:00 100.000000
1 2022-01-01 00:00:01 100.004526
2 2022-01-01 00:00:02 100.009053
3 2022-01-01 00:00:03 100.013579
4 2022-01-01 00:00:04 100.018105
... ... ...
31535995 2022-12-31 23:59:55 136.750000
31535996 2022-12-31 23:59:56 136.560000
31535997 2022-12-31 23:59:57 136.440000
31535998 2022-12-31 23:59:58 136.380000
31535999 2022-12-31 23:59:59 136.530000
[31536000 rows x 2 columns]
I have a super simple script:
directory = 'data/peak_shaving/20220803_132445'
df = pd.read_csv(f'{directory}/demand_profile_simulation.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.groupby(pd.PeriodIndex(df['timestamp'], freq="15min"))['power'].mean()
the result for this is:
timestamp
2022-01-01 00:00 100.133526
2022-01-01 00:01 100.405105
2022-01-01 00:02 100.676684
2022-01-01 00:03 100.948263
2022-01-01 00:04 101.219842
...
2022-12-31 23:55 153.952833
2022-12-31 23:56 150.040333
2022-12-31 23:57 146.124167
2022-12-31 23:58 142.225833
2022-12-31 23:59 138.318167
Freq: 15T, Name: power, Length: 525600, dtype: float64
as you can see it is grouped as minutes, not as 15 min intervals.
When I try other freq like one day it works perfectly:
2022-01-01 120.291041
2022-01-02 126.085428
2022-01-03 120.840020
2022-01-04 124.335800
2022-01-05 119.230694
...
2022-12-27 125.802254
2022-12-28 123.833951
2022-12-29 126.609810
2022-12-30 123.971885
2022-12-31 122.798069
Freq: D, Name: power, Length: 365, dtype: float64
Also tested hours and many other freq and it works but I can not make it work for 15in intervals, is there any issue in my code? Thanks

For me working your solution correct, here is altenative with Series.dt.to_period:
df = pd.read_csv(f'{directory}/demand_profile_simulation.csv', parse_dates=['timestamp'])
df = df.groupby(df['timestamp'].dt.to_period('15Min'))['power'].mean()
Another solutions:
df = pd.read_csv(f'{directory}/demand_profile_simulation.csv', parse_dates=['timestamp'])
df = df.groupby(pd.Grouper(key='timestamp', freq="15min"))['power'].mean()
#alternative
#df = df.resample("15min", on='timestamp')['power'].mean()

You can go through this link https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
I think this may help
ex:
pd.Series(pd.date_range(
'1/1/2020', '1/2/2020', freq='15min', closed='left')).dt.time

Related

How to split time records across midnight in Pandas

For example, if I have the following data:
df = pd.DataFrame({'Start': ['2022-01-01 08:30:00', '2022-01-01 13:00:00', '2022-01-02 22:00:00'],
'Stop': ['2022-01-01 12:00:00', '2022-01-02 10:30:00', '2022-01-04 8:00:00']})
df = df.apply(pd.to_datetime)
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 10:30:00
2 2022-01-02 22:00:00 2022-01-04 08:00:00
How can I split each record across midnight and upsample my data, so it looks like this:
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 00:00:00
2 2022-01-02 00:00:00 2022-01-02 10:30:00
3 2022-01-02 22:00:00 2022-01-03 00:00:00
4 2022-01-03 00:00:00 2022-01-04 00:00:00
5 2022-01-04 00:00:00 2022-01-04 08:00:00
I want to calculate the duration per day for each time record using df['Stop'] - df['Start']. Maybe there is another way to do it. Thank you!
You could start by implementing a function that computes all dates splits from each row :
from datetime import timedelta
def split_date(start, stop):
# Same day case
if start.date() == stop.date():
return [(start, stop)]
# Several days split case
stop_split = start.replace(hour=0, minute=0, second=0) + timedelta(days=1)
return [(start, stop_split)] + split_date(stop_split, stop)
Then you can just use your existing dataframe to create a new one with all records by computing the split of each record :
new_dates = [
elt for _, row in df.iterrows() for elt in split_date(row["Start"], row["Stop"])
]
new_df = pd.DataFrame(new_dates, columns=["Start", "Stop"])
Then the output should be the one you expected :
Start Stop
0 2022-01-01 08:30:00 2022-01-01 12:00:00
1 2022-01-01 13:00:00 2022-01-02 00:00:00
2 2022-01-02 00:00:00 2022-01-02 10:30:00
3 2022-01-02 22:00:00 2022-01-03 00:00:00
4 2022-01-03 00:00:00 2022-01-04 00:00:00
5 2022-01-04 00:00:00 2022-01-04 08:00:00

How to vectorize an expensive for loop in python

I have a pandas column which I have initialized with ones, this column represents the health of a solar panel.
I need to decay this value linearly unless the time has occurred where the panel will be replaced, here the value resets to 1 (hence why I have initialized to ones). What I am doing is looping through the column, then updating the current value with the value of the previous value, minus a constant.
This operation is extremely expensive (and I have over 200,000 samples). I was hoping someone might be able to help me with a vectorized solution, where I can avoid this for loop. Here is my code:
def set_degredation_factors_pv(df):
for i in df.index:
if i != replacement_duration_PV_year * hour_per_year and i != 0:
df.loc[i, 'degradation_factor_PV_power_frac'] = df.loc[i-1, 'degradation_factor_PV_power_frac'] - degradation_rate_PV_power_perc_per_hour/100
return df
Variables:
replacement_duration_PV_year = 25
hour_per_year = 8760
degradation_rate_PV_power_perc_per_hour = 5.479e-5
Input data:
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1
1 2022-01-01 01:00:00 1
2 2022-01-01 02:00:00 1
3 2022-01-01 03:00:00 1
4 2022-01-01 04:00:00 1
... ... ...
8732 2022-12-30 20:00:00 1
8733 2022-12-30 21:00:00 1
8734 2022-12-30 22:00:00 1
8735 2022-12-30 23:00:00 1
8736 2022-12-31 00:00:00 1
Output data (only taking one year for time):
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998
... ... ...
8732 2022-12-30 20:00:00 0.995216
8733 2022-12-30 21:00:00 0.995215
8734 2022-12-30 22:00:00 0.995215
8735 2022-12-30 23:00:00 0.995214
8736 2022-12-31 00:00:00 0.995214
Try:
rate = degradation_rate_PV_power_perc_per_hour / 100
mask = ~((df.index != replacement_duration_PV_year * hour_per_year)
& (df.index != 0))
df['degradation_factor_PV_power_frac'] = (
df.groupby(mask.cumsum())['degradation_factor_PV_power_frac']
.apply(lambda x: x.shift().sub(rate).cumprod())
.fillna(df['degradation_factor_PV_power_frac'])
)
Output:
>>> df
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998

pandas datetime index unique difference

The following works for getting unique difference in consecutive datetime index.
# Data
import pandas
d = pandas.DataFrame({"a": [x for x in range(5)]})
d.index = pandas.date_range("2021-01-01 00:00:00", "2021-01-01 01:00:00", freq="15min")
# Get difference
delta = d.index.to_series().diff().astype("timedelta64[m]").unique()
delta
# array([nan, 15.])
But I am not clear where the nan comes from. I am only interested in the 15 minutes. Is delta[1] a reliable way to get it or am I missing something?
The first row doesn't have anything to diff against, so its NaT.
>>> d.index.to_series().diff()
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]
From pandas.Series.unique: Uniques are returned in order of appearance.. Since that NaT is guaranteed to be the first element in the returned list it is okay to do delta[1] as you suggest. Assuming you have at least 2 rows and you don't have NaT in the data.
More generally, if you don't want that first value in a diff, you can slice it off
>>> d.index.to_series().diff()[1:]
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]
When you do diff , the first item will return NaN in pandas which is not same as R ~
d.index.to_series().diff()
Out[713]:
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 0 days 00:15:00
2021-01-01 00:30:00 0 days 00:15:00
2021-01-01 00:45:00 0 days 00:15:00
2021-01-01 01:00:00 0 days 00:15:00
Freq: 15T, dtype: timedelta64[ns]

How resample dataframe

I have problem, when I resample dataframe index, the date change !!.
>>>dpvis=dpvi.Puissance.resample('10min').mean()
>>> dpvi.head()
Puissance
Date
2016-05-01 00:00:00 0
2016-05-01 00:05:00 0
2016-05-01 00:10:00 0
2016-05-01 00:15:00 0
2016-05-01 00:20:00 0
>>> dpvis.head()
Date
2015-06-14 00:00:00 0.0
2015-06-14 00:10:00 0.0
2015-06-14 00:20:00 0.0
2015-06-14 00:30:00 0.0
2015-06-14 00:40:00 0.0
Freq: 10T, Name: Puissance, dtype: float64
>>>
Here's a demonstration that resample() will work correctly with the data you've provided, assuming that your dtypes are correct. It's not exactly an answer to your problem, but it may serve as a sort of sanity check.
First, generate sample data for a two month period at 5min intervals:
import pandas as pd
Date = pd.date_range("2016-05-01", "2016-07-01", freq="5min", name='Date')
Puissance = {'Puissance': np.zeros(len(Date), dtype=int)}
df = pd.DataFrame(Puissance, index=Date)
df.head()
Puissance
Date
2016-05-01 00:00:00 0
2016-05-01 00:05:00 0
2016-05-01 00:10:00 0
2016-05-01 00:15:00 0
2016-05-01 00:20:00 0
df.shape # (17569, 1)
df.index.dtype # datetime64[ns]
df.Puissance.dtype # int64
Now resample to 10min intervals:
resampled = df.Puissance.resample('10min').mean()
resampled.shape # (8785,)
Note: df.resample('10min').mean() also gives the same results here.
resampled.head()
Date
2016-05-01 00:00:00 0
2016-05-01 00:10:00 0
2016-05-01 00:20:00 0
2016-05-01 00:30:00 0
2016-05-01 00:40:00 0
Freq: 10T, Name: Puissance, dtype: int64
resampled.tail()
Date
2016-06-30 23:20:00 0
2016-06-30 23:30:00 0
2016-06-30 23:40:00 0
2016-06-30 23:50:00 0
2016-07-01 00:00:00 0
Freq: 10T, Name: Puissance, dtype: int64
Resampling works as expected.
This suggests that there's an issue somewhere with your dtype declarations, or with the format of an observation that isn't shown in your head() output.
One clue may be that your Puissance values start out as integers (0), but are resampled as floats (0.0). If all of your Puissance values are zero-valued integers, the mean output dtype will also be int64, as seen above. (mean() will normally return dtype float64 if the values being averaged are not all the same.) Your example data may not be representative of the actual problem you're trying to solve - if so, consider updating your post with a more representative example.

Optimize code to find the median of values of past 30 day for each row in a DataFrame

I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)

Categories