How resample dataframe

How resample dataframe - python

I have problem, when I resample dataframe index, the date change !!.
>>>dpvis=dpvi.Puissance.resample('10min').mean()
>>> dpvi.head()
Puissance
Date
2016-05-01 00:00:00 0
2016-05-01 00:05:00 0
2016-05-01 00:10:00 0
2016-05-01 00:15:00 0
2016-05-01 00:20:00 0
>>> dpvis.head()
Date
2015-06-14 00:00:00 0.0
2015-06-14 00:10:00 0.0
2015-06-14 00:20:00 0.0
2015-06-14 00:30:00 0.0
2015-06-14 00:40:00 0.0
Freq: 10T, Name: Puissance, dtype: float64
>>>

Here's a demonstration that resample() will work correctly with the data you've provided, assuming that your dtypes are correct. It's not exactly an answer to your problem, but it may serve as a sort of sanity check.
First, generate sample data for a two month period at 5min intervals:
import pandas as pd
Date = pd.date_range("2016-05-01", "2016-07-01", freq="5min", name='Date')
Puissance = {'Puissance': np.zeros(len(Date), dtype=int)}
df = pd.DataFrame(Puissance, index=Date)
df.head()
Puissance
Date
2016-05-01 00:00:00 0
2016-05-01 00:05:00 0
2016-05-01 00:10:00 0
2016-05-01 00:15:00 0
2016-05-01 00:20:00 0
df.shape # (17569, 1)
df.index.dtype # datetime64[ns]
df.Puissance.dtype # int64
Now resample to 10min intervals:
resampled = df.Puissance.resample('10min').mean()
resampled.shape # (8785,)
Note: df.resample('10min').mean() also gives the same results here.
resampled.head()
Date
2016-05-01 00:00:00 0
2016-05-01 00:10:00 0
2016-05-01 00:20:00 0
2016-05-01 00:30:00 0
2016-05-01 00:40:00 0
Freq: 10T, Name: Puissance, dtype: int64
resampled.tail()
Date
2016-06-30 23:20:00 0
2016-06-30 23:30:00 0
2016-06-30 23:40:00 0
2016-06-30 23:50:00 0
2016-07-01 00:00:00 0
Freq: 10T, Name: Puissance, dtype: int64
Resampling works as expected.
This suggests that there's an issue somewhere with your dtype declarations, or with the format of an observation that isn't shown in your head() output.
One clue may be that your Puissance values start out as integers (0), but are resampled as floats (0.0). If all of your Puissance values are zero-valued integers, the mean output dtype will also be int64, as seen above. (mean() will normally return dtype float64 if the values being averaged are not all the same.) Your example data may not be representative of the actual problem you're trying to solve - if so, consider updating your post with a more representative example.

Related

Python Pandas Period Strings does not work on minutes

my df is like this:
timestamp power
0 2022-01-01 00:00:00 100.000000
1 2022-01-01 00:00:01 100.004526
2 2022-01-01 00:00:02 100.009053
3 2022-01-01 00:00:03 100.013579
4 2022-01-01 00:00:04 100.018105
... ... ...
31535995 2022-12-31 23:59:55 136.750000
31535996 2022-12-31 23:59:56 136.560000
31535997 2022-12-31 23:59:57 136.440000
31535998 2022-12-31 23:59:58 136.380000
31535999 2022-12-31 23:59:59 136.530000
[31536000 rows x 2 columns]
I have a super simple script:
directory = 'data/peak_shaving/20220803_132445'
df = pd.read_csv(f'{directory}/demand_profile_simulation.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.groupby(pd.PeriodIndex(df['timestamp'], freq="15min"))['power'].mean()
the result for this is:
timestamp
2022-01-01 00:00 100.133526
2022-01-01 00:01 100.405105
2022-01-01 00:02 100.676684
2022-01-01 00:03 100.948263
2022-01-01 00:04 101.219842
...
2022-12-31 23:55 153.952833
2022-12-31 23:56 150.040333
2022-12-31 23:57 146.124167
2022-12-31 23:58 142.225833
2022-12-31 23:59 138.318167
Freq: 15T, Name: power, Length: 525600, dtype: float64
as you can see it is grouped as minutes, not as 15 min intervals.
When I try other freq like one day it works perfectly:
2022-01-01 120.291041
2022-01-02 126.085428
2022-01-03 120.840020
2022-01-04 124.335800
2022-01-05 119.230694
...
2022-12-27 125.802254
2022-12-28 123.833951
2022-12-29 126.609810
2022-12-30 123.971885
2022-12-31 122.798069
Freq: D, Name: power, Length: 365, dtype: float64
Also tested hours and many other freq and it works but I can not make it work for 15in intervals, is there any issue in my code? Thanks

For me working your solution correct, here is altenative with Series.dt.to_period:
df = pd.read_csv(f'{directory}/demand_profile_simulation.csv', parse_dates=['timestamp'])
df = df.groupby(df['timestamp'].dt.to_period('15Min'))['power'].mean()
Another solutions:
df = pd.read_csv(f'{directory}/demand_profile_simulation.csv', parse_dates=['timestamp'])
df = df.groupby(pd.Grouper(key='timestamp', freq="15min"))['power'].mean()
#alternative
#df = df.resample("15min", on='timestamp')['power'].mean()

You can go through this link https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
I think this may help
ex:
pd.Series(pd.date_range(
'1/1/2020', '1/2/2020', freq='15min', closed='left')).dt.time

pandas datetime index unique difference

The following works for getting unique difference in consecutive datetime index.
# Data
import pandas
d = pandas.DataFrame({"a": [x for x in range(5)]})
d.index = pandas.date_range("2021-01-01 00:00:00", "2021-01-01 01:00:00", freq="15min")
# Get difference
delta = d.index.to_series().diff().astype("timedelta64[m]").unique()
delta
# array([nan, 15.])
But I am not clear where the nan comes from. I am only interested in the 15 minutes. Is delta[1] a reliable way to get it or am I missing something?

The first row doesn't have anything to diff against, so its NaT.
>>> d.index.to_series().diff()
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]
From pandas.Series.unique: Uniques are returned in order of appearance.. Since that NaT is guaranteed to be the first element in the returned list it is okay to do delta[1] as you suggest. Assuming you have at least 2 rows and you don't have NaT in the data.
More generally, if you don't want that first value in a diff, you can slice it off
>>> d.index.to_series().diff()[1:]
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]

When you do diff , the first item will return NaN in pandas which is not same as R ~
d.index.to_series().diff()
Out[713]:
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 0 days 00:15:00
2021-01-01 00:30:00 0 days 00:15:00
2021-01-01 00:45:00 0 days 00:15:00
2021-01-01 01:00:00 0 days 00:15:00
Freq: 15T, dtype: timedelta64[ns]

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!

You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2

I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

Applying start and endtime as filters to dataframe

I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.

Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000

Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.

Another way to use downsampling in pandas

let’s look at some one-minute data:
In [513]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [514]: ts = Series(np.arange(12), index=rng)
In [515]: ts
Out[515]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking
the sum of each group:
In [516]: ts.resample('5min', how='sum')
Out[516]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T
However I don't want to use the resample method and still want the same input-output. How can I use group_by or reindex or any of such other methods?

You can use a custom pd.Grouper this way:
In [78]: ts.groupby(pd.Grouper(freq='5min', closed='right')).sum()
Out [78]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The closed='right' ensures that the output is exactly the same.
However, if your aim is to do more custom grouping, you can use .groupby with your own vector:
In [78]: buckets = (ts.index - ts.index[0]) / pd.Timedelta('5min')
In [79]: grp = ts.groupby(np.ceil(buckets.values))
In [80]: grp.sum()
Out[80]:
0 0
1 15
2 40
3 11
dtype: int64
The output is not exactly the same, but the method is more flexible (e.g. can create uneven buckets).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How resample dataframe - python

Related

Python Pandas Period Strings does not work on minutes

pandas datetime index unique difference

How to use pandas Grouper to get sum of values within each hour

Applying start and endtime as filters to dataframe

Another way to use downsampling in pandas

Categories

Resources