I am trying to set up a function with two different dictionaries.
datetime demand
0 2016-01-01 00:00:00 50.038
1 2016-01-01 00:00:10 50.021
2 2016-01-01 00:00:20 50.013
datetime dap
2016-01-01 00:00:00+01:00 23.86
2016-01-01 01:00:00+01:00 22.39
2016-01-01 02:00:00+01:00 20.59
As you can see, the dates are equal however the deltaT is different.
The function I have set up is as follows
for key, value in dap.items():
a = demand * value
print(a)
How do I make sure that in this function the dap value 23.86 is used for the datetime interval 2016-01-01 00:00:00 until 2016-01-01 01:00:00? This would mean that from the first dictionary indexed values 1-6 should be applied in the equation for 2016-01-01 00:00:00+01:00 23.86, and indexed values 7-12 are used for dap value 22.39 and so on?
datetime demand
0 2019-01-01 00:00:00 50.038
1 2019-01-01 00:00:10 50.021
2 2019-01-01 00:00:20 50.013
3 2019-01-01 00:00:30 50.004
4 2019-01-01 00:00:40 50.004
5 2019-01-01 00:00:50 50.009
6 2019-01-01 00:01:00 50.012
7 2019-01-01 00:01:10 49.998
8 2019-01-01 00:01:20 49.983
9 2019-01-01 00:01:30 49.979
10 2019-01-01 00:01:40 49.983
11 2019-01-01 00:01:50 49.983
12 2019-01-01 00:02:00 49.983
Related
I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.
I am grouping a time series by hour to perform an operation on each hour of data separately:
import pandas as pd
from datetime import datetime, timedelta
x = [2, 2, 4, 2, 2, 0]
idx = pd.date_range(
start=datetime(2019, 1, 1),
end=datetime(2019, 1, 1, 2, 30),
freq=timedelta(minutes=30),
)
s = pd.Series(x, index=idx)
hourly = s.groupby(lambda x: x.hour)
print(s)
print("summed:")
print(hourly.sum())
which produces:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Freq: 30T, dtype: int64
summed:
0 4
1 6
2 2
dtype: int64
As expected.
I now want to know the area under the time series per hour, for which I can use numpy.trapz:
import numpy as np
def series_trapz(series):
hours = [i.timestamp() / 3600 for i in series.index]
return np.trapz(series, x=hours)
print("Area under curve")
print(hourly.agg(series_trapz))
But for this to work correctly, the boundaries between the groups must appear in both groups!
For example, the first group must be:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
and the second group must be
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
etc.
Is this at all possible using pandas.groupby?
I don't think that I have your np.trapz logic completely correct here, but I think you can probably get what you want with .rolling(..., closed="both") so that the endpoints of the intervals are always included:
In [366]: s.rolling("1H", closed="both").apply(np.trapz).iloc[::2]
Out[366]:
2019-01-01 00:00:00 0.0
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
Freq: 60T, dtype: float64
I think you could repeat the limit of groups in your serie using Series.repeat:
r=(s.index.minute==0).astype(int)+1
new_s=s.repeat(r)
print(new_s)
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Then you could use Series.groupby:
groups=(new_s.index.to_series().shift(-1,fill_value=0).dt.minute!=0).cumsum()
for i,group in new_s.groupby(groups):
print(group)
print('-'*50)
Name: col1, dtype: int64
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Name: col1, dtype: int64
--------------------------------------------------
IIUC, this can be solved manually with rolling:
hours = np.unique(s.index.floor('H'))
# the answer:
(s.add(s.shift())
.mul(s.index.to_series()
.diff()
.dt.total_seconds()
.div(3600)
)
.rolling('1H').sum()[hours]
)
Output:
2019-01-01 00:00:00 NaN
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
dtype: float64
I have a dataset with every 60 mins interval value. Now, I want to divide them into 15mins interval using the averages between those 2 hourly values. How do I do that?
Time A
2016-01-01 00:00:00 1
2016-01-01 01:00:00 5
2016-01-01 02:00:00 13
So, I now want it to be in 15mins interval with average values:
Time A
2016-01-01 00:00:00 1
2016-01-01 00:15:00 2 ### at 2016-01-01 00:00:00 values is 1 and
2016-01-01 00:30:00 3 ### at 2016-01-01 01:00:00 values is 5.
2016-01-01 00:45:00 4 ### Therefore we have to fill 4 values ( 15 mins interval )
2016-01-01 01:00:00 5 ### with the average of the hour values.
2016-01-01 01:15:00 7
2016-01-01 01:30:00 9
2016-01-01 01:45:00 11
2016-01-01 02:00:00 13
I tried resampling it with mean to 15 mins but it won't work ( obviously ) and it given Nan values. Can anyone help me out? on how to do it?
I would just resample: df.resample("15min").interpolate("linear")
As you have the column Time set as index already, it should directly work
We can do this in one line with resample, replace and interpolate:
df.resample('15min').sum().replace(0, np.NaN).interpolate()
Output
A
Time
2016-01-01 00:00:00 1.0
2016-01-01 00:15:00 2.0
2016-01-01 00:30:00 3.0
2016-01-01 00:45:00 4.0
2016-01-01 01:00:00 5.0
2016-01-01 01:15:00 7.0
2016-01-01 01:30:00 9.0
2016-01-01 01:45:00 11.0
2016-01-01 02:00:00 13.0
You can do that like this:
import pandas as pd
df = pd.DataFrame({
'Time': ["2016-01-01 00:00:00", "2016-01-01 01:00:00", "2016-01-01 02:00:00"],
'A': [1 , 5, 13]
})
df['Time'] = pd.to_datetime(df['Time'])
new_idx = pd.DatetimeIndex(start=df['Time'].iloc[0], end=df['Time'].iloc[-1], freq='15min')
df2 = df.set_index('Time').reindex(new_idx).interpolate().reset_index()
df2.rename(columns={'index': 'Time'}, inplace=True)
print(df2)
# Time A
# 0 2016-01-01 00:00:00 1.0
# 1 2016-01-01 00:15:00 2.0
# 2 2016-01-01 00:30:00 3.0
# 3 2016-01-01 00:45:00 4.0
# 4 2016-01-01 01:00:00 5.0
# 5 2016-01-01 01:15:00 7.0
# 6 2016-01-01 01:30:00 9.0
# 7 2016-01-01 01:45:00 11.0
# 8 2016-01-01 02:00:00 13.0
If you want column A in the result to be an integer you can add something like:
df2['A'] = df2['A'].round().astype(int)
Let's say I have the below Dataframe. How would I do to get an extra column 'flag' with 1's where a day has a age bigger than 90 and only if it happens in 2 consecutive days (48h in this case)? The output should contain 1' on 2 or more days, depending on how many days the condition is met The dataset is much bigger, but I put here just a small portion so you get an idea.
Age
Dates
2019-01-01 00:00:00 29
2019-01-01 01:00:00 56
2019-01-01 02:00:00 82
2019-01-01 03:00:00 13
2019-01-01 04:00:00 35
2019-01-01 05:00:00 53
2019-01-01 06:00:00 25
2019-01-01 07:00:00 23
2019-01-01 08:00:00 21
2019-01-01 09:00:00 12
2019-01-01 10:00:00 15
2019-01-01 11:00:00 9
2019-01-01 12:00:00 13
2019-01-01 13:00:00 87
2019-01-01 14:00:00 9
2019-01-01 15:00:00 63
2019-01-01 16:00:00 62
2019-01-01 17:00:00 52
2019-01-01 18:00:00 43
2019-01-01 19:00:00 77
2019-01-01 20:00:00 95
2019-01-01 21:00:00 79
2019-01-01 22:00:00 77
2019-01-01 23:00:00 5
2019-01-02 00:00:00 78
2019-01-02 01:00:00 41
2019-01-02 02:00:00 10
2019-01-02 03:00:00 10
2019-01-02 04:00:00 88
2019-01-02 05:00:00 19
This would be the desired output:
Dates Age flag
0 2019-01-01 00:00:00 29 1
1 2019-01-01 01:00:00 56 1
2 2019-01-01 02:00:00 82 1
3 2019-01-01 03:00:00 13 1
4 2019-01-01 04:00:00 35 1
5 2019-01-01 05:00:00 53 1
6 2019-01-01 06:00:00 25 1
7 2019-01-01 07:00:00 23 1
8 2019-01-01 08:00:00 21 1
9 2019-01-01 09:00:00 12 1
10 2019-01-01 10:00:00 15 1
11 2019-01-01 11:00:00 9 1
12 2019-01-01 12:00:00 13 1
13 2019-01-01 13:00:00 87 1
14 2019-01-01 14:00:00 9 1
15 2019-01-01 15:00:00 63 1
16 2019-01-01 16:00:00 62 1
17 2019-01-01 17:00:00 52 1
18 2019-01-01 18:00:00 43 1
19 2019-01-01 19:00:00 77 1
20 2019-01-01 20:00:00 95 1
21 2019-01-01 21:00:00 79 1
22 2019-01-01 22:00:00 77 1
23 2019-01-01 23:00:00 5 1
24 2019-01-02 00:00:00 78 0
25 2019-01-02 01:00:00 41 0
26 2019-01-02 02:00:00 10 0
27 2019-01-02 03:00:00 10 0
28 2019-01-02 04:00:00 88 0
29 2019-01-02 05:00:00 19 0
The dates is the index of the dataframe and is incremented by 1h.
thanks
You can first compare column by Series.gt, then grouping by DatetimeIndex.date and ccheck if at least one True per groups by GroupBy.transform with GroupBy.any, last cast mask to integers for True/False to 1/0 mapping, then combinae it with previous answer:
df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='5H', periods=24))
#for test 1H timestamp use
#df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='H', periods=24 * 5))
df.loc[pd.Timestamp('2019-01-02 01:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-03 02:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-05 19:00:00'), 'Age'] = 95
#print (df)
#for test 48 consecutive values change N = 48
N = 10
s = df['Age'].gt(90)
s1 = (s.groupby(df.index.date).transform('any'))
g1 = s1.ne(s1.shift()).cumsum()
df['flag'] = (s.groupby(g1).transform('size').ge(N) & s1).astype(int)
print (df)
Age flag
2019-01-01 00:00:00 10 0
2019-01-01 05:00:00 10 0
2019-01-01 10:00:00 10 0
2019-01-01 15:00:00 10 0
2019-01-01 20:00:00 10 0
2019-01-02 01:00:00 95 1
2019-01-02 06:00:00 10 1
2019-01-02 11:00:00 10 1
2019-01-02 16:00:00 10 1
2019-01-02 21:00:00 10 1
2019-01-03 02:00:00 95 1
2019-01-03 07:00:00 10 1
2019-01-03 12:00:00 10 1
2019-01-03 17:00:00 10 1
2019-01-03 22:00:00 10 1
2019-01-04 03:00:00 10 0
2019-01-04 08:00:00 10 0
2019-01-04 13:00:00 10 0
2019-01-04 18:00:00 10 0
2019-01-04 23:00:00 10 0
2019-01-05 04:00:00 10 0
2019-01-05 09:00:00 10 0
2019-01-05 14:00:00 10 0
2019-01-05 19:00:00 95 0
Apparently, this could be a solution to the first version of the question: how to add a column whose row values are 1 if at least one of the rows with the same date (y-m-d) has an Age value greater than 90.
import pandas as pd
df = pd.DataFrame({
'Dates':['2019-01-01 00:00:00',
'2019-01-01 01:00:00',
'2019-01-01 02:00:00',
'2019-01-02 00:00:00',
'2019-01-02 01:00:00',
'2019-01-03 02:00:00',
'2019-01-03 03:00:00',],
'Age':[29, 56, 92, 13, 1, 2, 93],})
df.set_index('Dates', inplace=True)
df.index = pd.to_datetime(df.index)
df['flag'] = pd.DatetimeIndex(df.index).day
df['flag'] = df.flag.isin(df['flag'][df['Age']>90]).astype(int)
It returns:
Age flag
Dates
2019-01-01 00:00:00 29 1
2019-01-01 01:00:00 56 1
2019-01-01 02:00:00 92 1
2019-01-02 00:00:00 13 0
2019-01-02 01:00:00 1 0
2019-01-03 02:00:00 2 1
2019-01-03 03:00:00 93 1
I have a pandas df and with df['Battery capacity'] = df['total_load'].cumsum() + 5200
I subtract the values from "total_load" with the values from "Battery_capacity".
So, now I would like to add something to my code that breaks the adding/subtracting at a certain value. For example I don't want any higher values than 5200. So let's say at 13:00:00 the adding up should stop at 5200.
How could I implement that in my code? Scott Boston proposed an if-statement, but how would you do that with my code df['Battery capacity'] = df['total_load'].cumsum(if battery capacity = 5200, then stop adding) + 5200
Should I try to write a function?
Output should be something like that:
time total_load battery capacity
2016-06-01 12:00:00 2150 4487.7
2016-06-01 13:00:00 1200 5688 (but should stop at 5200)
2016-06-01 14:00:00 1980 5200 (don't actually add values now because we are still at 5200)
You can use np.clip to clip upper and lower bounds.
df['Battery capacity'] = np.clip(df['total_load'].cumsum() + 5200,-np.inf,5200)
Or as #jezrael points out Pandas Series has clip method:
df['Battery capacity'] = (df['total_load'].cumsum() + 5200).clip(-np.inf,5200)
Output:
Battery capacity total_load
2016-01-01 00:00:00 4755.0000 -445.0000
2016-01-01 01:00:00 4375.0000 -380.0000
2016-01-01 02:00:00 4025.0000 -350.0000
2016-01-01 03:00:00 3685.0000 -340.0000
2016-01-01 04:00:00 2955.4500 -729.5500
2016-01-01 05:00:00 1870.4500 -1085.0000
2016-01-01 06:00:00 879.1500 -991.3000
2016-01-01 07:00:00 -2555.8333 -3434.9833
2016-01-01 08:00:00 -1952.7503 603.0830
2016-01-01 09:00:00 -864.7503 1088.0000
2016-01-01 10:00:00 1155.2497 2020.0000
2016-01-01 11:00:00 2336.2497 1181.0000
2016-01-01 12:00:00 4486.2497 2150.0000
2016-01-01 13:00:00 5200.0000 1200.8330
2016-01-01 14:00:00 5200.0000 1980.0000
2016-01-01 15:00:00 5200.0000 -221.2667
Now, if you didn't want the value to go below zero replace -np.inf with 0.
Battery capacity total_load
2016-01-01 00:00:00 4755.0000 -445.0000
2016-01-01 01:00:00 4375.0000 -380.0000
2016-01-01 02:00:00 4025.0000 -350.0000
2016-01-01 03:00:00 3685.0000 -340.0000
2016-01-01 04:00:00 2955.4500 -729.5500
2016-01-01 05:00:00 1870.4500 -1085.0000
2016-01-01 06:00:00 879.1500 -991.3000
2016-01-01 07:00:00 0.0000 -3434.9833
2016-01-01 08:00:00 0.0000 603.0830
2016-01-01 09:00:00 0.0000 1088.0000
2016-01-01 10:00:00 1155.2497 2020.0000
2016-01-01 11:00:00 2336.2497 1181.0000
2016-01-01 12:00:00 4486.2497 2150.0000
2016-01-01 13:00:00 5200.0000 1200.8330
2016-01-01 14:00:00 5200.0000 1980.0000
2016-01-01 15:00:00 5200.0000 -221.2667