I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.
Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0
Related
I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"
Sample data:
id_trip
dtm_start_trip
dtm_end_trip
start_station
end_station
1
2018-10-01 10:15:00
2018-10-01 10:17:00
A
B
2
2018-10-01 10:17:00
2018-10-01 10:18:00
B
A
...
...
...
...
...
999999
2021-12-31 23:58:00
2022-01-01 00:22:00
C
A
1000000
2021-12-31 23:59:00
2022-01-01 00:29:00
A
D
Trial code:
df2 = df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)
df2= df2.set_index('dtm_end_trip')
df2 = df2.resample('15T').count()
Output I get:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
2
2
2018-10-01 00:30:00
0
0
2018-10-01 00:45:00
1
1
2018-10-01 01:00:00
2
2
2018-10-01 01:15:00
1
1
Desired output:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
A
2
2018-10-01 00:15:00
B
0
2018-10-01 00:15:00
C
1
2018-10-01 00:15:00
D
2
2018-10-01 00:30:00
A
3
2018-10-01 00:30:00
B
2
The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.
You can use pd.Grouper like this:
out = df.groupby([
pd.Grouper(freq='15min', key='dtm_end_trip'),
'end_station',
]).size()
>>> out
dtm_end_trip end_station
2018-10-01 10:15:00 A 1
B 1
2022-01-01 00:15:00 A 1
D 1
dtype: int64
The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:
>>> out.to_frame('count').reset_index()
dtm_end_trip end_station count
0 2018-10-01 10:15:00 A 1
1 2018-10-01 10:15:00 B 1
2 2022-01-01 00:15:00 A 1
3 2022-01-01 00:15:00 D 1
Note: this is the result from the four rows in your sample input data.
I have a pandas column which I have initialized with ones, this column represents the health of a solar panel.
I need to decay this value linearly unless the time has occurred where the panel will be replaced, here the value resets to 1 (hence why I have initialized to ones). What I am doing is looping through the column, then updating the current value with the value of the previous value, minus a constant.
This operation is extremely expensive (and I have over 200,000 samples). I was hoping someone might be able to help me with a vectorized solution, where I can avoid this for loop. Here is my code:
def set_degredation_factors_pv(df):
for i in df.index:
if i != replacement_duration_PV_year * hour_per_year and i != 0:
df.loc[i, 'degradation_factor_PV_power_frac'] = df.loc[i-1, 'degradation_factor_PV_power_frac'] - degradation_rate_PV_power_perc_per_hour/100
return df
Variables:
replacement_duration_PV_year = 25
hour_per_year = 8760
degradation_rate_PV_power_perc_per_hour = 5.479e-5
Input data:
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1
1 2022-01-01 01:00:00 1
2 2022-01-01 02:00:00 1
3 2022-01-01 03:00:00 1
4 2022-01-01 04:00:00 1
... ... ...
8732 2022-12-30 20:00:00 1
8733 2022-12-30 21:00:00 1
8734 2022-12-30 22:00:00 1
8735 2022-12-30 23:00:00 1
8736 2022-12-31 00:00:00 1
Output data (only taking one year for time):
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998
... ... ...
8732 2022-12-30 20:00:00 0.995216
8733 2022-12-30 21:00:00 0.995215
8734 2022-12-30 22:00:00 0.995215
8735 2022-12-30 23:00:00 0.995214
8736 2022-12-31 00:00:00 0.995214
Try:
rate = degradation_rate_PV_power_perc_per_hour / 100
mask = ~((df.index != replacement_duration_PV_year * hour_per_year)
& (df.index != 0))
df['degradation_factor_PV_power_frac'] = (
df.groupby(mask.cumsum())['degradation_factor_PV_power_frac']
.apply(lambda x: x.shift().sub(rate).cumprod())
.fillna(df['degradation_factor_PV_power_frac'])
)
Output:
>>> df
time_datetime degradation_factor_PV_power_frac
0 2022-01-01 00:00:00 1.000000
1 2022-01-01 01:00:00 0.999999
2 2022-01-01 02:00:00 0.999999
3 2022-01-01 03:00:00 0.999998
4 2022-01-01 04:00:00 0.999998
I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.
I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5
i have below times series data frames
i wanna delete rows on condtion (check everyday) : check aaa>100 then delete all day rows (in belows, delete all 2015-12-01 rows because aaa column last 3 have 1000 value)
....
date time aaa
2015-12-01,00:00:00,0
2015-12-01,00:15:00,0
2015-12-01,00:30:00,0
2015-12-01,00:45:00,0
2015-12-01,01:00:00,0
2015-12-01,01:15:00,0
2015-12-01,01:30:00,0
2015-12-01,01:45:00,0
2015-12-01,02:00:00,0
2015-12-01,02:15:00,0
2015-12-01,02:30:00,0
2015-12-01,02:45:00,0
2015-12-01,03:00:00,0
2015-12-01,03:15:00,0
2015-12-01,03:30:00,0
2015-12-01,03:45:00,0
2015-12-01,04:00:00,0
2015-12-01,04:15:00,0
2015-12-01,04:30:00,0
2015-12-01,04:45:00,0
2015-12-01,05:00:00,0
2015-12-01,05:15:00,0
2015-12-01,05:30:00,0
2015-12-01,05:45:00,0
2015-12-01,06:00:00,0
2015-12-01,06:15:00,0
2015-12-01,06:30:00,1000
2015-12-01,06:45:00,1000
2015-12-01,07:00:00,1000
....
how can i do it ?
I think you need if MultiIndex first compare values of aaa by condition and then filter all values in first level by boolean indexing, last filter again by isin with inverted condition by ~:
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
2015-12-02 05:00:00 0
05:15:00 200
05:30:00 0
05:45:00 0
2015-12-03 06:00:00 0
06:15:00 0
06:30:00 1000
06:45:00 1000
07:00:00 1000
lvl0 = df.index.get_level_values(0)
idx = lvl0[df['aaa'].gt(100)].unique()
print (idx)
Index(['2015-12-02', '2015-12-03'], dtype='object', name='date')
df = df[~lvl0.isin(idx)]
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
And if first column is not index only compare column date:
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0
4 2015-12-02 05:00:00 0
5 2015-12-02 05:15:00 200
6 2015-12-02 05:30:00 0
7 2015-12-02 05:45:00 0
8 2015-12-03 06:00:00 0
9 2015-12-03 06:15:00 0
10 2015-12-03 06:30:00 1000
11 2015-12-03 06:45:00 1000
12 2015-12-03 07:00:00 1000
idx = df.loc[df['aaa'].gt(100), 'date'].unique()
print (idx)
['2015-12-02' '2015-12-03']
df = df[~df['date'].isin(idx)]
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0