how can i delete whole day rows on condition column values.. pandas - python

i have below times series data frames
i wanna delete rows on condtion (check everyday) : check aaa>100 then delete all day rows (in belows, delete all 2015-12-01 rows because aaa column last 3 have 1000 value)
....
date time aaa
2015-12-01,00:00:00,0
2015-12-01,00:15:00,0
2015-12-01,00:30:00,0
2015-12-01,00:45:00,0
2015-12-01,01:00:00,0
2015-12-01,01:15:00,0
2015-12-01,01:30:00,0
2015-12-01,01:45:00,0
2015-12-01,02:00:00,0
2015-12-01,02:15:00,0
2015-12-01,02:30:00,0
2015-12-01,02:45:00,0
2015-12-01,03:00:00,0
2015-12-01,03:15:00,0
2015-12-01,03:30:00,0
2015-12-01,03:45:00,0
2015-12-01,04:00:00,0
2015-12-01,04:15:00,0
2015-12-01,04:30:00,0
2015-12-01,04:45:00,0
2015-12-01,05:00:00,0
2015-12-01,05:15:00,0
2015-12-01,05:30:00,0
2015-12-01,05:45:00,0
2015-12-01,06:00:00,0
2015-12-01,06:15:00,0
2015-12-01,06:30:00,1000
2015-12-01,06:45:00,1000
2015-12-01,07:00:00,1000
....
how can i do it ?

I think you need if MultiIndex first compare values of aaa by condition and then filter all values in first level by boolean indexing, last filter again by isin with inverted condition by ~:
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
2015-12-02 05:00:00 0
05:15:00 200
05:30:00 0
05:45:00 0
2015-12-03 06:00:00 0
06:15:00 0
06:30:00 1000
06:45:00 1000
07:00:00 1000
lvl0 = df.index.get_level_values(0)
idx = lvl0[df['aaa'].gt(100)].unique()
print (idx)
Index(['2015-12-02', '2015-12-03'], dtype='object', name='date')
df = df[~lvl0.isin(idx)]
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
And if first column is not index only compare column date:
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0
4 2015-12-02 05:00:00 0
5 2015-12-02 05:15:00 200
6 2015-12-02 05:30:00 0
7 2015-12-02 05:45:00 0
8 2015-12-03 06:00:00 0
9 2015-12-03 06:15:00 0
10 2015-12-03 06:30:00 1000
11 2015-12-03 06:45:00 1000
12 2015-12-03 07:00:00 1000
idx = df.loc[df['aaa'].gt(100), 'date'].unique()
print (idx)
['2015-12-02' '2015-12-03']
df = df[~df['date'].isin(idx)]
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0

Related

Add missing timestamps for each different ID in dataframe

I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.
Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0

Python Pandas: resample based on just one of the columns

I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"
Sample data:
id_trip
dtm_start_trip
dtm_end_trip
start_station
end_station
1
2018-10-01 10:15:00
2018-10-01 10:17:00
A
B
2
2018-10-01 10:17:00
2018-10-01 10:18:00
B
A
...
...
...
...
...
999999
2021-12-31 23:58:00
2022-01-01 00:22:00
C
A
1000000
2021-12-31 23:59:00
2022-01-01 00:29:00
A
D
Trial code:
df2 = df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)
df2= df2.set_index('dtm_end_trip')
df2 = df2.resample('15T').count()
Output I get:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
2
2
2018-10-01 00:30:00
0
0
2018-10-01 00:45:00
1
1
2018-10-01 01:00:00
2
2
2018-10-01 01:15:00
1
1
Desired output:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
A
2
2018-10-01 00:15:00
B
0
2018-10-01 00:15:00
C
1
2018-10-01 00:15:00
D
2
2018-10-01 00:30:00
A
3
2018-10-01 00:30:00
B
2
The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.
You can use pd.Grouper like this:
out = df.groupby([
pd.Grouper(freq='15min', key='dtm_end_trip'),
'end_station',
]).size()
>>> out
dtm_end_trip end_station
2018-10-01 10:15:00 A 1
B 1
2022-01-01 00:15:00 A 1
D 1
dtype: int64
The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:
>>> out.to_frame('count').reset_index()
dtm_end_trip end_station count
0 2018-10-01 10:15:00 A 1
1 2018-10-01 10:15:00 B 1
2 2022-01-01 00:15:00 A 1
3 2022-01-01 00:15:00 D 1
Note: this is the result from the four rows in your sample input data.

Applying start and endtime as filters to dataframe

I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.

After groupby, evaluate value in column against column values in all rows in the group

I am looking for the following functionality in python:
I have a Pandas DataFrame with 4 columns: ID, StartDate, EndDate, Moment.
I want to group by ID and evaluate per row in the group whether the Moment variable falls between the interval between StartDate and EndDate. The problem is that I want to evaluate this for each row in the group. For example in the following DataFrame there are two groups (ID=1 and ID=2) and both groups contains of 5 rows. For each row, I want a boolean for each row in both groups whether the moment variable in that row falls in ANY of the time windows in the group, the window being [date1, date2].
import pandas as pd
i = pd.date_range('2018-04-11', periods=10, freq='2D20min')
i2 = pd.date_range('2018-04-12', periods=10, freq='2D20min')
i3 = pd.date_range('2018-04-9', periods=10, freq='1D6H')
id = ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2']
ts = pd.DataFrame({'date1': i, 'date2': i2, 'moment': i3}, index=id)
ID date1 date2 moment
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00
In this case, the value for moment in the first row of the first group does not fall in any of the five time intervals. Neither does the second. The third value, 2018-04-11 12:00:00 does fall in the interval in the first row and I would thus want to have True returned.
The desired result would look as follows:
ID date1 date2 moment result
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00 False
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00 False
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00 True
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00 False
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00 True
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00 False
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00 False
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00 False
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00 False
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00 False
EDIT
I already 'solved' this problem with the following approach but am looking for a more pythonic and perhaps faster way...
boolean_result = []
for c in ts.index.unique():
temp = ts.loc[ts.index == c]
for row in temp.index:
current_date = temp['moment'][row]
boolean_result.append(max((temp['date1'] <= current_date)
& (current_date <= temp['date2'])))
ts['Result'] = boolean_result
This may actually be very slow if your dataframe is too big, and there might be an optimal solution other than this one:
def time_in_range(start, end, x):
"""Return true if x is in the range [start, end]"""
if start <= x and x <= end:
return True
else:
return False
# empty list to be appended
result = []
test_list = []
for i in ts.index.unique():
temp_df = ts[ts.index == i]
for j in range(0, len(temp_df)):
for k in range(0, len(temp_df)):
test_list.append(time_in_range(temp_df.date1.iloc[k], temp_df.date2.iloc[k], temp_df.moment.iloc[j]))
result.append(any(test_list))
# reset the list
test_list = []
ts['result'] = result

pandas dataframe new column which checks previous day

I have a Dataframe which has a Datetime as Index and a column named "Holiday" which is an Flag with 1 or 0.
So if the datetimeindex is a holiday, the Holiday column has 1 in it and if not so 0.
I need a new column that says whether a given datetimeindex is the first day after a holiday or not.The new column should just look if its previous day has the flag "HOLIDAY" set to 1 and then set its flag to 1, otherwise 0.
EDIT
Doing:
df['DayAfter'] = df.Holiday.shift(1).fillna(0)
Has the Output:
Holiday DayAfter AnyNumber
Datum
...
2014-01-01 20:00:00 1 1.0 9
2014-01-01 20:30:00 1 1.0 2
2014-01-01 21:00:00 1 1.0 3
2014-01-01 21:30:00 1 1.0 3
2014-01-01 22:00:00 1 1.0 6
2014-01-01 22:30:00 1 1.0 1
2014-01-01 23:00:00 1 1.0 1
2014-01-01 23:30:00 1 1.0 1
2014-01-02 00:00:00 0 1.0 1
2014-01-02 00:30:00 0 0.0 2
2014-01-02 01:00:00 0 0.0 1
2014-01-02 01:30:00 0 0.0 1
...
if you check the first timestamp for 2014-01-02 the DayAfter flag is set right. But the other flags are 0. Thats wrong.
Create an array of unique days that are holidays and offset them by one day
days = pd.Series(df[df.Holiday == 1].index).add(pd.DateOffset(1)).dt.date.unique()
Create a new column with the one day holiday offsets (days)
df['DayAfter'] = np.where(pd.Series(df.index).dt.date.isin(days),1,0)
Holiday AnyNumber DayAfter
Datum
2014-01-01 20:00:00 1 9 0
2014-01-01 20:30:00 1 2 0
2014-01-01 21:00:00 1 3 0
2014-01-01 21:30:00 1 3 0
2014-01-01 22:00:00 1 6 0
2014-01-01 22:30:00 1 1 0
2014-01-01 23:00:00 1 1 0
2014-01-01 23:30:00 1 1 0
2014-01-02 00:00:00 0 1 1
2014-01-02 00:30:00 0 2 1
2014-01-02 01:00:00 0 1 1
2014-01-02 01:30:00 0 1 1

Categories