Fill pandas column using values from list - python

This is my list:
my_list = [
2002-01-11 22:15:00,
2002-02-12 10:30:00,
2002-03-14 02:30:00,
2002-04-12 22:15:00
]
I have DataFrame:
dt_object diff
0 2002-01-01 00:00:00 -160.95041
1 2002-01-01 00:15:00 -160.81016
2 2002-01-01 00:30:00 -160.66989
3 2002-01-01 00:45:00 -160.52961
4 2002-01-01 01:00:00 -160.38930
I want to create new column 'Hit' with False value by default and True value when dates from list match.
Expected output:
dt_object diff hit
0 2002-01-01 00:00:00 -160.95041 False
1 2002-01-01 00:15:00 -160.81016 False
2 2002-01-01 00:30:00 -160.66989 False
3 2002-01-01 00:45:00 -160.52961 False
4 2002-01-01 01:00:00 -160.38930 False
....................
....................
1010 2002-01-11 22:15:00 -150.54678 True
because 2002-01-11 22:15:00 is in list.

you can do:
import numpy as np
df['hit'] = np.where(df['dt_object'].isin(my_list),1,0)) # will give 1 or 0 according if the condition is satisfied.
To just get back True or False, just remove the returning part.
df['hit'] = df['dt_object'].isin(my_list)

Use Series.isin
df['hit'] = df['dt_object'].isin(my_list)

Related

Why are there different results for pandas groupby+resample on an appended dataframe

I want to groupby and resample a dataframe i have. I group by int_var and bool_var, and then I resample per 1Min to fill in any missing minutes in the dataset. This works perfectly fine for the base dataframe A:
date bool_var int_var
2021-01-01 00:03:00 True 1
2021-01-01 00:06:00 False 6
2021-01-01 00:06:00 True 6
The result then becomes something like this:
int_var bool_var date
1 True 2021-01-01 00:03:00 1
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 0
6 True 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
6 False 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
This is exactly what I want. However, as you can see the data starts a bit after midnight, and I want those minutes from midnight to be in there as well. So I append a row for each bool_var / int_var combination at 2021-01-01 00:00:00, to make sure the resampling starts from there.
rows = []
some for loop:
rows.append()
extra_rows_df = pd.DataFrame(rows, columns=['date', 'bool_var', 'int_var'])
B = pd.concat([A, extra_rows_df], ignore_index=True)
The resulting dataframe B appear to be correct, and in the same format as dataframe A:
date bool_var int_var
2021-01-01 00:00:00 True 1
2021-01-01 00:03:00 True 1
2021-01-01 00:00:00 False 6
2021-01-01 00:06:00 False 6
2021-01-01 00:00:00 True 6
2021-01-01 00:06:00 True 6
However, if I run the exact same groupby and resample command on dataframe B. My results are all weird:
date 2021-01-01 00:00:00 ... 2021-12-31 23:59:00
int_var bool_var 1 ... 1
1 True
6 True
False
It is like each date suddenly became a column instead of being listed for each grouping.
TL;DR: use stack().
I figured it out. In dataframe A, every bool_var / int_var group has different datetime values; here (1, True) started with 00:03, but some other group, e.g. (2, True) could start with an entry at 01:14. Once I filled out dataframe A so that each group had an entry at 00:00 in dataframe B, and I resampled to fill in each minute, every group had each datetime. In this way, all those datetimes could become columns since they apply to each group.
The solution is to use stack() on this final result

After groupby, evaluate value in column against column values in all rows in the group

I am looking for the following functionality in python:
I have a Pandas DataFrame with 4 columns: ID, StartDate, EndDate, Moment.
I want to group by ID and evaluate per row in the group whether the Moment variable falls between the interval between StartDate and EndDate. The problem is that I want to evaluate this for each row in the group. For example in the following DataFrame there are two groups (ID=1 and ID=2) and both groups contains of 5 rows. For each row, I want a boolean for each row in both groups whether the moment variable in that row falls in ANY of the time windows in the group, the window being [date1, date2].
import pandas as pd
i = pd.date_range('2018-04-11', periods=10, freq='2D20min')
i2 = pd.date_range('2018-04-12', periods=10, freq='2D20min')
i3 = pd.date_range('2018-04-9', periods=10, freq='1D6H')
id = ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2']
ts = pd.DataFrame({'date1': i, 'date2': i2, 'moment': i3}, index=id)
ID date1 date2 moment
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00
In this case, the value for moment in the first row of the first group does not fall in any of the five time intervals. Neither does the second. The third value, 2018-04-11 12:00:00 does fall in the interval in the first row and I would thus want to have True returned.
The desired result would look as follows:
ID date1 date2 moment result
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00 False
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00 False
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00 True
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00 False
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00 True
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00 False
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00 False
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00 False
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00 False
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00 False
EDIT
I already 'solved' this problem with the following approach but am looking for a more pythonic and perhaps faster way...
boolean_result = []
for c in ts.index.unique():
temp = ts.loc[ts.index == c]
for row in temp.index:
current_date = temp['moment'][row]
boolean_result.append(max((temp['date1'] <= current_date)
& (current_date <= temp['date2'])))
ts['Result'] = boolean_result
This may actually be very slow if your dataframe is too big, and there might be an optimal solution other than this one:
def time_in_range(start, end, x):
"""Return true if x is in the range [start, end]"""
if start <= x and x <= end:
return True
else:
return False
# empty list to be appended
result = []
test_list = []
for i in ts.index.unique():
temp_df = ts[ts.index == i]
for j in range(0, len(temp_df)):
for k in range(0, len(temp_df)):
test_list.append(time_in_range(temp_df.date1.iloc[k], temp_df.date2.iloc[k], temp_df.moment.iloc[j]))
result.append(any(test_list))
# reset the list
test_list = []
ts['result'] = result

Pandas .resample() or .asfreq() fill forward times

I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5

how can i delete whole day rows on condition column values.. pandas

i have below times series data frames
i wanna delete rows on condtion (check everyday) : check aaa>100 then delete all day rows (in belows, delete all 2015-12-01 rows because aaa column last 3 have 1000 value)
....
date time aaa
2015-12-01,00:00:00,0
2015-12-01,00:15:00,0
2015-12-01,00:30:00,0
2015-12-01,00:45:00,0
2015-12-01,01:00:00,0
2015-12-01,01:15:00,0
2015-12-01,01:30:00,0
2015-12-01,01:45:00,0
2015-12-01,02:00:00,0
2015-12-01,02:15:00,0
2015-12-01,02:30:00,0
2015-12-01,02:45:00,0
2015-12-01,03:00:00,0
2015-12-01,03:15:00,0
2015-12-01,03:30:00,0
2015-12-01,03:45:00,0
2015-12-01,04:00:00,0
2015-12-01,04:15:00,0
2015-12-01,04:30:00,0
2015-12-01,04:45:00,0
2015-12-01,05:00:00,0
2015-12-01,05:15:00,0
2015-12-01,05:30:00,0
2015-12-01,05:45:00,0
2015-12-01,06:00:00,0
2015-12-01,06:15:00,0
2015-12-01,06:30:00,1000
2015-12-01,06:45:00,1000
2015-12-01,07:00:00,1000
....
how can i do it ?
I think you need if MultiIndex first compare values of aaa by condition and then filter all values in first level by boolean indexing, last filter again by isin with inverted condition by ~:
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
2015-12-02 05:00:00 0
05:15:00 200
05:30:00 0
05:45:00 0
2015-12-03 06:00:00 0
06:15:00 0
06:30:00 1000
06:45:00 1000
07:00:00 1000
lvl0 = df.index.get_level_values(0)
idx = lvl0[df['aaa'].gt(100)].unique()
print (idx)
Index(['2015-12-02', '2015-12-03'], dtype='object', name='date')
df = df[~lvl0.isin(idx)]
print (df)
aaa
date time
2015-12-01 00:00:00 0
00:15:00 0
00:30:00 0
00:45:00 0
And if first column is not index only compare column date:
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0
4 2015-12-02 05:00:00 0
5 2015-12-02 05:15:00 200
6 2015-12-02 05:30:00 0
7 2015-12-02 05:45:00 0
8 2015-12-03 06:00:00 0
9 2015-12-03 06:15:00 0
10 2015-12-03 06:30:00 1000
11 2015-12-03 06:45:00 1000
12 2015-12-03 07:00:00 1000
idx = df.loc[df['aaa'].gt(100), 'date'].unique()
print (idx)
['2015-12-02' '2015-12-03']
df = df[~df['date'].isin(idx)]
print (df)
date time aaa
0 2015-12-01 00:00:00 0
1 2015-12-01 00:15:00 0
2 2015-12-01 00:30:00 0
3 2015-12-01 00:45:00 0

Unable to convert to datetime using pd.to_datetime

I am trying to read a csv file and convert it to a dataframe to be used as a time series.
The csv file is of this type:
#Date Time CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaN NaN %
1 NaN NaN Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
I read the file using:
df = pd.read_csv ('filepath/file.csv', sep=';', parse_dates = [[0,1]])
producing this result:
#Date_Time FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 nan nan %
1 nan nan Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 0
3 2014-01-01 01:00:00 0
4 2014-01-01 02:00:00 0
5 2014-01-01 03:00:00 0
6 2014-01-01 04:00:00 0
to continue converting string to datetime and using it as index:
pd.to_datetime(df.values[:,0])
df.set_index([df.columns[0]], inplace=True)
so i get this:
FCO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
#Date_Time
nan nan %
nan nan Cooling Coil Hydronic Valve Position
2014-01-01 00:00:00 0
2014-01-01 01:00:00 0
2014-01-01 02:00:00 0
2014-01-01 03:00:00 0
2014-01-01 04:00:00 0
However, the pd.to_datetime is unable to convert to datetime. Is there a way of finding out what is the error?
Many thanks.
Luis
The string entry 'nan nan' cannot be converted using to_datetime, so replace these with an empty string so that they can now be converted to NaT:
In [122]:
df['Date_Time'].replace('nan nan', '',inplace=True)
df
Out[122]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 0 %
1 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
In [124]:
df['Date_Time'] = pd.to_datetime(df['Date_Time'])
df
Out[124]:
Date_Time index CO_T1_AHU.01_CC_CTRV_CHW__SIG_STAT
0 NaT 0 %
1 NaT 1 Cooling Coil Hydronic Valve Position
2 2014-01-01 00:00:00 2 0
3 2014-01-01 01:00:00 3 0
4 2014-01-01 02:00:00 4 0
5 2014-01-01 03:00:00 5 0
6 2014-01-01 04:00:00 6 0
UPDATE
Actually if you just set coerce=True then it converts fine:
df['Date_Time'] = pd.to_datetime(df['Date_Time'], coerce=True)

Categories