I have two dataframes, one is the main_df that has high frequency pi data with timestamps. The other is the reference_df that is a header-level dataframe with start and end timestamps. I need to create an indicator for main_df for when individual rows fall between the start and end timestamps in the referenc_df. Any input would be appreciated.
import pandas as pd
#Proxy reference dataframe
reference_data = [['site a', '2021-03-05 00:00:00', '2021-03-05 23:52:00'],
['site a', '2021-03-06 00:00:00', '2021-03-06 12:00:00'],
['site b', '2021-04-08 20:04:00', '2021-04-08 23:00:00'],
['site c', '2021-04-09 04:08:00', '2021-04-09 09:52:00']]
ref_df = pd.DataFrame(reference_data, columns = ['id', 'start', 'end'])
ref_df['start'] = pd.to_datetime(ref_df['start'], infer_datetime_format=True)
ref_df['end'] = pd.to_datetime(ref_df['end'], infer_datetime_format=True)
#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 10],
['site a', '2021-03-05 01:01:00', 11],
['site b', '2021-04-08 20:00:00', 9],
['site b', '2021-04-08 20:04:00', 10],
['site b', '2021-04-08 20:05:00', 11],
['site c', '2021-01-09 10:0:00', 7]]
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)
Desired DataFrame:
print(main_df)
id timestamp value event_indicator
0 site a 2021-03-05 01:00:00 10 1
1 site a 2021-03-05 01:01:00 11 1
2 site b 2021-04-08 20:00:00 9 0
3 site b 2021-04-08 20:04:00 10 1
4 site b 2021-04-08 20:05:00 11 1
5 site c 2021-01-09 10:00:00 7 0
Perform an inner join on the sites, then check if the timestamp is between any of the ranges. reset_index before the merge so you can use that to keep track of which row you're checking the range for.
s = main_df.reset_index().merge(ref_df, on='id')
s['event_indicator'] = s['timestamp'].between(s['start'], s['end']).astype(int)
# Max checks for at least 1 overlap.
s = s.groupby('index')['event_indicator'].max()
main_df['event_indicator'] = s
id timestamp value event_indicator
0 site a 2021-03-05 01:00:00 10 1
1 site a 2021-03-05 01:01:00 11 1
2 site b 2021-04-08 20:00:00 9 0
3 site b 2021-04-08 20:04:00 10 1
4 site b 2021-04-08 20:05:00 11 1
5 site c 2021-01-09 10:00:00 7 0
Related
I have data that is in this inconvenient format. Simple reproducible example below:
26/9/21 26/9/21
10:00 Paul
12:00 John
27/9/21 27/9/21
1:00 Ringo
As you can see, the dates have not been entered as a column. Instead, the dates repeat across rows as a "header" row for the rows below it. Each date then has a variable number of data rows beneath it, before the next date "header" row.
The output I would like would be:
26/9/21 10:00 Paul
26/9/21 12:00 John
27/9/21 1:00 Ringo
How can I do this in Python and Pandas?
Code for data entry below:
import pandas as pd
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
df
Convert your column a to datetime with errors='coerce' then fill forward. Now you can add the time offset rows.
sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
msk = sra.isnull()
sra = sra.ffill() + pd.to_timedelta(df.loc[msk, 'a'] + ':00')
out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
>>> out
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Step by step:
>>> sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
0 2021-09-26
1 NaT
2 NaT
3 2021-09-27
4 NaT
Name: a, dtype: datetime64[ns]
>>> msk = sra.isnull()
0 False
1 True
2 True
3 False
4 True
Name: a, dtype: bool
>>> sra = sra.ffill() + pd.to_timedelta(df.loc[msk, 'a'] + ':00')
0 NaT
1 2021-09-26 10:00:00
2 2021-09-26 12:00:00
3 NaT
4 2021-09-27 01:00:00
Name: a, dtype: datetime64[ns]
>>> out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Following is simple to understand code, reading original dataframe row by row and creating a new dataframe:
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
dflen = len(df)
newrow = []; newdata = []
for i in range(dflen): # read each row one by one
if '/' in df.iloc[i,0]: # if date found
item0 = df.iloc[i,0] # get new date
newrow = [item0] # put date as first entry of new row
continue # go to next row
newrow.append(df.iloc[i,0]) # add time
newrow.append(df.iloc[i,1]) # add name
newdata.append(newrow) # add row to new data
newrow = [item0] # create new row with same date entry
newdf = pd.DataFrame(newdata, columns=['Date','Time','Name']) # create new dataframe;
print(newdf)
Output:
Date Time Name
0 26/9/21 10:00 Paul
1 26/9/21 12:00 John
2 27/9/21 1:00 Ringo
I have some columns in a dataset that contain date and time and my goal is to obtain two separate columns that contain date and time separately.
Example:
Name Dataset: A
Starting
Name column: Cat
12/01/2021 20:15:06
02/01/2021 12:15:07
01/01/2021 15:05:03
01/01/2021 15:05:03
Goal
Name column: Cat1
12/01/2021
02/01/2021
01/01/2021
01/01/2021
Name Column: Cat2
20:15:06
12:15:07
15:05:03
15:05:03
I assume that you 're using pandas, and that you want to use the same dataframe.
# df = A (?)
df['Cat1'] = [d.date() for d in df['Cat']]
df['Cat2'] = [d.time() for d in df['Cat']]
Working example:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame.from_dict(
{'A': [1, 2, 3],
'B': [4, 5, 6],
'Datetime': [datetime.strftime(datetime.now()-timedelta(days=_),
"%m/%d/%Y, %H:%M:%S") for _ in range(3)]},
orient='index',
columns=['A', 'B', 'C']).T
df['Datetime'] = pd.to_datetime(df['Datetime'], format="%m/%d/%Y, %H:%M:%S")
# A B Datetime
# A 1 4 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59
df['Cat1'] = [d.date() for d in df['Datetime']]
df['Cat2'] = [d.time() for d in df['Datetime']]
# A B Datetime Cat1 Cat2
# A 1 4 2021-03-05 14:07:59 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59 2021-03-03 14:07:59
I have DataFrame like below:
rng = pd.date_range('2020-12-11', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'status': ['active', 'active', 'finished', 'finished', 'active'] })
And I need to create 2 new columns in this DataFrame:
New1 = amount of days from "Date" column until today for status 'active'
New2 = amount of days from "Date" column until today for status 'finished'
Below sample result:
Use Series.rsub for subtract from right side with today by Timestamp and Timestamp.floor, convert timedeltas to days by Series.dt.days and assign new columns by condition in Series.where:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng,
'status': ['active', 'active', 'finished', 'finished', 'active'] })
days = df['Date'].rsub(pd.Timestamp('now').floor('d')).dt.days
df['New1'] = days.where(df['status'].eq('active'))
df['New2'] = days.where(df['status'].eq('finished'))
print (df)
Date status New1 New2
0 2020-12-01 active 13.0 NaN
1 2020-12-02 active 12.0 NaN
2 2020-12-03 finished NaN 11.0
3 2020-12-04 finished NaN 10.0
4 2020-12-05 active 9.0 NaN
Below is a sample of my df
date value
0006-03-01 00:00:00 1
0006-03-15 00:00:00 2
0006-05-15 00:00:00 1
0006-07-01 00:00:00 3
0006-11-01 00:00:00 1
2009-05-20 00:00:00 2
2009-05-25 00:00:00 8
2020-06-24 00:00:00 1
2020-06-30 00:00:00 2
2020-07-01 00:00:00 13
2020-07-15 00:00:00 2
2020-08-01 00:00:00 4
2020-10-01 00:00:00 2
2020-11-01 00:00:00 4
2023-04-01 00:00:00 1
2218-11-12 10:00:27 1
4000-01-01 00:00:00 6
5492-04-15 00:00:00 1
5496-03-15 00:00:00 1
5589-12-01 00:00:00 1
7199-05-15 00:00:00 1
9186-12-30 00:00:00 1
As you can see, the data contains some misspelled dates.
Questions:
How can we convert this column to format dd.mm.yyyy?
How can we replace rows when Year greater than 2022? by 01.01.2100
How can we Remove All rows when Year less than 2005?
The final output should look like this.
date value
20.05.2009 2
25.05.2009 8
26.04.2020 1
30.06.2020 2
01.07.2020 13
15.07.2020 2
01.08.2020 4
01.10.2020 2
01.11.2020 4
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
I tried to convert the column using to_datetime but it failed.
df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
Out of bounds nanosecond timestamp: 5-03-01 00:00:00
Thanks to anyone helping!
You could check the first element of your datetime strings after a split on '-' and clean up / replace based on its integer value. For the small values like '0006', calling pd.to_datetime with errors='coerce' will do the trick. It will leave 'NaT' for the invalid dates. You can drop those with dropna(). Example:
import pandas as pd
df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
'0006-03-15 00:00:00',
'0006-05-15 00:00:00',
'0006-07-01 00:00:00',
'0006-11-01 00:00:00',
'nan',
'2009-05-25 00:00:00',
'2020-06-24 00:00:00',
'2020-06-30 00:00:00',
'2020-07-01 00:00:00',
'2020-07-15 00:00:00',
'2020-08-01 00:00:00',
'2020-10-01 00:00:00',
'2020-11-01 00:00:00',
'2023-04-01 00:00:00',
'2218-11-12 10:00:27',
'4000-01-01 00:00:00',
'NaN',
'5496-03-15 00:00:00',
'5589-12-01 00:00:00',
'7199-05-15 00:00:00',
'9186-12-30 00:00:00']})
# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]
# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'
# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0 NaT
# 1 NaT
# 2 NaT
# 3 NaT
# 4 NaT
# 5 2009-05-20
# 6 2009-05-25
# ...
df = df.dropna()
# df
# date
# 6 2009-05-25
# 7 2020-06-24
# 8 2020-06-30
# 9 2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...
Due to the limitations of pandas, the out of bounds error is thrown (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). This code will remove values that would cause this error before creating the dataframe.
import datetime as dt
import pandas as pd
data = [[dt.datetime(year=2022, month=3, day=1), 1],
[dt.datetime(year=2009, month=5, day=20), 2],
[dt.datetime(year=2001, month=5, day=20), 2],
[dt.datetime(year=2023, month=12, day=30), 3],
[dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]
df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
date Value
0 2022-03-01 1
1 2009-05-20 2
2 2001-05-20 2
3 2023-12-30 3
df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01 1
1 2009-05-20 2
3 2100-01-01 3
If you still want to include the dates that throw the out of bounds error, check out How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?
I suggest the following:
df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
'7199-05-15 00:00:00',
'2020-10-21 00:00:00'],
'value': [1, 2, 3]})
df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]
df.dropna(inplace = True)
This yields the desired output:
date value
01.01.2100 2
21.10.2020 3
I have a pandas dataframe like this -
ColA ColB ColC
Apple 2019-03-02 18:00:00 Saturday
Orange 2019-03-03 10:00:00 Sunday
Mango 2019-03-04 09:00:00 Monday
I am trying to remove rows from my dateframe based on certain conditions.
Remove the row if the datetime is 9AM and above and 5PM and below.
Do not remove this, if it is weekend (Saturday and Sunday).
Expected output will not have Mango in the dataframe.
Seems it is harder than what I thought
s1=df.ColB.dt.hour.between(9,17,inclusive=False)
df.loc[s1|df.ColC.isin(['Saturday','Sunday'])]
ColA ColB ColC
0 Apple 2019-03-02 18:00:00 Saturday
1 Orange 2019-03-03 10:00:00 Sunday
Or using
s1=pd.Index(df.ColB).indexer_between_time('09:00:00','17:00:00',include_start =False ,include_end =False)
s1=df.index.isin(s1)
df.loc[s1|df.ColC.isin(['Saturday','Sunday'])]
To give another alternative you could write it like this:
cond1 = df.ColB.dt.hour >= 9 # After 09:00
cond2 = df.ColB.dt.hour <= 15 # Before 16:00
cond3 = df.ColB.dt.weekday < 5 # Mon-Fri
df = df[~(cond1&cond2&cond3)]
Full example:
import pandas as pd
df = pd.DataFrame({
'ColA': ['Apple','Orange','Mango'],
'ColB': pd.to_datetime([
'2019-03-02 18:00:00',
'2019-03-03 10:00:00',
'2019-03-04 09:00:00'
]),
'ColC': ['Saturday', 'Sunday', 'Monday']
})
cond1 = df.ColB.dt.hour >= 9 # After 09:00
cond2 = df.ColB.dt.hour <= 15 # Before 16:00
cond3 = df.ColB.dt.weekday < 5 # Mon-Fri
df = df[~(cond1&cond2&cond3)] # conditions mark the rows to drop, hence ~
print(df)
Returns:
ColA ColB ColC
0 Apple 2019-03-02 18:00:00 Saturday
1 Orange 2019-03-03 10:00:00 Sunday