Assign rows from one dataframe to another dataframe at a discretion - python

I need to generate df_Result_Sensor automatically.
I would like the dataframe (df_Result_Sensor) to only receive df_Sensor rows where the ['TimeStamp'] Column was not contained in the df_Message ['date init'] and df_Message ['date end'] ranges.
#In the code example, I wrote a df_Result_Sensor manually, just to illustrate the desired output:
TimeStamp Sensor_one Sensor_two
0 2017-05-20 00:00:00 1 1
1 2017-04-13 00:00:00 1 1
2 2017-09-10 00:00:00 0 1
import pandas as pd
df_Sensor = pd.DataFrame({'TimeStamp' : ['2017-05-25 00:00:00','2017-05-20 00:00:00', '2017-04-13 00:00:00', '2017-08-29 01:15:12', '2017-08-15 02:15:12', '2017-09-10 00:00:00'], 'Sensor_one': [1,1,1,1,1,0], 'Sensor_two': [1,1,1,0,1,1]})
df_Message = pd.DataFrame({'date init': ['2017-05-22 00:00:00', '2017-08-14 00:00:10'], 'date end': ['2017-05-26 00:00:05', '2017-09-01 02:10:05'], 'Message': ['Cold', 'Cold']})
just to illustrate the desired output:
df_Result_Sensor = pd.DataFrame({'TimeStamp' : ['2017-05-20 00:00:00', '2017-04-13 00:00:00', '2017-09-10 00:00:00'], 'Sensor_one': [1,1,0], 'Sensor_two': [1,1,1]})

This will work,make sure your date columns are converted to datetime before date comparisons
df_Message["date init"] = pd.to_datetime(df_Message["date init"])
df_Message['date end'] = pd.to_datetime(df_Message['date end'])
df_Sensor["TimeStamp"] = pd.to_datetime(df_Sensor["TimeStamp"])
df_Sensor_ = df_Sensor.copy()
for index, row in df_Message.iterrows():
df_Sensor_ = df_Sensor_[~((df_Sensor_["TimeStamp"] > row['date init']) & (df_Sensor_["TimeStamp"] < row['date end'])) ]
df_Result_Sensor = df_Sensor_

Related

how to find the difference of dataframe which is given on start date and end date

I have to find the difference in data provided at 00:00:00 and 23:59:59 per day for seven days.
How to find the difference in the data frame, which is given on the start date and end date?
Sample Data
Date Data
2018-12-01 00:00:00 2
2018-12-01 12:00:00 5
2018-12-01 23:59:59 10
2018-12-02 00:00:00 12
2018-12-02 12:00:00 15
2018-12-02 23:59:59 22
Expected Output
Date Data
2018-12-01 8
2018-12-02 10
Example
data = {
'Date': ['2018-12-01 00:00:00', '2018-12-01 12:00:00', '2018-12-01 23:59:59',
'2018-12-02 00:00:00', '2018-12-02 12:00:00', '2018-12-02 23:59:59'],
'Data': [2, 5, 10, 12, 15, 22]
}
df = pd.DataFrame(data)
Code
df['Date'] = pd.to_datetime(df['Date'])
out = (df.resample('D', on='Date')['Data']
.agg(lambda x: x.iloc[-1] - x.iloc[0]).reset_index())
out
Date Data
0 2018-12-01 8
1 2018-12-02 10
Update
more efficient way
you can get same result following code:
g = df.resample('D', on='Date')['Data']
out = g.last().sub(g.first()).reset_index()
You can use groupby and iterate over with min-max range.
import pandas as pd
df = pd.DataFrame({
'Date': ['2018-12-01 00:00:00', '2018-12-01 12:00:00', '2018-12-01 23:59:59',
'2018-12-02 00:00:00', '2018-12-02 12:00:00', '2018-12-02 23:59:59'],
'Data': [2, 5, 10, 12, 15, 22]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Date_Only'] = df['Date'].dt.date
result = df.groupby('Date_Only').apply(lambda x: x['Data'].max() - x['Data'].min())
print(result)

Calculate Usage time per day and customer in Pandas

I have a Pandas DataFrame with monthly events send by customers, like this:
df = pd.DataFrame(
[
('2017-01-01 12:00:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 00:30:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 12:00:00', 'SID2', 'Something', 'A. Inc'),
('2017-01-01 15:00:00', 'SID4', 'Something', 'B. GmbH')
],
columns=['TimeStamp', 'Session ID', 'Event', 'Customer']
)
The Session IDs are unique, but could spann multiple days. In addition multiple sessions could occur on a given day.
I would like to calculate the minutes of usage for each day of the months per customer like this.
Customer
01.01
02.01
...
31.01
A. Inc
720
30
...
50
B. GmbH
1
0
...
0
I suspect, that a split of Timestamp into Days and Time, followed by groupby('Customer', 'Day', 'Session ID') and then applying (via apply()) some maths is the way to go, but so far i could not produce any real progress.
You can try this.
Extract date and time in minutes to new colums. Then sum time for customer and date using groupby and agg. Then finally pivot the dataframe.
df['TimeStamp']= df['TimeStamp'].apply(pd.to_datetime)
df['date'] = df['TimeStamp'].dt.date
df['minutes'] = df['TimeStamp'].dt.strftime('%H:%M').apply(lambda x: int(x.split(':')[0]) * 60 + int(x.split(':')[1]))
new_df = df.groupby(['Customer','date']).agg({'minutes': sum}).reset_index()
print(pd.pivot_table(new_df, values = 'minutes', index=['Customer'], columns = 'date'))
Output:
date 2017-01-01 2017-01-02
Customer
A. Inc 720.0 750.0
B. GmbH 900.0 NaN
Ok, i found a solution, might not be the best, but works.
# group by id and add max and min values of each group to new columns
group_Session = df.groupby(['Session ID'])
df['Start Time'] = group_Session['Timestamp'].transform(lambda x: x.min())
df['Stop Time'] = group_Session['Timestamp'].transform(lambda x: x.max())
df.drop_duplicates(subset=['Session ID'], keep='first', inplace = True)
# now we have start/stop for each session
# add all days of month to dataframe and fill with zeros
dateStart = datetime.datetime(2022, 2, 1)
dateStop = (dateStart + dateutil.relativedelta.relativedelta(day = 31))
for single_date in (dateStart.day + n for n in range(dateStop.day)):
df[str(single_date) + '.' + str(dateStart.month)] = 0
for index, row in df.iterrows():
# Create a dateRange from start to finisch with minute frequency
# Convert dateRange to Dataframe
dateRangeFrame = pd.date_range(start = row['Start Time'], end = row['Stop Time'], freq = 'T').to_frame(name = 'value')
# extract day from dateIndex
dateRangeFrame['day'] = dateRangeFrame['value'].dt.strftime('%#d.%#m')
#group by day and count the results -> now we have: per session(index) a day/minute object
day_to_minute_df = dateRangeFrame.groupby(['day']).count()
# for each group find column from index and add sum of val
for d2m_index, row in day_to_minute_df.iterrows():
df.loc[index, d2m_index] = row['value']
new_df = df.groupby(['Customer']).sum()

How to filter by day the last 3 days values of a pandas dataframe considering the hours?

I have this dataframe:
I need to get the values of the single days between time 05:00:00 and 06:00:00 (so, in this example, ignore 07:00:00)
And create a separate dataframe for each day considering the last 3 days.
This is the result i want to achive: (3 dataframes considering 3 days and Time between 05 and 06)
I tried this: (without success)
df.sort_values(by = "Time", inplace=True)
df_of_yesterday = df[ (df.Time.dt.hour > 4)
& (df.Time.dt.hour < 7)]
You can use:
from datetime import date, time, timedelta
today = date.today()
m = df['Time'].dt.time.between(time(5), time(6))
df_yda = df.loc[m & (df['Time'].dt.date == today - timedelta(1))]
df_2da = df.loc[m & (df['Time'].dt.date == today - timedelta(2))]
df_3da = df.loc[m & (df['Time'].dt.date == today - timedelta(3))]
Output:
>>> df_yda
Time Open
77 2022-03-09 05:00:00 0.880443
78 2022-03-09 06:00:00 0.401932
>> df_2da
Time Open
53 2022-03-08 05:00:00 0.781377
54 2022-03-08 06:00:00 0.638676
>>> df_3da
Time Open
29 2022-03-07 05:00:00 0.838719
30 2022-03-07 06:00:00 0.897211
Setup a MRE:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
dti = pd.date_range('2022-03-06', '2022-03-10', freq='H')
df = pd.DataFrame({'Time': dti, 'Open': rng.random(len(dti))})
Use Series.between with set offsets.DateOffset for datetimes between this times in list comprehension for list of DataFrames:
now = pd.to_datetime('now').normalize()
dfs = [df[df.Time.between(now - pd.DateOffset(days=i, hour=5),
now - pd.DateOffset(days=i, hour=6))] for i in range(1,4)]
print (dfs[0])
print (dfs[1])
print (dfs[2])
I've manually copied your data into a dictionary and then converted it to your desired output.
First you should probably edit your question to use the text version of the data instead of an image, here's a small example:
data = {
'Time': [
'2022-03-06 05:00:00',
'2022-03-06 06:00:00',
'2022-03-06 07:00:00',
'2022-03-07 05:00:00',
'2022-03-07 06:00:00',
'2022-03-07 07:00:00',
'2022-03-08 05:00:00',
'2022-03-08 06:00:00',
'2022-03-08 07:00:00',
'2022-03-09 05:00:00',
'2022-03-09 06:00:00',
'2022-03-09 07:00:00'
],
'Open': [
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6',
'13823.6',
'13786.6',
'13823.6'
]
}
df = pd.DataFrame(data)
Then you can use this code to get all the dates that are on the same day and in between the hours 4 and 7 and then create your dataframes as follows:
import pandas as pd
from datetime import datetime
dict = {}
for index, row in df.iterrows():
found = False
for item in dict:
date = datetime.strptime(row['Time'], '%Y-%m-%d %H:%M:%S')
date2 = datetime.strptime(item, '%Y-%m-%d')
if(date.date() == date2.date() and date.hour > 4 and date.hour < 7):
dict[item].append(row['Open'])
found = True
date = datetime.strptime(row['Time'], '%Y-%m-%d %H:%M:%S')
if(not found and date.hour > 4 and date.hour < 7):
dict[date.strftime('%Y-%m-%d')] = []
dict[date.strftime('%Y-%m-%d')].append(row['Open'])
for key in dict:
temp = {
key: dict[key]
}
df = pd.DataFrame(temp)
print(df)

Convert datetime pandas

Below is a sample of my df
date value
0006-03-01 00:00:00 1
0006-03-15 00:00:00 2
0006-05-15 00:00:00 1
0006-07-01 00:00:00 3
0006-11-01 00:00:00 1
2009-05-20 00:00:00 2
2009-05-25 00:00:00 8
2020-06-24 00:00:00 1
2020-06-30 00:00:00 2
2020-07-01 00:00:00 13
2020-07-15 00:00:00 2
2020-08-01 00:00:00 4
2020-10-01 00:00:00 2
2020-11-01 00:00:00 4
2023-04-01 00:00:00 1
2218-11-12 10:00:27 1
4000-01-01 00:00:00 6
5492-04-15 00:00:00 1
5496-03-15 00:00:00 1
5589-12-01 00:00:00 1
7199-05-15 00:00:00 1
9186-12-30 00:00:00 1
As you can see, the data contains some misspelled dates.
Questions:
How can we convert this column to format dd.mm.yyyy?
How can we replace rows when Year greater than 2022? by 01.01.2100
How can we Remove All rows when Year less than 2005?
The final output should look like this.
date value
20.05.2009 2
25.05.2009 8
26.04.2020 1
30.06.2020 2
01.07.2020 13
15.07.2020 2
01.08.2020 4
01.10.2020 2
01.11.2020 4
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
I tried to convert the column using to_datetime but it failed.
df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
Out of bounds nanosecond timestamp: 5-03-01 00:00:00
Thanks to anyone helping!
You could check the first element of your datetime strings after a split on '-' and clean up / replace based on its integer value. For the small values like '0006', calling pd.to_datetime with errors='coerce' will do the trick. It will leave 'NaT' for the invalid dates. You can drop those with dropna(). Example:
import pandas as pd
df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
'0006-03-15 00:00:00',
'0006-05-15 00:00:00',
'0006-07-01 00:00:00',
'0006-11-01 00:00:00',
'nan',
'2009-05-25 00:00:00',
'2020-06-24 00:00:00',
'2020-06-30 00:00:00',
'2020-07-01 00:00:00',
'2020-07-15 00:00:00',
'2020-08-01 00:00:00',
'2020-10-01 00:00:00',
'2020-11-01 00:00:00',
'2023-04-01 00:00:00',
'2218-11-12 10:00:27',
'4000-01-01 00:00:00',
'NaN',
'5496-03-15 00:00:00',
'5589-12-01 00:00:00',
'7199-05-15 00:00:00',
'9186-12-30 00:00:00']})
# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]
# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'
# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0 NaT
# 1 NaT
# 2 NaT
# 3 NaT
# 4 NaT
# 5 2009-05-20
# 6 2009-05-25
# ...
df = df.dropna()
# df
# date
# 6 2009-05-25
# 7 2020-06-24
# 8 2020-06-30
# 9 2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...
Due to the limitations of pandas, the out of bounds error is thrown (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). This code will remove values that would cause this error before creating the dataframe.
import datetime as dt
import pandas as pd
data = [[dt.datetime(year=2022, month=3, day=1), 1],
[dt.datetime(year=2009, month=5, day=20), 2],
[dt.datetime(year=2001, month=5, day=20), 2],
[dt.datetime(year=2023, month=12, day=30), 3],
[dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]
df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
date Value
0 2022-03-01 1
1 2009-05-20 2
2 2001-05-20 2
3 2023-12-30 3
df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01 1
1 2009-05-20 2
3 2100-01-01 3
If you still want to include the dates that throw the out of bounds error, check out How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?
I suggest the following:
df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
'7199-05-15 00:00:00',
'2020-10-21 00:00:00'],
'value': [1, 2, 3]})
df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]
df.dropna(inplace = True)
This yields the desired output:
date value
01.01.2100 2
21.10.2020 3

pandas averaging data between timestamps

If I have some data (24 hours time series) read into Pandas:
import pandas as pd
import numpy as np
#read CSV file
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Building-Demand-Electrical-Load-Profiles/master/july15.csv',
index_col='Date', parse_dates=True)
How can I average the df column kW between these time stamps into a new separate pandas df?
bkps_timestamps_kW = [
'2013-06-19 00:15:00',
'2013-06-19 05:15:00',
'2013-06-19 16:30:00',
'2014-06-18 00:00:00']
The new pandas df could have column names something like avg_kw1, avg_kw2, avg_kw3 that would represent the averaging of the data between the time stamps in bkps_timestamps_kW
thanks for any help/tips
I think you need cut for bining by your list converterted to datetimes and aggregate mean:
d = [
'2013-06-19 00:00:00',
'2013-06-19 00:15:00',
'2013-06-19 01:15:00',
'2013-06-19 05:15:00',
'2013-06-19 07:15:00',
'2013-06-19 16:30:00',
'2013-06-20 16:30:00',
'2014-06-18 00:00:00',
'2015-06-18 00:00:00']
df = pd.DataFrame({'Date':range(len(d))}, index=pd.to_datetime(d))
print (df)
Date
2013-06-19 00:00:00 0
2013-06-19 00:15:00 1
2013-06-19 01:15:00 2
2013-06-19 05:15:00 3
2013-06-19 07:15:00 4
2013-06-19 16:30:00 5
2013-06-20 16:30:00 6
2014-06-18 00:00:00 7
2015-06-18 00:00:00 8
bkps_timestamps_kW = [
'2013-06-19 00:15:00',
'2013-06-19 05:15:00',
'2013-06-19 16:30:00',
'2014-06-18 00:00:00']
b = pd.to_datetime(bkps_timestamps_kW)
labels = [f'{i}-{j}' for i, j in zip(bkps_timestamps_kW[:-1], bkps_timestamps_kW[1:])]
df = df.groupby(pd.cut(df.index, bins=b, labels=labels)).mean()
print (df)
Date
2013-06-19 00:15:00-2013-06-19 05:15:00 2.5
2013-06-19 05:15:00-2013-06-19 16:30:00 4.5
2013-06-19 16:30:00-2014-06-18 00:00:00 6.5
If need closed left intervals in cut:
df = df.groupby(pd.cut(df.index, bins=b, labels=labels, right=False)).mean()
print (df)
Date
2013-06-19 00:15:00-2013-06-19 05:15:00 1.5
2013-06-19 05:15:00-2013-06-19 16:30:00 3.5
2013-06-19 16:30:00-2014-06-18 00:00:00 5.5

Categories