replace dataframe values based on another dataframe - python

I have a pandas dataframe which is structured as follows:
timestamp y
0 2020-01-01 00:00:00 336.0
1 2020-01-01 00:15:00 544.0
2 2020-01-01 00:30:00 736.0
3 2020-01-01 00:45:00 924.0
4 2020-01-01 01:00:00 1260.0
...
The timestamp column is a datetime data type
and I have another dataframe with the following structure:
y
timestamp
00:00:00 625.076923
00:15:00 628.461538
00:30:00 557.692308
00:45:00 501.692308
01:00:00 494.615385
...
I this case, the time is the pandas datetime index.
Now what I want to do is replace the values in the first dataframe where the time field is matching i.e. the time of the day is matching with the second dataset.

IIUC your first dataframe df1's timestamp is datetime type and your second dataframe (df2) has an index of type datetime as well but only time and not date.
then you can do:
df1['y'] = df1['timestamp'].dt.time.map(df2['y'])

I wouldn't be surprised if there is a better way, but you can accomplish this by working to get the tables so that they can merge on the time. Assuming your dataframes will be df and df2.
df['time'] = df['timestamp'].dt.time
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df_combined = pd.merge(df,df2,left_on='time',right_on='timestamp')
df_combined
timestamp_x y_x time timestamp_y y_y
0 2020-01-01 00:00:00 336.0 00:00:00 00:00:00 625.076923
1 2020-01-01 00:15:00 544.0 00:15:00 00:15:00 628.461538
2 2020-01-01 00:30:00 736.0 00:30:00 00:30:00 557.692308
3 2020-01-01 00:45:00 924.0 00:45:00 00:45:00 501.692308
4 2020-01-01 01:00:00 1260.0 01:00:00 01:00:00 494.615385
# This clearly has more than you need, so just keep what you want and rename things back.
df_combined = df_combined[['timestamp_x','y_y']]
df_combined = df_combined.rename(columns={'timestamp_x':'timestamp','y_y':'y'})

New answer I like way better: actually using .map()
Still need to get df2 to have the time column to match on.
df2 = df2.reset_index()
df2['timestamp'] = pd.to_datetime(df2['timestamp'].dt.time
df['y'] = df['timestamp'].dt.time.map(dict(zip(df2['timestamp',df2['y'])))

Related

DataFrame Pandas datetime error: hour must be in 0..23

I have the following time series and I want to convert to datetime in DataFrame using "pd.to_datetime". I am getting the following error: "hour must be in 0..23: 2017/ 01/01 24:00:00". How can I go around this error?
DateTime
0 2017/ 01/01 01:00:00
1 2017/ 01/01 02:00:00
2 2017/ 01/01 03:00:00
3 2017/ 01/01 04:00:00
...
22 2017/ 01/01 23:00:00
23 2017/ 01/01 24:00:00
Given:
DateTime
0 2017/01/01 01:00:00
1 2017/01/01 02:00:00
2 2017/01/01 03:00:00
3 2017/01/01 04:00:00
4 2017/01/01 23:00:00
5 2017/01/01 24:00:00
As the error says, 24:00:00 isn't a valid time. Depending on what it actually means, we can salvage it like this:
# Split up your Date and Time Values into separate Columns:
df[['Date', 'Time']] = df.DateTime.str.split(expand=True)
# Convert them separately, one as datetime, the other as timedelta.
df.Date = pd.to_datetime(df.Date)
df.Time = pd.to_timedelta(df.Time)
# Fix your DateTime Column, Drop the helper Columns:
df.DateTime = df.Date + df.Time
df = df.drop(['Date', 'Time'], axis=1)
print(df)
print(df.dtypes)
Output:
DateTime
0 2017-01-01 01:00:00
1 2017-01-01 02:00:00
2 2017-01-01 03:00:00
3 2017-01-01 04:00:00
4 2017-01-01 23:00:00
5 2017-01-02 00:00:00
DateTime datetime64[ns]
dtype: object
df['DateTime'] =pd.to_datetime(df['DateTime'], format='%y-%m-%d %H:%M', errors='coerce')
Try this out!

Efficient way of dividing a column by the mean of given groups in Pandas

I have a dataframe with hourly values for several years. My dataframe is already in datetime format, and the column containing the values is called say "value column".
date = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24 , 31.71 , 34.39 , 34.49 ]
df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)
print(df.head())
value column
index
2015-02-03 23:00:00 33.24
2015-02-03 23:30:00 31.71
2015-02-04 00:00:00 34.39
2015-02-04 00:30:00 34.49
I know how to get the mean of the "value column" for each year efficiently with for instance the following command:
df = df.groupby(df.index.year).mean()
Now, I would like to divide all hourly values of the column "value column" by the mean of its values for its corresponding year (for instance dividing all the 2015 hourly values by the mean of 2015 values, and same for the other years).
Is there an efficient way to do that in pandas?
Expected result:
value column Value column/mean of year
index
2015-02-03 23:00:00 33.24 0.993499
2015-02-03 23:30:00 31.71 0.94777
2015-02-04 00:00:00 34.39 1.027871
2015-02-04 00:30:00 34.49 1.03086
Many thanks,
Try the following:
df.groupby(df.index.year).transform(lambda x: x/x.mean())
Refer: Group By: split-apply-combine
Transformation is recommended as it is meant to perform some group-specific computations and return a like-indexed object.
I just found another way, which im not sure to understand but works!
df['result'] = df['value column'].groupby(df.index.year).apply(lambda x: x/x.mean())
I thought that in apply functions, x was refering to single values of the array but it seems that it refers to the group itself.
You should be able to do:
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
So you set the index to be the year in order to divide by the groupby object, and then reset the index to keep the original timestamps.
Full example:
import pandas as pd
import numpy as np
np.random.seed(1)
dr = pd.date_range('1-1-2010','1-1-2020', freq='H')
df = pd.DataFrame({'value column':np.random.rand(len(dr))}, index=dr)
print(df, '\n')
print(df.groupby(df.index.year).mean(), '\n')
df = (df.set_index(df.index.year)/df.groupby(df.index.year).mean()).set_index(df.index)
print(df)
Output:
#original data
value column
2010-01-01 00:00:00 0.417022
2010-01-01 01:00:00 0.720324
2010-01-01 02:00:00 0.000114
2010-01-01 03:00:00 0.302333
2010-01-01 04:00:00 0.146756
...
2019-12-31 20:00:00 0.530828
2019-12-31 21:00:00 0.224505
2019-12-31 22:00:00 0.459977
2019-12-31 23:00:00 0.931504
2020-01-01 00:00:00 0.581869
[87649 rows x 1 columns]
#grouped by year
value column
2010 0.497135
2011 0.503547
2012 0.501023
2013 0.497848
2014 0.497065
2015 0.501417
2016 0.498303
2017 0.499266
2018 0.499533
2019 0.492220
2020 0.581869
#final output
value column
2010-01-01 00:00:00 0.838851
2010-01-01 01:00:00 1.448952
2010-01-01 02:00:00 0.000230
2010-01-01 03:00:00 0.608150
2010-01-01 04:00:00 0.295203
...
2019-12-31 20:00:00 1.078436
2019-12-31 21:00:00 0.456107
2019-12-31 22:00:00 0.934494
2019-12-31 23:00:00 1.892455
2020-01-01 00:00:00 1.000000
[87649 rows x 1 columns]

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

Copying column from one data frame to another based on matching of combination of two columns

I have got two dataframes (i.e. df1 and df2).
df1 contains date and time columns. Time columns contains 30 minutes interval of time series:
df1:
date time
0 2015-04-01 00:00:00
1 2015-04-01 00:30:00
2 2015-04-01 01:00:00
3 2015-04-01 01:30:00
4 2015-04-01 02:00:00
df2 contains date, start-time, end-time, value:
df2
INCIDENT_DATE INTERRUPTION_TIME RESTORE_TIME WASTED_MINUTES
0 2015-04-01 00:32 01:15 1056.0
1 2015-04-01 01:20 02:30 3234.0
2 2015-04-01 01:22 03:30 3712.0
3 2015-04-01 01:30 03:15 3045.0
Now I want to copy the wasted_minutes column from df2 to df1 when date columns of both data frames are the same and Interruption_time of the column of df2 lies in the time column of df1. So the output should look like:
df1:
date time Wasted_columns
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
I tried merge command (on the basis of date column), but didn't produce the desired result, because I am not sure how to check whether time falls in 30 minutes intervals or not? Could anyone guide how to fix the issue?
Convert time to timedelta and assign back to df1. Convert INTERRUPTION_TIME to timedelta and floor it to 30-minute interval and assign to s. Groupby df2 by INCIDENT_DATE, s and call sum of WASTED_MINUTES. Finally, join the result of groupby back to df1
df1['time'] = pd.to_timedelta(df1['time'].astype(str)) #cast to str before calling `to_timedelta`
s = pd.to_timedelta(df2.INTERRUPTION_TIME+':00').dt.floor('30Min')
df_final = df1.join(df2.groupby(['INCIDENT_DATE', s]).WASTED_MINUTES.sum(),
on=['date', 'time'])
Out[631]:
date time WASTED_MINUTES
0 2015-04-01 00:00:00 NaN
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 NaN
You can do this
df1['time']=pd.to_datetime(df1['time'])
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= x['time']) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< x['time']+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
df1['time']=df1['time'].dt.time
If you convert the 'time' column in the lambda function itself, then it is just one line of code as below
df1['Wasted_columns']=df1.apply(lambda x: df2.loc[(pd.to_datetime(df2['INTERRUPTION_TIME'])>= pd.to_datetime(x['time'])) & (pd.to_datetime(df2['INTERRUPTION_TIME'])< pd.to_datetime(x['time'])+pd.Timedelta(minutes=30)),'WASTED_MINUTES'].sum(), axis=1)
Output
date time Wasted_columns
0 2015-04-01 00:00:00 0.0
1 2015-04-01 00:30:00 1056.0
2 2015-04-01 01:00:00 6946.0
3 2015-04-01 01:30:00 3045.0
4 2015-04-01 02:00:00 0.0
The idea:
+ Convert to datetime
+ Round to nearest 30 mins
+ Merge
from datetime import datetime, timedelta
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
# Convert
df1['dt'] = (df1['date'] + ' ' + df1['time']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M:%S'])
df2['dt'] = (df2['INCIDENT_DATE '] + ' ' + df2['INTERRUPTION_TIME']).apply(datetime.strptime, args=['%Y-%m-%d %H:%M'])
# Round
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
df2['dt'] = df2['dt'].apply(ceil_dt, args=[timedelta(minutes=30)])
# Merge
final = df1.merge(df2.loc[:, ['dt', 'wasted_column'], on='dt', how='left'])
Also if multiple incidents happens in 30 mins timeframe, you would want to group by on df2 with rounded dt col first to sum up wasted then merge

How to round date time index in a pandas data frame?

There is a pandas dataframe like this:
index
2018-06-01 02:50:00 R 45.48 -2.8
2018-06-01 07:13:00 R 45.85 -2.0
...
2018-06-01 08:37:00 R 45.87 -2.7
I would like to round the index to the hour like this:
index
2018-06-01 02:00:00 R 45.48 -2.8
2018-06-01 07:00:00 R 45.85 -2.0
...
2018-06-01 08:00:00 R 45.87 -2.7
I am trying the following code:
df = df.date_time.apply ( lambda x : x.round('H'))
but returns a serie instead of a dataframe with the modified index column
Try using floor:
df.index.floor('H')
Setup:
df = pd.DataFrame(np.arange(25),index=pd.date_range('2018-01-01 01:12:50','2018-01-02 01:12:50',freq='H'),columns=['Value'])
df.head()
Value
2018-01-01 01:12:50 0
2018-01-01 02:12:50 1
2018-01-01 03:12:50 2
2018-01-01 04:12:50 3
2018-01-01 05:12:50 4
df.index = df.index.floor('H')
df.head()
Value
2018-01-01 01:00:00 0
2018-01-01 02:00:00 1
2018-01-01 03:00:00 2
2018-01-01 04:00:00 3
2018-01-01 05:00:00 4
Try my method:
Add a new column by the rounded value of hour:
df['E'] = df.index.round('H')
Set it as index:
df1 = df.set_index('E')
Delete the name you set('E' here):
df1.index.name = None
And now, df1 is a new DataFrame with index hour rounded from df.
Try this
df['index'].apply(lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,60*(dt.minute // 60)))

Categories