How to get matching values between two DataFrames without iterating - python

I have a pandas DataFrame called df1, which looks like:
value analysis_date hour error
7 2000-01-01 00:00:00 9 None
8 2000-01-01 00:00:00 10 None
9 2000-01-01 00:00:00 11 None
And a second DataFrame, df2:
value analysis_date hour error
4 2000-01-01 09:00:00 1 None
5 2019-01-01 00:00:00 2 None
6 2000-01-01 08:00:00 3 None
I want to
compare 'corresponding' rows, which means rows in which analysis_date + hour are equivalent between df1 and df2; meaning that df1 rows 2 and 3 correspond with df2 rows 3 and 1 respectively
Then, I want to set the error column in df1 to be df1['value'][row] - df2['value'][row] for that corresponding row. So in this case, df1 should end up looking like this:
value analysis_date hour error
7 2000-01-01 00:00:00 9 None
8 2000-01-01 00:00:00 10 4
9 2000-01-01 00:00:00 11 3
Is there a way I can do this beyond looping through every single row and individually comparing them using iterrows()?

you could go about it like :
df1['analysis_date'] = pd.to_datetime(df1['analysis_date'])
df2['analysis_date'] = pd.to_datetime(df2['analysis_date'])
df2['total_date'] = df2.analysis_date + df2.hour.astype('timedelta64[h]')
df1['total_date'] = df1.analysis_date + df1.hour.astype('timedelta64[h]')
mr_df = df1.merge(df2.loc[:,['value', 'total_date']], on = 'total_date', how = 'left')
df1['error'] = mr_df['value_x'] - mr_df['value_y']
df1
# value date hour error total_date
# 0 7 2000-01-01 9 NaN 2000-01-01 09:00:00
# 1 8 2000-01-01 10 4.0 2000-01-01 10:00:00
# 2 9 2000-01-01 11 3.0 2000-01-01 11:00:00

Related

datetime hour component to column python pandas

I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.

Drop overlapping periods less than 6 months in pandas dataframe

I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01

How to use groupby() with between_time()?

I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.

python pandas dataframe how to apply a function to each time period

I have the following dataframe,
df = pd.DataFrame({'col1':range(9), 'col2': list(range(7)) + [np.nan] *2},
index = pd.date_range('1/1/2000', periods=9, freq='0.5S'))
df
Out[109]:
col1 col2
2000-01-01 00:00:00.000 0 0.0
2000-01-01 00:00:00.500 1 1.0
2000-01-01 00:00:01.000 2 2.0
2000-01-01 00:00:01.500 3 3.0
2000-01-01 00:00:02.000 4 4.0
2000-01-01 00:00:02.500 5 5.0
2000-01-01 00:00:03.000 6 6.0
2000-01-01 00:00:03.500 7 NaN
2000-01-01 00:00:04.000 8 NaN
As can been seen above, each second there are two data point. What I would like to do is for the two rows in a second, if both cols in the latest row has valid number, that row will be chosen; if either cols in the latest row is invalid, we will see previous row is valid for bot col, if valid, we will chose previous row, otherwise we will skip the second. The resuling dataframe looks like this,
col1 col2
2000-01-01 00:00:00.000 1 1.0
2000-01-01 00:00:01.000 3 3.0
2000-01-01 00:00:02.000 5 5.0
2000-01-01 00:00:03.000 6 6.0
How to achieve this?
Here is one way using reindex after dropna we reindex , then both of the columns become NaN, In this situation if we using last , we will not select any item from this row (correlated with your previous question )
df.dropna().reindex(df.index).resample('1s').last().dropna()
Out[175]:
col1 col2
2000-01-01 00:00:00 1.0 1.0
2000-01-01 00:00:01 3.0 3.0
2000-01-01 00:00:02 5.0 5.0
2000-01-01 00:00:03 6.0 6.0

How to compare two Dataframes on a column and replace with other column value

I have two input dataframes, df1 and df2:
id first last size
A 1978-01-01 1979-01-01 2
B 2000-01-01 2000-01-01 1
C 1998-01-01 2000-01-01 3
D 1998-01-01 1998-01-01 1
E 1999-01-01 2000-01-01 2
id token
A ZA.00
B As.11
C SD.34
My desired output:
id first last size
ZA.00 1978-01-01 1979-01-01 2
As.11 2000-01-01 2000-01-01 1
SD.34 1998-01-01 2000-01-01 3
D 1998-01-01 1998-01-01 1
E 1999-01-01 2000-01-01 2
If df2['id'] matches df1['id'] then replace df1['id'] with df2['token']. How can I achieve this?
Use map and fillna:
df1['id'] = df1['id'].map(df2.set_index('id')['token']).fillna(df1['id'])
df1
Output:
id first last size
0 ZA.00 1978-01-01 1979-01-01 2
1 As.11 2000-01-01 2000-01-01 1
2 SD.34 1998-01-01 2000-01-01 3
3 D 1998-01-01 1998-01-01 1
4 E 1999-01-01 2000-01-01 2
You can use map with a series as an argument.
Using Merge and combine_first:
df = df1.merge(df2,how='outer')
df['id'] = df['token'].combine_first(df['id'] )
df.drop('token',inplace=True,axis=1)
Another way is to use replace with dictionary of df2.values, here the df1 dataframe changes.:
df1.id.replace(dict(df2.values),inplace=True)
id first last size
0 ZA.00 1978-01-01 1979-01-01 2
1 As.11 2000-01-01 2000-01-01 1
2 SD.34 1998-01-01 2000-01-01 3
3 D 1998-01-01 1998-01-01 1
4 E 1999-01-01 2000-01-01 2
If you do not wish to merge your DataFrame, you could use apply function to solve this. Change your small DataFrame to dictionary and map it to the other DataFrame.
from io import StringIO #used to get string to df
import pandas as pd
id_ =list('ABC')
token = 'ZA.00 As.11 SD.34'.split()
dt = pd.DataFrame(list(zip(id_,token)),columns=['id','token'])
a ='''
id first last size
A 1978-01-01 1979-01-01 2
B 2000-01-01 2000-01-01 1
C 1998-01-01 2000-01-01 3
D 1998-01-01 1998-01-01 1
E 1999-01-01 2000-01-01 2
'''
df =pd.read_csv(StringIO(a), sep=' ')
# This last two lines are all you need
mp= {x:y for x,y in zip(dt.id.tolist(),dt.token.tolist())}
df.id.apply(lambda x: mp[x] if x in mp.keys() else x)

Categories