Left join two dateframes with date columns on a range of dates? - python

Here's an example that fulfills the criteria. Dataframe df1 and df2 both have id columns and date columns. In df1 the id column is unique while in df2 it is non-unique. I'd like to create a new dataframe where the join happens if df1.id == df2.id and some pair of dates in df2 is within 1 week of the date in df1
df1
| customer_id (unique) | 'purchase_date'|
| -------- | -------------- |
| 1 | 2021-05-14 |
| 2 | 2021-09-16 |
df2
| customer_id | 'visit_dates' |
| -------- | -------------- |
| 1 | 2021-05-11 |
| 1 | 2021-05-16 |
| 1 | 2021-05-21 |
| 2 | 2021-07-14 |
| 2 | 2021-09-17 |
# New Df will only have 1 row.
# For customer 1 there is a date within the range(05-07 -> 05-14)
# and within the range(05-14 -> 05-21) with matching id.
# For customer 2, there are no dates within (09-09 -> 09-16) so it should be filtered
newdf
| customer_id (unique) | 'purchase_date'| begin_date_range | end_date_range
| -------- | -------------- | ---------------- | -------------
| 1 | 2021-05-14 | 2021-05-11 | 2021-05-16
I understand how to do this in SQL, but I don't know what functions allow similar date predicate filtering in Pandas.
To construct df1 and df2:
data1 = {'customer_id': [1, 2], 'purchase_date': ['2021-05-14', '2021-09-16']}
data2 = {'customer_id': [1, 1, 1, 2, 2],
'visit_dates': ['2021-05-11', '2021-05-16', '2021-05-21', '2021-07-14', '2021-09-17']}

Building on #Raymond Kwok's excellent answer: We could merge twice, once to left merge df1 to df2 and then merge the "begin_date_range" part with the "end_date_range" part
merged = df1.merge(df2, on='customer_id', how='left')
merged['purchase_date'] = pd.to_datetime(merged['purchase_date'])
merged['visit_dates'] = pd.to_datetime(merged['visit_dates'])
day_diff = merged['purchase_date'].sub(merged['visit_dates']).dt.days
out = (merged[day_diff.between(0,6)]
.merge(merged[day_diff.between(-6,0)], on=['customer_id','purchase_date'])
.rename(columns={'visit_dates_x': 'begin_date_range', 'visit_dates_y': 'end_date_range'}))
Output:
customer_id purchase_date begin_date_range end_date_range
0 1 2021-05-14 2021-05-11 2021-05-16

Filter the merge result using the week condition.
df = df1.merge(df2, on='customer_id', how='left')
df[(df['purchase_date'] - df['visit_dates']).dt.days.between(0, 6)]
df['visit_dates+1week'] = df['visit_dates'] + pd.Timedelta(days=6)

Related

Pandas - Join two Dataframes based on timestamp insert in between dates [duplicate]

This question already has answers here:
Outer Join Pandas Dataframe
(2 answers)
Closed 1 year ago.
I've two dataframes with matching and non-matching timestamps. I want to join both the dataframes such that new dataframe contains timestamps from both the dataframes and missing data from other dataframe is set to that dataframe's previous value. I want to analyze two different data sets to check their value at any exact timestamp (moment)
DF1
| Data1 | Timestamp1 |
| ----- | ---------- |
| A | 1623974400000|
| B | 1623974400200|
| C | 1623974400200|
| D | 1623974400400|
DF2
| Data2 | Timestamp2 |
| ----- | ---------- |
| M | 1623974400000|
| N | 1623974400100|
| O | 1623974400200|
| P | 1623974400500|
Output:
DF3
| Data1 | Data2 | Timestamp |
| ----- | ----- | --------- |
| A | M | 1623974400000|
| A | N | 1623974400100|
| B | O | 1623974400200|
| C | O | 1623974400200|
| D | O | 1623974400400|
| D | P | 1623974400500|
As I see it, just outer merge. sort_values and fillna forward.
Code below
Rename columns
DF1.rename(columns={'Timestamp1':'Timestamp'}, inplace=True)
DF2.rename(columns={'Timestamp2':'Timestamp'}, inplace=True)
merge
pd.merge(DF1,DF2, on='Timestamp', how='outer').sort_values(by='Timestamp').fillna(method='ffill')
outcome
Data1 Timestamp Data2
0 A 1623974400000 M
4 A 1623974400100 N
1 B 1623974400200 O
2 C 1623974400200 O
3 D 1623974400400 O
5 D 1623974400500 P
You can use pd.merge_asof:
# sort dfs by timestamp:
df1 = df1.sort_values(by="Timestamp1")
df2 = df2.sort_values(by="Timestamp2")
x = pd.merge_asof(df1, df2, left_on="Timestamp1", right_on="Timestamp2")
y = pd.merge_asof(df2, df1, left_on="Timestamp2", right_on="Timestamp1")
df_out = pd.concat([x, y]).drop_duplicates()
df_out["Timestamp"] = df_out[["Timestamp1", "Timestamp2"]].max(axis=1)
print(df_out[["Data1", "Data2", "Timestamp"]])
Prints:
Data1 Data2 Timestamp
0 A M 1623974400000
1 B O 1623974400200
2 C O 1623974400200
3 D O 1623974400400
1 A N 1623974400100
3 D P 1623974400500

The total number of times User U bought Item T out of the 3 recent months

I have the following dataframe:
| order_id | item_id | user_id | order_date |
| -------- | -------------- | -------- | ----------- |
| 383706 | 1 | A | 2012-09-11 |
| 354776 | 2 | A | 2018-05-19 |
| 33333 | 2 | A | 2014-01-19 |
| 383706 | 3 | B | 2013-12-10 |
and i want to calculate this following variable: total_buy_m5(User U, Item T) is the total number of times User U bought Item T out of the 5 most recent months (between 2019-12-01 and 2019-07-01).
I want this final table:
| user_id | item_id | count |
| -------------- | -------- | -------- |
| A | 1 | 100 |
| A | 2 | 1 |
| A | 3 | 12 |
| B | 1 | 5 |
Assuming that your order_date is of datetime type, you can do this to filter. If not, you have to convert that column to the datetime type.
df = df[(df['user_id'] == U) & (df['item_id'] == T) & ((df['order_date'] >= start_date) & (df['order_date'] <= end_date))]
In order to get your final desired table, you can use a groupby.
import pandas as pd
from datetime import datetime
# Creating some sample data to illustrate the example
df = pd.DataFrame(columns=['user_id', 'item_id', 'order_date'], data=[['a', 1, datetime(2020, 1, 1)], ['a', 1, datetime(2020, 1, 2)]])
# Filter the DataFrame based on your function arguments
df = df[(df['user_id'] == 'a') &
(df['item_id'] == 1) &
((df['order_date'] >= '2019-02-01') & (df['order_date'] <= '2020-02-02'))]
# Now do a groupby and rename the order_date column to count
df2 = df.groupby(['user_id', 'item_id']).count().reset_index()
df3 = df2.rename(columns={'order_date': 'count'})
print(df3)

Select value from other dataframe where index is equal

I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+
This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1
I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']

Pandas Merge multiple dataframes on index and column

I am trying to Merge multiple dataframes to one main dataframe using the datetime index and id from main dataframe and datetime and id columns from other dataframes
Main dataframe
DateTime | id | data
(Df.Index)
---------|----|------
2017-9-8 | 1 | a
2017-9-9 | 2 | b
df1
id | data1 | data2 | DateTime
---|-------|-------|---------
1 | a | c | 2017-9-8
2 | b | d | 2017-9-9
5 | a | e | 2017-9-20
df2
id | data3 | data4 | DateTime
---|-------|-------|---------
1 | d | c | 2017-9-8
2 | e | a | 2017-9-9
4 | f | h | 2017-9-20
The main dataframe and the other dataframes are in different dictionaries. I want to read from each dictionary and merge when the joining condition (datetime, id) is met
for sleep in dictOfSleep#MainDataFrame:
for sensorDevice in dictOfSensor#OtherDataFrames:
try:
dictOfSleep[sleep]=pd.merge(dictOfSleep[sleep],dictOfSensor[sensorDevice], how='outer',on=['DateTime','id'])
except:
print('Join could not be done')
Desired Output:
DateTime | id | data | data1 | data2 | data3 | data4
(Df.Index)
---------|----|------|-------|-------|-------|-------|
2017-9-8 | 1 | a | a | c | d | c |
2017-9-9 | 2 | b | b | d | e | a |
I'm not sure how your dictionaries are set up so you will most likely need to modify this but I'd try something like:
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# set df index to match the main_df index
df = df.set_index(['DateTime'])
# try join (not merge) when combining on index
main_df = main_df.join(df, how='outer')
Alternatively, if the id column is very important you can try to first reset your main_df index and then merging.
main_df = main_df.reset_index()
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# try to merge on both columns
main_df = main_df.merge(df, how='outer', on=['DateTime', 'id])

Python Pandas merge and update dataframe

I am currently using Python and Pandas to form a stock price "database". I managed to find some codes to download the stock prices.
df1 is my existing database. Each time I download the share price, it will look like df2 and df3. Then, i need to combine df1, df2 and df3 data to look like df4.
Each stock has its own column.
Each date has its own row.
df1: Existing database
+----------+-------+----------+--------+
| Date | Apple | Facebook | Google |
+----------+-------+----------+--------+
| 1/1/2018 | 161 | 58 | 1000 |
| 2/1/2018 | 170 | 80 | |
| 3/1/2018 | 190 | 84 | 100 |
+----------+-------+----------+--------+
df2: New data (2/1/2018 and 4/1/2018) and updated data (3/1/2018) for Google.
+----------+--------+
| Date | Google |
+----------+--------+
| 2/1/2018 | 500 |
| 3/1/2018 | 300 |
| 4/1/2018 | 200 |
+----------+--------+
df3: New data for Amazon
+----------+--------+
| Date | Amazon |
+----------+--------+
| 1/1/2018 | 1000 |
| 2/1/2018 | 1500 |
| 3/1/2018 | 2000 |
| 4/1/2018 | 3000 |
+----------+--------+
df4 Final output: Basically, it merges and updates all the data into the database. (df1 + df2 + df3) --> this will be the updated database of df1
+----------+-------+----------+--------+--------+
| Date | Apple | Facebook | Google | Amazon |
+----------+-------+----------+--------+--------+
| 1/1/2018 | 161 | 58 | 1000 | 1000 |
| 2/1/2018 | 170 | 80 | 500 | 1500 |
| 3/1/2018 | 190 | 84 | 300 | 2000 |
| 4/1/2018 | | | 200 | 3000 |
+----------+-------+----------+--------+--------+
I do not know how to combine df1 and df3.
And I do not know how to combine df1 and df2 (add new row: 4/1/2018) while at the same time updating the data (2/1/2018 -> Original data: NaN; amended data: 500 | 3/1/2018 -> Original data: 100; amended data: 300) and leaving the existing intact data (1/1/2018).
Can anyone help me to get df4? =)
Thank you.
EDIT: Based on Sociopath suggestion, I amended the code to:
dataframes = [df2, df3]
df4 = df1
for i in dataframes:
# Merge the dataframe
df4 = df4.merge(i, how='outer', on='date')
# Get the stock name
stock_name = i.columns[1]
# To check if there is any column with "_x", if have, then combine these columns
if stock_name+"_x" in df4.columns:
x = stock_name+"_x"
y = stock_name+"_y"
df4[stock_name] = df4[y].fillna(df4[x])
df4.drop([x, y], 1, inplace=True)
You need merge:
df1 = pd.DataFrame({'date':['2/1/2018','3/1/2018','4/1/2018'], 'Google':[500,300,200]})
df2 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018','4/1/2018'], 'Amazon':[1000,1500,2000,3000]})
df3 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018'], 'Apple':[161,171,181], 'Google':[1000,None,100], 'Facebook':[58,75,65]})
If the column is not present in current database simply use merge as below
df_new = df3.merge(df2, how='outer',on=['date'])
If the column in present in DB then use fillna to update the values as below:
df_new = df_new.merge(df1, how='outer', on='date')
#print(df_new)
df_new['Google'] = df_new['Google_y'].fillna(df_new['Google_x'])
df_new.drop(['Google_x','Google_y'], 1, inplace=True)
Output:
date Apple Facebook Amazon Google
0 1/1/2018 161.0 58.0 1000 1000.0
1 2/1/2018 171.0 75.0 1500 500.0
2 3/1/2018 181.0 65.0 2000 300.0
3 4/1/2018 NaN NaN 3000 200.0
EDIT
More generic solution for later part.
dataframes = [df2, df3, df4]
for i in dataframes:
stock_name = list(i.columns.difference(['date']))[0]
df_new = df_new.merge(i, how='outer', on='date')
x = stock_name+"_x"
y = stock_name+"_y"
df_new[stock_name] = df_new[y].fillna(df_new[x])
df_new.drop([x,y], 1, inplace=True)

Categories