Here's an example that fulfills the criteria. Dataframe df1 and df2 both have id columns and date columns. In df1 the id column is unique while in df2 it is non-unique. I'd like to create a new dataframe where the join happens if df1.id == df2.id and some pair of dates in df2 is within 1 week of the date in df1
df1
| customer_id (unique) | 'purchase_date'|
| -------- | -------------- |
| 1 | 2021-05-14 |
| 2 | 2021-09-16 |
df2
| customer_id | 'visit_dates' |
| -------- | -------------- |
| 1 | 2021-05-11 |
| 1 | 2021-05-16 |
| 1 | 2021-05-21 |
| 2 | 2021-07-14 |
| 2 | 2021-09-17 |
# New Df will only have 1 row.
# For customer 1 there is a date within the range(05-07 -> 05-14)
# and within the range(05-14 -> 05-21) with matching id.
# For customer 2, there are no dates within (09-09 -> 09-16) so it should be filtered
newdf
| customer_id (unique) | 'purchase_date'| begin_date_range | end_date_range
| -------- | -------------- | ---------------- | -------------
| 1 | 2021-05-14 | 2021-05-11 | 2021-05-16
I understand how to do this in SQL, but I don't know what functions allow similar date predicate filtering in Pandas.
To construct df1 and df2:
data1 = {'customer_id': [1, 2], 'purchase_date': ['2021-05-14', '2021-09-16']}
data2 = {'customer_id': [1, 1, 1, 2, 2],
'visit_dates': ['2021-05-11', '2021-05-16', '2021-05-21', '2021-07-14', '2021-09-17']}
Building on #Raymond Kwok's excellent answer: We could merge twice, once to left merge df1 to df2 and then merge the "begin_date_range" part with the "end_date_range" part
merged = df1.merge(df2, on='customer_id', how='left')
merged['purchase_date'] = pd.to_datetime(merged['purchase_date'])
merged['visit_dates'] = pd.to_datetime(merged['visit_dates'])
day_diff = merged['purchase_date'].sub(merged['visit_dates']).dt.days
out = (merged[day_diff.between(0,6)]
.merge(merged[day_diff.between(-6,0)], on=['customer_id','purchase_date'])
.rename(columns={'visit_dates_x': 'begin_date_range', 'visit_dates_y': 'end_date_range'}))
Output:
customer_id purchase_date begin_date_range end_date_range
0 1 2021-05-14 2021-05-11 2021-05-16
Filter the merge result using the week condition.
df = df1.merge(df2, on='customer_id', how='left')
df[(df['purchase_date'] - df['visit_dates']).dt.days.between(0, 6)]
df['visit_dates+1week'] = df['visit_dates'] + pd.Timedelta(days=6)
Related
This question already has answers here:
Outer Join Pandas Dataframe
(2 answers)
Closed 1 year ago.
I've two dataframes with matching and non-matching timestamps. I want to join both the dataframes such that new dataframe contains timestamps from both the dataframes and missing data from other dataframe is set to that dataframe's previous value. I want to analyze two different data sets to check their value at any exact timestamp (moment)
DF1
| Data1 | Timestamp1 |
| ----- | ---------- |
| A | 1623974400000|
| B | 1623974400200|
| C | 1623974400200|
| D | 1623974400400|
DF2
| Data2 | Timestamp2 |
| ----- | ---------- |
| M | 1623974400000|
| N | 1623974400100|
| O | 1623974400200|
| P | 1623974400500|
Output:
DF3
| Data1 | Data2 | Timestamp |
| ----- | ----- | --------- |
| A | M | 1623974400000|
| A | N | 1623974400100|
| B | O | 1623974400200|
| C | O | 1623974400200|
| D | O | 1623974400400|
| D | P | 1623974400500|
As I see it, just outer merge. sort_values and fillna forward.
Code below
Rename columns
DF1.rename(columns={'Timestamp1':'Timestamp'}, inplace=True)
DF2.rename(columns={'Timestamp2':'Timestamp'}, inplace=True)
merge
pd.merge(DF1,DF2, on='Timestamp', how='outer').sort_values(by='Timestamp').fillna(method='ffill')
outcome
Data1 Timestamp Data2
0 A 1623974400000 M
4 A 1623974400100 N
1 B 1623974400200 O
2 C 1623974400200 O
3 D 1623974400400 O
5 D 1623974400500 P
You can use pd.merge_asof:
# sort dfs by timestamp:
df1 = df1.sort_values(by="Timestamp1")
df2 = df2.sort_values(by="Timestamp2")
x = pd.merge_asof(df1, df2, left_on="Timestamp1", right_on="Timestamp2")
y = pd.merge_asof(df2, df1, left_on="Timestamp2", right_on="Timestamp1")
df_out = pd.concat([x, y]).drop_duplicates()
df_out["Timestamp"] = df_out[["Timestamp1", "Timestamp2"]].max(axis=1)
print(df_out[["Data1", "Data2", "Timestamp"]])
Prints:
Data1 Data2 Timestamp
0 A M 1623974400000
1 B O 1623974400200
2 C O 1623974400200
3 D O 1623974400400
1 A N 1623974400100
3 D P 1623974400500
I have the following dataframe:
| order_id | item_id | user_id | order_date |
| -------- | -------------- | -------- | ----------- |
| 383706 | 1 | A | 2012-09-11 |
| 354776 | 2 | A | 2018-05-19 |
| 33333 | 2 | A | 2014-01-19 |
| 383706 | 3 | B | 2013-12-10 |
and i want to calculate this following variable: total_buy_m5(User U, Item T) is the total number of times User U bought Item T out of the 5 most recent months (between 2019-12-01 and 2019-07-01).
I want this final table:
| user_id | item_id | count |
| -------------- | -------- | -------- |
| A | 1 | 100 |
| A | 2 | 1 |
| A | 3 | 12 |
| B | 1 | 5 |
Assuming that your order_date is of datetime type, you can do this to filter. If not, you have to convert that column to the datetime type.
df = df[(df['user_id'] == U) & (df['item_id'] == T) & ((df['order_date'] >= start_date) & (df['order_date'] <= end_date))]
In order to get your final desired table, you can use a groupby.
import pandas as pd
from datetime import datetime
# Creating some sample data to illustrate the example
df = pd.DataFrame(columns=['user_id', 'item_id', 'order_date'], data=[['a', 1, datetime(2020, 1, 1)], ['a', 1, datetime(2020, 1, 2)]])
# Filter the DataFrame based on your function arguments
df = df[(df['user_id'] == 'a') &
(df['item_id'] == 1) &
((df['order_date'] >= '2019-02-01') & (df['order_date'] <= '2020-02-02'))]
# Now do a groupby and rename the order_date column to count
df2 = df.groupby(['user_id', 'item_id']).count().reset_index()
df3 = df2.rename(columns={'order_date': 'count'})
print(df3)
I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+
This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1
I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']
I am trying to Merge multiple dataframes to one main dataframe using the datetime index and id from main dataframe and datetime and id columns from other dataframes
Main dataframe
DateTime | id | data
(Df.Index)
---------|----|------
2017-9-8 | 1 | a
2017-9-9 | 2 | b
df1
id | data1 | data2 | DateTime
---|-------|-------|---------
1 | a | c | 2017-9-8
2 | b | d | 2017-9-9
5 | a | e | 2017-9-20
df2
id | data3 | data4 | DateTime
---|-------|-------|---------
1 | d | c | 2017-9-8
2 | e | a | 2017-9-9
4 | f | h | 2017-9-20
The main dataframe and the other dataframes are in different dictionaries. I want to read from each dictionary and merge when the joining condition (datetime, id) is met
for sleep in dictOfSleep#MainDataFrame:
for sensorDevice in dictOfSensor#OtherDataFrames:
try:
dictOfSleep[sleep]=pd.merge(dictOfSleep[sleep],dictOfSensor[sensorDevice], how='outer',on=['DateTime','id'])
except:
print('Join could not be done')
Desired Output:
DateTime | id | data | data1 | data2 | data3 | data4
(Df.Index)
---------|----|------|-------|-------|-------|-------|
2017-9-8 | 1 | a | a | c | d | c |
2017-9-9 | 2 | b | b | d | e | a |
I'm not sure how your dictionaries are set up so you will most likely need to modify this but I'd try something like:
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# set df index to match the main_df index
df = df.set_index(['DateTime'])
# try join (not merge) when combining on index
main_df = main_df.join(df, how='outer')
Alternatively, if the id column is very important you can try to first reset your main_df index and then merging.
main_df = main_df.reset_index()
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# try to merge on both columns
main_df = main_df.merge(df, how='outer', on=['DateTime', 'id])
I am currently using Python and Pandas to form a stock price "database". I managed to find some codes to download the stock prices.
df1 is my existing database. Each time I download the share price, it will look like df2 and df3. Then, i need to combine df1, df2 and df3 data to look like df4.
Each stock has its own column.
Each date has its own row.
df1: Existing database
+----------+-------+----------+--------+
| Date | Apple | Facebook | Google |
+----------+-------+----------+--------+
| 1/1/2018 | 161 | 58 | 1000 |
| 2/1/2018 | 170 | 80 | |
| 3/1/2018 | 190 | 84 | 100 |
+----------+-------+----------+--------+
df2: New data (2/1/2018 and 4/1/2018) and updated data (3/1/2018) for Google.
+----------+--------+
| Date | Google |
+----------+--------+
| 2/1/2018 | 500 |
| 3/1/2018 | 300 |
| 4/1/2018 | 200 |
+----------+--------+
df3: New data for Amazon
+----------+--------+
| Date | Amazon |
+----------+--------+
| 1/1/2018 | 1000 |
| 2/1/2018 | 1500 |
| 3/1/2018 | 2000 |
| 4/1/2018 | 3000 |
+----------+--------+
df4 Final output: Basically, it merges and updates all the data into the database. (df1 + df2 + df3) --> this will be the updated database of df1
+----------+-------+----------+--------+--------+
| Date | Apple | Facebook | Google | Amazon |
+----------+-------+----------+--------+--------+
| 1/1/2018 | 161 | 58 | 1000 | 1000 |
| 2/1/2018 | 170 | 80 | 500 | 1500 |
| 3/1/2018 | 190 | 84 | 300 | 2000 |
| 4/1/2018 | | | 200 | 3000 |
+----------+-------+----------+--------+--------+
I do not know how to combine df1 and df3.
And I do not know how to combine df1 and df2 (add new row: 4/1/2018) while at the same time updating the data (2/1/2018 -> Original data: NaN; amended data: 500 | 3/1/2018 -> Original data: 100; amended data: 300) and leaving the existing intact data (1/1/2018).
Can anyone help me to get df4? =)
Thank you.
EDIT: Based on Sociopath suggestion, I amended the code to:
dataframes = [df2, df3]
df4 = df1
for i in dataframes:
# Merge the dataframe
df4 = df4.merge(i, how='outer', on='date')
# Get the stock name
stock_name = i.columns[1]
# To check if there is any column with "_x", if have, then combine these columns
if stock_name+"_x" in df4.columns:
x = stock_name+"_x"
y = stock_name+"_y"
df4[stock_name] = df4[y].fillna(df4[x])
df4.drop([x, y], 1, inplace=True)
You need merge:
df1 = pd.DataFrame({'date':['2/1/2018','3/1/2018','4/1/2018'], 'Google':[500,300,200]})
df2 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018','4/1/2018'], 'Amazon':[1000,1500,2000,3000]})
df3 = pd.DataFrame({'date':['1/1/2018','2/1/2018','3/1/2018'], 'Apple':[161,171,181], 'Google':[1000,None,100], 'Facebook':[58,75,65]})
If the column is not present in current database simply use merge as below
df_new = df3.merge(df2, how='outer',on=['date'])
If the column in present in DB then use fillna to update the values as below:
df_new = df_new.merge(df1, how='outer', on='date')
#print(df_new)
df_new['Google'] = df_new['Google_y'].fillna(df_new['Google_x'])
df_new.drop(['Google_x','Google_y'], 1, inplace=True)
Output:
date Apple Facebook Amazon Google
0 1/1/2018 161.0 58.0 1000 1000.0
1 2/1/2018 171.0 75.0 1500 500.0
2 3/1/2018 181.0 65.0 2000 300.0
3 4/1/2018 NaN NaN 3000 200.0
EDIT
More generic solution for later part.
dataframes = [df2, df3, df4]
for i in dataframes:
stock_name = list(i.columns.difference(['date']))[0]
df_new = df_new.merge(i, how='outer', on='date')
x = stock_name+"_x"
y = stock_name+"_y"
df_new[stock_name] = df_new[y].fillna(df_new[x])
df_new.drop([x,y], 1, inplace=True)