how to concat 2 pandas dataframe based on time ranges - python

I have two dataframe and I would like to concat them based on time ranges
for example
dataframe A
user timestamp product
A 2015/3/13 1
B 2015/3/15 2
dataframe B
user time behavior
A 2015/3/1 2
A 2015/3/8 3
A 2015/3/13 1
B 2015/3/1 2
I would like to concat 2 dataframe as below ( frame B left join to frame A)
column "timestamp1" is 7 days before column "timestamp"
for example when timestamp is 3/13 , then 3/6-13 is in the range
otherwise dont concat
user timestamp product time1 behavior
A 2015/3/13 1 2015/3/8 3
A 2015/3/13 1 2015/3/13 1
B 2015/3/15 2 NaN NaN
the sql code would look like
select * from
B left join A
on user
where B.time >= A.timestamp - 7 & B.time <= A.timestamp
##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;
how can we make this on python?
can only think of the following and dont know how to work with the time..
new = pd.merge(A, B, on='user', how='left')
thanks and sorry..

The few steps required to solve this-
from datetime import timedelta
First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)
df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)
Merge as follows: (Based on your final dataset i think your left join is more of a right join)
df3 = pd.merge(df1,df2,how='left')
Get your final df:
df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]
The row containing nan is missing and this is because of the way conditional joins are done in pandas.
Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.

Here's one solution that relies on two merges - first, to narrow down dataframe B (df2), and then to produce the desired output:
We can read in the example dataframes with read_clipboard():
import pandas as pd
# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])
# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])
Merge and filter:
# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')
# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)
# now perform final merge
merged = df1.merge(tmp, on='user', how='left')
Output:
user timestamp product time behavior
0 A 2015-03-13 1 2015-03-08 3.0
1 A 2015-03-13 1 2015-03-13 1.0
2 B 2015-03-15 2 NaT NaN

Related

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

How to join data from two seperate columns based upon condition to another dataframe?

I want to add data from df2 if date is greater than 01/01/2015 and df1 if its below than 01/01/2015. Unsure how to do this as the columns are of difference length.
I have a main DF df1, and two dataframes containing data called df2 and df3.
df2 looks something like this:
day Value
01/01/2015 5
02/01/2015 6
...
going up to today,
I also have DF3 which is the same data but goes from 2000 to today so like
day Value
01/01/2000 10
02/01/2000 15
...
I want to append a Value column to DF1 that is the values of DF3 when date is less than 2015, and the values of DF2 when the date is more than 01/01/2015( including). Unsure how to do this using a condition. I think there is a way with np.where but unsure how.
To add more context.
I want to codify this statement:
df1[values] = DF2[Values] when date is bigger than 1/1/2015 and df1[values] = DF3[Values] when date is less than 1/1/2015
If you want to join 2 dataframe, you can use df1.mrege (df2,on="col_name"). If you want a condition on one dataframe and join with the other,
do it like this,
import pandas as pd
#generating 2 dataframes
date1 = pd.date_range(start='1/01/2015', end='2/02/2015')
val1 = [2,3,5]*11. #length equal to date
dict1 = {"Date":date2,"Value":val1}
df1 = pd.DataFrame(dict1)
df1. #view df1
date2 = pd.date_range(start='1/01/2000', end='2/02/2020')
val2 = [21,15]*3669. #length equal to date
dict2 = {"Date":date2,"Value":val2}
df2 = pd.DataFrame(dict2)
df2. #view df1
df3 =df1[df1["Date"]<"2015-02-01"] #whatever condition you want apply
and store on different dataframe
df2.merge(df3,on="Date")
df2 #joined dataframe
This is how you can join 2 dataframe on date with a condition, just apply
the conditon and store i another dataframe and join it with the first
dataframe by df.merge(df3,on="Date")

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Categories