merge pandas data frames based on id and date range

merge pandas data frames based on id and date range - python

I need to perform a merge to map a new set of ids to an old set of ids. My starting data looks like this:
lst = [10001, 20001, 30001]
dt = pd.date_range(start='2016', end='2018', freq='M')
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
In [94]: df.head()
Out[94]:
0
date id
2016-01-31 10001 -0.512371
20001 -1.164461
30001 -1.253232
2016-02-29 10001 -0.129874
20001 0.711938
And I want to map id to newid using data that looks like this:
df1 = pd.DataFrame({'id': [10001, 10001, 10001, 10001],
'start_date': ['2015-11-31', '2016-02-01', '2016-05-16', '2017-02-16'],
'end_date': ['2016-01-31', '2016-05-15', '2017-02-15', '2018-04-02'],
'new_id': ['ABC123', 'XYZ789', 'HIJ456', 'LMN654']},)
df2 = pd.DataFrame({'id': [20001, 20001, 20001, 20001],
'start_date': ['2015-10-07', '2016-01-08', '2016-06-02', '2017-02-13'],
'end_date': ['2016-01-07', '2016-06-01', '2017-02-12', '2018-03-017'],
'new_id': ['CBA321', 'ZYX987', 'JIH765', 'NML345']},)
df3 = pd.DataFrame({'id': [30001, 30001, 30001, 30001],
'start_date': ['2015-07-31', '2016-02-23', '2016-06-17', '2017-05-12'],
'end_date': ['2016-02-22', '2016-06-16', '2017-05-11', '2018-01-05'],
'new_id': ['CCC333', 'XXX444', 'HHH888', 'III888']},)
df_ranges = pd.concat([df1,df2,df3])
In [95]: df_ranges.head()
Out[95]:
index end_date id new_id start_date
0 0 2016-01-31 10001 ABC123 2015-11-31
1 1 2016-05-15 10001 XYZ789 2016-02-01
2 2 2017-02-15 10001 HIJ456 2016-05-16
3 3 2018-04-02 10001 LMN654 2017-02-16
4 0 2016-01-07 20001 CBA321 2015-10-07
Basically, my data is monthly panel data and the new data has ranges of dates for which a specific mapping from A->B is valid. So row 1 of the mapping data says that from 2016-01-31 through 2015-211-31 the id 10001 maps to ABC123.
I've previously done this in SAS/SQL with a statement like this:
SELECT a.*, b.newid FROM df as a, df_ranges as b
WHERE a.id = b.id AND b.start_date <= a.date < b.end_date
A few notes about the data:
it should be a 1:1 mapping of id to newid.
the date ranges are non-overlapping
The solution here may be a good start: Merging dataframes based on date range
It is exactly what I'm looking for except that it merges only on dates, not additionally on id. I played with groupby() and this solution but didn't find a way to make it work. Another idea I had was to unstack() the mapping data (df_ranges) to match the dimensions/time frequency of df but this seems to simply re-state the existing problem.

Perhaps I got downvoted because this was too easy, but I couldn't find the answer anywhere so I'll just post it here: you should use the merge_asof() which provides fuzzy matching on dates.
First, data need to be sorted:
df_ranges.sort_values(by=['start_date','id'],inplace=True)
df.sort_values(by=['date','id'],inplace=True)
Then, do the merge:
pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date')
Output:
In [30]: pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date').head()
Out[30]:
date id 0 start_date end_date new_id
0 2016-01-31 10001 0.120892 2015-11-30 2016-01-31 ABC123
1 2016-01-31 20001 -0.576096 2016-01-08 2016-06-01 ZYX987
2 2016-01-31 30001 0.543597 2015-07-31 2016-02-22 CCC333
3 2016-02-29 10001 0.316212 2016-02-01 2016-05-15 XYZ789
4 2016-02-29 20001 -0.625878 2016-01-08 2016-06-01 ZYX987

Related

Forward rolling 365 day groupby sum - irregular intervals

I have a pandas dataframe consisting of transactional data that looks like the below:
Customer_ID
Day
Sales
1
2018-08-01
80.11
2
2019-01-07
10.15
2
2021-02-21
74.15
1
2019-06-18
10.00
3
2020-03-17
15.15
2
2020-04-29
80.98
4
2016-06-01
133.54
3
2022-01-14
17.15
2
2021-02-28
25.12
1
2021-01-02
1.22
I need to calculate the forward rolling 365 day sum of sales grouped by the customer, exclusive of the current day. I would like to insert the result as a new column.
e.g. for customer_id == 1 for the first row, the value to be inserted in the new column will be the sum of sales between 2018-08-02 and 2019-08-01 for customer_id == 1.
I'm sure there's a way to do this using pandas, but I can't figure it out.
Code to produce the dataframe:
df = pd.DataFrame({
'Customer_ID': [1, 2, 2, 1, 3, 2, 4, 3, 2, 1],
'Day': ['2018-01-01', '2019-01-07', '2021-02-21', '2019-06-17', '2020-03-17', '2020-04-29', '2016-06-01', '2022-01-14', '2021-02-28', '2021-01-02'],
'Sales': [80.11, 10.15, 74.15, 10.00, 15.15, 80.98, 133.54, 17.15, 25.12, 1.22]
})
df.Day = pd.to_datetime(df.Day)

You first need to groupby the Customer_ID column, then perform a rolling sum on each group after you set the 'Day' column as each groups index.
df.groupby('Customer_ID')
.apply(
lambda gr: gr.set_index('Day').sort_index()['Sales'].rolling('365D').sum()
).reset_index()
There is probably a better way to do this, but for me this is the simplest one for me.

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?

I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

How to use pandas Grouper with 7d frequency and fill missing days with 0?

I have the following sample dataset
df = pd.DataFrame({
'names': ['joe', 'joe', 'joe'],
'dates': [dt.datetime(2019,6,1), dt.datetime(2019,6,5), dt.datetime(2019,7,1)],
'values': [5,2,13]
})
and I want to group by names and by weeks or 7 days, which I can achieve with
df_grouped = df.groupby(['names', pd.Grouper(key='dates', freq='7d')]).sum()
values
names dates
joe 2019-06-01 7
2019-06-29 13
But what I would be looking for is something like this, with all the explicit dates
values
names dates
joe 2019-06-01 7
2019-06-08 0
2019-06-15 0
2019-06-22 0
2019-06-29 13
And by doing df_grouped.index.levels[1] I see that all those intermediate dates are actually in the index, so maybe that's something I can leverage.
Any ideas on how to achieve this?
Thanks

Use DataFrameGroupBy.resample with DatetimeIndex:
df_grouped = df.set_index('dates').groupby('names').resample('7D').sum()
print (df_grouped)
values
names dates
joe 2019-06-01 7
2019-06-08 0
2019-06-15 0
2019-06-22 0
2019-06-29 13

pandas DataFrame get 1st day per year from MultiIndex DateIndex

I have this DataFrame:
dft2 = pd.DataFrame(np.random.randn(20, 1),
columns=['A'],
index=pd.MultiIndex.from_product([pd.date_range('20130101',
periods=10,
freq='4M'),
['a', 'b']]))
That looks like this when I print it.
Output:
A
2013-01-31 a 0.275921
b 1.336497
2013-05-31 a 1.040245
b 0.716865
2013-09-30 a -2.697420
b -1.570267
2014-01-31 a 1.326194
b -0.209718
2014-05-31 a -1.030777
b 0.401654
2014-09-30 a 1.138958
b -1.162370
2015-01-31 a 1.770279
b 0.606219
2015-05-31 a -0.819126
b -0.967827
2015-09-30 a -1.423667
b 0.894103
2016-01-31 a 1.765187
b -0.334844
How do I select filter by rows that are the min of that year? Like 2013-01-31, 2014-01-31?
Thanks.

# Create dataframe from the dates in the first level of the index.
df = pd.DataFrame(dft2.index.get_level_values(0), columns=['date'], index=dft2.index)
# Add a `year` column that gets the year of each date.
df = df.assign(year=[d.year for d in df['date']])
# Find the minimum date of each year by grouping.
min_annual_dates = df.groupby('year')['date'].min().tolist()
# Filter the original dataframe based on these minimum dates by year.
>>> dft2.loc[(min_annual_dates, slice(None)), :]
A
2013-01-31 a 1.087274
b 1.488553
2014-01-31 a 0.119801
b 0.922468
2015-01-31 a -0.262440
b 0.642201
2016-01-31 a 1.144664
b 0.410701

Or you can try using isin
dft1=dft2.reset_index()
dft1['Year']=dft1.level_0.dt.year
dft1=dft1.groupby('Year')['level_0'].min()
dft2[dft2.index.get_level_values(0).isin(dft1.values)]
Out[2250]:
A
2013-01-31 a -1.072400
b 0.660115
2014-01-31 a -0.134245
b 1.344941
2015-01-31 a 0.176067
b -1.792567
2016-01-31 a 0.033230
b -0.960175

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!

you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04

Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge pandas data frames based on id and date range - python

Related

Forward rolling 365 day groupby sum - irregular intervals

Pandas - Add at least one row for every day (datetimes include a time)

How to use pandas Grouper with 7d frequency and fill missing days with 0?

pandas DataFrame get 1st day per year from MultiIndex DateIndex

how to get the datetimes before and after some specific dates in Pandas?

Categories

Resources