I need to perform a merge to map a new set of ids to an old set of ids. My starting data looks like this:
lst = [10001, 20001, 30001]
dt = pd.date_range(start='2016', end='2018', freq='M')
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
In [94]: df.head()
Out[94]:
0
date id
2016-01-31 10001 -0.512371
20001 -1.164461
30001 -1.253232
2016-02-29 10001 -0.129874
20001 0.711938
And I want to map id to newid using data that looks like this:
df1 = pd.DataFrame({'id': [10001, 10001, 10001, 10001],
'start_date': ['2015-11-31', '2016-02-01', '2016-05-16', '2017-02-16'],
'end_date': ['2016-01-31', '2016-05-15', '2017-02-15', '2018-04-02'],
'new_id': ['ABC123', 'XYZ789', 'HIJ456', 'LMN654']},)
df2 = pd.DataFrame({'id': [20001, 20001, 20001, 20001],
'start_date': ['2015-10-07', '2016-01-08', '2016-06-02', '2017-02-13'],
'end_date': ['2016-01-07', '2016-06-01', '2017-02-12', '2018-03-017'],
'new_id': ['CBA321', 'ZYX987', 'JIH765', 'NML345']},)
df3 = pd.DataFrame({'id': [30001, 30001, 30001, 30001],
'start_date': ['2015-07-31', '2016-02-23', '2016-06-17', '2017-05-12'],
'end_date': ['2016-02-22', '2016-06-16', '2017-05-11', '2018-01-05'],
'new_id': ['CCC333', 'XXX444', 'HHH888', 'III888']},)
df_ranges = pd.concat([df1,df2,df3])
In [95]: df_ranges.head()
Out[95]:
index end_date id new_id start_date
0 0 2016-01-31 10001 ABC123 2015-11-31
1 1 2016-05-15 10001 XYZ789 2016-02-01
2 2 2017-02-15 10001 HIJ456 2016-05-16
3 3 2018-04-02 10001 LMN654 2017-02-16
4 0 2016-01-07 20001 CBA321 2015-10-07
Basically, my data is monthly panel data and the new data has ranges of dates for which a specific mapping from A->B is valid. So row 1 of the mapping data says that from 2016-01-31 through 2015-211-31 the id 10001 maps to ABC123.
I've previously done this in SAS/SQL with a statement like this:
SELECT a.*, b.newid FROM df as a, df_ranges as b
WHERE a.id = b.id AND b.start_date <= a.date < b.end_date
A few notes about the data:
it should be a 1:1 mapping of id to newid.
the date ranges are non-overlapping
The solution here may be a good start: Merging dataframes based on date range
It is exactly what I'm looking for except that it merges only on dates, not additionally on id. I played with groupby() and this solution but didn't find a way to make it work. Another idea I had was to unstack() the mapping data (df_ranges) to match the dimensions/time frequency of df but this seems to simply re-state the existing problem.
Perhaps I got downvoted because this was too easy, but I couldn't find the answer anywhere so I'll just post it here: you should use the merge_asof() which provides fuzzy matching on dates.
First, data need to be sorted:
df_ranges.sort_values(by=['start_date','id'],inplace=True)
df.sort_values(by=['date','id'],inplace=True)
Then, do the merge:
pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date')
Output:
In [30]: pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date').head()
Out[30]:
date id 0 start_date end_date new_id
0 2016-01-31 10001 0.120892 2015-11-30 2016-01-31 ABC123
1 2016-01-31 20001 -0.576096 2016-01-08 2016-06-01 ZYX987
2 2016-01-31 30001 0.543597 2015-07-31 2016-02-22 CCC333
3 2016-02-29 10001 0.316212 2016-02-01 2016-05-15 XYZ789
4 2016-02-29 20001 -0.625878 2016-01-08 2016-06-01 ZYX987
Related
I have a pandas dataframe consisting of transactional data that looks like the below:
Customer_ID
Day
Sales
1
2018-08-01
80.11
2
2019-01-07
10.15
2
2021-02-21
74.15
1
2019-06-18
10.00
3
2020-03-17
15.15
2
2020-04-29
80.98
4
2016-06-01
133.54
3
2022-01-14
17.15
2
2021-02-28
25.12
1
2021-01-02
1.22
I need to calculate the forward rolling 365 day sum of sales grouped by the customer, exclusive of the current day. I would like to insert the result as a new column.
e.g. for customer_id == 1 for the first row, the value to be inserted in the new column will be the sum of sales between 2018-08-02 and 2019-08-01 for customer_id == 1.
I'm sure there's a way to do this using pandas, but I can't figure it out.
Code to produce the dataframe:
df = pd.DataFrame({
'Customer_ID': [1, 2, 2, 1, 3, 2, 4, 3, 2, 1],
'Day': ['2018-01-01', '2019-01-07', '2021-02-21', '2019-06-17', '2020-03-17', '2020-04-29', '2016-06-01', '2022-01-14', '2021-02-28', '2021-01-02'],
'Sales': [80.11, 10.15, 74.15, 10.00, 15.15, 80.98, 133.54, 17.15, 25.12, 1.22]
})
df.Day = pd.to_datetime(df.Day)
You first need to groupby the Customer_ID column, then perform a rolling sum on each group after you set the 'Day' column as each groups index.
df.groupby('Customer_ID')
.apply(
lambda gr: gr.set_index('Day').sort_index()['Sales'].rolling('365D').sum()
).reset_index()
There is probably a better way to do this, but for me this is the simplest one for me.
Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)
I have the following sample dataset
df = pd.DataFrame({
'names': ['joe', 'joe', 'joe'],
'dates': [dt.datetime(2019,6,1), dt.datetime(2019,6,5), dt.datetime(2019,7,1)],
'values': [5,2,13]
})
and I want to group by names and by weeks or 7 days, which I can achieve with
df_grouped = df.groupby(['names', pd.Grouper(key='dates', freq='7d')]).sum()
values
names dates
joe 2019-06-01 7
2019-06-29 13
But what I would be looking for is something like this, with all the explicit dates
values
names dates
joe 2019-06-01 7
2019-06-08 0
2019-06-15 0
2019-06-22 0
2019-06-29 13
And by doing df_grouped.index.levels[1] I see that all those intermediate dates are actually in the index, so maybe that's something I can leverage.
Any ideas on how to achieve this?
Thanks
Use DataFrameGroupBy.resample with DatetimeIndex:
df_grouped = df.set_index('dates').groupby('names').resample('7D').sum()
print (df_grouped)
values
names dates
joe 2019-06-01 7
2019-06-08 0
2019-06-15 0
2019-06-22 0
2019-06-29 13
I have this DataFrame:
dft2 = pd.DataFrame(np.random.randn(20, 1),
columns=['A'],
index=pd.MultiIndex.from_product([pd.date_range('20130101',
periods=10,
freq='4M'),
['a', 'b']]))
That looks like this when I print it.
Output:
A
2013-01-31 a 0.275921
b 1.336497
2013-05-31 a 1.040245
b 0.716865
2013-09-30 a -2.697420
b -1.570267
2014-01-31 a 1.326194
b -0.209718
2014-05-31 a -1.030777
b 0.401654
2014-09-30 a 1.138958
b -1.162370
2015-01-31 a 1.770279
b 0.606219
2015-05-31 a -0.819126
b -0.967827
2015-09-30 a -1.423667
b 0.894103
2016-01-31 a 1.765187
b -0.334844
How do I select filter by rows that are the min of that year? Like 2013-01-31, 2014-01-31?
Thanks.
# Create dataframe from the dates in the first level of the index.
df = pd.DataFrame(dft2.index.get_level_values(0), columns=['date'], index=dft2.index)
# Add a `year` column that gets the year of each date.
df = df.assign(year=[d.year for d in df['date']])
# Find the minimum date of each year by grouping.
min_annual_dates = df.groupby('year')['date'].min().tolist()
# Filter the original dataframe based on these minimum dates by year.
>>> dft2.loc[(min_annual_dates, slice(None)), :]
A
2013-01-31 a 1.087274
b 1.488553
2014-01-31 a 0.119801
b 0.922468
2015-01-31 a -0.262440
b 0.642201
2016-01-31 a 1.144664
b 0.410701
Or you can try using isin
dft1=dft2.reset_index()
dft1['Year']=dft1.level_0.dt.year
dft1=dft1.groupby('Year')['level_0'].min()
dft2[dft2.index.get_level_values(0).isin(dft1.values)]
Out[2250]:
A
2013-01-31 a -1.072400
b 0.660115
2014-01-31 a -0.134245
b 1.344941
2015-01-31 a 0.176067
b -1.792567
2016-01-31 a 0.033230
b -0.960175
I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()