Merge pandas df based on 2 keys - python

I have 2 df and I would like to merge them based on 2 keys - ID and date:
I following is just a small slice of the entire df
df_pw6
ID date pw10_0 pw50_0 pw90_0
0 153 2018-01-08 27.88590 43.2872 58.2024
0 2 2018-01-05 11.03610 21.4879 31.6997
0 506 2018-01-08 6.98468 25.3899 45.9486
df_ex
date ID measure f188 f187 f186 f185
0 2017-07-03 501 NaN 1 0.5 7 4.0
1 2017-07-03 502 NaN 0 2.5 5 3.0
2 2018-01-08 506 NaN 5 9.0 9 1.2
As you can see, only the third row has a match.
When I type:
#check date
df_ex.iloc[2,0]== df_pw6.iloc[1,1]
True
#check ID
df_ex.iloc[2,1] == df_pw6.iloc[2,0]
True
Now I try to merge them:
df19 = pd.merge(df_pw6,df_ex,on=['date','ID'])
I get an empty df
When I try:
df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID'])
I get:
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN
1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN
2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN
My desired result should be:
> ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
>
> 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2

I have run your codes post your edit, and I succeeded in getting the desired result.
import pandas as pd
# copy paste your first df by hand
pw = pd.read_clipboard()
# copy paste your second df by hand
ex = pd.read_clipboard()
pd.merge(pw,ex,on=['date','ID'])
# output [edited. now it is the correct result OP wanted.]
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2

Related

Generate date ranges using group by + apply in pandas

I want to imitate prophet make_future_dataframe() functionality for multiple groups in a pandas dataframe.
If I would like to create a date range as a separate column I could do:
import pandas as pd
my_dataframe['prediction_range'] = pd.date_range(start=my_dataframe['date_column'].min(),
periods=48,
freq='M')
However my dataframe has the following structure:
id feature1 feature2 date_column
1 0 4.3 2022-01-01
2 0 3.3 2022-01-01
3 0 2.2 2022-01-01
4 1034 1.11 2022-01-01
5 1090 0.98 2022-01-01
6 1078 0 2022-01-01
I wanted to do the following:
def generate_date_range(date_column, data):
dates = pd.date_range(start=data[date_column].unique()[0],
periods=48,
freq='M')
return dates
And then:
my_dataframe = my_dataframe.groupby('id').apply(generate_date_ranges('date_columns', my_dataframe))
But I am getting the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1377, in apply
func = com.is_builtin_func(func)
File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/common.py", line 615, in is_builtin_func
return _builtin_table.get(arg, arg)
TypeError: unhashable type: 'DatetimeIndex'
I am not sure if I am approaching the problem in the right way. I have also done this with a MultiIndex:
multi_index = pd.MultiIndex.from_product([pd.Index(file['id'].unique()), dates], names=('customer', 'prediction_date'))
And then reindexing an filling the NANs but I am not able to understand why the apply version does not work.
The desired output is:
id feature1 feature2 date_column prediction_date
1 0 4.3 2022-01-01 2022-03-01
1 0 4.3 2022-01-01 2022-04-01
1 0 4.3 2022-01-01 2022-05-01
1 0 4.3 2022-01-01 2022-06-01
--- Up to 48 periods --
2 0 3.3 2022-01-01 2022-03-01
2 0 3.3 2022-01-01 2022-04-01
2 0 3.3 2022-01-01 2022-05-01
2 0 3.3 2022-01-01 2022-06-01
BR
E
Try doing some list comprehension on your groupby object where you reindex the dates then forward fill the id
df['date_column'] = pd.to_datetime(df['date_column'])
df = df.set_index('date_column')
new_df = pd.concat([g.reindex(pd.date_range(g.index.min(), periods=48, freq='MS'))
for _,g in df.groupby('id')])
new_df['id'] = new_df['id'].ffill().astype(int)
id feature1 feature2
2022-01-01 1 0.0 4.3
2022-02-01 1 NaN NaN
2022-03-01 1 NaN NaN
2022-04-01 1 NaN NaN
2022-05-01 1 NaN NaN
... .. ... ...
2025-08-01 6 NaN NaN
2025-09-01 6 NaN NaN
2025-10-01 6 NaN NaN
2025-11-01 6 NaN NaN
2025-12-01 6 NaN NaN
Update
If there is only one record for each ID we can do the following. If there is more than one record for each ID then we will need to only keep the min value for each ID, preform the task below and merge everything back together.
# make sure your date is datetime
df['date_column'] = pd.to_datetime(df['date_column'])
# use index.repeat for the number of months you want
# in this case we will offset the min date for 48 months
new_df = df.reindex(df.index.repeat(48)).reset_index(drop=True)
# groupby the id, cumcount and set the type so we can offset
new_df['date_column'] = new_df['date_column'].values.astype('datetime64[M]') + \
new_df.groupby('id')['date_column'].cumcount().values.astype('timedelta64[M]')
id feature1 feature2 date_column
0 1 0 4.3 2022-01-01
1 1 0 4.3 2022-02-01
2 1 0 4.3 2022-03-01
3 1 0 4.3 2022-04-01
4 1 0 4.3 2022-05-01
.. .. ... ... ...
283 6 1078 0.0 2025-08-01
284 6 1078 0.0 2025-09-01
285 6 1078 0.0 2025-10-01
286 6 1078 0.0 2025-11-01
287 6 1078 0.0 2025-12-01

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

Leading and Trailing Padding Dates in Pandas DataFrame

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

How to use loop to count the number of nan

There are a lot of stations in csv file, I don't know how to use loop to count the number of nan of every station. There is I got so far, count one by one. Can someone help me please, thank you in advance.
station1= train_df[train_df['station'] == 28079004]
station1 = station1[['date', 'O_3']]
count_nan = len(station1) - station1.count()
print(count_nan)
I think need create index by station column with set_index, filter columns for check missing values and last count them by sum:
train_df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'date':pd.date_range('2015-01-01', periods=6),
'O_3':[np.nan,3,np.nan,9,2,np.nan],
'station':[28079004] * 2 + [28079005] * 4})
print (train_df)
B C date O_3 station
0 4 7 2015-01-01 NaN 28079004
1 5 8 2015-01-02 3.0 28079004
2 4 9 2015-01-03 NaN 28079005
3 5 4 2015-01-04 9.0 28079005
4 5 2 2015-01-05 2.0 28079005
5 4 3 2015-01-06 NaN 28079005
df = train_df.set_index('station')[['date', 'O_3']].isnull().sum(level=0).astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Another solution:
df = train_df[['date', 'O_3']].isnull().groupby(train_df['station']).sum().astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Although jez already answered and that answer is probably better here. This is how a groupby would look like:
import pandas as pd
import numpy as np
np.random.seed(444)
n = 10
train_df = pd.DataFrame({
'station': np.random.choice(np.arange(28079004,28079008), size=n),
'date': pd.date_range('2018-01-01', periods=n),
'O_3': np.random.choice([np.nan,1], size=n)
})
print(train_df)
s = train_df.groupby('station')['O_3'].apply(lambda x: x.isna().sum())
print(s)
prints:
station date O_3
0 28079007 2018-01-01 NaN
1 28079004 2018-01-02 1.0
2 28079007 2018-01-03 NaN
3 28079004 2018-01-04 NaN
4 28079007 2018-01-05 NaN
5 28079004 2018-01-06 1.0
6 28079007 2018-01-07 NaN
7 28079004 2018-01-08 NaN
8 28079006 2018-01-09 NaN
9 28079007 2018-01-10 1.0
And the output (s):
station
28079004 2
28079006 1
28079007 4

Elegant resample for groups in Pandas

For a given pandas data frame called full_df which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
Do full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and #JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)

Categories