How to merge on multiple keys and add remaming info on conditions? - python

I have a simple Dataframe that i am performing a merge it has three labels, Id, year and a value, and I have another Df that has the same Id a different year and some names for a simple example df1 looks like this :
Id Value Year
1 10 2010
6 11 2020
3 12 2019
4 15 2018
2 17 2017
and df2 looks like this:
Id names Year
1 bs 2017
2 fs 2017
6 td 2020
4 dh 2018
3 sv 2019
So I'm merging on using:
df3 = pd.merge(df1, df2, left_on=['Id', 'Year'],right_on=['Id', 'Year'],how='left')
The answer I want to get is this but I don't know how to do it:
Id Value Year names
1 10 2010 bs
6 11 2020 td
3 12 2019 sv
4 15 2018 dh
5 17 2017 fs
So the idea is that the data below 2017 can be assigned from the data of 2017 the dataframe I have is much longer.

You can make temporary column where you use some constant for years <=2017 and merge on Id and this column:
df1["tmp"] = np.where(df1["Year"] <= 2017, 1, df1["Year"])
df2["tmp"] = np.where(df2["Year"] <= 2017, 1, df2["Year"])
df3 = pd.merge(
df1, df2, left_on=["Id", "tmp"], right_on=["Id", "tmp"], how="left"
)
print(
df3[["Id", "Value", "Year_x", "names"]].rename(columns={"Year_x": "Year"})
)
Prints:
Id Value Year names
0 1 10 2010 bs
1 6 11 2020 td
2 3 12 2019 sv
3 4 15 2018 dh
4 2 17 2017 fs

As you are going to attach the names column of df2 to df1 with matching Id, we can make the 2 dataframes with same index on Id and join them after dropping the Year column of df2.
We can use .join() with .set_index() as follows:
(df1.set_index('Id')
.join(
df2.set_index('Id')
.drop(columns='Year')
)
).reset_index()
# Result
Id Value Year names
0 1 10 2010 bs
1 6 11 2020 td
2 3 12 2019 sv
3 4 15 2018 dh
4 2 17 2017 fs

Related

Doing joins between 2 csv files [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

merging two csv using python [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

Creating subsets of df using pandas groupby and getting a value based on a function

I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.
df
Year ID A Year 2 C
0 2020 12 0 2019 0
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0
4 2019 6 0 2017 0
5 2019 6 1 2018 0 <-
I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.
df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))
This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.
TYIA.
You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:
idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()
If need new column filled by boolean use Index.isin:
df['new'] = df.index.isin(idx)
print (df)
Year ID A Year 2 C new
0 2020 12 0 2019 0 False
1 2020 12 0 2020 0 True
2 2017 10 1 2017 0 True
3 2017 10 0 2018 0 False
4 2019 6 0 2017 0 False
5 2019 6 1 2018 0 True
If need filter rows use DataFrame.loc:
df1 = df.loc[idx]
print (df1)
Year ID A Year 2 C
5 2019 6 1 2018 0
2 2017 10 1 2017 0
1 2020 12 0 2020 0
One row solution:
df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
You could get the idxmin per group:
idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()
# assignment for test
df.loc[idx, 'D'] = '<-'
for selection only:
df2 = df.loc[idx]
output:
Year ID A Year 2 C D
0 2020 12 0 2019 0 NaN
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0 NaN
4 2019 6 0 2017 0 NaN
5 2019 6 1 2018 0 <-
Note that there is a difference between:
df.loc[df.index.isin(idx)]
which gets all the min rows
and:
df.loc[idx]
which gets the first match

Pandas group by id and year(date), but show year for all years, not just those which are present in id?

I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0

Sort GroupBy object by certain max. value within individual groups

I am trying to sort my groupby object by the highest value for a certain year - i.e. the 2018 values. However, unsuccessfully.
Code:
aggs = {'sales':'sum')
df.groupby(by=['segment', 'year'].agg(aggs)
Default result by pandas when grouping
(sorted alphabetically by Level0, then ascending by Level1)
Segment Year Sales
A 2016 2
A 2017 10
A 2018 6
B 2016 1
B 2017 4
B 2018 8
Expected result:
Segment Year Sales
B 2016 1
B 2017 4
B 2018 8
A 2016 2
A 2017 10
A 2018 6
i.e. A is sorted behind B, because sum of B in 2018 is 8 while for A it is 6.
Idea is create ordered Categorical with categories by filtered values with 2018 and sorted by Sales:
cats = df[df['Year'] == 2018].sort_values('Sales', ascending=False)['Segment']
aggs = {'Sales':'sum'}
df['Segment'] = pd.Categorical(df['Segment'], ordered=True, categories=cats)
df1 = df.groupby(by=['Segment', 'Year']).agg(aggs)
print (df1)
Sales
Segment Year
B 2016 1
2017 4
2018 8
A 2016 2
2017 10
2018 6

Categories