Merge datasets with certain priority - python

I have 3 datasets
All the same shape
CustomerNumber, Name, Status
A customer can appear on 1, 2 or all 3.
Each dataset is a list of gold/silver/bronze.
example data:
Dataframe 1:
100,James,Gold
Dataframe 2:
100,James,Silver
101,Paul,Silver
Dataframe 3:
100,James,Bronze
101,Paul,Bronze
102,Fred,Bronze
Expected output/aggregated list:
100,James,Gold
101,Paul,Silver
102,Fred,Bronze
So a customer that is captured in all 3, I want to keep Status as gold.
Have been playing with join and merge and just can’t get it right.

Use concat with convert column to ordered categorical, so get priorites if sorting values by multiple columns and last remove duplicates by DataFrame.drop_duplicates:
print (df1)
print (df2)
print (df3)
a b c
0 100 James Gold
a b c
0 100 James Silver
1 101 Paul Silver
a b c
0 101 Paul Bronze
1 102 Fred Bronze
df = pd.concat([df1, df2, df3], ignore_index=True)
df['c'] = pd.Categorical(df['c'], ordered=True, categories=['Gold','Silver','Bronze'])
df = df.sort_values(['a','b','c']).drop_duplicates(['a','b'])
print (df)
a b c
0 100 James Gold
2 101 Paul Silver
4 102 Fred Bronze

Related

How do I merge two sets of data with Pandas in Pyhton without losing rows?

I'm using Pandas in Python to compare two data frames. I want to match up the data from one set to another.
Dataframe 1
Name
Sam
Mike
John
Matthew
Mark
Dataframe 2
Name
Number
Mike
76
John
92
Mark
32
This is the output I would like to get:
Name
Number
Sam
0
Mike
76
John
92
Matthew
0
Mark
32
At the moment I am doing this
df1 = pd.read_csv('data_frame1.csv', usecols=['Name', 'Number'])
df2 = pd.read_csv('data_frame2.csv')
df3 = pd.merge(df1, df2, on = 'Name')
df3.set_index('Name', inplace = True)
df3.to_csv('output.csv')
However, this is deleting the names which do not have a number. I want to keep them and assign 0 to them.
You can use pd.merge(..., , how = 'outer') this keep all row and insert for them Nan and then use .fillna(0) and insert 0 for Nan:
>>> pd.merge(df1, df2, on = 'Name', how = 'outer').fillna(0)
Name Number
0 Sam 0
1 Mike 76
2 John 92
3 Matthew 0
4 Mark 32
With pd.merge(..., , how = 'outer') you consider two DataFrame if you want megre one DataFrame with another you can merge like below, see this example:
>>> df1 = pd.DataFrame({'Name': ['Mike','John','Mark','Matthew']})
>>> df2 = pd.DataFrame({'Name': ['Mike','John','Mark', 'Sara'], 'Number' : [76,92,32,50]})
>>> pd.merge(df1, df2, on='Name', how='outer').fillna(0)
Name Number
0 Mike 76.0
1 John 92.0
2 Mark 32.0
3 Matthew 0.0
4 Sara 50.0
>>> df1.merge(df2,on='Name', how='left').fillna(0)
Name Number
0 Mike 76.0
1 John 92.0
2 Mark 32.0
3 Matthew 0.0

Add indicator to inform where the data came from Python

Many thanks for reading.
I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.
This would be my desired result:
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
jon smith 0 0 0 1
charlie jim 1 0 0 1
ian james 0 1 0 0
For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).
I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).
Many thanks for any help.
You can use:
first concat with parameter keys for identify DataFrames
reset_index for columns from MultiIndex
groupby and join indicators
create indicators by str.get_dummies
reindex if need append 0 columns for missing categories
reset_index for columns from Index
df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
df3 = pd.DataFrame()
df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})
#list of DataFrames
dfs = [df1, df2, df3, df4]
#generate indicators
inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
df = (pd.concat(dfs, keys=inds)
.reset_index()
.groupby(['Forename','Surname'])['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1
More general solution with groupby by all columns:
df = pd.concat(dfs, keys=inds)
print (df)
Forename Surname
Ind_1 0 charlie jim
Ind_2 0 ian james
Ind_4 0 charlie jim
1 jon smith
df1 =(df.reset_index()
.groupby(df.columns.tolist())['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df1)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1

How to add two DataFrame

I have DataFrame number 1
Price Things
0 1 pen
1 2 pencil
2 6 apple
I have DataFrame number 2:
Price Things
0 5 pen
1 6 pencil
2 10 cup
I want to join two DataFrames and I'd like to see this DataFrame:
DataFrame number 1 + DatFRame number 2
Price Things
0 6 pen
1 8 pencil
2 6 apple
3 10 cup
How can I do this?
This code:
import pandas as pd
df = pd.DataFrame({'Things': ['pen', 'pencil'], 'Price': [1, 2]})
series = pd.Series([1,2], index=[0,1])
df["Price"] = series
df.loc[2] = [6, "apple"]
print("DataFrame number 1")
print(df)
df2 = pd.DataFrame({'Things': ['pen', 'pencil'], 'Price': [1, 2]})
series = pd.Series([5,6], index=[0,1])
df2["Price"] = series
df2.loc[2] = [10, "cup"]
print("DataFrame number 2")
print(df2)
You can also use concatenate function to combine two dataframes along axis = 0, then group by column and sum them.
df3 = pd.concat([df, df2], axis=0).groupby('Things').sum().reset_index()
df3
Output:
Things Price
0 apple 6
1 cup 10
2 pen 6
3 pencil 8
You can merge, add, then drop the interim columns:
common = pd.merge(
df,
df2,
on='Things',
how='outer').fillna(0)
common['Price'] = common.Price_x + common.Price_y
common.drop(['Price_x', 'Price_y'], axis=1, inplace=True)
>>> common
Things Price
0 pen 6.0
1 pencil 8.0
2 apple 6.0
3 cup 10.0
You can also set Things as index on both data frames and then use add(..., fill_value=0):
df.set_index('Things').add(df2.set_index('Things'), fill_value=0).reset_index()
# Things Price
#0 apple 6.0
#1 cup 10.0
#2 pen 6.0
#3 pencil 8.0

isin pandas dataframe from 2 other dataframe

i have a pandas dataframe.
df = pd.DataFrame({'countries':['US','UK','Germany','China','India','Pakistan','lanka'],
'id':['a','b','c','d','e','f','g']})
also i have two more dataframes. df2 and df3.
df2 = pd.DataFrame({'countries':['Germany','China'],
'capital':['c','d']})
df3 = pd.DataFrame({'countries':['lanka','USA'],
'capital':['g','a']})
i want to find the rows in df where df is in df2 and df3
i had this code:
df[df.id.isin(df2.capital)]
but it will find the rows which is in df2.
is there any way i can do it for both df2 and df3 in a single code.
i'e rows from df where df is in df2 and df3
I think you need simply sum both list together:
print (df[df.id.isin(df2.capital.tolist() + df3.capital.tolist())])
countries id
0 US a
2 Germany c
3 China d
6 lanka g
Another solution is use numpy.setxor1d - set exclusive-or of two arrays:
print (df[df.id.isin(np.setxor1d(df2.capital, df3.capital))])
countries id
0 US a
2 Germany c
3 China d
6 lanka g
Or solution with comment with or - |:
print (df[(df.id.isin(df2.capital)) | (df.id.isin(df3.capital))])
countries id
0 US a
2 Germany c
3 China d
6 lanka g

How to Stack Data Frames on top of one another (Pandas,Python3)

Lets say i Have 3 Pandas DF
DF1
Words Score
The Man 2
The Girl 4
Df2
Words2 Score2
The Boy 6
The Mother 7
Df3
Words3 Score3
The Son 3
The Daughter 4
Right now, I have them concatenated together so that it becomes 6 columns in one DF. That's all well and good but I was wondering, is there a pandas function to stack them vertically into TWO columns and change the headers?
So to make something like this?
Family Members Score
The Man 2
The Girl 4
The Boy 6
The Mother 7
The Son 3
The Daughter 4
everything I'm reading here http://pandas.pydata.org/pandas-docs/stable/merging.html seems to only have "horizontal" methods of joining DF!
As long as you rename the columns so that they're the same in each dataframe, pd.concat() should work fine:
# I read in your data as df1, df2 and df3 using:
# df1 = pd.read_clipboard(sep='\s\s+')
# Example dataframe:
Out[8]:
Words Score
0 The Man 2
1 The Girl 4
all_dfs = [df1, df2, df3]
# Give all df's common column names
for df in all_dfs:
df.columns = ['Family_Members', 'Score']
pd.concat(all_dfs).reset_index(drop=True)
Out[16]:
Family_Members Score
0 The Man 2
1 The Girl 4
2 The Boy 6
3 The Mother 7
4 The Son 3
5 The Daughter 4

Categories