Pandas Dataframe Comparison - specify mismatched columns - python

I have two dataframes as shown below, df1 and df2:
df1 =
emp_name emp_city counts
emp_id
2 two city2 3
4 fourxxx city4 1
5 five city5 1
df2 =
emp_name emp_city counts
emp_id
2 two city2 1
3 three city3 1
4 four city4 1
Note: 'emp_id' acts as index.
I want to find the difference between df1 and df2 and write the name of the columns which has mismatched values. The below code snippet will do that.
df3 = df2.copy()
df3['mismatch_col'] = df2.ne(df1, axis=1).dot(df2.columns)
Results in df3:
df3 =
emp_name emp_city counts mismatch_col
emp_id
2 two city2 1 counts
3 three city3 1 emp_nameemp_citycounts
4 four city4 1 emp_name
Now the problem I have is with respect to 'mismatch_col'. It is giving me the names of columns where there is a mismatch in df1 and df2. But, the column names are NOT separated. I want to separate the column names by commas. Expected output should look like below:
Expected_df3 =
emp_name emp_city counts mismatch_col
emp_id
2 two city2 1 counts
3 three city3 1 emp_name,emp_city,counts
4 four city4 1 emp_name
Can someone please help me on this?

You can use df2.columns + ',' to add commas and then str[:-1] to remove the last one:
df3['mismatch_col'] = df2.ne(df1, axis=1).dot(df2.columns + ',').str[:-1]
Result:
emp_name emp_city counts mismatch_col
emp_id
2 two city2 1 counts
3 three city3 1 emp_name,emp_city,counts
4 four city4 1 emp_name

Related

Function with if case to merge two columns or three in pandas

I'm trying to solve an interesting problem and would like any suggestions.
What I'm trying to do is merge two dataframes in three columns but if the third one in the first dataframe has a nan value then only merge the first two.
Example:
---DataFrame 1---
Number
Number2
Name
1
2
One
2
2
3
2
Three
---DataFrame 2---
Number
Number2
Name2
1
2
One
2
2
Two
2
2
Two.5
3
2
Three
3
2
Three.5
4
2
Four
---Result---
Number
Number2
Name
Name2
1
2
One
One
2
2
Two
2
2
Two.5
3
2
Three
Three
So far I tried to do a function for this.
def merge_three_or_two(row):
if row['Name'] == np.nan:
row = pd.merge(row, df2, how='left', left_on=['Number','Number2'], right_on = ['Number','Number2'])
else:
row = pd.merge(row, df2, how='left', left_on=['Number','Number2','Name'], right_on = ['Number','Number2','Name2'])
df1 = df1.apply(merge_three_or_two, axis=1)
Try to use .isna().any() in the condition:
if df1.Name.isna().any():
print(df1.merge(df2, how='left', on=['Number', 'Number2']))
else:
print(df1.merge(df2, how='left', left_on=['Number','Number2','Name'], right_on = ['Number','Number2','Name2']))
You can merge df1 and df2 dataframes on 'Number','Number2' columns on 1st phase/step, then just drop rows that match the additional condition:
df3 = df1.merge(df2, how='left', left_on=['Number','Number2'], right_on=['Number','Number2'])
df3.drop(df3[df3['Name'].notna() & (df3['Name'] != df3['Name2'])].index, inplace=True)
print(df3)
Number Number2 Name Name2
0 1 2 One One
1 2 2 NaN Two
2 2 2 NaN Two.5
3 3 2 Three Three

Iterating thru a “Pandas Data Frame” with criteria’s from another “Pandas Data Frame”?

Edit: Sorry to be not clear in my question:
The Problem is, that customer B22 has three entries in the "Sales Table" but a target of only two sales, therefore I have to sum up only the first two entries and ignore the last one.
In the original data frame, the values in the value column are not the same.
I'm on pandas version 0.24.0
I have two pandas data frames. One with customers and sales and one with customers and sales targets. I want to sum up the sales value according to number of sales from the "Sales Target" dataframe.
Sales Table
Index
Cust_ID
Date
Value
0
A11
02.01.2021
100
1
A11
03.01.2021
100
2
A11
04.01.2021
100
3
A11
05.01.2021
100
4
B22
05.01.2021
100
5
B22
06.01.2021
100
6
B22
07.01.2021
100
7
C33
08.01.2021
100
8
C33
09.01.2021
100
Sales Targets
Index
Cust_ID
Sales_Target
0
A11
4
1
B22
2
2
C33
4
Customer A11 has a "Sales_Target" of 4 he bought 4 therefore a value of 400
Customer B22 has a "Sales_Target" of 2 he bought 3 therefore only a value of 200
Customer C33 has a "Sales_Target" of 3 he bought 2 therefore only a value of 200
Index
Cust_ID
Sales_Target
Sales
Sales_Value
0
A11
4
4
400
1
B22
2
3
200
2
C33
4
2
200
Sorry, I have no idea to solve the problem.
Thank you for your help.
Cheers Marcus
you need to marge the table and then group_by the relevant columns
in the case of the current data you can just group by customer
sales_df = "sales_and_customer_data"
target_df = "target_and_customer_data"
merge_df = pd.merge(sales_df,target_df,how='left',on=['Cust_ID'],copy=False)
merge_df = merge_df[['Cust_ID','Sales_Target','Value']].groupby('Cust_ID').agg(sales_value=('Value': 'sum'),sales_count=('Value', 'count'))
What you actually want isn't a direct merge. First, transform your first table into sums grouped by customer ID:
sales_agg = sales.groupby('Cust_ID').agg(sales_value=('Value': 'sum'),
sales_count=('Value', 'count')) \
.reset_index()
Then merge your targets against this new table to introduce the new columns:
table3 = sales_targets.merge(sales_agg, on='Cust_ID', how='left', validate='1:1')
If you're using older Pandas, just use two groupbys:
sales_agg = pd.concat([
sales.groupby('Cust_ID')['Value'].sum().rename('sales_value'),
sales.groupby('Cust_ID')['Value'].count().rename('sales_count')],
axis=1).reset_index()
Then do the merge as above.

Returning the rows based on specific value without column name

I know how to return the rows based on specific text by specifying the column name like below.
import pandas as pd
data = {'id':['1', '2', '3','4'],
'City1':['abc','def','abc','khj'],
'City2':['JH','abc','abc','yuu'],
'City2':['JRR','ytu','rr','abc']}
df = pd.DataFrame(data)
df.loc[df['City1']== 'abc']
and output is -
id City1 City2
0 1 abc JRR
2 3 abc rr
but what i need is -my specific value 'abc' can be in any columns and i need to return rows values that has specific text eg 'abc' without giving column name. Is there any way? need output as below
id City1 City2
0 1 abc JRR
1 3 abc rr
2 4 khj abc
You can use any with the (1) parameter to apply it on all columns to get the expected result :
>>> df[(df == 'abc').any(1)]
id City1 City2
0 1 abc JRR
2 3 abc rr
3 4 khj abc

Pandas - Replace row values based on multi-column match [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
This is a simple question but most of the solution I found here were based on one column match (mainly only ID).
Df1
'Name' 'Dept' 'Amount' 'Leave'
ABC 1 10 0
BCD 1 5 0
Df2
'Alias_Name', 'Dept', 'Amount', 'Leave', 'Address', 'Join_Date'
ABC 1 100 5 qwerty date1
PQR 2 0 2 asdfg date2
I want to replaces row values in df1 when both the Name and Dept are matched.
I tried merge(left_on=['Name', 'Dept'], right_on=['Alias_Name', 'Dept'], how='left') but it gives me double number of columns with _x and _y suffix. I just need to replaces the Dept, Amount, Leave in df1 if the Name and Dept are matched with any row in df2.
Desired Output:
Name Dept Amount Leave
ABC 1 100 5
BCD 1 5 0
new_df = df1[['Name', 'Dept']].merge(df2[['Alias_Name', 'Dept', 'Amount', 'Leave']].rename(columns={'Alias_Name': 'Name'}), how='left').fillna(df1[['Amount', 'Leave']])
Result:
Name Dept Amount Leave
0 ABC 1 100.0 5.0
1 BCD 1 5.0 0.0
You can use new_df[['Amount', 'Leave']] = new_df[['Amount', 'Leave']].astype(int) to re-cast the dtype if that's important.
You can create a temp column in both data frames which will be sum of both "Name" and "Dept". That column can be used as primary key to match
Try:
# select rows that should be replace
replace_df = df1[['Name', 'Dept']].merge(df2, left_on=['Name', 'Dept'], right_on=['Alias_Name', 'Dept'], how='inner')
# replace rows in df1
df1.iloc[replace_df.index] = replace_df
Result:
Name Dept Amount Leave
0 ABC 1 100 5
1 BCD 1 5 0

pandas function to fill missing values from other dataframe based on matching column?

So I have two dataframes: one where certain columns are filled in and one where others are filled in but some from the previous df are missing. Both share some common non-empty columns.
DF1:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 NaN
Charlie 3 20160627 NaN
DF2:
FirstName Uid JoinDate BirthDate
Bob 1 NaN 19910524
Alice 2 NaN 19950403
Result:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 19910524
Alice 2 NaN 19950403
Charlie 3 20160627 NaN
Assuming that these rows do not share index positions in their respective dataframes, is there a way that I can fill the missing values in DF1 with values from DF2 where the rows match on a certain column (in this example Uid)?
Also, is there a way to create a new entry in DF1 from DF2 if there isn't a match on that column (e.g. Uid) without removing rows in DF1 that don't match any rows in DF2?
EDIT: I updated the dataframes to add non-matching results in both dataframes that I need in the result df. I also updated my last question to reflect that.
UPDATE: you can do it setting the proper indices and finally resetting the index of joined DF:
In [14]: df1.set_index('FirstName').combine_first(df2.set_index('FirstName')).reset_index()
Out[14]:
FirstName Uid JoinDate BirthDate
0 Alice 2.0 NaN 19950403.0
1 Bob 1.0 20160628.0 19910524.0
2 Charlie 3.0 20160627.0 NaN
try this:
In [113]: df2.combine_first(df1)
Out[113]:
FirstName Uid JoinDate BirthDate
0 Bob 1 20160628.0 19910524
1 Alice 2 NaN 19950403

Categories