Pandas Concat dataframes with Duplicates

Pandas Concat dataframes with Duplicates - python

I am facing issues in concatenating two Dataframes of different lengths. Below is the issue:
df1 =
emp_id emp_name counts
1 sam 0
2 joe 0
3 john 0
df2 =
emp_id emp_name counts
1 sam 0
2 joe 0
2 joe 1
3 john 0
My Expected output is:
Please Note that my expectation is not to merge the 2 dataframes into one but I would like to concat two dataframes side by side and highlight the differences in such a way that, if there is a duplicate row in one df, example df2, the respective row of df1 should show as NaN/blank/None any Null kind of values
Expected_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
NaN NaN NaN 2 joe 1
3 john 0 3 john 0
whereas am getting output as below:
actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])
the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output
actual_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
3 john 0 2 joe 1
NaN NaN NaN 3 john 0
Tried pd.concat by passing different parameters but not getting expected result.
The main issue I have in concat is, I am not able to move the duplicate rows one row down.
Can anyone please help me on this? Thanks in advance

This does not give the exact output you asked, but it could solve your problem anyway:
df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
Output:
emp_id emp_name counts _merge
0 1 sam 0 both
1 2 joe 0 both
2 3 john 0 both
3 2 joe 1 right_only
You don't have rows with NaNs as you wanted, but in this way you can check whether a row is in the left df, right df or both by looking at the _merge column. You can also give a custom name to that columns using indicator='name'.
Update
To get the exact output you want you can do the following:
output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]
output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)
output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'),
('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])
Output:
df1 df2
emp_id emp_name counts emp_id emp_name counts
0 1.0 sam 0.0 1.0 sam 0.0
1 2.0 joe 0.0 2.0 joe 0.0
2 3.0 john 0.0 3.0 john 0.0
3 NaN NaN NaN 2.0 joe 1.0

Related

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?

Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Series Mismatch (Find value in "mismatched" Series)

Feel like I should know this. I am attempting to compare two DataFrames and find those individuals who are not included:
First df
data_x = {'Num':[321654,654987, 654321], 'Name':['Tim', 'Jake', 'Sam']}
x = pd.DataFrame(data_x)
x =
0 321654 Tim
1 654987 Jake
2 654321 Sam
Second df
data_z = {'Num':[321654,123456, 654987,894523], 'Name':['Tim', 'Jim', 'Jake', 'Bob']}
z = pd.DataFrame(data_z)
z =
0 321654 Tim
1 123456 Jim
2 654987 Jake
3 894523 Bob
Requested Results =
0 123456 Jim
1 894523 Bob

You can do a outer .merge() on x and z with parameter indicator=True and check for those merge result with right_only, as follows:
out = x.merge(z, how='outer', indicator=True)
# You can also specify the 2 columns if there are other columns in real situation
# out = x.merge(z, on=['Num', 'Name'] how='outer', indicator=True)
Result:
print(out)
Num Name _merge
0 321654 Tim both
1 654987 Jake both
2 654321 Sam left_only
3 123456 Jim right_only
4 894523 Bob right_only
and then filter the result by:
out.loc[out['_merge'] == 'right_only']
Output:
Num Name _merge
3 123456 Jim right_only
4 894523 Bob right_only
Of course you can remove the merge result column and reset the index, if you like:
out_filtered = out.loc[out['_merge'] == 'right_only']
out_filtered = out_filtered.drop(columns='_merge').reset_index(drop=True)
print(out_filtered)
Num Name
0 123456 Jim
1 894523 Bob

dataframe not droping duplicate

i have two dataframes :
df:
id Name Number Stat
1 co 4
2 ma 98
3 sa 0
df1:
id Name Number Stat
1 co 4
2 ma 98 5%
I want to merge both dataframes in 1 (dfnew) and i want it as follow:
id Name Number Stat
1 co 4
2 ma 98 5%
3 sa 0
I used
dfnew = pd.concat([df, df2])
dfnew = df_row.drop_duplicates(keep='last')
I am not getting the result i want. the dataframes are joined but duplicates are not deleted. I need help please

It seems you need check only first 3 columns for duplicates:
dfnew = pd.concat([df, df2]).drop_duplicates(subset=['id','Name','Number'], keep='last')
print (dfnew)
id Name Number Stat
2 3 sa 0 NaN
0 1 co 4 NaN
1 2 ma 98 5%

try pd.merge function with inner/ outer based on requirement.

Split single column into two based on column values

I have a dataframe that looks like this:
Supervisor Score
Bill Pass
Bill Pass
Susan Fail
Susan Fail
Susan Fail
I would like to do some aggregates (such as getting the % of pass by supervisor) and would like to split up the Score column so all the Pass are in one column and all the Fail are in another column. Like this:
Supervisor Pass Fail
Bill 0 1
Bill 0 1
Susan 1 0
Susan 1 0
Susan 1 0
Any ideas? Would a simple groupby work by grouping both the supervisor and score columns and getting a count of Score?

pd.get_dummies
Removes any columns you specify from your DataFrame in favor of N dummy columns with the default naming convention 'OrigName_UniqueVal'. Specifying empty strings for the prefix and separator gives you column headers of only the unique values.
pd.get_dummies(df, columns=['Score'], prefix_sep='', prefix='')
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
If in the end you just want % of each category by supervisor then you don't really need the dummies. You can groupby. I use a reindex to ensure the resulting DataFrame has each category represented for each Supervisor.
(df.groupby(['Supervisor']).Score.value_counts(normalize=True)
.reindex(pd.MultiIndex.from_product([df.Supervisor.unique(), df.Score.unique()]))
.fillna(0))
#Bill Pass 1.0
# Fail 0.0
#Susan Pass 0.0
# Fail 1.0
#Name: Score, dtype: float64

IIUC, you want DataFrame.pivot_table + DataFrmae.join
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0))
print(new_df)
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
For the output expect:
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0)
.eq(0)
.astype(int))
print(new_df)
Supervisor Fail Pass
0 Bill 1 0
1 Bill 1 0
2 Susan 0 1
3 Susan 0 1
4 Susan 0 1

**Let's try this one**
df=pd.DataFrame({'Supervisor':['Bill','Bill','Susan','Susan','Susan'],
'Score':['Pass','Pass','Fail','Fail','Fail']}).set_index('Supervisor')
pd.get_dummies(df['Score'])
PANDAS 100 tricks
For More Pandas trick refer following : https://www.kaggle.com/python10pm/pandas-100-tricks

To get the df you want you can do it like this:
df["Pass"] = df["Score"].apply(lambda x: 0 if x == "Pass" else 1)
df["Fail"] = df["Score"].apply(lambda x: 0 if x == "Fail" else 1)

How to Compare Values of two Dataframes in Pandas?

I have two dataframes df and df2 like this
id initials
0 100 J
1 200 S
2 300 Y
name initials
0 John J
1 Smith S
2 Nathan N
I want to compare the values in the initials columns found in (df and df2) and copy the name (in df2) which its initial is matching to the initial in the first dataframe (df)
import pandas as pd
for i in df.initials:
for j in df2.initials:
if i == j:
# copy the name value of this particular initial to df
The output should be like this:
id name
0 100 Johon
1 200 Smith
2 300
Any idea how to solve this problem?

How about?:
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id'])
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0 NaN
So the 'initials' column is dropped and so is anything with np.nan in the 'id' column.
If you don't want the np.nan in there tack on a .fillna():
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id']).fillna('')
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0

df1
id initials
0 100 J
1 200 S
2 300 Y
df2
name initials
0 John J
1 Smith S
2 Nathan N
Use Boolean masks: df2.initials==df1.initials will tell you which values in the two initials columns are the same.
0 True
1 True
2 False
Use this mask to create a new column:
df1['name'] = df2.name[df2.initials==df1.initials]
Remove the initials column in df1:
df1.drop('initials', axis=1)
Replace the NaN using fillna(' ')
df1.fillna('', inplace=True) #inplace to avoid creating a copy
id name
0 100 John
1 200 Smith
2 300

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Concat dataframes with Duplicates - python

Related

Update dataframe column based on another dataframe column without for loop

Series Mismatch (Find value in "mismatched" Series)

dataframe not droping duplicate

Split single column into two based on column values

How to Compare Values of two Dataframes in Pandas?

Categories

Resources