merge two dataframes on common cell values of different columns

merge two dataframes on common cell values of different columns - python

I have two dataframes
df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})
and would like to left merge df1 to df2. I don't have a fixed merge column in df1 though. I would like to merge on col1 if the cell value of col1 exists in df2.col3 and on col2 if the cell value of col2 exists in df2.col3. So in the above example merge on col1, col2 and then col1. (This is just an example, I actually have more than only two columns).
I could do this but I'm not sure if it's ok.
df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
Are there any better ways to solve it?

Perform the merges in the preferred order, and use combine_first to combine the merges:
(df1.merge(df2, left_on='col1', right_on='col3', how='left')
.combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
)
)
For a generic method with many columns:
cols = ['col1', 'col2']
from functools import reduce
out = reduce(
lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=col, right_on='col3', how='left')
for col in cols]
)
Output:
col1 col2 col3
0 1 4 1.0
1 2 5 5.0
2 3 6 3.0
Better example:
Adding another column to df2 to illustrate the merge:
df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})
Output:
col1 col2 col3 new
0 1 4 1.0 A
1 2 5 5.0 B
2 3 6 3.0 C

I think your solution is possible modify with get merged Series with compare all columns from list and then merge with this Series:
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
cols = ['col1', 'col2']
s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
col1 col2 col3
0 1 4 1
1 2 5 5
2 3 6 3
Your solution with helper column:
cols = ['col1', 'col2']
df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
.bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
print (df)
col1 col2 merge_col col3
0 1 4 1.0 1
1 2 5 5.0 5
2 3 6 3.0 3
Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:
print (df1[cols].isin(df2.col3))
col1 col2
0 True False
1 False True
2 True False
print (df1[cols].where(df1[cols].isin(df2.col3)))
col1 col2
0 1.0 NaN
1 NaN 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
col1 col2
0 1.0 NaN
1 5.0 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64

Related

Concat pandas dataframes

How to concat without column names?
>> df = pd.DataFrame({'col1': [1], 'col2': [4]})
>> df1 = pd.DataFrame([[5,5]])
>> pd.concat([df, df1])
col1 col2 0 1
0 1.0 4.0 NaN NaN
0 NaN NaN 5.0 5.0
Also the types changed into float64 from int64 if you see closely.
Expected
col1 col2
0 1 4
0 5 5

Create same columns names in both DataFrames by DataFrame.set_axis:
df1 = df1.set_axis(df.columns, axis=1)
Or assign columns names:
df1.columns = df.columns
#alternative - got 0,1 columns
#df.columns = df1.columns
Last use concat:
out = pd.concat([df, df1], ignore_index=True)

Temporarily set_axis with that of df on df1:
pd.concat([df, df1.set_axis(df.columns, axis=1)], ignore_index=True)
NB. append is being deprecated, don't use it.
output:
col1 col2
0 1 4
1 5 5

concatenate 2 dataframes in order to append some columns

I have 2 dataframes(df1 and df2) and I want to append them as follows :
df1 and df2 have some columns in common but I want to append the columns that exist in df2 and not in df1 but keep the columns of df1 as they are
df2 is empty (all rows are nan)
I could just add columns in df1 but in the future, df2 could have new cols added that is why I do not want to hardcode the column names but rather be done automatically. I used to use append but I get the following message
df_new = df1.append(df2)
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead
I tried the following
df_new = pd.concat([df1, df2], axis=1)
but it concatenates all the columns of both dataframes

According to https://pandas.pydata.org/docs/reference/api/pandas.concat.html
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
INNER
df = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
df2 = pd.DataFrame([[None,None,None,None],[None,None,None,None]], columns=['letter', 'number', 'animal', 'newcol'])
print(pd.concat([df,df2], join='inner').dropna(how='all'))
output:
letter number animal
0 c 3 cat
1 d 4 dog
OUTER
df = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
df2 = pd.DataFrame([[None,None,None,None],[None,None,None,None]], columns=['letter', 'number', 'animal', 'newcol'])
print(pd.concat([df,df2], join='outer').dropna(how='all'))
output:
letter number animal newcol
0 c 3 cat NaN
1 d 4 dog NaN

You could use pd.concat() with axis=0 (default) and join='outer' (default). I'm illustrating with some examples
df1 = pd.DataFrame({'col1': [3,3,3],
'col2': [4,4,4]})
df2 = pd.DataFrame({'col1': [1,2,3],
'col2': [1,2,3],
'col3': [1,2,3],
'col4': [1,2,3]})
print(df1)
col1 col2
0 3 4
1 3 4
2 3 4
print(df2)
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
df3 = pd.concat([df1, df2], axis=0, join='outer')
print(df3)
col1 col2 col3 col4
0 3 4 NaN NaN
1 3 4 NaN NaN
2 3 4 NaN NaN
0 1 1 1.0 1.0
1 2 2 2.0 2.0
2 3 3 3.0 3.0

To concatenate just the columns from df2 that are not present in df1:
pd.concat([df1, df2.loc[:, [c for c in df2.columns if c not in df1.columns]]], axis=1)

Find name of column which is non nan

I have a Dataframe defined like :
df1 = pd.DataFrame({"col1":[1,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
"col2":[np.nan,3,np.nan,4,np.nan,np.nan,np.nan,5,6],
"col3":[np.nan,np.nan,7,np.nan,np.nan,8,9,np.nan, np.nan]})
I want to transform it into a DataFrame like:
df2 = pd.DataFrame({"col_name":['col1','col2','col3','col2','col1',
'col3','col3','col2','col2'],
"value":[1,3,7,4,2,8,9,5,6]})
If possible, can we reverse this process too? By that I mean convert df2 into df1.
I don't want to go through the DataFrame iteratively as it becomes too computationally expensive.

You can stack it:
out = (df1.stack().astype(int).droplevel(0)
.rename_axis('col_name').reset_index(name='value'))
Output:
col_name value
0 col1 1
1 col2 3
2 col3 7
3 col2 4
4 col1 2
5 col3 8
6 col3 9
7 col2 5
8 col2 6
To go from out back to df1, you could pivot:
out1 = pd.pivot(out.reset_index(), 'index', 'col_name', 'value')

Pandas Combining two rows into one [duplicate]

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!

In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A

You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A

Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A

If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series

You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

How repeat index in pandas pivot table such that nan fill record-less rows?

Pardon my poorly phrased question--I'm not sure how to word it.
Given this pandas pivot table,
df = pd.DataFrame({'col1': list('AABB'),
'col2': list('acab'),
'values': [1,3,4,5]})
pt = pd.pivot_table(df,
index=['col1', 'col2'],
values='values',
aggfunc=sum)
Output:
col1 col2
A a 1
c 3
B a 4
b 5
How can I make the pivot table output this instead:
col1 col2
A a 1
b NaN
c 3
B a 4
b 5
c NaN

If you convert your column to the category data type (new in pandas 0.15!) you will get the aggregation that you are after:
df.col2 = df.col2.astype('category')
In [378]: df.groupby(['col1','col2']).sum()
Out[378]:
values
col1 col2
A a 1
b NaN
c 3
B a 4
b 5
c NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge two dataframes on common cell values of different columns - python

Related

Concat pandas dataframes

concatenate 2 dataframes in order to append some columns

Find name of column which is non nan

Pandas Combining two rows into one [duplicate]

How repeat index in pandas pivot table such that nan fill record-less rows?

Categories

Resources