Join two Pandas DataFrames on specific column with matching values

Join two Pandas DataFrames on specific column with matching values - python

I want to join two pandas dataframes on "ColA", but the thing is that values in "ColA" in these two dataframes are not in order and dataframes are not the same lenght. I want to join them so that missing values are changed with 0 value, and that values in "ColA" are matching.
df1 = pd.DataFrame({"ColA":["num 1", "num 2", "num 3"],
"ColB":[5,6,7]})
print(df1)
df2 = pd.DataFrame({"ColA":["num 2", "num 3","num 1", "num 4"],
"ColC":[3,2,1,5]})
print(df2)
ColA ColB
0 num 1 5
1 num 2 6
2 num 3 7
ColA ColC
0 num 2 3
1 num 3 2
2 num 1 1
3 num 4 5
Result should look like this:
# num1 is matched with appropriate values and num4 has the value 0 for "ColB"
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5

Use DataFrame.merge with outer join, convert NaNs to 0 and last if necessary convert dtypes to original by dictionary:
d = df1.dtypes.append(df2.dtypes).to_dict()
df = df1.merge(df2, how='outer', on='ColA').fillna(0).astype(d)
print (df)
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5
Or use concat with convert all columns to integers (if possible):
df = (pd.concat([df1.set_index('ColA'),
df2.set_index('ColA')], axis=1, sort=True)
.fillna(0)
.astype(int)
.rename_axis('ColA')
.reset_index())
print (df)
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5

Related

python - how to select difference between two dataframes and it's different column

I have two dataframes and df2 is more columns
If the row in df1 doesn't have in df2, I select it to df3
df1
id colA colB
0 1 4 1
1 2 5 2
2 3 2 4
3 4 4 2
4 5 2 4
df2
id colA colB colC
0 1 4 1 0
1 2 5 2 0
2 5 2 4 0
I want select some rows from df1
df3
id colA colB
0 3 2 4
1 4 4 2

Assuming you are comparing on the 'id' column (if not, please clarify), you can use Series.isin with boolean indexing.
>>> df3 = df1[~df1['id'].isin(df2['id'])]
>>> df3
id colA colB
2 3 2 4
3 4 4 2

df3 = df1.loc[~df1['id'].isin(list(df2['id']))]
Output:
id colA colB
2 3 2 4
3 4 4 2

Use drop_duplicates:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4,5],
'colA':[4,5,2,4,2],
'colB':[1,2,4,2,4]})
df2 = pd.DataFrame({'id': [1,2,5],
'colA':[4,5,2],
'colB':[1,2,4])
pd.concat([df1,df2]).drop_duplicates(subset='id',keep=False)
Output:
id colA colB
2 3 2 4
3 4 4 2

Merge two data frames with taking max of two columns

I have two dataframes with the same form:
> df1
Day ItemId Quantity
1 1 2
1 2 3
1 4 5
> df2
Day ItemId Quantity
1 1 0
1 2 0
1 3 0
1 4 0
I'd like to merge df1 and df2 and if a row of ['Day','ItemId'] exists in both df1 and df2 take df1 which the max
I tried this command :
df = pd.concat([df1, df2]).groupby(level=0).max(df1['Quantity'],df2['Quantity'])

Use groupby by both columns in list and aggregate max:
df = pd.concat([df1, df2]).groupby(['Day','ItemId'], as_index=False)['Quantity'].max()
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
3 1 4 5
If possible multiple columns:
df = (pd.concat([df1, df2])
.sort_values(['Day','ItemId','Quantity'], ascending=[True, True, False])
.drop_duplicates(['Day','ItemId']))
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
2 1 4 5

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.

I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3

Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Dynamic comparison with pandas AND

I have a dictionary with each column as a key in a dataframe like:
dict = {"colA":1,"colB":1,"colC":1}
with colA, colB, colC the columns of my dataframe.
I would like to do something like:
df.loc[(df["colA"] < = dict["colA"]) & (df["colB"] < = dict["colB"]) & (df["colC"] < = dict["colC"])]
but dynamically (I don't know the length of the dict / number of columns)
Is there a way to do a & with dynamic number of arguments?

You can use:
from functools import reduce
df = pd.DataFrame({'colA':[1,2,0],
'colB':[0,5,6],
'colC':[1,8,9]})
print (df)
colA colB colC
0 1 0 1
1 2 5 8
2 0 6 9
d = {"colA":1,"colB":1,"colC":1}
a = df[(df["colA"] <= d["colA"]) & (df["colB"] <= d["colB"]) & (df["colC"] <= d["colC"])]
print (a)
colA colB colC
0 1 0 1
Solution with creating Series, compare with le, check all True by all and last use boolean indexing:
d = {"colA":1,"colB":1,"colC":1}
s = pd.Series(d)
print (s)
colA 1
colB 1
colC 1
dtype: int64
print (df.le(s).all(axis=1))
0 True
1 False
2 False
dtype: bool
print (df[df.le(s).all(axis=1)])
colA colB colC
0 1 0 1
Another solution with numpy.logical_and and reduce for creating mask and list comprehension for apply conditions:
print ([df[x] <= d[x] for x in df.columns])
[0 True
1 False
2 True
Name: colA, dtype: bool, 0 True
1 False
2 False
Name: colB, dtype: bool, 0 True
1 False
2 False
Name: colC, dtype: bool]
mask = reduce(np.logical_and, [df[x] <= d[x] for x in df.columns])
print (mask)
0 True
1 False
2 False
Name: colA, dtype: bool
print (df[mask])
colA colB colC
0 1 0 1

Here is one SQL-like solution, which uses .query() method:
Data:
In [23]: df
Out[23]:
colA colB colC
0 2 2 5
1 3 0 8
2 5 9 2
3 3 0 2
4 9 1 3
5 7 5 6
6 7 8 0
7 0 4 1
8 8 2 6
9 9 6 7
Solution:
In [20]: dct = {"colA":4,"colB":4,"colC":4}
In [21]: qry = ' and '.join(('{0[0]} <= {0[1]}'.format(tup) for tup in dct.items()))
In [22]: qry
Out[22]: 'colB <= 4 and colA <= 4 and colC <= 4'
In [24]: df.query(qry)
Out[24]:
colA colB colC
3 3 0 2
7 0 4 1

Create unique MultiIndex from Non-unique Index Python Pandas

I have a pandas DataFrame with a non-unique index:
index = [1,1,1,1,2,2,2,3]
df = pd.DataFrame(data = {'col1': [1,3,7,6,2,4,3,4]}, index=index)
df
Out[12]:
col1
1 1
1 3
1 7
1 6
2 2
2 4
2 3
3 4
I'd like to turn this into unique MultiIndex and preserve order, like this:
col1
Ind2
1 0 1
1 3
2 7
3 6
2 0 2
1 4
2 3
3 0 4
I would imagine pandas would have a function for something like this but haven't found anything

You can do a groupby.cumcount on the index, and then append it as a new level to the index using set_index:
df = df.set_index(df.groupby(level=0).cumcount(), append=True)
The resulting output:
col1
1 0 1
1 3
2 7
3 6
2 0 2
1 4
2 3
3 0 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join two Pandas DataFrames on specific column with matching values - python

Related

python - how to select difference between two dataframes and it's different column

Merge two data frames with taking max of two columns

return rows with unique pairs across columns

Dynamic comparison with pandas AND

Create unique MultiIndex from Non-unique Index Python Pandas

Categories

Resources