Number of missing entries when merging DataFrames - python

In an exercise, I was asked to merge 3 DataFrames with inner join (df1+df2+df3 = mergedDf), then in another question I was asked to tell how many entries I've lost when performing this 3-way merging.
#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
Goals Medals
Argentina 5 2
Angola 1 0
Bolivia 3 1
#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
Dates Medals
Venezuela 1 0
Africa 2 1
Argentina 2 2
#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
Players Goals
Argentina 11 5
Australia 11 1
Spain 10 0
#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
Goals_X Medals_X Dates Medals_Y Players Goals_Y
Argentina 5 2 2 2 11 2
#Calculate number of lost entries by code
I tried to merge everything with outer join and then subtracting the mergedDf, but I don't know how to do this, can anyone help me?

I've found a simple but effective solution:
Merging the 3 DataFrames, inner and outer:
df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')
Now, the number of missed entries (rows) is:
return (len(outer)-len(inner))

Solution with outer join and parameter indicator, last count rows with no both in both indicator columns a and b by sum of True values (processes like 1s):
mergedDf = pd.merge(df1,df2,how='outer',left_index=True, right_index=True, indicator='a')
mergedDf = pd.merge(mergedDf,df3,how='outer',left_index=True, right_index=True, indicator='b')
print(mergedDf)
Goals_x Medals_x Dates Medals_y a Players Goals_y \
Africa NaN NaN 2.0 1.0 right_only NaN NaN
Angola 1.0 0.0 NaN NaN left_only NaN NaN
Argentina 5.0 2.0 2.0 2.0 both 11.0 5.0
Australia NaN NaN NaN NaN NaN 11.0 1.0
Belgica NaN NaN NaN NaN NaN 10.0 0.0
Bolivia 3.0 1.0 NaN NaN left_only NaN NaN
Venezuela NaN NaN 1.0 0.0 right_only NaN NaN
b
Africa left_only
Angola left_only
Argentina both
Australia right_only
Belgica right_only
Bolivia left_only
Venezuela left_only
missing = ((mergedDf['a'] != 'both') & (mergedDf['b'] != 'both')).sum()
print (missing)
6
Another solution is use inner join and sum filtered values of each index which not matched mergedDf.index:
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
vals = mergedDf.index
print (vals)
Index(['Argentina'], dtype='object')
dfs = [df1, df2, df3]
missing = sum((~x.index.isin(vals)).sum() for x in dfs)
print (missing)
6
Anoter solution if unique values in each index:
dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]
#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6

You can passing True to the indicator in merge
df1=pd.DataFrame({'A':[1,2,3],'B':[1,1,1]})
df2=pd.DataFrame({'A':[2,3],'B':[1,1]})
df1.merge(df2,on='A',how='inner')
Out[257]:
A B_x B_y
0 2 1 1
1 3 1 1
df1.merge(df2,on='A',how='outer',indicator =True)
Out[258]:
A B_x B_y _merge
0 1 1 NaN left_only
1 2 1 1.0 both
2 3 1 1.0 both
mergedf=df1.merge(df2,on='A',how='outer',indicator =True)
Then with value_counts you know how many you lost when do inner , since only the both will keep when how='inner'
mergedf['_merge'].value_counts()
Out[260]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
For 3 df and filter with both merge columns words is both
df1.merge(df2, on='A',how='outer',indicator =True).rename(columns={'_merge':'merge'}).merge(df3, on='A',how='outer',indicator =True)

Related

Outer join to check existence each records of two pandas dataframes like SQL

I have to tables looks like following:
Table T1
ColumnA
ColumnB
A
1
A
3
B
1
C
2
Table T2
ColumnA
ColumnB
A
1
A
4
B
1
D
2
in SQL I will do following query to check the existence of each record
select
COALESCE(T1.ColumnA,T2.ColumnA) as ColumnA
,T1.ColumnB as ExistT1
,T2.ColumnB as ExistT2
from T1
full join T2 on
T1.ColumnA=T2.ColumnA
and T1.ColumnB=T2.ColumnB
where
(T1.ColumnA is null or T2.ColumnA is null)
I have tried many way in Pandas like concate, join, merge, etc, but it seems that the two merge keys would be combined into one.
I think the problem is that I want to check is not 'data columns' but 'key columns'.
Is there any good idea to do this in Python? Thanks!
ColumnA
ExistT1
ExistT2
A
3
null
A
null
4
C
2
null
D
null
2
pd.merge has an indicator parameter that could be helpful here:
(t1
.merge(t2, how = 'outer', indicator=True)
.loc[lambda df: df._merge!="both"]
.assign(ExistT1 = lambda df: df.ColumnB.where(df._merge.eq('left_only')),
ExistT2 = lambda df: df.ColumnB.where(df._merge.eq('right_only')) )
.drop(columns=['ColumnB', '_merge'])
)
ColumnA ExistT1 ExistT2
1 A 3.0 NaN
3 C 2.0 NaN
4 A NaN 4.0
5 D NaN 2.0
First
merge 2 dataframes
following code:
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer'))
output:
ColumnA ColumnB ExistT1 ExistT2
0 A 1 1.00 1.00
1 A 3 3.00 NaN
2 B 1 1.00 1.00
3 C 2 2.00 NaN
4 A 4 NaN 4.00
5 D 2 NaN 2.00
Second
drop ColumnB and same value rows (like row 0 and row2)
include full code:
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
.drop('ColumnB', axis=1)
.loc[lambda x: x.isnull().any(axis=1)])
output:
ColumnA ExistT1 ExistT2
1 A 3.00 NaN
3 C 2.00 NaN
4 A NaN 4.00
5 D NaN 2.00
Final
sort_values and reset_index (full code)
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
.drop('ColumnB', axis=1)
.loc[lambda x: x.isnull().any(axis=1)]
.sort_values(['ColumnA']).reset_index(drop=True))
result:
ColumnA ExistT1 ExistT2
0 A 3.00 NaN
1 A NaN 4.00
2 C 2.00 NaN
3 D NaN 2.00

Check if values in one dataframe match values from another, updating dataframe

Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!
Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0
The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64

Efficiently combine dataframes on 2nd level index

I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.
Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)

convert rows to columns in pandas dataframe

I'd like to convert to below Df1 to Df2.
The empty values would be filled with Nan.
Below Dfs are examples.
My data has weeks from 1 to 8.
IDs are 100,000.
Only week 8 has all IDs, so total rows will be 100,000.
I have Df3 which has 100,000 of id, and I want to merge df1 on Df3 formatted as df2.
ex) pd.merge(df3, df1, on="id", how="left") -> but, formatted as df2
Df1>
wk, id, col1, col2 ...
1 1 0.5 15
2 2 0.5 15
3 3 0.5 15
1 2 0.5 15
3 2 0.5 15
------
Df2>
wk1, id, col1, col2, wk2, id, col1, col2, wk3, id, col1, col2,...
1 1 0.5 15 2 1 Nan Nan 3 1 Nan Nan
1 2 0.5 15 2 2 0.5 15 3 2 0.5 15
1 3 Nan Nan 2 3 Nan Nan 3 3 0.5 15
Use:
#create dictionary for rename columns for correct sorting
d = dict(enumerate(df.columns))
d1 = {v:k for k, v in d.items()}
#first add missing values for each `wk` and `id`
df1 = df.set_index(['wk', 'id']).unstack().stack(dropna=False).reset_index()
#for each id create DataFrame, reshape by unstask and rename columns
df1 = (df1.groupby('id')
.apply(lambda x: pd.DataFrame(x.values, columns=df.columns))
.unstack()
.reset_index(drop=True)
.rename(columns=d1, level=0)
.sort_index(axis=1, level=1)
.rename(columns=d, level=0))
#convert values to integers if necessary
df1.loc[:, ['wk', 'id']] = df1.loc[:, ['wk', 'id']].astype(int)
#flatten MultiIndex in columns
df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
print (df1)
wk_0 id_0 col1_0 col2_0 wk_1 id_1 col1_1 col2_1 wk_2 id_2 col1_2 \
0 1 1 0.5 15.0 2 1 NaN NaN 3 1 NaN
1 1 2 0.5 15.0 2 2 0.5 15.0 3 2 0.5
2 1 3 NaN NaN 2 3 NaN NaN 3 3 0.5
col2_2
0 NaN
1 15.0
2 15.0
You can use GroupBy + concat. The idea is to create a list of dataframes with appropriately named columns and appropriate index. The concatenate along axis=1:
d = {k: v.reset_index(drop=True) for k, v in df.groupby('wk')}
def formatter(df, key):
return df.rename(columns={'w': f'wk{key}'}).set_index('id')
L = [formatter(df, key) for key, df in d.items()]
res = pd.concat(L, axis=1).reset_index()
print(res)
id wk col1 col2 wk col1 col2 wk col1 col2
0 1 1.0 0.5 15.0 NaN NaN NaN NaN NaN NaN
1 2 1.0 0.5 15.0 2.0 0.5 15.0 3.0 0.5 15.0
2 3 NaN NaN NaN NaN NaN NaN 3.0 0.5 15.0
Note NaN forces your series to become float. There's no "good" fix for this.

Best way to avoid merge nulls

Let's say I have those 2 pandas dataframes.
In [3]: df1 = pd.DataFrame({'id':[None,20,None,40,50],'value':[1,2,3,4,5]})
In [4]: df2 = pd.DataFrame({'index':[None,20,None], 'value':[1,2,3]})
In [7]: df1
Out[7]: id value
0 NaN 1
1 20.0 2
2 NaN 3
3 40.0 4
4 50.0 5
In [8]: df2
Out[8]: index value
0 NaN 1
1 20.0 2
2 NaN 3
When I'm merging those dataframes (based on the id and index columns) - the result include rows that the id and index have missing values.
df3 = df1.merge(df2, left_on='id', right_on = 'index', how='inner')
In [9]: df3
Out[9]: id value_x index value_y
0 NaN 1 NaN 1
1 NaN 1 NaN 3
2 NaN 3 NaN 1
3 NaN 3 NaN 3
4 20.0 2 20.0 2
that's what I tried but I guess it's not the best solution:
I replaced all the missing values with some value in one dataframe column,
and the same in the second dataframe but with another value - the purpose is that the condition will return False and the rows will not be in the result.
In [14]: df1_fill = df1.fillna({'id':'NONE1'})
In [13]: df2_fill = df2.fillna({'index':'NONE2'})
In [15]: df1_fill
Out[15]: id value
0 NONE1 1
1 20 2
2 NONE1 3
3 40 4
4 50 5
In [16]: df2_fill
Out[16]: index value
0 NONE2 1
1 20 2
2 NONE2 3
What is the best solution for that issue?
Also, in the example - the daya type of the join columns is numeric, but it can be another type like text or date...
EDIT:
So, with the solutions here I can use dropna function to drop the rows with the missing values before the join - but this is good with inner join that I don't want those rows at all.
What about a left join or full join?
Let's say I have those 2 dataframes I've used before - df1, df2.
So for inner and left join I realy can use the dropna function:
In [61]: df_inner = df1.dropna(subset=['id']).merge(df2.dropna(subset=['index']), left_on='id', right_on = 'index', how='inner')
In [62]: df_inner
Out[62]: id value_x index value_y
0 20.0 2 20.0 6
In [63]: df_left = df1.merge(df2.dropna(subset=['index']), left_on='id', right_on = 'index', how='left')
In [64]: df_left
Out[64]: id value_x index value_y
0 NaN 1 NaN NaN
1 20.0 2 20.0 6.0
2 NaN 3 NaN NaN
3 40.0 4 NaN NaN
4 50.0 5 NaN NaN
In [65]: df_full = df1.merge(df2, left_on='id', right_on = 'index', how='outer')
In [66]: df_full
Out[66]: id value_x index value_y
0 NaN 1 NaN 5.0
1 NaN 1 NaN 7.0
2 NaN 3 NaN 5.0
3 NaN 3 NaN 7.0
4 20.0 2 20.0 6.0
5 40.0 4 NaN NaN
6 50.0 5 NaN NaN
In the left I droped the missing-values-rows from the "right" dataframe and then I used merge.
It was ok because in left join you know that If the condition returns false you have null in the right-source columns - so it's not matter if the rows realy exists or they jusr return false.
But for full join - I need all the rows from the 2 sources both...
I cant use dropna because it will drop me rows that I need and if I don't use it - I get wrong result.
Thanks.
Why not to do something like this:
pd.merge(df1.dropna(subset=['id']), df2.dropna(subset=['index']),
left_on='id',right_on='index', how='inner')
Output:
id value_x index value_y
0 20.0 2 20.0 2
If you dont want nan values then you can drop the nan values i.e
df3 = df1.merge(df2, left_on='id', right_on = 'index', how='inner').dropna()
or
df3 = df1.dropna().merge(df2.dropna(), left_on='id', right_on = 'index', how='inner')
Output:
id value_x index value_y
0 20.0 2 20.0 2
For outer merge drop after merging ie.
df_full = df1.merge(df2, left_on='id', right_on = 'index', how='outer').dropna(subset = ['id'])
Output:
id value_x index value_y
4 20.0 2 20.0 2.0
5 40.0 4 NaN NaN
6 50.0 5 NaN NaN
Since you are doing an 'inner' join, what you could do is drop the rows in df1 where the id column is NaN before you merge.
df1_nonan = df1.dropna(subset = ['id'])
df3 = df1_nonan.merge(df2, left_on='id', right_on = 'index', how='inner')

Categories