Check if values in one dataframe match values from another, updating dataframe - python

Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!

Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0

The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64

Related

Mapping a dictionary to NAN rows of a column in Panda

Here as shown below is a data frame , where in a column col2 many nan's are there , i want to fill that only nan value the col1 as key from dictionary dict_map and map those value in col2.
Reproducible code:
import pandas as pd
import numpy as np
dict_map = {'a':45,'b':23,'c':97,'z': -1}
df = pd.DataFrame()
df['tag'] = [1,2,3,4,5,6,7,8,9,10,11]
df['col1'] = ['a','b','c','b','a','a','z','c','b','c','b']
df['col2'] = [np.nan,909,34,56,np.nan,45,np.nan,11,61,np.nan,np.nan]
df['_'] = df['col1'].map(dict_map)
Expected Output
One of the Method is :
df['col3'] = np.where(df['col2'].isna(),df['_'],df['col2'])
df
Just wanted to know any other method using function and map function , we can optimize this .
You can map col1 with your dict_map and then use that as input to fillna, as follows
df['col3'] = df['col2'].fillna(df['col1'].map(dict_map))
You can achieve the very same result just using list comprehension, it is a very pythonic solution and I believe it holds better performance.
We are just reading col2 and copying the value to col3 if its not NaN. Then, if it is, we look into Col1, grab the dict key and, instead, use the corresponding value from dict_map.
df['col3'] = [df['col2'][idx] if not np.isnan(df['col2'][idx]) else dict_map[df['col1'][idx]] for idx in df.index.tolist()]
Output:
df
tag col1 col2 col3
0 1 a NaN 45.0
1 2 b 909.0 909.0
2 3 c 34.0 34.0
3 4 b 56.0 56.0
4 5 a NaN 45.0
5 6 a 45.0 45.0
6 7 z NaN -1.0
7 8 c 11.0 11.0
8 9 b 61.0 61.0
9 10 c NaN 97.0
10 11 b NaN 23.0

Efficiently combine dataframes on 2nd level index

I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.
Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)

Filling Null Values based on conditions on other columns

I want to fill the Null values in the first column based on the value of the 2nd column.
(For example)
For "Apples" in col2, the value should be 12 in places of Nan in the col1
For "Vegies", in col2 the value should be 134 in place of Nan in col1
For every description, there is a specific code(number) in the 1st column. I need to map it somehow.
(IGNORE the . (dots)
All I can think of is to make a dictionary of codes and replace null but's that very hardcoded.
Can anyone help?
col1. col2
12. Apple
134. Vegies
23. Oranges
Nan. Apples
Nan. Vegies
324. Sugar
Nan. Apples
**Reupdate
Here, I replicate your DF, and the implementation:
import pandas as pd
import numpy as np
l1 = [12, 134, 23, np.nan, np.nan, 324, np.nan,np.nan,np.nan,np.nan]
l2 = ["Apple","Vegies","Oranges","Apples","Vegies","Sugar","Apples","Melon","Melon","Grapes"]
df = pd.DataFrame(l1, columns=["col1"])
df["col2"] = pd.DataFrame(l2)
df
Out[26]:
col1 col2
0 12.0 Apple
1 134.0 Vegies
2 23.0 Oranges
3 NaN Apples
4 NaN Vegies
5 324.0 Sugar
6 NaN Apples
7 NaN Melon
8 NaN Melon
9 NaN Grapes
Then to Replace the Null values based on your rules:
df.loc[df.col2 == "Vegies", 'col1'] = 134
df.loc[df.col2 == "Apple", 'col1'] = 12
If you want to apply these to a larger scales, consider make a dictionary first:
for example is:
item_dict = {"Apples":12, "Melon":65, "Vegies":134, "Grapes":78}
Then apply all of these to your dataframe with this custom function:
def item_mapping(df, dictionary, colsource, coltarget):
dict_keys = list(dictionary.keys())
dict_values = list(dictionary.values())
for x in range(len(dict_keys)):
df.loc[df[colsource]==dict_keys[x], coltarget] = dict_values[x]
return(df)
Usage Examples:
item_mapping(df, item_dict, "col2", "col1")
col1 col2
0 12.0 Apple
1 134.0 Vegies
2 23.0 Oranges
3 12.0 Apples
4 134.0 Vegies
5 324.0 Sugar
6 12.0 Apples
7 65.0 Melon
8 65.0 Melon
9 78.0 Grapes

Number of missing entries when merging DataFrames

In an exercise, I was asked to merge 3 DataFrames with inner join (df1+df2+df3 = mergedDf), then in another question I was asked to tell how many entries I've lost when performing this 3-way merging.
#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
Goals Medals
Argentina 5 2
Angola 1 0
Bolivia 3 1
#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
Dates Medals
Venezuela 1 0
Africa 2 1
Argentina 2 2
#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
Players Goals
Argentina 11 5
Australia 11 1
Spain 10 0
#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
Goals_X Medals_X Dates Medals_Y Players Goals_Y
Argentina 5 2 2 2 11 2
#Calculate number of lost entries by code
I tried to merge everything with outer join and then subtracting the mergedDf, but I don't know how to do this, can anyone help me?
I've found a simple but effective solution:
Merging the 3 DataFrames, inner and outer:
df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')
Now, the number of missed entries (rows) is:
return (len(outer)-len(inner))
Solution with outer join and parameter indicator, last count rows with no both in both indicator columns a and b by sum of True values (processes like 1s):
mergedDf = pd.merge(df1,df2,how='outer',left_index=True, right_index=True, indicator='a')
mergedDf = pd.merge(mergedDf,df3,how='outer',left_index=True, right_index=True, indicator='b')
print(mergedDf)
Goals_x Medals_x Dates Medals_y a Players Goals_y \
Africa NaN NaN 2.0 1.0 right_only NaN NaN
Angola 1.0 0.0 NaN NaN left_only NaN NaN
Argentina 5.0 2.0 2.0 2.0 both 11.0 5.0
Australia NaN NaN NaN NaN NaN 11.0 1.0
Belgica NaN NaN NaN NaN NaN 10.0 0.0
Bolivia 3.0 1.0 NaN NaN left_only NaN NaN
Venezuela NaN NaN 1.0 0.0 right_only NaN NaN
b
Africa left_only
Angola left_only
Argentina both
Australia right_only
Belgica right_only
Bolivia left_only
Venezuela left_only
missing = ((mergedDf['a'] != 'both') & (mergedDf['b'] != 'both')).sum()
print (missing)
6
Another solution is use inner join and sum filtered values of each index which not matched mergedDf.index:
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
vals = mergedDf.index
print (vals)
Index(['Argentina'], dtype='object')
dfs = [df1, df2, df3]
missing = sum((~x.index.isin(vals)).sum() for x in dfs)
print (missing)
6
Anoter solution if unique values in each index:
dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]
#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6
You can passing True to the indicator in merge
df1=pd.DataFrame({'A':[1,2,3],'B':[1,1,1]})
df2=pd.DataFrame({'A':[2,3],'B':[1,1]})
df1.merge(df2,on='A',how='inner')
Out[257]:
A B_x B_y
0 2 1 1
1 3 1 1
df1.merge(df2,on='A',how='outer',indicator =True)
Out[258]:
A B_x B_y _merge
0 1 1 NaN left_only
1 2 1 1.0 both
2 3 1 1.0 both
mergedf=df1.merge(df2,on='A',how='outer',indicator =True)
Then with value_counts you know how many you lost when do inner , since only the both will keep when how='inner'
mergedf['_merge'].value_counts()
Out[260]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
For 3 df and filter with both merge columns words is both
df1.merge(df2, on='A',how='outer',indicator =True).rename(columns={'_merge':'merge'}).merge(df3, on='A',how='outer',indicator =True)

Delete values from pandas dataframe based on logical operation

I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5

Categories