How to complete NaN cells based on another Pandas dataframe in Python - python

I have the following 2 dataframes..
First dataframe df1:
import pandas as pd
import numpy as np
d1 = {'id': [1, 2, 3, 4], 'col1': [13, np.nan, 15, np.nan], 'col2': [23, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1
id col1 col2
0 1 13.0 23.0
1 2 NaN NaN
2 3 15.0 NaN
3 4 NaN NaN
And the second dataframe df2:
d2 = {'id': [2, 3, 4], 'col1': [ 14, 150, 16], 'col2': [24, 250, np.nan]}
df2 = pd.DataFrame(data=d2)
df2
id col1 col2
0 2 14 24.0
1 3 150 250.0
2 4 16 NaN
I need to replace the NaN fields in df1 with the non-NaN values from df2, where it is possible. But there are some conditions...
Condition 1) id column in each dataframe consists of unique values. When replacing any NaN value in df1 with another value from df2, the id column value needs to match.
Condition 2) Dataframes do not necessarily have the same size.
Condition 3) NaN values will only be looked for in col1 or col2 in any of the dataframes. The id column cannot be NaN in any row. There might be other columns in the dataframes, with or without NaN values. But for replacing the data, we will only be looking at col1 and col2 columns.
Condition 4) To go for a replacement of a row in df1, it is enough that any of col1 or col2 have a NaN value in any corresponding row. And when any NaN value is detected in any row in df1, the entire row will be replaced by the corresponding row with the same id value from df2, as long as all values of col1 and col2 in the corresponding row of df2 are non-NaN. With other words, if the row with the same id value in df2 have NaN value in any of col1 or col2, do not replace any data in df1.
After doing this operation, the df1 should look like the following:
id col1 col2
0 1 13.0 23.0
1 2 14 24
2 3 150.0 250.0 # Note that the entire row is replaced!
3 4 NaN NaN # This row not replaced bcz col2 value is NaN in df2 for the same row
How can this be done in the most elegant way? Python offers a lot of functions that I may not be aware of, which maybe solves this problem in a few rows instead of writing a very complex logic.

You can drop the NaN values from df2, then update with concat and groupby:
pd.concat([df2.dropna(), df1]).groupby('id', as_index=False).first()
Output:
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 150.0 250.0
3 4 NaN NaN

here is another way using fillna:
df1 = df1.set_index('id').fillna(df2.dropna().set_index('id')).reset_index()
output:
>>>
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 15.0 250.0
3 4 NaN NaN

Related

Mapping a dictionary to NAN rows of a column in Panda

Here as shown below is a data frame , where in a column col2 many nan's are there , i want to fill that only nan value the col1 as key from dictionary dict_map and map those value in col2.
Reproducible code:
import pandas as pd
import numpy as np
dict_map = {'a':45,'b':23,'c':97,'z': -1}
df = pd.DataFrame()
df['tag'] = [1,2,3,4,5,6,7,8,9,10,11]
df['col1'] = ['a','b','c','b','a','a','z','c','b','c','b']
df['col2'] = [np.nan,909,34,56,np.nan,45,np.nan,11,61,np.nan,np.nan]
df['_'] = df['col1'].map(dict_map)
Expected Output
One of the Method is :
df['col3'] = np.where(df['col2'].isna(),df['_'],df['col2'])
df
Just wanted to know any other method using function and map function , we can optimize this .
You can map col1 with your dict_map and then use that as input to fillna, as follows
df['col3'] = df['col2'].fillna(df['col1'].map(dict_map))
You can achieve the very same result just using list comprehension, it is a very pythonic solution and I believe it holds better performance.
We are just reading col2 and copying the value to col3 if its not NaN. Then, if it is, we look into Col1, grab the dict key and, instead, use the corresponding value from dict_map.
df['col3'] = [df['col2'][idx] if not np.isnan(df['col2'][idx]) else dict_map[df['col1'][idx]] for idx in df.index.tolist()]
Output:
df
tag col1 col2 col3
0 1 a NaN 45.0
1 2 b 909.0 909.0
2 3 c 34.0 34.0
3 4 b 56.0 56.0
4 5 a NaN 45.0
5 6 a 45.0 45.0
6 7 z NaN -1.0
7 8 c 11.0 11.0
8 9 b 61.0 61.0
9 10 c NaN 97.0
10 11 b NaN 23.0

Replace NA values with values with corresponding from other same

How can I replace NA values in df1
df1:
ID col1 col2 col3 col4
A NaN NaN NaN NaN
B 0 0 1 2
C NaN NaN NaN NaN
With the values from the other dataframe that are corresponding to those NaN values (so other values do not go over)
df2:
ID col1 col2 col3 col4
A 1 2 1 11
B 2 2 4 8
C 0 0 NaN NaN
So result is
ID col1 col2 col3 col4
A 1 2 1 11
B 0 0 1 2
C 0 0 NaN NaN
IIUC use if ID are index in both DataFrames:
df = df1.fillna(df2)
Or:
df = df1.combine_first(df2)
print (df)
col1 col2 col3 col4
ID
A 1.0 2.0 1.0 11.0
B 0.0 0.0 1.0 2.0
C 0.0 0.0 NaN NaN
If ID are columns:
df = df1.set_index('ID').fillna(df2.set_index('ID'))
#alternative
#df = df1.set_index('ID').combine_first(df2.set_index('ID'))
import numpy as np
import pandas as pd
(rows, columns) = df1.shape
for i in range(rows):
for j in range(columns):
if df1.iloc[i,j] == np.NaN:
df1.iloc[i,j] = df2.iloc[i,j]
If all df1 missing values have a corresponding value in df2, that should work.
This solution also takes in count that the NaN values are expressed correctly in df1 as np.NaN, so if they are in string format or another one it will raise an exception.

Reliable way of dropping rows in df1 which are also in df2

I have a scenario where I have an existing dataframe and I have a new dataframe which contains rows which might be in the existing frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new dataframe by comparing it with the existing dataframe.
I've done my homework. The solution seems to be to use isin(). However, I find that this has hidden dangers. In particular:
pandas get rows which are NOT in other dataframe
Pandas cannot compute isin with a duplicate axis
Pandas promotes int to float when filtering
Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new that are in existing so that new only contains rows not in existing. A simpler problem of updating existing with new rows from new can be achieved with pd.merge() + DataFrame.drop_duplicates()
In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
In [54]: df1
Out[54]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
In [55]: df2
Out[55]:
col1 col2
0 1 10
1 2 11
2 3 12
In [56]: df1[~df1.isin(df2)]
Out[56]:
col1 col2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 13.0
4 5.0 14.0
In [57]: df1[~df1.isin(df2)].dropna()
Out[57]:
col1 col2
3 4.0 13.0
4 5.0 14.0
We can use DataFrame.merge with indicator = True + DataFrame.query and DataFrame.drop
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
3 4 13
4 5 14
if now for example we change a value of row 0:
df1.iat[0,0]=3
row 0 is no longer filtered
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
Step by step
df_filtered=( df1.merge(df2,how='outer',indicator=True)
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 1 10 right_only
df_filtered=( df1.merge(df2,how='outer',indicator=True).query("_merge == 'left_only'")
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
3 4 13 left_only
4 5 14 left_only
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1)
)
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
You may try Series isin. It is independent from index. I.e, It only checks on values. You just need to convert columns of each dataframe to series of tuples to create mask
s1 = df1.agg(tuple, axis=1)
s2 = df2.agg(tuple, axis=1)
df1[~s1.isin(s2)]
Out[538]:
col1 col2
3 4 13
4 5 14

Merge small file into big file and give NaN's for the rows that do not match in python

I would like to merge two data frames - big one and small one. Example of data frames is following:
# small data frame construction
>>> d1 = {'col1': ['A', 'B'], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 A 3
1 B 4
# big data frame construction
>>> d2 = {'col1': ['A', 'B', 'C', 'D', 'E'], 'col2': [3, 4, 6, 7, 8]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 A 3
1 B 4
2 C 6
3 D 7
4 E 8
The code I am looking for should produce the following output (a data frame with big data frame shape, column names, and NaNs in rows that were not merged with the small data frame):
col1 col2
0 A 3
1 B 4
2 NA NA
3 NA NA
4 NA NA
The code I have tried:
>>> print(pd.merge(df1, df2, left_index=True, right_index=True, how='right', sort=False))
col1_x col2_x col1_y col2_y
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 5
3 NaN NaN D 6
4 NaN NaN E 7
You can add parameter suffixes with add _ for added columns and then removed added columns with Series.str.endswith, inverted mask by ~ and boolean indexing with loc, because droping columns:
df = pd.merge(df1, df2,
left_index=True,
right_index=True,
how='right',
sort=False,
suffixes=('','_'))
print (df)
col1 col2 col1_ col2_
0 A 3.0 A 3
1 B 4.0 B 4
2 NaN NaN C 6
3 NaN NaN D 7
4 NaN NaN E 8
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
col1 col2
0 A 3.0
1 B 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN

replace value based on other dataframe

There are two dataframe with same columns, index and the order of the columns are the same. I call them tableA and tableB.
tableA = pd.DataFrame({'col1':[np.NaN,1,2],'col2':[2,3,np.NaN]})
tableB = pd.DataFrame({'col1':[2,4,2],'col2':[2,3,5]})
tableA tableB
col1 col2 col1 col2
0 na 2 0 2 2
1 1 3 1 4 5
2 2 na 2 2 5
I want to replace some value of tableB to 'NA' where the value of same position of tableA is na.
For now, I use loop to do it column by column.
for n in range(tableB.shape[1]):
tableB.iloc[:,n] = tableB.iloc[:,n].where(pd.isnull(tableA.iloc[:,n])==False,'NA')
tableB
col1 col2
0 NA 2
1 4 5
2 2 NA
Is there other way to do it without using loop? I have tried using replace but it can only change the first column.
tableB.replace(pd.isnull(tableA), 'NA', inplace=True) #only adjust the first column.
Thanks for your help!
I think you need where or numpy.where:
1.
df = tableB.where(tableA.notnull())
print (df)
col1 col2
0 NaN 2.0
1 4.0 3.0
2 2.0 NaN
2.
df = pd.DataFrame(np.where(tableA.notnull(), tableB, np.nan),
columns=tableB.columns,
index=tableB.index)
print (df)
col1 col2
0 NaN 2.0
1 4.0 3.0
2 2.0 NaN
You could use mask
In [7]: tableB.mask(tableA.isnull())
Out[7]:
col1 col2
0 NaN 2.0
1 4.0 3.0
2 2.0 NaN
tableB[tableA.isnull()] = np.nan

Categories