Mapping a dictionary to NAN rows of a column in Panda - python

Here as shown below is a data frame , where in a column col2 many nan's are there , i want to fill that only nan value the col1 as key from dictionary dict_map and map those value in col2.
Reproducible code:
import pandas as pd
import numpy as np
dict_map = {'a':45,'b':23,'c':97,'z': -1}
df = pd.DataFrame()
df['tag'] = [1,2,3,4,5,6,7,8,9,10,11]
df['col1'] = ['a','b','c','b','a','a','z','c','b','c','b']
df['col2'] = [np.nan,909,34,56,np.nan,45,np.nan,11,61,np.nan,np.nan]
df['_'] = df['col1'].map(dict_map)
Expected Output
One of the Method is :
df['col3'] = np.where(df['col2'].isna(),df['_'],df['col2'])
df
Just wanted to know any other method using function and map function , we can optimize this .

You can map col1 with your dict_map and then use that as input to fillna, as follows
df['col3'] = df['col2'].fillna(df['col1'].map(dict_map))

You can achieve the very same result just using list comprehension, it is a very pythonic solution and I believe it holds better performance.
We are just reading col2 and copying the value to col3 if its not NaN. Then, if it is, we look into Col1, grab the dict key and, instead, use the corresponding value from dict_map.
df['col3'] = [df['col2'][idx] if not np.isnan(df['col2'][idx]) else dict_map[df['col1'][idx]] for idx in df.index.tolist()]
Output:
df
tag col1 col2 col3
0 1 a NaN 45.0
1 2 b 909.0 909.0
2 3 c 34.0 34.0
3 4 b 56.0 56.0
4 5 a NaN 45.0
5 6 a 45.0 45.0
6 7 z NaN -1.0
7 8 c 11.0 11.0
8 9 b 61.0 61.0
9 10 c NaN 97.0
10 11 b NaN 23.0

Related

output NaN value when using apply function to a dataframe with index

I am trying to use apply function to create 2 new columns. when dataframe has index, it doesn't wokr, the new columns have values of NaN. If dataframe has no index, then it works. Could you please help? Thanks
def calc_test(row):
a=row['col1']+row['col2']
b=row['col1']/row['col2']
return (a,b)
df_test_dict={'col1':[1,2,3,4,5],'col2':[10,20,30,40,50]}
df_test=pd.DataFrame(df_test_dict)
df_test.index=['a1','b1','c1','d1','e1']
df_test
col1 col2
a1 1 10
b1 2 20
c1 3 30
d1 4 40
e1 5 50
Now I use apply function, the new creately columns have values of NaN. Thanks for your help.
df_test[['a','b']] = pd.DataFrame(df_test.apply(lambda row:calc_test(row),axis=1).tolist())
df_test
col1 col2 a b
a1 1 10 NaN NaN
b1 2 20 NaN NaN
c1 3 30 NaN NaN
d1 4 40 NaN NaN
e1 5 50 NaN Na
When using apply, you may use the result_type ='expand' argument to expand the output of your function as columns of a pandas Dataframe:
df_test[['a','b']]=df_test.apply(lambda row:calc_test(row),axis=1, result_type ='expand')
This returns:
col1 col2 a b
a1 1 10 11.0 0.1
b1 2 20 22.0 0.1
c1 3 30 33.0 0.1
d1 4 40 44.0 0.1
e1 5 50 55.0 0.1
You are wrapping the return of the apply as a DataFrame which has a default indexing of [0, 1, 2, 3, 4] which don't exist in your original DataFrame's index. You can see this by looking at the output of pd.DataFrame(df_test.apply(lambda row:calc_test(row),axis=1).tolist()).
Simply remove the pd.DataFrame() to fix this problem.
df_test[['a', 'b']] = df_test.apply(lambda row:calc_test(row),axis=1).tolist()

How to complete NaN cells based on another Pandas dataframe in Python

I have the following 2 dataframes..
First dataframe df1:
import pandas as pd
import numpy as np
d1 = {'id': [1, 2, 3, 4], 'col1': [13, np.nan, 15, np.nan], 'col2': [23, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1
id col1 col2
0 1 13.0 23.0
1 2 NaN NaN
2 3 15.0 NaN
3 4 NaN NaN
And the second dataframe df2:
d2 = {'id': [2, 3, 4], 'col1': [ 14, 150, 16], 'col2': [24, 250, np.nan]}
df2 = pd.DataFrame(data=d2)
df2
id col1 col2
0 2 14 24.0
1 3 150 250.0
2 4 16 NaN
I need to replace the NaN fields in df1 with the non-NaN values from df2, where it is possible. But there are some conditions...
Condition 1) id column in each dataframe consists of unique values. When replacing any NaN value in df1 with another value from df2, the id column value needs to match.
Condition 2) Dataframes do not necessarily have the same size.
Condition 3) NaN values will only be looked for in col1 or col2 in any of the dataframes. The id column cannot be NaN in any row. There might be other columns in the dataframes, with or without NaN values. But for replacing the data, we will only be looking at col1 and col2 columns.
Condition 4) To go for a replacement of a row in df1, it is enough that any of col1 or col2 have a NaN value in any corresponding row. And when any NaN value is detected in any row in df1, the entire row will be replaced by the corresponding row with the same id value from df2, as long as all values of col1 and col2 in the corresponding row of df2 are non-NaN. With other words, if the row with the same id value in df2 have NaN value in any of col1 or col2, do not replace any data in df1.
After doing this operation, the df1 should look like the following:
id col1 col2
0 1 13.0 23.0
1 2 14 24
2 3 150.0 250.0 # Note that the entire row is replaced!
3 4 NaN NaN # This row not replaced bcz col2 value is NaN in df2 for the same row
How can this be done in the most elegant way? Python offers a lot of functions that I may not be aware of, which maybe solves this problem in a few rows instead of writing a very complex logic.
You can drop the NaN values from df2, then update with concat and groupby:
pd.concat([df2.dropna(), df1]).groupby('id', as_index=False).first()
Output:
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 150.0 250.0
3 4 NaN NaN
here is another way using fillna:
df1 = df1.set_index('id').fillna(df2.dropna().set_index('id')).reset_index()
output:
>>>
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 15.0 250.0
3 4 NaN NaN

Check if values in one dataframe match values from another, updating dataframe

Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!
Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0
The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64

how to assign to slice of slice in pandas

I have a pandas dataframe df as shown.
col1 col2
0 NaN a
1 2 b
2 NaN c
3 NaN d
4 5 e
5 6 f
I want to find the first NaN value in col1 and assign a new value to it. I've tried both of the following methods but none of them works.
df.loc[df['col'].isna(), 'col1'][0] = 1
df.loc[df['col'].isna(), 'col1'].iloc[0] = 1
Both of them don't show any error or warning. But when I check the value of the original dataframe, it doesn't change.
What is the correct way to do this?
You can use .fillna() with limit=1 parameter:
df['col1'].fillna(1, limit=1, inplace=True)
print(df)
Prints:
col1 col2
0 1.0 a
1 2.0 b
2 NaN c
3 NaN d
4 5.0 e
5 6.0 f

fiilna() method on Pandas ignores inplace argument returns error when called on axis=1

I am experimenting with the fillna() method. I have created a small dataframe and two Series for that purpose:
col1 col2 col3 col4
0 NaN NaN 3 4
1 NaN NaN 7 8
2 9.0 10.0 11 12
n1 = pd.Series([10, 20])
n2 = pd.Series([30, 40, 50, 60])
n2.index = list(df.columns.values)
When I try the command:
df.fillna(n1, axis=0, inplace = True)
Nothing happens, the NaNs remain intact. I would expect to see them replaced with the values 10 (col1) and 20 (col2). When I try
df.fillna(n2, axis =1)
I get an error message:
NotImplementedError: Currently only can fill with dict/Series column by column
Could you explain this behavior? Your advice will be appreciated.
The default axis for fillna is 0. This translates to matching the columns with the index of the series being passed. That means that filling in with n2 should be on axis=0
df.fillna(n2) # axis=0 is default
col1 col2 col3 col4
0 30.0 40.0 3 4
1 30.0 40.0 7 8
2 9.0 10.0 11 12
Doing this inplace=True definitely works
df.fillna(n2, inplace=True)
print(df)
col1 col2 col3 col4
0 30.0 40.0 3 4
1 30.0 40.0 7 8
2 9.0 10.0 11 12
df.fillna(n1, axis=1)
NotImplementedError: Currently only can fill with dict/Series column by column
Yeah! You're out of luck... sort of
option 1
transpose()
df.T.fillna(n1).T
col1 col2 col3 col4
0 10.0 10.0 3.0 4.0
1 20.0 20.0 7.0 8.0
2 9.0 10.0 11.0 12.0
option 2
use awkward pandas broadcasting
n1_ = pd.DataFrame([n1], index=df.columns).T
df.fillna(n1_)
Or inplace
df.fillna(n1_, inplace=True)
df
col1 col2 col3 col4
0 10.0 10.0 3 4
1 20.0 20.0 7 8
2 9.0 10.0 11 12
You want to specify the column for which you will fill in the values. For example,
df['col1'].fillna(n1, inplace=True)
df
Out[17]:
col1 col2 col3 col4
0 10 NaN 3 4
1 20 NaN 7 8
2 9 10 11 12
In the instance that you are filling in a single value instead, say 0, you can apply it to the DataFrame as you did above. Starting with the original DataFrame,
df.fillna(0, inplace=True)
df
Out[27]:
col1 col2 col3 col4
0 0 0 3 4
1 0 0 7 8
2 9 10 11 12

Categories