Set dataframe column using values from matching indices in another dataframe - python

I would like to set values in col2 of DF1 using the value held at the matching index of col2 in DF2:
DF1:
col1 col2
index
0 a
1 b
2 c
3 d
4 e
5 f
DF2:
col1 col2
index
2 a x
3 d y
5 f z
DF3:
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If I just try and set DF1['col2'] = DF2['col2'] then col2 comes out as all NaN values in DF3 - I take it this is because the indices are different. However when I try and use map() to do something like:
DF1.index.to_series().map(DF2['col2'])
then I still get the same NaN column, but I thought it would map the values over where the index matches...
What am I not getting?

You need join or assign:
df = df1.join(df2['col2'])
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
Or:
df1 = df1.assign(col2=df2['col2'])
#same like
#df1['col2'] = df2['col2']
print (df1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If no match and all values are NaNs check if indices have same dtype in both df:
print (df1.index.dtype)
print (df2.index.dtype)
If not, then use astype:
df1.index = df1.index.astype(int)
df2.index = df2.index.astype(int)
Bad solution (check index 2):
df = df2.combine_first(df1)
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 a x
3 d y
4 e NaN
5 f z

You can simply concat as you are combining based on index
df = pd.concat([df1['col1'], df2['col2']],axis = 1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z

Related

Imputing values into a dataframe based on another dataframe and a condition

Suppose I have the following dataframes:
df1 = pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df2 = pd.DataFrame({'col3':['a','x','a','c','b']})
I wonder how can I look up on df1 and make a new column on df2 and replace values from col2 in it, for those values that there is no data I shall impute 0, the result should look like the following:
col3 col4
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Use Series.map with Series.fillna:
df2['col2'] = df2['col3'].map(df1.set_index('col1')['col2']).fillna(0).astype(int)
print (df2)
col3 col2
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Or DataFrame.merge, better if need append multiple columns:
df = df2.merge(df1.rename(columns={'col1':'col3'}), how='left').fillna(0)
print (df)
col3 col2
0 a 1.0
1 x 0.0
2 a 1.0
3 c 3.0
4 b 2.0

Replace NA values with values with corresponding from other same

How can I replace NA values in df1
df1:
ID col1 col2 col3 col4
A NaN NaN NaN NaN
B 0 0 1 2
C NaN NaN NaN NaN
With the values from the other dataframe that are corresponding to those NaN values (so other values do not go over)
df2:
ID col1 col2 col3 col4
A 1 2 1 11
B 2 2 4 8
C 0 0 NaN NaN
So result is
ID col1 col2 col3 col4
A 1 2 1 11
B 0 0 1 2
C 0 0 NaN NaN
IIUC use if ID are index in both DataFrames:
df = df1.fillna(df2)
Or:
df = df1.combine_first(df2)
print (df)
col1 col2 col3 col4
ID
A 1.0 2.0 1.0 11.0
B 0.0 0.0 1.0 2.0
C 0.0 0.0 NaN NaN
If ID are columns:
df = df1.set_index('ID').fillna(df2.set_index('ID'))
#alternative
#df = df1.set_index('ID').combine_first(df2.set_index('ID'))
import numpy as np
import pandas as pd
(rows, columns) = df1.shape
for i in range(rows):
for j in range(columns):
if df1.iloc[i,j] == np.NaN:
df1.iloc[i,j] = df2.iloc[i,j]
If all df1 missing values have a corresponding value in df2, that should work.
This solution also takes in count that the NaN values are expressed correctly in df1 as np.NaN, so if they are in string format or another one it will raise an exception.

how to assign to slice of slice in pandas

I have a pandas dataframe df as shown.
col1 col2
0 NaN a
1 2 b
2 NaN c
3 NaN d
4 5 e
5 6 f
I want to find the first NaN value in col1 and assign a new value to it. I've tried both of the following methods but none of them works.
df.loc[df['col'].isna(), 'col1'][0] = 1
df.loc[df['col'].isna(), 'col1'].iloc[0] = 1
Both of them don't show any error or warning. But when I check the value of the original dataframe, it doesn't change.
What is the correct way to do this?
You can use .fillna() with limit=1 parameter:
df['col1'].fillna(1, limit=1, inplace=True)
print(df)
Prints:
col1 col2
0 1.0 a
1 2.0 b
2 NaN c
3 NaN d
4 5.0 e
5 6.0 f

Pandas divide two dataframe with different sizes

I have a dataframe df1 as:
col1 col2 Val1 Val2
A g 4 6
A d 3 8
B h 5 10
B p 7 14
I have another dataframe df2 as:
col1 Val1 Val2
A 2 3
B 1 4
I want to divide df1 by df2 based on col1, val1 and val2 so that row A from df2 divides both rows A from df1.
My final output of df1.div(df2) is as follows:
col1 col2 Val1 Val2
A g 2 2
A d 1.5 2
B h 5 2.5
B p 7 3.5
Convert col1 and col2 to MultiIndex, also convert col1 in second DataFrame to index and then use DataFrame.div:
df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1')).reset_index()
#alternative with specify level of index
#df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1'), level=0).reset_index()
print (df)
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I think there is a slight mistake in your example. For col Val2, 2nd row - 8/3 should be 2.67. So the final output df1.div(df2) should be :
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
Anyways here is a possible solution:
Construct the 2 dfs
import pandas as pd
df1 = pd.DataFrame(data={'col1':['A','A','B','B'], 'col2': ['g','d','h','p'], 'Val1': [4,3,5,7], 'Val2': [6,8,10,14]}, columns=['col1','col2','Val1','Val2'])
df2 = pd.DataFrame(data={'col1':['A','B'], 'Val1': [2,1], 'Val2': [3,4]}, columns=['col1','Val1','Val2'])
print (df1)
print (df2)
Output:
>>>
col1 col2 Val1 Val2
0 A g 4 6
1 A d 3 8
2 B h 5 10
3 B p 7 14
col1 Val1 Val2
0 A 2 3
1 B 1 4
Now we can just do an INNER JOIN of df1 and df2 on col: col1. If you are not familiar with SQL joins have a look at this: sql-join. We can do join in pandas using the merge() method
## join df1, df2
merged_df = pd.merge(left=df1, right=df2, how='inner', on='col1')
print (merged_df)
Output:
>>>
col1 col2 Val1_x Val2_x Val1_y Val2_y
0 A g 4 6 2 3
1 A d 3 8 2 3
2 B h 5 10 1 4
3 B p 7 14 1 4
Now that we have got the corresponding columns of df1 and df2, we can simply compute the division and delete the redundant columns:
# Val1 = Val1_x/Val1_y, Val2 = Val2_x/Val2_y
merged_df['Val1'] = merged_df['Val1_x']/merged_df['Val1_y']
merged_df['Val2'] = merged_df['Val2_x']/merged_df['Val2_y']
# delete the cols: Val1_x,Val1_y,Val2_x,Val2_y
merged_df.drop(columns=['Val1_x', 'Val1_y', 'Val2_x', 'Val2_y'], inplace=True)
print (merged_df)
Final Output:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I hope this solves your question :)
You can use the pandas.merge() function to execute a database-like join between dataframes, then use the result to divide column values:
# merge against col1 so we get a merged index
merged = pd.merge(df1[["col1"]], df2)
df1[["Val1", "Val2"]] = df1[["Val1", "Val2"]].div(merged[["Val1", "Val2"]])
This produces:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000

Merging with pandas while keeping NaNs at bottom

Let's say I have 3 dataframes, each with a single column. In each df, there are slightly more rows than in the previous one. For example:
and I want to get exactly this:
df1 = col1
1 a
2 b
3 c
df2 = col2
1 x
2 y
3 z
4 w
5 q
df3 = col3
1 A
2 B
3 C
4 D
5 E
6 F
7 G
and I want to get exactly this:
res = col1 col2 col3
1 a x A
2 b y B
3 c z C
4 - w D
5 - q E
6 - - F
7 - - G
That is, I want the rows to stay in the order in which they are added, so NaNs (-) are kept in the bottom.
I tried this:
import pandas as pd
total = pd.DataFrame()
total = pd.merge(total,df1,how='outer',left_index=True,right_index=True)
total = pd.merge(total,df2,how='outer',left_index=True,right_index=True)
total = pd.merge(total,df3,how='outer',left_index=True,right_index=True)
but I keep getting the table in a seemingly random order. Stuff like:
res = col1 col2 col3
1 a x A
4 - w D
3 c z C
5 - q E
2 b y B
7 - - G
6 - - F
How can I force the final df to take the desired form?
Thanks!
concat and pass axis=1 to do so column-wise:
In [203]:
pd.concat([df1,df2,df3], axis=1)
Out[203]:
col1 col2 col3
1 a x A
2 b y B
3 c z C
4 NaN w D
5 NaN q E
6 NaN NaN F
7 NaN NaN G

Categories