Pandas left merge but overwrite with right data - python

I would like to merge two dataframes, df2 might have more columns and will always be 1 row. I would like the data from the df2 row to overwrite the matching row in df on a.
df = pd.DataFrame({'a': {0: 0, 1: 1, 2: 2}, 'b': {0: 3, 1: 4, 2: 5}})
df2 = pd.DataFrame({'a': {0: 1}, 'b': {0: 90}, 'c': {0: 76}})
>>> df
a b
0 0 3
1 1 4
2 2 5
>>> df2
a b c
0 1 90 76
The desired output:
a b c
0 0 3 NaN
1 1 90 76
2 2 5 NaN
I have tried merge left but this creates two b columns (b_x and b_y):
>>> pd.merge(df,df2,how='left', on='a')
a b_x b_y c
0 0 3 NaN NaN
1 1 4 90.0 76.0
2 2 5 NaN NaN

You can use df.combine_first here:
df2.set_index("a").combine_first(df.set_index("a")).reset_index()
Or with merge:
out = df.merge(df2,on=['a'],how='left')
out.loc[:,out.columns.str.endswith("_x")] = out.loc[:,
out.columns.str.endswith("_y")].to_numpy()
out = out.groupby(out.columns.str.split("_").str[0],axis=1).first()
print(out)
a b c
0 0 3.0 NaN
1 1 90.0 76.0
2 2 5.0 NaN

Related

how can I merge two dataframes that have same columns but it has different row values? [duplicate]

This question already has an answer here:
What is the difference between combine_first and fillna?
(1 answer)
Closed 2 years ago.
I'm trying to put together two dataframes that have the same columns and number of rows, but one of them have nan in some rows and the other doesn't.
This example is with 2 DF, but I have to do this with around 50 DF and get all dataframes merged in 1.
DF1:
id b c
0 1 15 1
1 2 nan nan
2 3 2 3
3 4 nan nan
DF2:
id b c
0 1 nan nan
1 2 26 6
2 3 nan nan
3 4 60 3
Desired output:
id b c
0 1 15 1
1 2 26 6
2 3 2 3
3 4 60 3
If you have
df1 = pd.DataFrame(np.nan, index=[0, 1], columns=[0, 1])
df2 = pd.DataFrame([[0, np.nan]], index=[0, 1], columns=[0, 1])
df3 = pd.DataFrame([[np.nan, 1]], index=[0, 1], columns=[0, 1])
Then you can update df1
for df in [df2, df3]:
df1.update(df)
print(df1)
0 1
0 0.0 1.0
1 0.0 1.0

How to fill NAs with median of means of 2-column groupby in pandas?

Working with pandas, I have a dataframe with two hierarchies A and B, where B can be NaN, and I want to fill some NaNs in D in a particular way:
In the example below, A has "B-subgroups" where there are no values at all for D (e.g. (1, 1)), while A also has values for D in other subgroups (e.g. (1, 3)).
Now I want to get the mean of each subgroup (120, 90 and 75 for A==1), find the median of these means (90 for A==1) and use this median to fill NaNs in the other subgroups of A==1.
Groups like A==2, where there are only NaNs for D, should not be filled.
Groups like A==3, where there are some values for D but only rows with B being NaN have NaN in D, should not be filled if possible (I intend to fill these later with the mean of all values of D of their whole A groups).
Example df:
d = {'A': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3],
'B': [1, 2, 3, 3, 4, 5, 6, 1, 1, np.NaN, np.NaN],
'D': [np.NaN, np.NaN, 120, 120, 90, 75, np.NaN, np.NaN, 60, 50, np.NaN]}
df = pd.DataFrame(data=d)
A B D
1 1 NaN
1 2 NaN
1 3 120
1 3 120
1 4 90
1 5 75
1 6 NaN
2 1 NaN
3 1 60
3 NaN 50
3 NaN NaN
Expected result:
A B D
1 1 90
1 2 90
1 3 120
1 3 120
1 4 90
1 5 75
1 6 90
2 1 NaN
3 1 60
3 NaN 50
3 NaN NaN
With df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median') or .median() I seem to get the right values, but using
df['D'] = df['D'].fillna(
df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
)
does not seem to change any values in D.
Any help is greatly appreciated, I've been stuck on this for a while and cannot find any solution anywhere.
Your first step is correct. After that we use Series.map to map the correct medians to each group in column A.
Finally we use np.where to conditionally fill in column D if B is not NaN:
medians = df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
df['D'] = np.where(df['B'].notna(), # if B is not NaN
df['D'].fillna(df['A'].map(medians)), # fill in the median
df['D']) # else keep the value of column D
A B D
0 1 1.00 90.00
1 1 2.00 90.00
2 1 3.00 120.00
3 1 3.00 120.00
4 1 4.00 90.00
5 1 5.00 75.00
6 1 6.00 90.00
7 2 1.00 nan
8 3 1.00 60.00
9 3 nan 50.00
10 3 nan nan

When the dataframe has duplicate columns, it seems that fillna function cannot work correctly with dict parameter

I find that after using pd.concat() to concatenate two dataframes with same column name, then df.fillna() will not work correctly with the dict parameter specifying which value to use for each column.
I don't know why? Is something wrong with my understanding?
a1 = pd.DataFrame({'a': [1, 2, 3]})
a2 = pd.DataFrame({'a': [1, 2, 3]})
b = pd.DataFrame({'b': [np.nan, 20, 30]})
c = pd.DataFrame({'c': [40, np.nan, 60]})
x = pd.concat([a1,a2, b, c], axis=1)
print(x)
x = x.fillna({'b':10, 'c': 50})
print(x)
Initial dataframe:
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
Data is unchanged after df.fillna():
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
As mentioned in the comments, there's a problem assigning values to a dataframe in the presence of duplicate column names.
However, you can use this workaround:
for col,val in {'b':10, 'c': 50}.items():
new_col = x[col].fillna(val)
idx = int(x.columns.get_loc(col))
x = x.drop(col,axis=1)
x.insert(loc=idx, column=col, value=new_col)
print(x)
result:
a a b c
0 1 1 10.0 40.0
1 2 2 20.0 50.0
2 3 3 30.0 60.0

Python convert specific dataframe columns to integer

I have a dataframe of 8 columns and I would like to convert last six columns to integer. The dataframe contains also NaN values and I don't want to remove them.
a b c d e f g h
0 john 1 NaN 2.0 2.0 42.0 3.0 NaN
1 david 2 28.0 52.0 15.0 NaN 2.0 NaN
2 kevin 3 1.0 NaN 1.0 10.0 1.0 5.0
Any ideas?
Thank you.
Thanks to MaxU I'm adding this option with nan = -1:
Reason: nan values are float values and can't coexist with integers.
So either nan values and floats or the option to think of -1 as nan
http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_numeric.html
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({'a': {0: 'john', 1: 'david', 2: 'kevin'},
'b': {0: 1, 1: 2, 2: 3},
'c': {0: np.nan, 1: 28.0, 2: 1.0},
'd': {0: 2.0, 1: 52.0, 2: np.nan},
'e': {0: 2.0, 1: 15.0, 2: 1.0},
'f': {0: 42.0, 1: np.nan, 2: 10.0},
'g': {0: 3.0, 1: 2.0, 2: 1.0},
'h': {0: np.nan, 1: np.nan, 2: 5.0}})
df.iloc[:, -6:] = df.iloc[:, -6:].fillna(-1)
df.iloc[:, -6:] = df.iloc[:, -6:].apply(pd.to_numeric, downcast='integer')
df
a b c d e f g h
0 john 1 -1 2 2 42 3 -1
1 david 2 28 52 15 -1 2 -1
2 kevin 3 1 -1 1 10 1 5
Thanks #AntonvBR for the downcast='integer' hint:
In [29]: df.iloc[:, -6:] = df.iloc[:, -6:].apply(pd.to_numeric, errors='coerce', downcast='integer')
In [30]: df
Out[30]:
a b c d e f g h
0 john 1 NaN 2.0 2 42.0 3 NaN
1 david 2 28.0 52.0 15 NaN 2 NaN
2 kevin 3 1.0 NaN 1 10.0 1 5.0
In [31]: df.dtypes
Out[31]:
a object
b int64
c float64
d float64
e int8
f float64
g int8
h float64
dtype: object

filter DataFrame using dictionary

I am new to pandas and python.
I want to use dictionary to filter DataFrame
import pandas as pd
from pandas import DataFrame
df = DataFrame({'A': [1, 2, 3, 3, 3, 3], 'B': ['a', 'b', 'f', 'c', 'e', 'c'], 'D':[0,0,0,0,0,0]})
my_filter = {'A':[3], 'B':['c']}
When I call
df[df.isin(my_filter)]
I get
A B D
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 NaN NaN
3 3.0 c NaN
4 3.0 NaN NaN
5 3.0 c NaN
What I want is
A B D
3 3.0 c 0
5 3.0 c 0
I dont want to add "D" in the dictionary, I want to get rows that has proper values in A and B clumns
You can sum of True by columns and then compare with 2:
print (df.isin(my_filter).sum(1) == 2)
0 False
1 False
2 False
3 True
4 False
5 True
dtype: bool
print (df[df.isin(my_filter).sum(1) == 2])
A B D
3 3 c 0
5 3 c 0
Another solution with first filter only columns with condition A and B with all for checking both True by columns:
print (df[df[['A','B']].isin(my_filter).all(1)])
A B D
3 3 c 0
5 3 c 0
Thank you MaxU for more flexible solution:
print (df[df.isin(my_filter).sum(1) == len(my_filter.keys())])
A B D
3 3 c 0
5 3 c 0

Categories