Merge with overwrite values from left dataframe in pandas - python

I have two DataFrames C and D as follows:
C
A B
0 AB 1
1 CD 2
2 EF 3
D
A B
1 CD 4
2 GH 5
I have to merge both the dataframes but the merge should overwrite the values in the right df. Rest of the rows from the dataframe should not change.
Output
A B
0 AB 1
1 CD 4
2 EF 3
3 GH 5
The order of the rows of df must not change i.e. CD should remain in index 1. I tried using outer merge which is handling index but duplicating columns instead of overwriting.
>>> pd.merge(c,d, how='outer', on='A')
A B_x B_y
0 AB 1.0 NaN
1 CD 2.0 4.0
2 EF 3.0 NaN
3 GH NaN 5.0
Basically B_y should have replaced values in B_x(only where values occur).
I am using Python3.7.

You will have to replace the rows to override the values in place. This is different from drop duplicates as it will change the ordering of the rows.
Combine DFs takes in "pkey" as an argument, which is the main column on which the merge should happen.
def update_df_row(row=None, col_name="", df=pd.DataFrame(), pkey=""):
try:
match_index = df.loc[df[pkey] == col_name].index[0]
row = df.loc[match_index]
except IndexError:
pass
except Exception as ex:
raise
finally:
return row
def combine_dfs(parent_df, child_df, pkey):
filtered_child_df = child_df[child_df[pkey].isin(parent_df[pkey])]
parent_df[parent_df[pkey].isin(child_df[pkey])] = parent_df[
parent_df[pkey].isin(child_df[pkey])].apply(
lambda row: update_df_row(row, row[pkey], filtered_child_df, pkey), axis=1)
parent_df = pd.concat([parent_df, child_df]).drop_duplicates([pkey])
return parent_df.reset_index(drop=True)
The output of the above code snippet will be:
A B
0 AD 1
1 CD 4
2 EF 3
3 GH 5

Use:
df = pd.merge(C,D, how='outer', on='A', suffixes=('_',''))
#filter columns names
new_cols = df.columns[df.columns.str.endswith('_')]
#remove last char from column names
orig_cols = new_cols.str[:-1]
#dictionary for rename
d = dict(zip(new_cols, orig_cols))
#filter columns and replace NaNs by new appended columns
df[orig_cols] = df[orig_cols].combine_first(df[new_cols].rename(columns=d))
#remove appended columns
df = df.drop(new_cols, axis=1)
print (df)
A B
0 AB 1.0
1 CD 4.0
2 EF 3.0
3 GH 5.0

If it's acceptable to assume that column A is in alphabetical order:
C = pd.DataFrame({"A": ["AB", "CD", "EF"], "B": [1, 2, 3]})
D = pd.DataFrame({"A": ["CD", "GH"], "B": [4, 5]})
df_merge = pd.concat([C,D]).drop_duplicates('A', keep='last').sort_values(by=['A']).reset_index(drop=True)
df_merge
A B
0 AB 1
1 CD 4
2 EF 3
3 GH 5
Edit
This will do the job if the order in which each category appears in the original dataframes is most important:
C = pd.DataFrame({"A": ["AB", "CD", "EF"], "B": [1, 2, 3]})
D = pd.DataFrame({"A": ["CD", "GH"], "B": [4, 5]})
df_merge = pd.concat([C,D]).drop_duplicates('A', keep='last')
df_merge['A'] = pd.Categorical(df_merge['A'], C.A.append(D.A).drop_duplicates())
df_merge.sort_values(by=['A'], inplace=True)
df_merge.reset_index(drop=True, inplace=True)
df_merge

You can use update. In your case it would be:
C.update(D)

Related

openpyxl concat three dataframes horizonally

I am using openpyxl to edit three dataframes, df1, df2, df3 (If it is necessary, we can also regard as three excels independently):
import pandas as pd
data1 = [[1, 1],[1,1]]
df1 = pd.DataFrame(data1, index = ['I1a','I1b'], columns=['v1a', 'v1b'])
df1.index.name='I1'
data2 = [[2, 2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2]]
df2 = pd.DataFrame(data2, index = ['I2a','I2b','I2c','I2d'], columns=['v2a','v2b','v2c','v2d'])
df2.index.name='I2'
data3 = [['a', 'b',3,3],['a','c',3,3],['b','c',3,3],['c','d',3,3]]
df3 = pd.DataFrame(data3, columns=['v3a','v3b','v3c','v3d'])
df3 = df3.groupby(['v3a','v3b']).first()
Here df3 is multiindex. How to concat them into one excel horizontally (each dataframe start at the same line) as following:
Here we will regard index as a column and for multiindex, we keep the first level hidden.
Update
IIUC:
>>> pd.concat([df1.reset_index(), df2.reset_index(), df3.reset_index()], axis=1)
I1 v1a v1b I2 v2a v2b v2c v2d v3a v3b v3c v3d
0 I1a 1.0 1.0 I2a 2 2 2 2 a b 3 3
1 I1b 1.0 1.0 I2b 2 2 2 2 a c 3 3
2 NaN NaN NaN I2c 2 2 2 2 b c 3 3
3 NaN NaN NaN I2d 2 2 2 2 c d 3 3
Old answer
Assuming you know the start row, you can use pandas to remove extra columns:
import pandas as pd
df = pd.read_excel('input.xlsx', header=0, skiprows=3).dropna(how='all', axis=1)
df.to_excel('output.xlsx', index=False)
Input:
Output:

How to subtract two dataframe column in dask dataframe?

I have two dataframes like df1, df2.
In df1 i have 4 columns (A,B,C,D) and two rows,
In df2 i have 4 columns (A,B,C,D) and two rows.
Now I want to subtract the two dataframe LIKE df1['A'] - df2['A'] and so on. But I don't know how to do it.
df1-
df2 -
Just do the subtraction but keep in mind the indexes, for example, let's say I have df1 and df2 with same columns but different index:
df1 = dd.from_array(np.arange(8).reshape(2, 4), columns=['A','B','C','D'])
df2 = dd.from_pandas(pd.DataFrame(
np.arange(8).reshape(2, 4),
columns=['A','B','C','D'],
index=[1, 2]
), npartitions=1)
Then:
(df1 - df2).compute()
# A B C D
# 0 NaN NaN NaN NaN
# 1 4.0 4.0 4.0 4.0
# 2 NaN NaN NaN NaN
On the other hand, let's match index from df2 to df1 and subtract
df2 = df2.assign(idx=1)
df2 = df2.set_index(df2.idx.cumsum() - 1)
df2 = df2.drop(columns=['idx'])
(df1 - df2).compute()
# A B C D
# 0 0 0 0 0
# 1 0 0 0 0

Outer merge between pandas and imputing NA with preceeding row

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns
Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

How can I join two dataframes with update in some rows, using Pandas?

I'm new to pandas and I would like to know how I can join two files and update existing lines, taking into account a specific column. The files have thousands of lines. For example:
Df_1:
A B C D
1 2 5 4
2 2 6 8
9 2 2 1
Now, my table 2 has exactly the same columns, and I want to join the two tables replacing some rows that may be in this table and also in table 1 but where there were changes / updates in column C, and add the new lines that exist in this second table (df_2), for example:
Df_2:
A B C D
2 2 7 8
9 2 3 1
3 4 6 7
1 2 3 4
So, the result I want is the union of the two tables and their update in a few rows, in a specific column, like this:
Df_result:
A B C D
1 2 5 4
2 2 7 8
9 2 3 1
3 4 6 7
1 2 3 4
How can I do this with the merge or concatenate function? Or is there another way to get the result I want?
Thank you!
You need to have at least one column as a reference, I mean, to know what needs to change to do the update.
Assuming that in your case it is "A" and "B" in this case.
import pandas as pd
ref = ['A','B']
df_result = pd.concat([df_1, df_2], ignore_index = True)
df_result = df_result.drop_duplicates(subset=ref, keep='last')
Here a real example.
d = {'col1': [1, 2, 3], 'col2': ["a", "b", "c"], 'col3': ["aa", "bb", "cc"]}
df1 = pd.DataFrame(data=d)
d = {'col1': [1, 4, 5], 'col2': ["a", "d", "f"], 'col3': ["dd","ee", "ff"]}
df2 = pd.DataFrame(data=d)
df_result = pd.concat([df1, df2], ignore_index=True)
df_result = df_result.drop_duplicates(subset=['col1','col2'], keep='last')
df_result

Categories