Pandas replace columns by merging another dataframe - python

I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?

here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8

Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2

Related

How to set column headers to the first row in Pandas dataframe?

How do I set the column header of a dataframe to the first row of a dataframe and reset the column names?
# Creation of dataframe
df = pd.DataFrame({"A": ["1", "4", "7"],
"B": ["2", "5", "8"],
"C": ['3','6','9']})
# df:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
Desired Outcome:
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9
Use concat with Index.to_frame with transpose for one row DataFrame and last set columns names by range:
df = pd.concat([df.columns.to_frame().T, df], ignore_index=True)
df.columns = range(len(df.columns))
print (df)
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9
Or use DataFrame.set_axis for chained method solution:
df = (pd.concat([df.columns.to_frame().T, df], ignore_index=True)
.set_axis(range(len(df.columns)), axis=1))
What you want to do is similar to reset_index but on the other axis. Unfortunately, there is no axis parameter in reset_index.
But, you can cheat a bit and apply a double transposition to handle the columns as index temporarily:
df.T.reset_index().T.reset_index(drop=True)
output:
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9
You can use np.vstack on a list of column names and the DataFrame to create an array with one extra row; then cast it into pd.DataFrame:
out = pd.DataFrame(np.vstack([df.columns, df]))
Output:
0 1 2
0 A B C
1 1 2 3
2 4 5 6
3 7 8 9

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

Select Columns of a DataFrame based on another DataFrame

I am trying to select a subset of a DataFrame based on the columns of another DataFrame.
The DataFrames look like this:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I want to get all rows of the first Dataframe for the columns which are included in both DataFrames. My result should look like this:
a b
0 0 1
1 4 5
2 8 9
3 12 13
You can use pd.Index.intersection or its syntactic sugar &:
intersection_cols = df1.columns & df2.columns
res = df1[intersection_cols]
import pandas as pd
data1=[[0,1,2,3,],[4,5,6,7],[8,9,10,11],[12,13,14,15]]
data2=[[0,1],[2,3],[4,5],[6,7],[8,9]]
df1 = pd.DataFrame(data=data1,columns=['a','b','c','d'])
df2 = pd.DataFrame(data=data2,columns=['a','b'])
df1[(df1.columns) & (df2.columns)]

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

why the dataframe generated using merge operation is not of 3x3 dimension rather than 3x5?

i follow the instruction http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html, but is confused when the merge colums is not of the same index. for example, column 1 in d3 corresponding to column 1 in d4.
In [92]: d4
Out[92]:
0 1
0 9 1
1 11 3
2 1 2
In [93]: d3
Out[93]:
0 1
0 2 3
1 1 9
2 3 9
In [94]: d3.merge(d4, how='left', left_on=0, right_on=1)
Out[94]:
0 0_x 1_x 0_y 1_y
0 2 2 3 1 2
1 1 1 9 9 1
2 3 3 9 11 3
i think the result should be
0 1 2
0 2 3 1
1 1 9 9
2 3 9 11
Edit 1:
why the following merge could create an exactly 3x3 DataFrame, while the formmer can create a 3x5 DataFrame?
In [164]: d1
Out[164]:
0 1
0 1 10
1 2 5
2 3 7
In [165]: d2
Out[165]:
0 1
0 1 5
1 2 6
2 3 8
In [162]: d1.merge(d2, on=[0])
Out[162]:
0 1_x 1_y
0 1 10 5
1 2 5 6
2 3 7 8
In your first merge you are merging lhs on column '0' and rhs on column '1' but you have no identical values so it has to create two columns with suffixes. The remaining columns have no matches either so you create additional columns.
In the second example you merge on column '0', whereby you do have identical values so it doesn't need to create an additional column, however you still have a clash of column names for '1' and values so it has to create the additional columns with suffixes.
I think your confusion stems from the expectation that because you've specified the columns to merge on that it will then use these columns like an index and match the other columns against these rows, it will not. It will only do this if you set these columns as the index:
In [23]:
merged = df1.set_index(keys=[1]).merge(df2.set_index(keys=[0]), left_index=True, right_index=True,how='left')
merged.index.names=['2']
merged.reset_index()
Out[23]:
2 0 1
0 1 9 9
1 3 11 9
2 2 1 3
[3 rows x 3 columns]
so I'm setting the index on these columns and setting the left_index and right_index params to True.
However we have to then recover the index as a column, the first issue is that the index name clashes with an existing column name so we rename it.
We can then call reset_index to recover the values.

Categories