Merge dataframes based on column values with duplicated rows - python

I want to merge two dataframes based on equal column values. The problem is that one of my columns have duplicated row values, which cannot be drop since it's correlated to another columns. Here's an example of my two dataframes:
Essentialy, I want to merge this two dataframes based on equal values of FromPatchID (df1) and Id (df2) columns, in order to get something like this:
FromPatchID ToPatchID ... Id MMM LB
1 1 ... 1 26.67 27.67
1 2 ... 1 26.67 27.67
1 3 ... 1 26.67 27.67
2 1 ... 2 26.50 27.50
3 1 ... 3 26.63 27.63
I already tried a simple merge with df_merged = pd.merge(df1, df2, on=['FromPatchID','Id']), but I got KeyError indicating to check for duplicates in FromPatchID column.

You have to specify the different column names to match on with left_on and right_on. Also specify how='right' to use only keys from the right frame.
df_merged = pd.merge(df1, df2, left_on='FromPatchID', right_on='Id', how='right')

Related

How to use Python Pandas to sort data frame to match files and invoice amounts

I have a data frame with 4 columns. I need column 1 and 2 (new_df_1 and bill_df_1) to not change. I want to sort (new_File_Number_Data new_invoice_total) to match column 1 and 2 and if there is no match, match it with missing.
new_df_1 bill_df_1 new_File_Number_Data new_invoice_total
0 1-08912-000218-033 25.0 1-08915-000041-054 134.50
1 1-08915-000041-054 163.0 001-0464-01589-061 148.50
2 001-0464-01589-061 166.7 004-3001-00080-532 54.00
3 004-3001-00080-532 74.0 missing missing
easier to look at Python Data Frame pic
You can't sort only some columns of a dataframe and not others. It sounds like you need to separate the columns into two different dataframes and then merge them so that they are matched as you want. You can then fill the missing values with the string 'missing'. For example:
df1 = df[['new_df_1', 'bill_df_1']]
df2 = df[['new_File_Number_Data', 'new_invoice_total']]
new_df = pd.merge(df1, df2, how='left', left_on='new_df_1', right_on='new_File_Number_Data').fillna('missing')

merging two dataframes while moving column positions [duplicate]

This question already has an answer here:
Merge DataFrames based on index columns [duplicate]
(1 answer)
Closed 4 years ago.
I have a dataframe called df1 that is:
0
103773708 68.50
103773718 57.01
103773730 30.80
103773739 67.62
I have another one called df2 that is:
0
103773739 37.02
103773708 30.25
103773730 15.50
103773718 60.54
105496332 20.00
I'm wondering how I would get them to combine to end up looking like df3:
0 1
103773708 30.25 68.50
103773718 60.54 57.01
103773730 15.50 30.80
103773739 37.02 67.62
105496332 20.00 00.00
As you can see sometimes the index position is not the same, so it has to append the data to the same index. The goal is to append column 0 from df1, into df2 while pushing column 0 in df2 over one.
result = df1.join(df2.rename(columns={0:1})).fillna(0)
Simply merge on index, and then relabel the columns:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
df.columns = [0,1]
df = df.fillna(0)
df1.columns = ['1'] # Rename the column from '0' to '1'. I assume names as strings.
df=df2.join(df1).fillna(0) # Join by default is LEFT
df
0 1
103773739 37.02 67.20
103773708 30.25 68.50
103773730 15.50 30.80
103773718 60.54 57.01
105496332 20.00 0.00

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Join dataframe with different indices

please consider the following dataframe with daily dates as its index
df1= pd.date_range(start_date, end_date)
df1 = pd.DataFrame(index=date_range, columns=['A', 'B'])
now I have a second dataframe df2 where df2.index is a subset of df1.index
I want to join the data from df2 into df1 and for the missing indices I want to have NAN.
In a second step I want to replace the NaN with the last available data like this:
2004-03-28 5
2004-03-30 NaN
2004-03-31 NaN
2004-04-01 7
should become
2004-03-28 5
2004-03-30 5
2004-03-31 5
2004-04-01 7
many thanks for your help
Assuming that you have common index and just a single column that is named the same in both dataframes:
First merge
df1 = df1.merge(df2, how='left')
Now fill the missing values using 'ffill' which means forwards fill:
df1 = df1.fillna(method='ffill')
In the situation where the columns are not named the same you can either rename the columns:
right.rename(columnss={'old_name':'new_name'},inplace=True)
or specify the columns from both left and right hand side to merge with:
df1.merge(df2, left_on='left_col', right='right_col', how='left')
if the indexes don't match then you have to set left_index=False and right_index=False

Combining DataFrames without Nans

I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
df2=
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
result=
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.
merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
Merge:
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
Join:
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index

Categories