Combine two dataframes where column values match - python

I have two dataframes containing similar columns:
ID prop1
1 UUU &&&
2 III ***
3 OOO )))
4 PPP %%%
and
ID prop2
1 UUU 1234
2 WWW 4567
3 III 7890
5 EEE 0123
6 OOO 3456
7 RRR 6789
8 PPP 9012
I need to merge these two dataframes where the IDs match, and add the prop2 column to the original.
ID prop1 prop1
1 UUU &&& 1234
2 III *** 7890
3 OOO ))) 3456
4 PPP %%% 9012
Ive tried every combination of merge, join, concat, for, iter, etc. It will either fail to merge, lose the index, or straight-up drop the column values.

You can use pd.merge():
pd.merge(df1, df2, on='ID')
Output:
ID prop1 prop2
0 UUU &&& 1234
1 III *** 7890
2 OOO ))) 3456
3 PPP %%% 9012
You can also use df.merge() as follows::
df1.merge(df2, on='ID')
Same result.
The default parameter on .merge() no matter using pd.merge() or df.merge() is how='inner'. So you are already doing an inner join without specifying how= parameter.
More complex scenario:
If you require the more complicated situation to maintain the index of df1 1, 2, 3, 4 instead of 0, 1, 2, 3, you can do it by resetting index before merge and then set index on the interim index column produced when resetting index:
df1.reset_index().merge(df2, on='ID').set_index('index')
Output:
ID prop1 prop2
index
1 UUU &&& 1234
2 III *** 7890
3 OOO ))) 3456
4 PPP %%% 9012
Now, the index 1 2 3 4 of original df1 are kept.
Optionally, if you don't want the axis label index appear on top of the row index, you can do a rename_axis() as follows:
df1.reset_index().merge(df2, on='ID').set_index('index').rename_axis(index=None)
Output:
ID prop1 prop2
1 UUU &&& 1234
2 III *** 7890
3 OOO ))) 3456
4 PPP %%% 9012

You can also use .map to add the prop2 values to your original dataframe, where the ID column values match.
df1['prop2'] = df1['ID'].map(dict(df2[['ID', 'prop2']].to_numpy())
Should there be any IDs in your original dataframe that aren't also in the second one (and so don't have a prop2 value to bring across, you can fill those holes by adding .fillna() with the value of your choice.
df1['prop2'] = df1['ID'].map(dict(df2[['ID', 'prop2']].to_numpy()).fillna(your_fill_value_here)

Related

Filter record from one data frame based on column values in second data frame in python

I have two DataFrame df1 and df2.
df1 is the original dataset and df2 is the dataset made from df1 after some manipulation.
In df1 I have column 'log' and in df2 I have column 'log1' and 'log2' two columns.
where the values in columns 'log1' and 'log2' contains in column 'log' in df1.
df2 sample below
date id log1 log2
1 uu1q (2,4) (3,5)
1 uu1q (2,4) (7,6)
1 uu1q (3,5) (7,6)
5 u25a (4,7) (3,9)
5 uu25a (1,9) (3,9)
6 ua3b7 (1,1) (2,2)
6 ua3b7 (1,1) (3,3)
6 ua3b7 (2,2) (3,3)
df1 column sample with data below
date id log name col1 col2
1 uu1q (2,4) xyz 1123 qqq
1 uu1q (3,5) aas 2132 wew
1 uu1q (7,6) wqas 2567 uuo
5 u25a (4,7) enj 666 ttt
5 fff (0,0) ddd 0 lll
Now I want to take fetch/filter all the records from df1 based on column values for each row in df2 i.e. based on 'date', 'id', 'log1' or 'log2' and compare it with columns in df1 i.e.
'date', 'id', 'log'.
NOTE: values columns 'log1' and 'log2' contained in single column 'log'
IIUC, you're looking for a chained isin:
out = df1[df1['date'].isin(df2['date']) & df1['id'].isin(df2['id']) & (df1['log'].isin(df2['log1']) | df1['log'].isin(df2['log2']))]
Output:
date id log name col1 col2
0 1 uu1q (2,4) xyz 1123 qqq
1 1 uu1q (3,5) aas 2132 wew
2 1 uu1q (7,6) wqas 2567 uuo
3 5 u25a (4,7) enj 666 ttt
Use DataFrame.melt for column log from log1, log2... columns and for filtering inner join in DataFrame.merge:
df = (df2.melt(['date','id'], value_name='log')
.drop('variable', axis=1)
.drop_duplicates()
.merge(df1))
print (df)
date id log name col1 col2
0 1 uu1q (2,4) xyz 1123 qqq
1 1 uu1q (3,5) aas 2132 wew
2 5 u25a (4,7) enj 666 ttt
3 1 uu1q (7,6) wqas 2567 uuo

Concatenating data from two files

There are 2 files opened with Pandas. If there are common parts in the first column of two files (colored letters), I want to paste the data of the second column of second file into the matched part of the first file. And if there is no match, I want to write 'NaN'. Is there a way I can do in this situation?
File1
enter code here
0 1
0 JCW 574
1 MBM 4212
2 COP 7424
3 KVI 4242
4 ECX 424
File2
enter code here
0 1
0 G=COP d4ssd5vwe2e2
1 G=DDD dfd23e1rv515j5o
2 G=FEW cwdsuve615cdldl
3 G=JCW io55i5i55j8rrrg5f3r
4 G=RRR c84sdw5e5vwldk455
5 G=ECX j4ut84mnh54t65y
File1#
enter code here
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Use Series.str.extract for new Series for matched values by df1[0] values first and then merge with left join in DataFrame.merge:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
s = df2[0].str.extract(f'({"|".join(df1[0])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Or if need match last 3 values of column df1[0] use:
s = df2[0].str.extract(f'({"|".join(df1[0].str[-3:])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
Have a look at the concat-function of pandas using join='outer' (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). There is also this question and the answer to it that can help you.
It involves reindexing each of your data frames to use the column that is now called "0" as the index, and then joining two data frames based on their indices.
Also, can I suggest that you do not paste an image of your dataframes, but upload the data in a form that other people can test their suggestions.

Returning the rows based on specific value without column name

I know how to return the rows based on specific text by specifying the column name like below.
import pandas as pd
data = {'id':['1', '2', '3','4'],
'City1':['abc','def','abc','khj'],
'City2':['JH','abc','abc','yuu'],
'City2':['JRR','ytu','rr','abc']}
df = pd.DataFrame(data)
df.loc[df['City1']== 'abc']
and output is -
id City1 City2
0 1 abc JRR
2 3 abc rr
but what i need is -my specific value 'abc' can be in any columns and i need to return rows values that has specific text eg 'abc' without giving column name. Is there any way? need output as below
id City1 City2
0 1 abc JRR
1 3 abc rr
2 4 khj abc
You can use any with the (1) parameter to apply it on all columns to get the expected result :
>>> df[(df == 'abc').any(1)]
id City1 City2
0 1 abc JRR
2 3 abc rr
3 4 khj abc

pandas function to fill missing values from other dataframe based on matching column?

So I have two dataframes: one where certain columns are filled in and one where others are filled in but some from the previous df are missing. Both share some common non-empty columns.
DF1:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 NaN
Charlie 3 20160627 NaN
DF2:
FirstName Uid JoinDate BirthDate
Bob 1 NaN 19910524
Alice 2 NaN 19950403
Result:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 19910524
Alice 2 NaN 19950403
Charlie 3 20160627 NaN
Assuming that these rows do not share index positions in their respective dataframes, is there a way that I can fill the missing values in DF1 with values from DF2 where the rows match on a certain column (in this example Uid)?
Also, is there a way to create a new entry in DF1 from DF2 if there isn't a match on that column (e.g. Uid) without removing rows in DF1 that don't match any rows in DF2?
EDIT: I updated the dataframes to add non-matching results in both dataframes that I need in the result df. I also updated my last question to reflect that.
UPDATE: you can do it setting the proper indices and finally resetting the index of joined DF:
In [14]: df1.set_index('FirstName').combine_first(df2.set_index('FirstName')).reset_index()
Out[14]:
FirstName Uid JoinDate BirthDate
0 Alice 2.0 NaN 19950403.0
1 Bob 1.0 20160628.0 19910524.0
2 Charlie 3.0 20160627.0 NaN
try this:
In [113]: df2.combine_first(df1)
Out[113]:
FirstName Uid JoinDate BirthDate
0 Bob 1 20160628.0 19910524
1 Alice 2 NaN 19950403

Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe

I am trying to add a column to a pandas dataframe (df1) that has a unique identifier ('id') column from another dataframe (df2) that has the same unique identifier ('sameid'). I have tried merge, but I need to only add one specific column ('addthiscolumn') not all of the columns. What is the best way to do this?
print df1
'id' 'column1'
0 aaa randomdata1
1 aab randomdata2
2 aac randomdata3
3 aad randomdata4
print df2
'sameid' 'irrelevant' 'addthiscolumn'
0 aaa irre1 1234
1 aab irre2 2345
2 aac irre3 3456
3 aad irre4 4567
4 aae irre5 5678
5 aad irre6 6789
Desired Result
print df1
'id' 'column1' 'addthiscolumn'
0 aaa randomdata1 1234
1 aab randomdata2 2345
2 aac randomdata3 3456
3 aad randomdata4 4567
Because you just want to merge a single column, you can select as follows:
df1.merge(df2[['sameid', 'addthiscolumn']], left_on='id', right_on='sameid')

Categories