Python Merging data frames - python

In python, I have a df that looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 2
Lacy 2
Ryan 3
Colt 4
Tia 4
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
Dan 4
Hallie 5
Cam 5
Lacy 5
Ryan 6
Colt 7
Tia 7
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time. I know that I can reset the index if ID is a unique identifier. But in this case, more than one person can have the same ID. So how can I account for that?

From the example that you have provided above, you can observe that we can obtain the final dataframe by: adding the maximum value of ID in first df to the second and then concatenating them, to explain this better:
Name df2 final_df
Dan 1 4
This value in final_df is obtained by doing a 1+(max value of ID from df1 i.e. 3) and this trend is followed for all entries for the dataframe.
Code:
import pandas as pd
df = pd.DataFrame({'Name':['Anna','Polly','Sarah','Max','Kate','Ally','Steve'],'ID':[1,1,2,3,3,3,3]})
df1 = pd.DataFrame({'Name':['Dan','Hallie','Cam','Lacy','Ryan','Colt','Tia'],'ID':[1,2,2,2,3,4,4]})
max_df = df['ID'].max()
df1['ID'] = df1['ID'].apply(lambda x: x+max_df)
final_df = pd.concat([df,df1])
print(final_df)

Related

Join two Pandas dataframes, sampling from the smaller dataframe

I have two dataframes that look as follows:
import pandas as pd
import io
train_data="""input_example,user_id
example0.npy, jane
example1.npy, bob
example4.npy, alice
example5.npy, jane
example3.npy, bob
example2.npy, bob
"""
user_data="""user_data,user_id
data_jane0.npy, jane
data_jane1.npy, jane
data_bob0.npy, bob
data_bob1.npy, bob
data_alice0.npy, alice
data_alice1.npy, alice
data_alice2.npy, alice
"""
train_df = pd.read_csv(io.StringIO(train_data), sep=",")
user_df = pd.read_csv(io.StringIO(user_data), sep=",")
Suppose that the train_df table is many thousands of entries long, i.e., there are 1000s of unique "exampleN.npy" files. I was wondering if there was a straightforward way to merge the train_df and user_df tables where each row of the joined table matches on the key user_id but is subsampled from user_df.
Here is one example of a resulting dataframe (I'm trying to do uniform sampling, so theoretically, there are infinite possible result dataframes):
>>> result_df
input_example user_data user_id
0 example0.npy data_jane0.npy jane
1 example1.npy data_bob1.npy bob
2 example4.npy data_alice0.npy alice
3 example5.npy data_jane1.npy jane
4 example3.npy data_bob0.npy bob
5 example2.npy data_bob0.npy bob
That is, the user_data column is filled with a random choice of filename based on the corresponding user_id.
I know one could write this using some multi-line for-loop query-based approach, but perhaps there was a faster way using built-in Pandas functions, e.g., "sample", "merge", "join", or "combine".
You can sample by groups in user_df and then join that with train_df.
e.g.,
# this samples by fraction so each data is equally likely
user_df = user_df.groupby("user_id").sample(frac=0.5, replace=True)
user_data user_id
6 data_alice2.npy alice
4 data_alice0.npy alice
3 data_bob1.npy bob
0 data_jane0.npy jane
or
# this will sample 2 samples per group
user_df = user_df.groupby("user_id").sample(n=2, replace=True)
user_data user_id
6 data_alice2.npy alice
4 data_alice0.npy alice
2 data_bob0.npy bob
2 data_bob0.npy bob
0 data_jane0.npy jane
1 data_jane1.npy jane
Join
pd.merge(train_df, user_df)
I don't know if it is possible to merge with a sample without first merging both. This doesn't include a multi-line for loop:
merged = train_df.merge(user_df, on="user_id", how="left").\
groupby("input_example", as_index=False).\
apply(lambda x: x.sample(1)).\
reset_index(drop=True)
merge the two together, on "user_id", only taking those that appear in the left
group by "input_example", assuming these will all be unique (other could group on both columns of train_df)
take a sample of size 1 for these
reset the index
Sampling second, after the merge, means that rows with the same user_id will not necessarily be the same (but sampling user_df first would result in all rows in the output dataframe with the same user_id).
Think I figured out a solution myself, it's a one-liner but conceptually it's the same as what #Rawson suggested. First, I do a left-merge, which results in a table with many duplicates. Then I shuffle all the rows to give it randomness. Finally, I drop the duplicates. If I add "sort_index", the resulting table will have the same ordering as the original table.
I'm able to use the random_state kwarg to switch up which user_data file is used. See here:
>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=0).drop_duplicates('input_example').sort_index()
input_example user_id user_data
1 example0.npy jane data_jane1.npy
2 example1.npy bob data_bob0.npy
6 example4.npy alice data_alice2.npy
8 example5.npy jane data_jane1.npy
10 example3.npy bob data_bob1.npy
11 example2.npy bob data_bob0.npy
>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=1).drop_duplicates('input_example').sort_index()
input_example user_id user_data
1 example0.npy jane data_jane1.npy
2 example1.npy bob data_bob0.npy
4 example4.npy alice data_alice0.npy
7 example5.npy jane data_jane0.npy
10 example3.npy bob data_bob1.npy
12 example2.npy bob data_bob1.npy

How to findout difference between two dataframes irrespective of index? [duplicate]

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
In other word, a data frame that has all the rows/columns in df1 that are not in df2?
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".
For columns, try this:
set(df1.columns).symmetric_difference(df2.columns)
Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.
Let's go trough the code. First we cast values to string, and apply tuple function to each row.
df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)
Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2.
Then we apply isin method on df1 to check if each tuple "is in" df2.
The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.
To make it more readable, we may write it as:
df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt
Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)
edit2, I figured out a new solution without the need of setting index
newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)
Okay i found the answer of highest vote already contain what I have figured out. Yes, we can only use this code on condition that there are no duplicates in each two dfs.
I have a tricky method. First we set ’Name’ as the index of two dataframe given by the question. Since we have same ’Name’ in two dfs, we can just drop the ’smaller’ df’s index from the ‘bigger’ df.
Here is the code.
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)
import numpy as np
import pandas as pd
def get_dataframe_setdiff2d(df_new: pd.DataFrame,
df_old: pd.DataFrame,
rtol=1e-03, atol=1e-05) -> pd.DataFrame:
"""Returns set difference of two pandas DataFrames"""
union_index = np.union1d(df_new.index, df_old.index)
union_columns = np.union1d(df_new.columns, df_old.columns)
new = df_new.reindex(index=union_index, columns=union_columns)
old = df_old.reindex(index=union_index, columns=union_columns)
mask_diff = ~np.isclose(new, old, rtol, atol)
df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
df_diff = pd.concat([new[df_bool].stack(),
old[df_bool].stack()], axis=1)
df_diff.columns = ["New", "Old"]
return df_diff
Example:
In [1]
df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
print("df1:\n", df1, "\n")
print("df2:\n", df2, "\n")
diff = get_dataframe_setdiff2d(df1, df2)
print("diff:\n", diff, "\n")
Out [1]
df1:
A C
0 2 2
1 1 1
2 2 2
df2:
A B
0 1 1
1 1 1
diff:
New Old
0 A 2.0 1.0
B NaN 1.0
C 2.0 NaN
1 B NaN 1.0
C 1.0 NaN
2 A 2.0 NaN
C 2.0 NaN
As mentioned here
that
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.
Use concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)
I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count. This doesn't return duplicates, but it won't return any if both sides have the same count.
from collections import Counter
def diff(df1, df2, on=None):
"""
:param on: same as pandas.df.merge(on) (a list of columns)
"""
on = on if on else df1.columns
df1on = df1[on]
df2on = df2[on]
c1 = Counter(df1on.apply(tuple, 'columns'))
c2 = Counter(df2on.apply(tuple, 'columns'))
c1c2 = c1-c2
c2c1 = c2-c1
df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
a
0 1
0 2
There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records.
Example
First Dataframe
Id Customer Status Date
1 ABC Good Mar 2023
2 BAC Good Feb 2024
3 CBA Bad Apr 2022
Second Dataframe
Id Customer Status Date
1 ABC Bad Mar 2023
2 BAC Good Feb 2024
5 CBA Good Apr 2024
Comparing Dataframes
print("Dataframe difference -- \n")
print(df1.compare(df2))
print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))
print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))
print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))
Result
Dataframe difference --
Id Status Date
self other self other self other
0 NaN NaN Good Bad NaN NaN
2 3.0 5.0 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping equal values --
Id Status Date
self other self other self other
0 1 1 Good Bad Mar 2023 Mar 2023
2 3 5 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape --
Id Customer Status Date
self other self other self other self other
0 NaN NaN NaN NaN Good Bad NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 3.0 5.0 NaN NaN Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape and equal values --
Id Customer Status Date
self other self other self other self other
0 1 1 ABC ABC Good Bad Mar 2023 Mar 2023
1 2 2 BAC BAC Good Good Feb 2024 Feb 2024
2 3 5 CBA CBA Bad Good Apr 2022 Apr 2024
A slight variation of the nice #liangli's solution that does not require to change the index of existing dataframes:
newdf = df1.drop(df1.join(df2.set_index('Name').index))
Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting
df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
# Example
df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
df2 = df1.loc[[1,3,5]]
df1
gender subject
1 f bio
2 m chem
3 f phy
4 m bio
5 f bio
df2
gender subject
1 f bio
3 f phy
5 f bio
df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
df3
gender subject
2 m chem
4 m bio
Defining our dataframes:
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])
df1
Name Age
0 John 23
1 Mike 45
2 Smith 12
3 Wale 34
4 Marry 27
5 Tom 44
6 Menda 28
7 Bolt 39
8 Yuswa 40
df2
Name Age
0 John 23
2 Smith 12
3 Wale 34
5 Tom 44
6 Menda 28
8 Yuswa 40
The difference between the two would be:
df1[~df1.isin(df2)].dropna()
Name Age
1 Mike 45.0
4 Marry 27.0
7 Bolt 39.0
Where:
df1.isin(df2) returns the rows in df1 that are also in df2.
~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2–the difference between the two.
.dropna() drops the rows with NaN presenting the desired output
Note This only works if len(df1) >= len(df2). If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()
I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters. You can experiment with diffing to_dict('records'), to_numpy(), and other exports:
import pandas as pd
from deepdiff import DeepDiff
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]
DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}
Symmetric Difference
If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:
pd.concat([df1,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Set Difference / Relational Algebra Difference
If you are interested in the relational algebra difference / set difference, i.e. df1-df2 or df1\df2:
pd.concat([df1,df2,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Another possible solution is to use numpy broadcasting:
df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]
Output:
Name Age
1 Mike 45
4 Marry 27
7 Bolt 39
Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2
df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df
Try this one:
df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
It will result a new dataframe with the differences: the values that exist in df1 but not in df2.

Check if row value in a dataframe exists in another dataframe using loop for reconciliation

I am looking to develop some generic logic that will allow me to perform reconciliation between 2 datasets.
I have 2 dataframes and I want to loop through every row value in df1 and check if it exists in df2. If it does exist I want to create a new column 'Match' in df1 with the value 'Yes' and if it does not exist I want to append the missing values in a separate df which I will print to csv.
Example datasets:
df1:
ID Name Age
1 Adam 45
2 Bill 44
3 Claire 23
df2:
ID Name Age
1 Adam 45
2 Bill 44
3 Claire 23
4 Bob 40
5 Chris 21
The column names in the 2 dataframes I've used here are just for reference. But essentially I want to check if the row (1, Adam, 45) in df1 exists in df2.
The output for df3 would look like this:
df3:
ID Name Age
4 Bob 40
5 Chris 21
The updated df1 would look like this:
df2:
ID Name Age Match
1 Adam 45 Yes
2 Bill 44 Yes
3 Claire 23 Yes
To be clear, I understand that this can be done using a merge or isin, but would like a fluid solution that can be used for any dataset.
I appreciate this might be a big ask as I haven't provided much guidline but any help with this would be great!!
Thanks!!
You need to use merge here and utilize the indicator=True feature:
df_all = df1.merge(df2, on=['ID'], how='outer', indicator=True)
df3 = df_all[df_all['_merge'] == 'right_only'].drop(columns=['Name_x', 'Age_x']).rename(columns={'Name_y': 'Name', 'Age_y': 'Age'})[['ID', 'Name', 'Age']]
df2 = df_all[df_all['_merge'] == 'both'].drop(columns=['Name_x', 'Age_x']).rename(columns={'Name_y': 'Name', 'Age_y': 'Age'})[['ID', 'Name', 'Age']]
print(df3)
print(df2)
df3:
ID Name Age
3 4 Bob 40
4 5 Chris 21
df2:
ID Name Age
0 1 Adam 45
1 2 Bill 44
2 3 Claire 23

Rearanging data frame table Pandas Python

I have the following kind of data frame.
Id Name Exam Result Exam Result
1 Bob Maths 10 Physics 9
2 Mar ML 8 Chemistry 10
What I would like to have is removing the duplicate columns and adding their value to the corresponding rows. Something below
Id Name Exam Result
1 Bob Maths 10
1 Bob Physics 9
2 Mar ML 8
2 Mar Chemistry 10
Is there any way to do this in Python?
Any help is appreciated!
First create MultiIndex by first columns, which are not duplicated by DataFrame.set_index, then create MultiIndex in columns by counter of duplicates nameswith GroupBy.cumcount wotking with Series, so Index.to_series and last reshape by DataFrame.stack with DataFrame.reset_index for remove helper level and then for MultiIndex to columns:
df = df.set_index(['Id','Name'])
s = df.columns.to_series()
df.columns = [s, s.groupby(s).cumcount()]
df = df.stack().reset_index(level=2, drop=True).reset_index()
print (df)
Id Name Exam Result
0 1 Bob Maths 10
1 1 Bob Physics 9
2 2 Mar ML 8
3 2 Mar Chemistry 10
This is an alternative using pandas melt:
#flip table into long format
(df.melt(['Id','Name'])
#sort by Id so that result follows immediately after Exam
.sort_values('Id')
#create new column on rows that have result in the variable column
.assign(Result=lambda x: x.loc[x['variable']=="Result",'value'])
.bfill()
#get rid of rows that contain 'result' in variable column
.query('variable != "Result"')
.drop(['variable'],axis=1)
.rename(columns={'value':'Exam'})
)
Id Name Exam Result
0 1 Bob Maths 10
4 1 Bob Physics 9
1 2 Mar ML 8
5 2 Mar Chemistry 10
Alternatively, just for fun :
df = df.set_index(['Id','Name'])
#get boolean of duplicated columns
dupes = df.columns.duplicated()
#concatenate first columns and their duplicates
pd.concat([df.loc[:,~dupes],
df.loc[:,dupes]
]).sort_index()

pandas function to fill missing values from other dataframe based on matching column?

So I have two dataframes: one where certain columns are filled in and one where others are filled in but some from the previous df are missing. Both share some common non-empty columns.
DF1:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 NaN
Charlie 3 20160627 NaN
DF2:
FirstName Uid JoinDate BirthDate
Bob 1 NaN 19910524
Alice 2 NaN 19950403
Result:
FirstName Uid JoinDate BirthDate
Bob 1 20160628 19910524
Alice 2 NaN 19950403
Charlie 3 20160627 NaN
Assuming that these rows do not share index positions in their respective dataframes, is there a way that I can fill the missing values in DF1 with values from DF2 where the rows match on a certain column (in this example Uid)?
Also, is there a way to create a new entry in DF1 from DF2 if there isn't a match on that column (e.g. Uid) without removing rows in DF1 that don't match any rows in DF2?
EDIT: I updated the dataframes to add non-matching results in both dataframes that I need in the result df. I also updated my last question to reflect that.
UPDATE: you can do it setting the proper indices and finally resetting the index of joined DF:
In [14]: df1.set_index('FirstName').combine_first(df2.set_index('FirstName')).reset_index()
Out[14]:
FirstName Uid JoinDate BirthDate
0 Alice 2.0 NaN 19950403.0
1 Bob 1.0 20160628.0 19910524.0
2 Charlie 3.0 20160627.0 NaN
try this:
In [113]: df2.combine_first(df1)
Out[113]:
FirstName Uid JoinDate BirthDate
0 Bob 1 20160628.0 19910524
1 Alice 2 NaN 19950403

Categories