pandas merge on columns with different names and avoid duplicates [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
How can I merge two pandas DataFrames on two columns with different names and keep one of the columns?
df1 = pd.DataFrame({'UserName': [1,2,3], 'Col1':['a','b','c']})
df2 = pd.DataFrame({'UserID': [1,2,3], 'Col2':['d','e','f']})
pd.merge(df1, df2, left_on='UserName', right_on='UserID')
This provides a DataFrame like this
But clearly I am merging on UserName and UserID so they are the same. I want it to look like this. Is there any clean ways to do this?
Only the ways I can think of are either re-naming the columns to be the same before merge, or droping one of them after merge. I would be nice if pandas automatically drops one of them or I could do something like
pd.merge(df1, df2, left_on='UserName', right_on='UserID', keep_column='left')

How about set the UserID as index and then join on index for the second data frame?
pd.merge(df1, df2.set_index('UserID'), left_on='UserName', right_index=True)
# Col1 UserName Col2
# 0 a 1 d
# 1 b 2 e
# 2 c 3 f

There is nothing really nice in it: it's meant to be keeping the columns as the larger cases like left right or outer joins would bring additional information with two columns. Don't try to overengineer your merge line, be explicit as you suggest
Solution 1:
df2.columns = ['Col2', 'UserName']
pd.merge(df1, df2,on='UserName')
Out[67]:
Col1 UserName Col2
0 a 1 d
1 b 2 e
2 c 3 f
Solution 2:
pd.merge(df1, df2, left_on='UserName', right_on='UserID').drop('UserID', axis=1)
Out[71]:
Col1 UserName Col2
0 a 1 d
1 b 2 e
2 c 3 f

Related

Convert two dataframes to numpy arrays for pairwise comparison [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 2 years ago.
I have two incredibly large dataframes, df1 and df2. Their sizes are below:
print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062
I know that each value of df2 appears in df1, and what I am looking to do is build a third dataframe that is the difference of the two, meaning, all of the rows that appear in df1 that do not appear in df2.
I have tried using the below method from this question:
df3 = (pd.merge(df2,df1, indicator=True, how='outer')
.query('_merge=="left_only"').drop('_merge', axis=1))
But am continually getting MemoryError failures due to this
Thus, I am now trying to do the following:
Loop through each row of df1
See if df1 appears in df2
If it does, skip
If not, add it to a list
What I am concerned about, in terms of rows, is that the rows of data are equal, meaning, all of the elements match pairwise, for example
[1,2,3]
[1,2,3]
is a match, while:
[1,2,3]
[1,3,2]
is not a match
I am now trying:
for i in notebook.tqdm(range(svm_data.shape[0])):
real_row = np.asarray(real_data.iloc[[i]].to_numpy())
synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
if (np.array_equal(real_row, synthetic_row)):
continue
else:
list_of_rows.append(list(synthetic_row))
gc.collect()
But for some reason, this is not finding the values in the rows themselves, so I am clearly still doing something wrong.
Note, I also tried:
df3 = df1[~df1.isin(df2)].dropna(how='all')
but that yielded incorrect results.
How can I (in a memory efficient way) find all of the rows in one of my dataframe
DATA
df1:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
df2:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4
Let's try concat and groupby to identify duplicate rows:
# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])
s = (pd.concat((df1,df2), keys=(1,2))
.groupby(list(df1.columns))
.ngroup()
)
# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])
df1[df1_in_df2]
Output:
0 1 2
2 4 5 6
3 7 8 9
Update Another option is to merge on the non-duplicated df2:
df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')
Output (you should be able to guess which rows you need from there):
0 1 2 _merge
0 1 2 3 left_only
1 1 2 3 left_only
2 4 5 6 both
3 7 8 9 both

Will passing ignore_index=True to pd.concat preserve index succession within dataframes that I'm concatenating?

I have two dataframes:
df1 =
value
0 a
1 b
2 c
df2 =
value
0 d
1 e
I need to concatenate them across index, but I have to preserve the index of the first dataframe and continue it in the second dataframe, like this:
result =
value
0 a
1 b
2 c
3 d
4 e
My guess is that pd.concat([df1, df2], ignore_index=True) will do the job. However, I'm worried that for large dataframes the order of the rows may be changed and I'll end up with something like this (first two rows changed indices):
result =
value
0 b
1 a
2 c
3 d
4 e
So my question is, does the pd.concat with ignore_index=True save the index succession within dataframes that are being concatenated, or there is randomness in the index assignment?
In my experience, pd.concat concats the rows in the order the DataFrames are passed to it during concatenation.
If you want to be safe, specify sort=False which will also avoid sorting on columns:
pd.concat([df1, df2], axis=0, sort=False, ignore_index=True)
value
0 a
1 b
2 c
3 d
4 e

join two pandas dataframe using a specific column

I am new with pandas and I am trying to join two dataframes based on the equality of one specific column. For example suppose that I have the followings:
df1
A B C
1 2 3
2 2 2
df2
A B C
5 6 7
2 8 9
Both dataframes have the same columns and the value of only one column (say A) might be equal. What I want as output is this:
df3
A B C B C
2 8 9 2 2
The values for column 'A' are unique in both dataframes.
Thanks
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner')
If you wish to maintain column A as a non-index, then:
pd.concat([df1.set_index('A'),df2.set_index('A')], axis=1, join='inner').reset_index()
Alternatively, you could just do:
df3 = df1.merge(df2, on='A', how='inner', suffixes=('_1', '_2'))
And then you can keep track of each value's origin

Combine two Pandas dataframes with the same index [duplicate]

This question already has an answer here:
What are the 'levels', 'keys', and names arguments for in Pandas' concat function?
(1 answer)
Closed 4 years ago.
I have two dataframes with the same index but different columns. How do I combine them into one with the same index but containing all the columns?
I have:
A
1 10
2 11
B
1 20
2 21
and I need the following output:
A B
1 10 20
2 11 21
pandas.concat([df1, df2], axis=1)
You've got a few options depending on how complex the dataframe is:
Option 1:
df1.join(df2, how='outer')
Option 2:
pd.merge(df1, df2, left_index=True, right_index=True, how='outer')

Combining DataFrames without Nans

I have two df. One maps values to IDs. The other one has multiple entries of these IDs. I want to have a df with the first dataframe with the values assigned to the respective IDs.
df1 =
Val1 Val2 Val3
x 1000 2 0
y 2000 3 9
z 3000 1 8
df2=
foo ID bar
0 something y a
1 nothing y b
2 everything x c
3 who z d
result=
foo ID bar Val1 Val2 Val3
0 something y a 2000 3 9
1 nothing y b 2000 3 9
2 everything x c 1000 2 0
3 who z d 3000 1 8
I've tried merge and join (obviously incorrectly) but I am getting a bunch of NaNs when I do that. It appears that I am getting NaNs on every alternate ID.
I have also tried indexing both DFs by ID but that didn't seem to help either. I am obviously missing something that I am guessing is a core functionality but I can't get my head around it.
merge and join could both get you the result DataFrame you want. Since one of your DataFrames is indexed (by ID) and the other has just a integer index, merge is the logical choice.
Merge:
# use ID as the column to join on in df2 and the index of df1
result = df2.merge(df1, left_on="ID", right_index=True, how="inner")
Join:
df2.set_index("ID", inplace=True) # index df2 in place so you can use join, which merges by index by default
result = df2.join(df1, how="inner") # join df1 by index

Categories