Find where three separate DataFrames overlap and create a new DataFrame - python

I have three separate DataFrames. Each DataFrame has the same columns - ['Email', 'Rating']. There are duplicate row values in all three DataFrames for the column Email. I'm trying to find those emails that appear in all three DataFrames and then create a new DataFrame based off those rows. So far I have I had all three DataFrames saved to a list like this dfs = [df1, df2, df3], and then concatenated them together using df = pd.concat(dfs). I tried using groupby from here but to no avail. Any help would be greatly appreciated

You want to do a merge. Similar to a join in sql you can do an inner merge and treat the email like a foreign key. Here is the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would look something like this:
in_common = pd.merge(df1, df2, on=['Email'], how='inner')

you could try using .isin from pandas, e.g:
df[df['Email'].isin(df2['Email'])]
This would retrieve row entries where the values for the column email are the same in the two dataframes.
Another idea is maybe try an inner merge.
Goodluck, post code next time.

Related

Merging Two Dataframes stacks rows instead of merging into one

I'm attempting to merge two dataframes using two columns as keys: "Date" and "Instrument"
Here is my code:
merge_df = pd.merge(df1 , df2, how='outer', left_on=['Date','Instrument'], right_on = ['Date','Instrument'])
df1:
df2:
You'll notice that the row in each dataframe has the same instrument and date value: AEA000201011 & 2008-01-31.
The merged dataframe is stacking the two rows instead of combining them:
merged_df:
I have ensured that the dataframe key columns dtypes match:
df1:
df2:
Any advice would be much appreciated!
Man I wish I could use add comment section.
Even though you've probably already tried, have tried to use "left" or "right" instead of "outer"
Or for once check them like
df1["Instrument"].iloc[0] == df2["Instrument"].iloc[0]
Maybe they got some invisible chars in them. If it's like that you can try using strip() functions.
Nothing other than these comes to my mind.

Is it possible to find common values in two dataframes using Python?

I have a dataframe df1 that is the list of e-mails of people that downloaded a certain e-book, and another dataframe df2 that is the e-mails of people that downloaded a different e-book.
I want to find the people that downloaded both e-books, or the common values between df1 and df2, using Python.
Is it possible to do that? How?
This was already discussed. Can you click on the below link
Find the common values in columns in Pandas dataframe
Assuming the two data frames as df1 and df2 with email column, you can do the following:
intersected_df = pd.merge(df1, df2, how='inner')
This data frame will have the values corresponding to emails found in df1 and df2
Dump the emails from df1 into a set, in order to avoid duplicates.
Dump the emails from df2 into a set, for the same reason.
Find the intersection of these two sets, as such:
set1 = set(df1.Emails)`
set2 = set(df2.Emails)
common = set1.intersection(set2)```
I believe you should merge the two dataframes
merged = pd.merge(df1, df1, how='inner', on=['e-mails'])
and then drop the Nan values:
merged.dropna(inplace=True)

How to inner join in pandas as SQL , Stuck in a problem below

I have two df named "df" and second as "topwud".
df
topwud
when I join these two dataframes bt inner join using BOMCPNO and PRTNO as the join column
like
second_level=pd.merge(df,top_wud ,left_on='BOMCPNO', right_on='PRTNO', how='inner').drop_duplicates()
Then I got this data frame
Result
I don't want common name coming as PRTNO_x and PRTNO_y , I want to keep only PRTNO_x in my result dataframe as name "PRTNO" which is default name.
Kindly help me :)
try This -
pd.merge(df1, top_wud, on=['BOMCPNO', 'PRTNO'])
What this will do though is return only the values where BOMCPNO and PRTNO exist in both dataframes as the default merge type is an inner merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['BOMCPNO', 'PRTNO'],inplace=True)
also try other types of join , as i dont know what exactly you want, i think its left inner .
check this if it solved your problem.

How to compare two dataframes and create a new one for those entries which are the same across two columns in the same row

I have been trying to make a comparison of two dataframes, creating new dataframes for the ones which have the same entries in two columns. I thought I had cracked it but the code I have now just looks at the two columns of interest and if the string is found anywhere in that column it considers it a match. I need the two strings to be common on the same row across the columns. A sample of the code follows.
#produce table with common items
vto_in_jeff = df_vto[(df_vto['source'].isin(df_jeff['source']) & df_vto['target'].isin(df_jeff['target']))].dropna().reset_index(drop=True)
#vto_in_jeff.index = vto_in_jeff.index + 1
vto_in_jeff['compare'] = 'Common_terms'
print(vto_in_jeff)
vto_in_jeff.to_csv(output_path+'vto_in_'+f+'.csv', index=False)
So this code comes out with a table which has a list of the rows which has both source and target strings, but not the source and target strings necessarily having to appear in the same row. Can anyone help me look specifically row by row?
you can use the pandas merge method
result = pd.merge(df1, df2, on='key')
here are more details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra

Create a dataframe by discarding intersections of two dataframes (Pandas)

Does anyone know of an efficient way to create a new dataframe based off of two dataframes in Python/Pandas?
What I am trying to do is check if a value from df1 is in df2, then do not add the row to df3. I am working with student IDS, and if a student ID from df1 is in df2, I do not want to include it in the new dataframe, df3.
So does anybody know an efficient way to do this? I have googled and looked on SO, but found nothing that works so far.
Assuming the column is called ID.
df3 = df1[~df1["ID"].isin(df2["ID"])].copy()
If you have both dataframes of same length you can also use:
print df1.loc[df1['ID'] != df2['ID']]
assign it to a third dataframe.

Categories