Create a dataframe by discarding intersections of two dataframes (Pandas) - python

Does anyone know of an efficient way to create a new dataframe based off of two dataframes in Python/Pandas?
What I am trying to do is check if a value from df1 is in df2, then do not add the row to df3. I am working with student IDS, and if a student ID from df1 is in df2, I do not want to include it in the new dataframe, df3.
So does anybody know an efficient way to do this? I have googled and looked on SO, but found nothing that works so far.

Assuming the column is called ID.
df3 = df1[~df1["ID"].isin(df2["ID"])].copy()

If you have both dataframes of same length you can also use:
print df1.loc[df1['ID'] != df2['ID']]
assign it to a third dataframe.

Related

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Finding non-matching rows between two dataframes

I have a scenario where I want to find non-matching rows between two dataframes. Both dataframes will have around 30 columns and an id column that uniquely identify each record/row. So, I want to check if a row in df1 is different from the one in df2. The df1 is an updated dataframe and df2 is the previous version.
I have tried an approach pd.concat([df1, df2]).drop_duplicates(keep=False) , but it just combines both dataframes. Is there a way to do it. I would really appreciate the help.
The sample data looks like this for both dfs.
id user_id type status
There will be total 39 columns which may have NULL values in them.
Thanks.
P.S. df2 will always be a subset of df1.
If your df1 and df2 has the same shape, you may easily compare with this code.
df3 = pd.DataFrame(np.where(df1==df2,True,False), columns=df1.columns)
And you will see boolean output "False" for not matching cell value.

Find where three separate DataFrames overlap and create a new DataFrame

I have three separate DataFrames. Each DataFrame has the same columns - ['Email', 'Rating']. There are duplicate row values in all three DataFrames for the column Email. I'm trying to find those emails that appear in all three DataFrames and then create a new DataFrame based off those rows. So far I have I had all three DataFrames saved to a list like this dfs = [df1, df2, df3], and then concatenated them together using df = pd.concat(dfs). I tried using groupby from here but to no avail. Any help would be greatly appreciated
You want to do a merge. Similar to a join in sql you can do an inner merge and treat the email like a foreign key. Here is the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would look something like this:
in_common = pd.merge(df1, df2, on=['Email'], how='inner')
you could try using .isin from pandas, e.g:
df[df['Email'].isin(df2['Email'])]
This would retrieve row entries where the values for the column email are the same in the two dataframes.
Another idea is maybe try an inner merge.
Goodluck, post code next time.

How to map two dataframes keeping the value same for one dataframe

Im trying to write a script for few ETL transformations. I have 34 fixed columns i.e. df1, according to which I have to map the column name of different input files containing different columns i.e. df2.
df1(Standard Columns):
df2:
I have tried df.merge but that does not seem to solve my problem.
The expected result is the columns in the input file df2 to be mapped with same column name as df1 and same order as they appaer in df2with its original value intact.
Expected Result :
any help will be greatly appreciated !!
A way to do this would be to have an intermediate step of mapping the columns.
For instance:
df2.rename(columns = {'Department Code':'Field 1 Dept Number','Column2':'2_column', .....})
And then you can merge the two dataframes on the columns of interest.

Pandas merge DataFrames based on index/column combination

I have two DataFrames that I want to merge. I have read about merging on multiple columns, and preserving the index when merging. My problem needs to cater for both, and I am having difficulty figuring out the best way to do this.
The first DataFrame looks like this
and the second looks like this
I want to merge these based on the Date and the ID. In the first DataFrame the Date is the index and the ID is a column; in the second DataFrame both Date and ID are part of a MultiIndex.
Essentially, as a result I want a DataFrame that looks like DataFrame 2 with an additional column for the Events from DataFrame 1.
I'd suggest reseting the index (reset_index) and then merging the DataFrame, as you've read. Then you can set the index (set_index) to reproduce your desired MultiIndex.

Categories