I have two dataframes: df1 and df2.
I want to eliminate all occurrences of df2 rows in df1.
Basically, this is the set difference operator but for dataframes.
My ask is very similar to this question with one major variation that its possible that df1 may have no common rows at all. In that case, if we concat the two dataframes and then drop the duplicates, it still doesn't eliminate df2 occurrences in df1. Infact it adds to it.
The question is also similar to this, except that I want my operation on the rows.
Example:
Case 1:
df1:
A,B,C,D
E,F,G,H
df2:
E,F,G,H
Then, df1-df2:
A,B,C,D
Case 2:
df1:
A,B,C,D
df2:
E,F,G,H
Then, df1 - df2:
A,B,C,D
Spoken simply, I am looking for a way to do df1 - df2 (remove df2 if present in df1). How should this be done?
Set difference will work here, it returns unique values in ar1 that are not in ar2.
np.setdiff1d(df1, df2)
Or to get the result in form of DataFrame,
pd.DataFrame([np.setdiff1d(df1, df2)])
try:
df1[~df1.isin(df2)]
A,B,C,D
Related
I have 2 dataframes. One has a bunch of columns including f_uuid. The other dataframe has 2 columns, f_uuid and i_uuid.
the first dataframe may contain some f_uuids that the second dataframe doesn't and vice versa.
I want the first dataframe to have a new column i_uuid (from the second dataframe) populated with the appropriate values for the matching f_uuid in that first dataframe.
How would I achieve this?
df1 = pd.merge(df1,
df2,
on='f_uuid')
If you want to keep all f_uuid from df1 (e.g. those not available in df2), you may run
df1 = pd.merge(df1,
df2,
on='f_uuid',
how='left')
I think what your looking for is a merge : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge
In your case, that would look like :
bunch_of_col_df.merge(other_df, on="f_uuid")
I have 2 Dataframes (df1 and df2). One common column in both the dataframes are 'Conct' which is concatenation of multiple columns. Goal is to find if ANY string in df1 exists in df2.
I tried to use merge, isin.. But all look for exact match.
Example:
Data in df1:
Conct
ABC_IronMan_x_nmc
xyz
Data in df2:
Conct
OPT_IronMan_b_efd
GGH
In this example i want to get only those rows in df2 which matches "IronMan" in df1
Assuming we can split the strings in df1 by '_' and if we are just looking for rows that contain ANY of these strings then
df2[df2['Conct'].apply(lambda x: any([any([string in x for string in row]) for row in df1['Conct'].str.split('_')]))]
However, this will also look for substrings in df1 such as 'x' which will likely be much too broad, in which case you should redefine your goal.
Given 2 data frames like the link example, I need to add to df1 the "index income" from df2. I need to search by the df1 combined key in df2 and if there is a match return the value into a new column in df1. There is not an equal number of instances in df1 and df2 and there are about 700 rows in df1 1000 rows in df2.
I was able to do this in excel with a vlookup but I am trying to apply it to python code now.
This should solve your issue:
df1.merge(df2, how='left', on='combind_key')
This (left join) will give you all the records of df1 and matching records from df2.
https://www.geeksforgeeks.org/how-to-do-a-vlookup-in-python-using-pandas/
Here is an answer using joins. I modified my df2 to only include useful columns then used pandas left join.
Left_join = pd.merge(df,
zip_df,
on ='State County',
how ='left')
I have a scenario where I want to find non-matching rows between two dataframes. Both dataframes will have around 30 columns and an id column that uniquely identify each record/row. So, I want to check if a row in df1 is different from the one in df2. The df1 is an updated dataframe and df2 is the previous version.
I have tried an approach pd.concat([df1, df2]).drop_duplicates(keep=False) , but it just combines both dataframes. Is there a way to do it. I would really appreciate the help.
The sample data looks like this for both dfs.
id user_id type status
There will be total 39 columns which may have NULL values in them.
Thanks.
P.S. df2 will always be a subset of df1.
If your df1 and df2 has the same shape, you may easily compare with this code.
df3 = pd.DataFrame(np.where(df1==df2,True,False), columns=df1.columns)
And you will see boolean output "False" for not matching cell value.
I have two pandas dataframes, df1 and df2, with both equal number of rows. df2 has 11 rows which contain NaN values. I know how to drop the empty rows in df2, by applying:
df2.dropna(subset=['HIGH'], inplace=True)
But now I want to delete these same rows from df1 (the rows with the same row numbers that have been deleted from df2). I tried the following but this does not seem to work.
df1.drop(df2[df2['HIGH'] == 'NaN'].index, inplace=False)
Any other suggestions?
You can get all rows with NaN values in it with:
is_NaN = df2.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df2[row_has_NaN]
After that you can delete the rows with NaN. (like you said in the question)
Now you can get every index out of 'rows_with_NaN'. With every index you can delete it out of df1 (Should have the same index like you said).
I hope this is correct! (No test done)