how to find the difference between two dataFrame Pandas [duplicate] - python

This question already has answers here:
Find difference between two data frames
(19 answers)
Closed 1 year ago.
I have two dataFrame, both of them have name column, I want to make new dataframe of dataframeA have and dataframeB don't have
dataframeA
id name
1 aaa
2 bbbb
3 cccc
4 gggg
dataframeB
id name
1 ddd
2 aaa
3 gggg
new dataframe
id name
1 bbbb
2 cccc

If I understand correctly, ou can merge the two dataframes
import pandas as pd
merged_df = pd.merge(dataframe_a, dataframe_b, on='name')

You can use reduce from functools, or you can use isin, to create a new_df that only contains values in dfA that are also present in dfB.
Approach 1 using reduce:
from functools import reduce #import package
li = [dfA, dfB] #create list of dataframes
new_df = reduce(lambda left,right: pd.merge(left,right,on='name'), li) #reduce list
Approach 2 using isin:
new_df = dfA[dfA['name'].isin(dfB['name])]

One way you could do this is to utilise python's set functionality.
This will convert the specified columns to sets and then create a new dataframe using the output.
dataframe = pd.DataFrame(data = {
'name': list(set(dataframeA['name'].tolist()) - set(dataframeB['name'].tolist()))
})

Related

Merge two dataframes with different sizes [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes with different sizes and I want to merge them.
It's like an "update" to a dataframe column based on another dataframe with different size.
This is an example input:
dataframe 1
CODUSU Situação TIPO1
0 1AB P A0
1 2C3 C B1
2 3AB P C1
dataframe 2
CODUSU Situação ABC
0 1AB A 3
1 3AB A 4
My output should be like this:
dataframe 3
CODUSU Situação TIPO1
0 1AB A A0
1 2C3 C B1
2 3AB A C1
PS: I did it through loop but I think there should better and easier way to make it!
I read this content: pandas merging 101 and wrote this code:
df3=df1.merge(df2, on=['CODUSU'], how='left', indicator=False)
df3['Situação'] = np.where((df3['Situação_x'] == 'P') & (df3['Situação_y'] == 'A') , df3['Situação_y'] , df3['Situação_x'])
df3=df3.drop(columns=['Situação_x', 'Situação_y','ABC'])
df3 = df3[['CODUSU','Situação','TIPO1']]
And Voilà, df3 is exactly what I needed!
Thanks for everyone!
PS: I already found my answer, is there a better place to answer my own question?
df1.merge(df2,how='left', left_on='CODUSU', right_on='CODUSU')
This should do the trick.
Also, worth noting that if you want your resultant data frame to not contain the column ABC, you'd use df2.drop("ABC") instead of just df2.

Concatenate multiple pandas groupby outputs

I would like to make multiple .groupby() operations on different subsets of a given dataset and bind them all together. For example:
import pandas as pd
df = pd.DataFrame({"ID":[1,1,2,2,2,3],"Subset":[1,1,2,2,2,3],"Value":[5,7,4,1,7,8]})
print(df)
ID Subset Value
0 1 1 5
1 1 1 7
2 2 2 4
3 2 2 1
4 2 2 7
5 3 1 9
I would then like to concatenate the following objects and store the result in a pandas data frame:
gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"]).mean()
gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"]).mean()
# Why do gr1 and gr2 have column names in different rows?
I realize that df.groupby(["ID","Subset"]).mean() would give me the concatenated object I'm looking for. Just bear with me, this is a reduced example of what I'm actually dealing with.
I think the solution could be to transform gr1 and gr2 to pandas data frames and then concatenate them like I normally would.
In essence, my questions are the following:
How do I convert a groupby result to a data frame object?
In case this can be done without transforming the series to data frames, how do you bind two groupby results together and then transform that to a pandas data frame?
PS: I come from an R background, so to me it's odd to group a data frame by something and have the output return as a different type of object (series or multi index data frame). This is part of my question too: why does .groupby return a series? What kind of series is this? How come a series can have multiple columns and an index?
The return type in your example is a pandas MultiIndex object. To return a dataframe with a single transformation function for a single value, then you can use the following. Note the inclusion of as_index=False.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr1
ID Subset Value
0 1 1 6
This however won't work if you wish to aggregate multiple functions like here. If you wish to avoid using df.groupby(["ID","Subset"]).mean(), then you can use the following for your example.
>>> gr1 = df[df["Subset"] == 1].groupby(["ID","Subset"], as_index=False).mean()
>>> gr2 = df[df["Subset"] == 2].groupby(["ID","Subset"], as_index=False).mean()
>>> pd.concat([gr1, gr2]).reset_index(drop=True)
ID Subset Value
0 1 1 6
1 2 2 4
If you're only concerned with dealing with a specific subset of rows, the following could be applicable, since it removes the necessity to concatenate results.
>>> values = [1,2]
>>> df[df['Subset'].isin(values)].groupby(["ID","Subset"], as_index=False).mean()
ID Subset Value
0 1 1 6
1 2 2 4

How to identify several matching column values to select row and assign value from anothe table to new column with python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I'm trying to use python to match values of 3 columns from a dataframe to another dataframe and get values from another column in the matching datframe. How do i loop through my dataframe to select the columns to match with the other dataframe and extract the values from the column i want? The matching conditions are columns a,b and c should be the same values.
This is my dataframe [df1]:
This is the other datframe[df2]:
This is the result I want to achieve[df3]:
Thanks.
It should be enough with this:
df3 = df2.merge(df1, on=['a','b','c'], how='inner')
What you actually want to do is to either join OR merge the columns. Use the following links to satisfy your query as per your required conditions, also refer to some tutorials.
new_df = df2.merge(df1, on=['a','b','c'], how='inner')
# use how='inner' when you want intersection
# use how='outer' when you want union
You could also do something like the following instead of merge:
dft = pd.concat([df1, df2.iloc[:, :3]])
df = df2.loc[dft[dft.duplicated()].index]
print(df)
a b c d
0 fra chi nga can
1 wal bra rsa usa
4 ita arg sen jam

How to delete an undesirable row from pandas dataframe [duplicate]

This question already has an answer here:
Deleting DataFrame row in Pandas where column value in list
(1 answer)
Closed 3 years ago.
I have pandas dataframe for exemple like :
id column1 column2
1 aaa mmm
2 bbb nnn
3 ccc ooo
4 ddd ppp
5 eee qqq
I have a list that contain some values from column1 :
[bbb],[ddd],[eee]
I need python code in order to delete from the pandas all elements existing in the list
Ps: my pandas contains 280 000 samples so I need a fast code
Thanks
You can use isin and its negation (~):
df[~df.column1.isin(['bbb','ddd', 'eee'])]
Try this:
df = df.loc[~df['B'].isin(list), :]

Python Pandas DataFrame: Rename all Column Names via Map [duplicate]

I would like to go through all the columns in a dataframe and rename (or map) columns if they contain certain strings.
For example: rename all columns that contain 'agriculture' with the string 'agri'
I'm thinking about using rename and str.contains but can't figure out how to combine them to achieve what i want.
You can use str.replace to process the columns first, and then re-assign the new columns back to the DataFrame:
import pandas as pd
df = pd.DataFrame({'A_agriculture': [1,2,3],
'B_agriculture': [11,22,33],
'C': [4,5,6]})
df.columns = df.columns.str.replace('agriculture', 'agri')
print df
Output:
A_agri B_agri C
0 1 11 4
1 2 22 5
2 3 33 6

Categories