Comparing Pandas Dataframes on unequal length [duplicate] - python

This question already has answers here:
Finding the difference between two dataframes having duplicate records
(2 answers)
Closed 1 year ago.
I have two Dataframes as below:
'''
df1 =
emp_id
emp_name
e_city
1
Joe
Acity
2
Nick
Bcity
3
Sam
Ccity
4
John
Dcity
5
Mike
Ecity
df2 =
emp_id
emp_name
e_city
2
Nick
Bcity
2
Nick
Bcity
3
Sam
Ccity
4
John
Dcity
'''
Please note df2 has a duplicate row and len of both DFs are not equal.
My use case is to find the mismatches or differences between these two DFs
expected output: - The row which is occurring only once in 1 DF and twice in another DF should be shown as a difference along with other mismatched values
df3 =
emp_id
emp_name
e_city
1
Joe
Acity
2
Nick
Bcity
5
Mike
Ecity
I tried below methods but nothing were fruitful.
I cannot use 'df.compare' since both dataframes are not of equal length.
I tried using 'df.merge' but it is not pointing the duplicated row as a mismatch/difference.
I tried to use 'concate' and 'compare'. That is not successful as well.
Can someone please help me on this? Thanks in advance

first you can count duplicate with groupby
df2 = df2.groupby(['emp_id','emp_name']).size().reset_index(name='count')
and then merge the dataframe with original

Related

Is there a pandas function to add in value of a column based on the other dataframe? [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 6 months ago.
I would like to add a column to a pandas dataframe based on the value from another dataframe. Here is table 1 and table 2. I would like to update the duration for table 1 based on the value from table 2. Eg, row 1 in Table 1 is Potato, so the duration should be updated to 30 based on value from table 2.
Table 1
Crops
Entry Time
Duration
Potato
2022-03-01
0
Cabbage
2022-03-02
0
Tomato
2022-03-03
0
Potato
2022-03-0
0
Table 2
Crops
Duration
Potato
30
Cabbage
20
Tomato
25
Thanks.
Just use merge method:
df = df1.merge(df2, on='Crops', how='left')
Before doing that I suggest to drop the duration column in the first dataframe (df1).
The parameter 'on' defines on which column you want to merge (also called 'key') and how='left' it returns a dataframe with the length of the first dataframe. Imposing 'left' avoids that records with 'vegetables'in df1 that are not present in df2 will be deleted.
Google 'difference between inner, left, right and outer join'.

Dropping rows, where a dynamic number of integer columns only contain 0's

I have the following problem - I have the following example data-frame:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Peter
Terra
0
0
Anna
Mars
5
4
Robert
Knowhere
0
1
Here, I want to only remove the rows, in which there are numbers, that are all 0's. In this case this is the second row. So my data-frame has to become like this:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Anna
Mars
5
4
Robert
Knowhere
0
1
For this, I have a solution and it is the following:
new_df = old_df.loc[(a['Number Column #1'] > 0) + (a['Number Column #2'] > 0)]
This works, however I have another problem. My dataframe, based on the request, will dynamically have a different number of number columns. For example:
Name
Planet
Number Column #1
Number Column #2
Number Column #3
John
Earth
2
0
1
Peter
Terra
0
0
0
Anna
Mars
5
4
2
Robert
Knowhere
0
1
1
This is the problematic part, as I am not sure how I can adjust my code to work for dynamic columns. I've tried multiple things from StackOverflow and the Pandas documentation - however most examples only work for dataframes, in which all columns are integers. Pandas would them consider them booleans, and you can add a simple solution like this:
new_df = (df != 0).any(axis=1)
In my case however, the text columns, which are always the same, are the problematic ones. Does anyone have an idea for a solution here? Thanks a lot in advance!
P.S. I have the names of the number columns available beforehand in the code as a list, for example:
my_num_columns = ["Number Column #1", "Number Column #2", "Number Column #3"]
# my pandas logic...
IIUC:
You can try via select_dtypes() and select int and float columns after that check your condition and filter out your dataframe:
df=df.loc[~df.select_dtypes(['int','float']).eq(0).all(axis=1)]
#OR
df=df.loc[df.select_dtypes(['int','float']).ne(0).any(axis=1)]
Note: If needed you can also include 'bool' columns and typecast it to float and then check your condition:
df=df.loc[df.select_dtypes(['int','float','bool']).astype(float).ne(0).any(axis=1)]

How to sum list of pandas dataframes by with respect to given column

I have list of pandas dataframes with two columns, basically class and value:
df1:
Name
Count
Bob
10
John
20
df2:
Name
Count
Mike
30
Bob
40
There might be same "Names" in different dataframes, there might be no same "Names", and list contains over 100 dataframes. But in each dataframe all "Names" are unique.
What I need is to iterate over all dataframes and create one big one, where presented all values from "Names" and their total sums of "Count" from all the dataframes, so like:
result:
Name
Count
Bob
50
John
20
Mike
30
Bob's data is summed, others are not, as they are only present once. Is there efficient way once there are many dataframes?
do pd.concat then groupby:
df = pd.concat(dfs) # where dfs is a list of dataframes
then you can do
gp = df.groupby(['Name'])['Count'].sum()
you can do the following (assuming you have ,more data that only conatined in one dataframe use fill_value=0 to still provide value..:
df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
>>> Name Count
0 Bob 50.0
1 John 20.0
2 Mike 30.0

Merge two dataframes with different sizes [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes with different sizes and I want to merge them.
It's like an "update" to a dataframe column based on another dataframe with different size.
This is an example input:
dataframe 1
CODUSU Situação TIPO1
0 1AB P A0
1 2C3 C B1
2 3AB P C1
dataframe 2
CODUSU Situação ABC
0 1AB A 3
1 3AB A 4
My output should be like this:
dataframe 3
CODUSU Situação TIPO1
0 1AB A A0
1 2C3 C B1
2 3AB A C1
PS: I did it through loop but I think there should better and easier way to make it!
I read this content: pandas merging 101 and wrote this code:
df3=df1.merge(df2, on=['CODUSU'], how='left', indicator=False)
df3['Situação'] = np.where((df3['Situação_x'] == 'P') & (df3['Situação_y'] == 'A') , df3['Situação_y'] , df3['Situação_x'])
df3=df3.drop(columns=['Situação_x', 'Situação_y','ABC'])
df3 = df3[['CODUSU','Situação','TIPO1']]
And Voilà, df3 is exactly what I needed!
Thanks for everyone!
PS: I already found my answer, is there a better place to answer my own question?
df1.merge(df2,how='left', left_on='CODUSU', right_on='CODUSU')
This should do the trick.
Also, worth noting that if you want your resultant data frame to not contain the column ABC, you'd use df2.drop("ABC") instead of just df2.

Merge Disjoint Columns in Pandas [duplicate]

This question already has answers here:
How to remove nan value while combining two column in Panda Data frame?
(5 answers)
Closed 4 years ago.
I have a pretty simple Pandas question that deals with merging two series. I have two series in a dataframe together that are similar to this:
Column1 Column2
0 Abc NaN
1 NaN Abc
2 Abc NaN
3 NaN Abc
4 NaN Abc
The answer will probably end up being a really simple .merge() or .concat() command, but I'm trying to get a result like this:
Column1
0 Abc
1 Abc
2 Abc
3 Abc
4 Abc
The idea is that for each row, there is a string of data in either Column1, Column2, but never both. I did about 10 minutes of looking for answers on StackOverflow as well as Google, but I couldn't find a similar question that cleanly applied to what I was looking to do.
I realize that a lot of this question just stems from my ignorance on the three functions that Pandas has to stick series and dataframes together. Any help is very much appreciated. Thank you!
You can just use pd.Series.fillna:
df['Column1'] = df['Column1'].fillna(df['Column2'])
Merge or concat are not appropriate here; they are used primarily for combining dataframes or series based on labels.
Use groupby with first
df.groupby(df.columns.str[:-1],axis=1).first()
Out[294]:
Column
0 Abc
1 Abc
2 Abc
3 Abc
4 Abc
Or :
`ndf = pd.DataFrame({'Column1':df.fillna('').sum(1)})`

Categories