Comparing two dataframe columns and outputting a third

Comparing two dataframe columns and outputting a third - python

I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.
I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.
i.e.
if df1['tag'] == df2['tag']:
df1['new column'] = df2['plate']
I attempted a merge but this didn't output what I expected.
df3 = pd.merge(df1, df2, on='tag', how='left')
I hope I did an okay job explaining this.
[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?

You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need
df1 = df1.merge(df2, on="tag", how = 'inner')
df1['new column'] = df1['plate']
del df1['plate']
I hope this works.

As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:
df1 = pd.merge(df1['tag'],
df2['tag', 'plate'],
on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')
That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.

This is totally a case for join/merge. You want to put df2 on the left because it's smaller.
df2.join(df1, on='tag', ...)
You only misunderstood the type of join/merge) you want to make:
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).
'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.
So:
df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']
or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:
df2.join(df1, on='tag', how='inner') ['plate']

Related

Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

I have two dataframes, df1 and df2, and I know that df2 is a subset of df1. What I am trying to do is find the set difference between df1 and df2, such that df1 has only entries that are different from those in df2. To accomplish this, I first used pandas.util.hash_pandas_object on each of the dataframes, and then found the set difference between the two hashed columns.
df1['hash'] = pd.util.hash_pandas_object(df1, index=False)
df2['hash'] = pd.util.hash_pandas_object(df2, index=False)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
This results in df1 remaining the same size; that is, none of the hash values matched. However, when I use a lambda function, df1 is reduced by the expected amount.
df1['hash'] = df1.apply(lambda x: hash(tuple(x)), axis=1)
df2['hash'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
The problem with the second approach is that it takes an extremely long time to execute (df1 has about 3 million rows). Am I just misunderstanding how to use pandas.util.hash_pandas_object?

The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.
If your object is to remove the duplicate rows, you can achieve this faster using left/right merge with indicator option and then drop the rows that are not unique to the original dataframe.
df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True)
df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns

The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.

You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

Merge two data-frames with only one column different. Need to append that column in the new dataframe. Please check below for detailed view

Little new to Python, I am trying to merge two data-frame with columns similar. 2nd data-frame consists of 1 column different need to append that in new data-frame.
Detailed view of dataframes
Code Used :
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id')
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id', how='outer')
Getting Output csv as
Unnamed: 0 Id_x Number_x Class_x Section_x Place_x Name_x Executed_Date_x Version_x Value PartDateTime_x Cycles_x Id_y Mumber_y Class_y Section_y Place_y Name_y Executed_Date_y Version_y Value_data PartDateTime_y Cycles_y
whereas i dont want _x & _y i wanted the output to be :
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
If i use df2=pd.concat([df,df1],axis=0,ignore_index=True)
then i will get values in the below mentioned format in all columns except Value_data; whereas Value_data would be empty column.
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
Please help me with a solution for this. Thanks for your time.

I think easiest path is to make a temporary df, let's call it df_temp2 , which is a copy of df_2, with renamed column, then append it to df_1
df_temp2 = df_2.copy()
df_temp2.columns = ['..','..', .... 'value' ...]
then
df_total = df_1.append(df_temp2)
This provides you a total DataFrame with all the rows of DF_1 and DF_2. 'append()' method supports a few arguments, check the docs for more details.
--- Added --------
One other possible approach is to use pd.concat() function, which can work in the same way ad .append() method, like this
result = pd.concat([df_1, df_temp2])
In your case the two approaches would lead to similar performances. You can consider append() as a method written on top of pd.concat() but it is applied to a DF itself.
Full docs about concat() here: pd.Concat() docs
Hope this was helpful.

import pandas as pd
df =pd.read_csv('C:/Users/output_2.csv')
df1 pd.read_csv('C:/Users/output_1.csv')
df1_temp=df1[['Id','Cycles','Value_data']].copy()
df3=pd.merge(df,df1_temp,on = ['Id','Cycles'], how='inner')
df3=df3.drop(columns="Unnamed: 0")
df3.to_csv('C:/Users/output.csv')
This worked

Concatenate non empty dataframes

I have n number of dataframes which is formed by downloading data from firestore. The number of dataframes depend on number of unique value of a variable.
coming to the question, I want to concatenate these dataframes into one final dataframe. But I want to ignore the empty dataframes. How can I do this?
For example if I have df1,df2,df3,df4. if df3 is empty, I want to concatenate df1, df2 and df4

I would do something like using .empty attribute:
def concat(*args):
return pd.concat([x for x in args if not x.empty])
df = concat(*[df1, df2, df3, df4])

pandas combine_first with particular index columns?

I'm trying to join two dataframes in pandas to have the following behavior: I want to join on a specified column, but have it so redundant columns are not added to the dataframe. This is analogous to combine_first except combine_first does not seem to take an index column optional argument. Example:
# combine df1 and df2 based on "id" column
df1 = pandas.merge(df2, how="outer", on=["id"])
The problem with the above is that columns common to df1/df2 aside from "id" will be added twice (with _x,_y prefixes) to df1. How can I do something like:
# Do outer join from df2 to df1, matching items by "id" but not adding
# columns that are redundant (df1 takes precedence if the values disagree)
df1.combine_first(df2, on=["id"])
How can this be done?

If you are trying to merge columns from df2 into df1 while excluding any redundant columns, the following should work.
df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")
However this obviously will not update any values from df1 with values from df2 as it is only bringing in non-redundant columns. But since you said df1 will take precedence on any values that disagree, perhaps this will do the trick?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two dataframe columns and outputting a third - python

You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need df1 = df1.merge(df2, on="tag", how = 'inner') df1['new column'] = df1['plate'] del df1['plate'] I hope this works.

Related

Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

Merge two data-frames with only one column different. Need to append that column in the new dataframe. Please check below for detailed view

Concatenate non empty dataframes

pandas combine_first with particular index columns?

Categories

Resources