compare multiple column value together using pandas - python

I know i can do like below if we are checking only two columns together.
df['flag'] = df['a_id'].isin(df['b_id'])
where df is a data frame, and a_id and b_id are two columns of the data frame. It will return True or False value based on the match. But i need to compare multiple columns together.
For example: if there are a_id , a_region, a_ip, b_id, b_region and b_ip columns. I want to compare like below,
a_key = df['a_id'] + df['a_region] + df['a_ip']
b_key = df['b_id'] + df['b_region] + df['b_ip']
df['flag'] = a_key.isin(b_key)
Somehow the above code is always returning False value. The output should be like below,
First row flag will be True because there is a match.
a_key becomes 2a10 this is match with last row of b_key (2a10)

You were going in the right direction, just use:
a_key = df['a_id'].astype(str) + df['a_region'] + df['a_ip'].astype(str)
b_key = df['b_id'].astype(str) + df['b_region'] + df['b_ip'].astype(str)
a_key.isin(b_key)
Mine is giving below results:
0 True
1 False
2 False

You can use isin with DataFrame as value, but as per the docs:
If values is a DataFrame, then both the index and column labels must
match
So this should work:
# Removing the prefixes from column names
df_a = df[['a_id', 'a_region', 'a_ip']].rename(columns=lambda x: x[2:])
df_b = df[['b_id', 'b_region', 'b_ip']].rename(columns=lambda x: x[2:])
# Find rows where all values are in the other
matched = df_a.isin(df_b).all(axis=1)
# Get actual rows with boolean indexing
df_a.loc[matched]
# ... or add boolean flag to dataframe
df['flag'] = matched

Here's one approach using DataFrame.merge, pandas.concat and testing for duplicated values:
df_merged = df.merge(df,
left_on=['a_id', 'a_region', 'a_ip'],
right_on=['b_id', 'b_region', 'b_ip'],
suffixes=('', '_y'))
df['flag'] = pd.concat([df, df_merged[df.columns]]).duplicated(keep=False)[:len(df)].values
[out]
a_id a_region a_ip b_id b_region b_ip flag
0 2 a 10 3222222 sssss 22222 True
1 22222 bcccc 10000 43333 ddddd 11111 False
2 33333 acccc 120000 2 a 10 False

Related

ValueError: Can only compare identically-labeled DataFrame objects [duplicate]

I'm using Pandas to compare the outputs of two files loaded into two data frames (uat, prod):
...
uat = uat[['Customer Number','Product']]
prod = prod[['Customer Number','Product']]
print uat['Customer Number'] == prod['Customer Number']
print uat['Product'] == prod['Product']
print uat == prod
The first two match exactly:
74357 True
74356 True
Name: Customer Number, dtype: bool
74357 True
74356 True
Name: Product, dtype: bool
For the third print, I get an error:
Can only compare identically-labeled DataFrame objects. If the first two compared fine, what's wrong with the 3rd?
Thanks
Here's a small example to demonstrate this (which only applied to DataFrames, not Series, until Pandas 0.19 where it applies to both):
In [1]: df1 = pd.DataFrame([[1, 2], [3, 4]])
In [2]: df2 = pd.DataFrame([[3, 4], [1, 2]], index=[1, 0])
In [3]: df1 == df2
Exception: Can only compare identically-labeled DataFrame objects
One solution is to sort the index first (Note: some functions require sorted indexes):
In [4]: df2.sort_index(inplace=True)
In [5]: df1 == df2
Out[5]:
0 1
0 True True
1 True True
Note: == is also sensitive to the order of columns, so you may have to use sort_index(axis=1):
In [11]: df1.sort_index().sort_index(axis=1) == df2.sort_index().sort_index(axis=1)
Out[11]:
0 1
0 True True
1 True True
Note: This can still raise (if the index/columns aren't identically labelled after sorting).
You can also try dropping the index column if it is not needed to compare:
print(df1.reset_index(drop=True) == df2.reset_index(drop=True))
I have used this same technique in a unit test like so:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(actual.reset_index(drop=True), expected.reset_index(drop=True))
At the time when this question was asked there wasn't another function in Pandas to test equality, but it has been added a while ago: pandas.equals
You use it like this:
df1.equals(df2)
Some differenes to == are:
You don't get the error described in the question
It returns a simple boolean.
NaN values in the same location are considered equal
2 DataFrames need to have the same dtype to be considered equal, see this stackoverflow question
EDIT:
As pointed out in #paperskilltrees answer index alignment is important. Apart from the solution provided there another option is to sort the index of the DataFrames before comparing the DataFrames. For df1 that would be df1.sort_index(inplace=True).
When you compare two DataFrames, you must ensure that the number of records in the first DataFrame matches with the number of records in the second DataFrame. In our example, each of the two DataFrames had 4 records, with 4 products and 4 prices.
If, for example, one of the DataFrames had 5 products, while the other DataFrame had 4 products, and you tried to run the comparison, you would get the following error:
ValueError: Can only compare identically-labeled Series objects
this should work
import pandas as pd
import numpy as np
firstProductSet = {'Product1': ['Computer','Phone','Printer','Desk'],
'Price1': [1200,800,200,350]
}
df1 = pd.DataFrame(firstProductSet,columns= ['Product1', 'Price1'])
secondProductSet = {'Product2': ['Computer','Phone','Printer','Desk'],
'Price2': [900,800,300,350]
}
df2 = pd.DataFrame(secondProductSet,columns= ['Product2', 'Price2'])
df1['Price2'] = df2['Price2'] #add the Price2 column from df2 to df1
df1['pricesMatch?'] = np.where(df1['Price1'] == df2['Price2'], 'True', 'False') #create new column in df1 to check if prices match
df1['priceDiff?'] = np.where(df1['Price1'] == df2['Price2'], 0, df1['Price1'] - df2['Price2']) #create new column in df1 for price diff
print (df1)
example from https://datatofish.com/compare-values-dataframes/
Flyingdutchman's answer is great but wrong: it uses DataFrame.equals, which will return False in your case.
Instead, you want to use DataFrame.eq, which will return True.
It seems that DataFrame.equals ignores the dataframe's index, while DataFrame.eq uses dataframes' indexes for alignment and then compares the aligned values. This is an occasion to quote the central gotcha of Pandas:
Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.
As we can see in the following examples, the data alignment is neither broken, nor enforced, unless explicitly requested. So we have three different situations.
No explicit instruction given, as to the alignment: == aka DataFrame.__eq__,
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame(index=[0, 1, 2], data={'col1':list('abc')})
In [3]: df2 = pd.DataFrame(index=[2, 0, 1], data={'col1':list('cab')})
In [4]: df1 == df2
---------------------------------------------------------------------------
...
ValueError: Can only compare identically-labeled DataFrame objects
Alignment is explicitly broken: DataFrame.equals, DataFrame.values, DataFrame.reset_index(),
In [5]: df1.equals(df2)
Out[5]: False
In [9]: df1.values == df2.values
Out[9]:
array([[False],
[False],
[False]])
In [10]: (df1.values == df2.values).all().all()
Out[10]: False
Alignment is explicitly enforced: DataFrame.eq, DataFrame.sort_index(),
In [6]: df1.eq(df2)
Out[6]:
col1
0 True
1 True
2 True
In [8]: df1.eq(df2).all().all()
Out[8]: True
My answer is as of pandas version 1.0.3.
Here I am showing a complete example of how to handle this error. I have added rows with zeros. You can have your dataframes from csv or any other source.
import pandas as pd
import numpy as np
# df1 with 9 rows
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
# df2 with 8 rows
df2 = pd.DataFrame({'Name':['John','Mike','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[25,45,14,34,26,44,29,42]})
# get lengths of df1 and df2
df1_len = len(df1)
df2_len = len(df2)
diff = df1_len - df2_len
rows_to_be_added1 = rows_to_be_added2 = 0
# rows_to_be_added1 = np.zeros(diff)
if diff < 0:
rows_to_be_added1 = abs(diff)
else:
rows_to_be_added2 = diff
# add empty rows to df1
if rows_to_be_added1 > 0:
df1 = df1.append(pd.DataFrame(np.zeros((rows_to_be_added1,len(df1.columns))),columns=df1.columns))
# add empty rows to df2
if rows_to_be_added2 > 0:
df2 = df2.append(pd.DataFrame(np.zeros((rows_to_be_added2,len(df2.columns))),columns=df2.columns))
# at this point we have two dataframes with the same number of rows, and maybe different indexes
# drop the indexes of both, so we can compare the dataframes and other operations like update etc.
df2.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)
# add a new column to df1
df1['New_age'] = None
# compare the Age column of df1 and df2, and update the New_age column of df1 with the Age column of df2 if they match, else None
df1['New_age'] = np.where(df1['Age'] == df2['Age'], df2['Age'], None)
# drop rows where Name is 0.0
df2 = df2.drop(df2[df2['Name'] == 0.0].index)
# now we don't get the error ValueError: Can only compare identically-labeled Series objects
I found where the error is coming from in my case:
The problem was that column names list was accidentally enclosed in another list.
Consider following example:
column_names=['warrior','eat','ok','monkeys']
df_good = pd.DataFrame(np.ones(shape=(6,4)),columns=column_names)
df_good['ok'] < df_good['monkeys']
>>> 0 False
1 False
2 False
3 False
4 False
5 False
df_bad = pd.DataFrame(np.ones(shape=(6,4)),columns=[column_names])
df_bad ['ok'] < df_bad ['monkeys']
>>> ValueError: Can only compare identically-labeled DataFrame objects
And the thing is you cannot visually distinguish the bad DataFrame from good.
In my case i just write directly param columns in creating dataframe, because data from one sql-query was with names, and without in other

How can I add pandas "match" based on column list values and value in additional column?

I have a dataframe that contains a column with a list of identifiers called Multiple_IDS and a column called ID. Now, I would like to create an additional column called "Match" that tells weather an ID was contained in the Multiple_IDs column or not. The ouput should be an additional column called Match that contains True or False values. Here some sample input data:
data = {'ID':[2128441, 2128447, 2128500], 'Multiple_IDs':["2128442, 2128443, 2128444, 2128441", "2128446, 2128447", "2128503, 2128508"]}
df = pd.DataFrame(data)
The list has the datatype "object".
The desire output would then be this according to the input data provided above.
I know I can achieve this using explode and then comparing the values but I am wondering if there something more elegant ?
Use in statement if is possible test without separate each ID:
df['Match'] = [str(x) in y for x, y in df[['ID','Multiple_IDs']].to_numpy()]
print (df)
ID Multiple_IDs Match
0 2128441 2128442, 2128443, 2128444, 2128441 True
1 2128447 2128446, 2128447 True
2 2128500 2128503, 2128508 False
Or:
df['Match'] = df.apply(lambda x: str(x['ID']) in x['Multiple_IDs'], axis=1)
print (df)
ID Multiple_IDs Match
0 2128441 2128442, 2128443, 2128444, 2128441 True
1 2128447 2128446, 2128447 True
2 2128500 2128503, 2128508 False
Another idea is match by splitted values:
df['Match'] = [str(x) in y.split(', ') for x, y in df[['ID','Multiple_IDs']].to_numpy()]
df['Match'] = df.apply(lambda x: str(x['ID']) in x['Multiple_IDs'].split(', '), axis=1)
What I will do
s=pd.DataFrame(df.Multiple_IDs.str.split(', ').tolist(),index=df.index).eq(df.ID.astype(str),axis=0).any(1)
Out[10]:
0 True
1 True
2 False
dtype: bool
df['Match']=s

Checking if the values from the pandas dataframe column exist in another column. isin method not working

I need to check if the values from the column A contain the values from column B.
I tried using the isin() method:
import pandas as pd
df = pd.DataFrame({'A': ['filePath_en_GB_LU_file', 'filePath_en_US_US_file', 'filePath_en_GB_PL_file'],
'B': ['_LU_', '_US_', '_GB_']})
df['isInCheck'] = df.A.isin(df.B)
For some reason it's not working.
It returns only False values, whereas for first two rows it should return True.
What am I missing in there?
I think you need DataFrame.apply, but for last row is also match:
df['isInCheck'] = df.apply(lambda x: x.B in x.A, axis=1)
print (df)
A B isInCheck
0 filePath_en_GB_LU_file _LU_ True
1 filePath_en_US_US_file _US_ True
2 filePath_en_GB_PL_file _GB_ True
Try to use an apply:
df['isInCheck'] = df.apply(lambda r: r['B'] in r['A'], axis=1)
This will check row-wise. If you want to check if multiple elements are presents, maybe you should create a column for each one of them:
for e in df['B'].unique():
df[f'has_"{e}"'] = df.apply(lambda r: e in r['A'], axis=1)
print(df)
A B has_"_LU_" has_"_US_" has_"_GB_"
0 filePath_en_GB_LU_file _LU_ True False True
1 filePath_en_US_US_file _US_ False True False
2 filePath_en_GB_PL_file _GB_ False False True

how can I check the existence of a word 'worm' in a data set contains different names:

I have a dataset which has 1854 rows and 106 columns , in the third column of it there are values like "Worm.Win32.Zwr.c" (and other type of malware names) ,I want to check if there is a word like 'worm' in any rows then insert 1 in target column of the same row
for rows in malware_data:
if ('worm' in malware_data[3]):
malware_data.loc[rows]['target']=1
else:
malware_data.loc[rows]['target']=0
you can do this in several ways:
1) by creating a bool mask to filter what rows contain your word 'worm':
mask = df.str.lower().str.contains('worm')
df.loc[mask, third_column].target = 1
df.loc[~mask, third_column].target = 0
insetad of df.str.lower().str.contains('worm') you can use df.str.contains('(?i)worm')
if you do not know the name of your third column you could use:
third_column = df.columns[2]
2) by applying a function along your third column of the DataFrame as #ArunPrabhath suggested:
df.target = df[third_column].apply(lamda x: int('worm' in x.lower()))
malware_data['target'] = malware_data[3].apply(lamda row: 1 if ('worm' in row) else 0)

Drop column that starts with

I have a data frame that has multiple columns, example:
Prod_A Prod_B Prod_C State Region
1 1 0 1 1 1
I would like to drop all columns that starts with Prod_, (I can't select or drop by name because the data frame has 200 variables)
Is it possible to do this ?
Thank you
Use startswith for mask and then delete columns with loc and boolean indexing:
df = df.loc[:, ~df.columns.str.startswith('Prod')]
print (df)
State Region
1 1 1
First, select all columns to be deleted:
unwanted = df.columns[df.columns.str.startswith('Prod_')]
The, drop them all:
df.drop(unwanted, axis=1, inplace=True)
we can also use negative RegEx:
In [269]: df.filter(regex=r'^(?!Prod_).*$')
Out[269]:
State Region
1 1 1
Drop all rows where the path column starts with /var:
df = df[~df['path'].map(lambda x: (str(x).startswith('/var')))]
This can be further simplified to:
df = df[~df['path'].str.startswith('/var')]
map+lambda offer more flexibility by allowing to handle raw values as opposed to scalars. In the example below rows will be removed when they start with /var or are empty (nan, None, etc).
df = df[~df['path'].map(lambda x: (str(x).startswith('/var') or not x))]
Drop all rows where the path column starts with /var or /tmp (you can also pass a tuple to startswith):
df = df[~df['path'].map(lambda x: (str(x).startswith(('/var', '/tmp'))))]
The tilda ~ is used for negation; if you wanted instead to keep all rows starting with /var then just remove the ~.

Categories