Splitting a dataframe based on condition - python

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None

Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7

Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1

cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)

Related

Lazy evaluate Pandas dataframe filters

I'm observing a behavior that's weird to me, can anyone tell me how I can define filter once and re-use throughout my code?
>>> df = pd.DataFrame([1,2,3], columns=['A'])
>>> my_filter = df.A == 2
>>> df.loc[1] = 5
>>> df[my_filter]
A
1 5
I expect my_filter to return empty dataset since none of the A columns are equal to 2.
I'm thinking about making a function that returns the filter and re-use that but is there any more pythonic as well as pandaic way of doing this?
def get_my_filter(df):
return df.A == 2
df[get_my_filter()]
change df
df[get_my_filter()]
Masks are not dynamic, they stay how you defined them when you defined them.
So if you still need to change the dataframe value, you should swap lines 2 and 3.
That would work.
you applied the filter in the first place. Changing a value in the row won't help.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = df.A == 2
print(my_filter)
'''
A
0 False
1 True
2 False
'''
as you can see, it returns a series. If you change the data after this process, it will not work. because this represents the first version of the df. But you can use define filter as a string. You can achieve what you want if you use the string filter inside the eval() function.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = 'df.A == 2'
df.loc[1] = 5
df[eval(my_filter)]
'''
Out[205]:
Empty DataFrame
Columns: [A]
Index: []
'''

ValueError: Can only compare identically-labeled DataFrame objects [duplicate]

I'm using Pandas to compare the outputs of two files loaded into two data frames (uat, prod):
...
uat = uat[['Customer Number','Product']]
prod = prod[['Customer Number','Product']]
print uat['Customer Number'] == prod['Customer Number']
print uat['Product'] == prod['Product']
print uat == prod
The first two match exactly:
74357 True
74356 True
Name: Customer Number, dtype: bool
74357 True
74356 True
Name: Product, dtype: bool
For the third print, I get an error:
Can only compare identically-labeled DataFrame objects. If the first two compared fine, what's wrong with the 3rd?
Thanks
Here's a small example to demonstrate this (which only applied to DataFrames, not Series, until Pandas 0.19 where it applies to both):
In [1]: df1 = pd.DataFrame([[1, 2], [3, 4]])
In [2]: df2 = pd.DataFrame([[3, 4], [1, 2]], index=[1, 0])
In [3]: df1 == df2
Exception: Can only compare identically-labeled DataFrame objects
One solution is to sort the index first (Note: some functions require sorted indexes):
In [4]: df2.sort_index(inplace=True)
In [5]: df1 == df2
Out[5]:
0 1
0 True True
1 True True
Note: == is also sensitive to the order of columns, so you may have to use sort_index(axis=1):
In [11]: df1.sort_index().sort_index(axis=1) == df2.sort_index().sort_index(axis=1)
Out[11]:
0 1
0 True True
1 True True
Note: This can still raise (if the index/columns aren't identically labelled after sorting).
You can also try dropping the index column if it is not needed to compare:
print(df1.reset_index(drop=True) == df2.reset_index(drop=True))
I have used this same technique in a unit test like so:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(actual.reset_index(drop=True), expected.reset_index(drop=True))
At the time when this question was asked there wasn't another function in Pandas to test equality, but it has been added a while ago: pandas.equals
You use it like this:
df1.equals(df2)
Some differenes to == are:
You don't get the error described in the question
It returns a simple boolean.
NaN values in the same location are considered equal
2 DataFrames need to have the same dtype to be considered equal, see this stackoverflow question
EDIT:
As pointed out in #paperskilltrees answer index alignment is important. Apart from the solution provided there another option is to sort the index of the DataFrames before comparing the DataFrames. For df1 that would be df1.sort_index(inplace=True).
When you compare two DataFrames, you must ensure that the number of records in the first DataFrame matches with the number of records in the second DataFrame. In our example, each of the two DataFrames had 4 records, with 4 products and 4 prices.
If, for example, one of the DataFrames had 5 products, while the other DataFrame had 4 products, and you tried to run the comparison, you would get the following error:
ValueError: Can only compare identically-labeled Series objects
this should work
import pandas as pd
import numpy as np
firstProductSet = {'Product1': ['Computer','Phone','Printer','Desk'],
'Price1': [1200,800,200,350]
}
df1 = pd.DataFrame(firstProductSet,columns= ['Product1', 'Price1'])
secondProductSet = {'Product2': ['Computer','Phone','Printer','Desk'],
'Price2': [900,800,300,350]
}
df2 = pd.DataFrame(secondProductSet,columns= ['Product2', 'Price2'])
df1['Price2'] = df2['Price2'] #add the Price2 column from df2 to df1
df1['pricesMatch?'] = np.where(df1['Price1'] == df2['Price2'], 'True', 'False') #create new column in df1 to check if prices match
df1['priceDiff?'] = np.where(df1['Price1'] == df2['Price2'], 0, df1['Price1'] - df2['Price2']) #create new column in df1 for price diff
print (df1)
example from https://datatofish.com/compare-values-dataframes/
Flyingdutchman's answer is great but wrong: it uses DataFrame.equals, which will return False in your case.
Instead, you want to use DataFrame.eq, which will return True.
It seems that DataFrame.equals ignores the dataframe's index, while DataFrame.eq uses dataframes' indexes for alignment and then compares the aligned values. This is an occasion to quote the central gotcha of Pandas:
Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.
As we can see in the following examples, the data alignment is neither broken, nor enforced, unless explicitly requested. So we have three different situations.
No explicit instruction given, as to the alignment: == aka DataFrame.__eq__,
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame(index=[0, 1, 2], data={'col1':list('abc')})
In [3]: df2 = pd.DataFrame(index=[2, 0, 1], data={'col1':list('cab')})
In [4]: df1 == df2
---------------------------------------------------------------------------
...
ValueError: Can only compare identically-labeled DataFrame objects
Alignment is explicitly broken: DataFrame.equals, DataFrame.values, DataFrame.reset_index(),
In [5]: df1.equals(df2)
Out[5]: False
In [9]: df1.values == df2.values
Out[9]:
array([[False],
[False],
[False]])
In [10]: (df1.values == df2.values).all().all()
Out[10]: False
Alignment is explicitly enforced: DataFrame.eq, DataFrame.sort_index(),
In [6]: df1.eq(df2)
Out[6]:
col1
0 True
1 True
2 True
In [8]: df1.eq(df2).all().all()
Out[8]: True
My answer is as of pandas version 1.0.3.
Here I am showing a complete example of how to handle this error. I have added rows with zeros. You can have your dataframes from csv or any other source.
import pandas as pd
import numpy as np
# df1 with 9 rows
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
# df2 with 8 rows
df2 = pd.DataFrame({'Name':['John','Mike','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[25,45,14,34,26,44,29,42]})
# get lengths of df1 and df2
df1_len = len(df1)
df2_len = len(df2)
diff = df1_len - df2_len
rows_to_be_added1 = rows_to_be_added2 = 0
# rows_to_be_added1 = np.zeros(diff)
if diff < 0:
rows_to_be_added1 = abs(diff)
else:
rows_to_be_added2 = diff
# add empty rows to df1
if rows_to_be_added1 > 0:
df1 = df1.append(pd.DataFrame(np.zeros((rows_to_be_added1,len(df1.columns))),columns=df1.columns))
# add empty rows to df2
if rows_to_be_added2 > 0:
df2 = df2.append(pd.DataFrame(np.zeros((rows_to_be_added2,len(df2.columns))),columns=df2.columns))
# at this point we have two dataframes with the same number of rows, and maybe different indexes
# drop the indexes of both, so we can compare the dataframes and other operations like update etc.
df2.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)
# add a new column to df1
df1['New_age'] = None
# compare the Age column of df1 and df2, and update the New_age column of df1 with the Age column of df2 if they match, else None
df1['New_age'] = np.where(df1['Age'] == df2['Age'], df2['Age'], None)
# drop rows where Name is 0.0
df2 = df2.drop(df2[df2['Name'] == 0.0].index)
# now we don't get the error ValueError: Can only compare identically-labeled Series objects
I found where the error is coming from in my case:
The problem was that column names list was accidentally enclosed in another list.
Consider following example:
column_names=['warrior','eat','ok','monkeys']
df_good = pd.DataFrame(np.ones(shape=(6,4)),columns=column_names)
df_good['ok'] < df_good['monkeys']
>>> 0 False
1 False
2 False
3 False
4 False
5 False
df_bad = pd.DataFrame(np.ones(shape=(6,4)),columns=[column_names])
df_bad ['ok'] < df_bad ['monkeys']
>>> ValueError: Can only compare identically-labeled DataFrame objects
And the thing is you cannot visually distinguish the bad DataFrame from good.
In my case i just write directly param columns in creating dataframe, because data from one sql-query was with names, and without in other

Python: inconsistent handling of IF statement in loop

I have a dataframe df containing conditions and values.
import pandas as pd
df=pd.DataFrame({'COND':['X','X','X','Y','Y','Y'], 'VALUE':[1,2,3,1,2,3]})
Therefore df looks like:
COND VALUE
X 1
X 2
X 3
Y 1
Y 2
Y 3
I'm using a loop to subset df according to COND, and write separate text files containing values for each condition
conditions = {'X','Y'}
for condition in conditions:
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
The end results is two text files: X_vals.txt and Y_vals.txt, both of which contain 1 2 3. Up until this point everything is working as expected.
I would like to further subset df for one condition only. For example, perhaps I want all values from condition Y, but ONLY values < 3 from condition X. In this scenario, X_vals.txt should contain 1 2 and Y_vals.txt should contain 1 2 3. I tried implementing this with an IF statement:
conditions = {'X','Y'}
for condition in conditions:
if condition == 'X':
df = df[df['VALUE'] < 3]
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
Here is where the inconsistency occurs. The above code works fine (i.e. X_vals.txt contains 1 2, and Y_vals.txt 1 2 3, as intended), but when I use if condition=='Y' instead of if condition=='X', it breaks, and both text files only contain 1 2.
In other words, if I specify the first element of conditions in the IF statement then it works as intended, however if I specify the second element then it breaks and applies the < 3 subset to values from both conditions.
What is going on here and how can I resolve it?
Thanks!
The problem you are encountering arises because you are overwriting df inside the loop.
conditions = {'X','Y'}
for condition in conditions:
if condition == 'X':
df = df[df['VALUE'] < 3] # <-- HERE'S YOUR ISSUE
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
What slightly surprised me is that when you are looping over the set conditions you get condition = 'Y' first, then condition = 'X'. But as a set is an unordered collection (i.e. it doesn't claim to have an inherent order of its elements), this ought not to be too disturbing: python is just reading out the elements in the most internally convenient way.
You could use conditions = ['X', 'Y'] to loop over a list (an ordered collection) instead. Then it will do X first, then Y. However, if you do that you will get the same bug but in reverse (i.e. it works for if condition == 'Y' but not if condition == 'X').
This is because after the loop runs once, df has been reassigned to the subset of the original df that only contains values less than three. That's why you get only the values 1 and 2 in both files if the if condition statement triggers on the first pass through the loop.
Now for the fix:
conditions = ['X', 'Y']
for condition in conditions:
csv_name = f"{condition}_values.txt"
if condition == 'X':
df_filter = f"VALUE < 3 & COND == '{condition}'"
else:
df_filter = f"COND == '{condition}'"
df.query(df_filter).VALUE.to_csv(csv_name, header=False, index=False)
Here I've introduced the DataFrame.query method, which is typically more concise than trying to create a Boolean series to use as a mask as you were doing.
The f-string syntax only works on python 3.6+, if you're on a lower version then modify as appropriate (e.g. df_filter = "COND == '{}'".format(condition))
We can write the condition to dict then use map filter the df before groupby
cond = {'X' : 2, 'Y' : 3}
subdf = df[df['VALUE']<df.COND.map(cond)]
for x, y in subdf.groupby('COND'):
y.to_csv(x + '_values.txt')
df=pd.DataFrame({'COND':['X','X','X','Y','Y','Y'], 'VALUE':[1,2,3,1,2,3]})
conditions = df.COND
for condition in conditions:
print(condition)
df2=df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt',header=False, index=False)
for condition in conditions:
if condition=='X':
df=df[df['VALUE'] < 3]
df2=df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt',header=False, index=False)
You didn't specify the variable "Condition", so it gave you an error.
try doing :
conditions = df.COND
before the for loop

pandas lookup with long and nested conditions

I want to perform a lookup in a dataframe via pandas. But It will be created by a series of nested if else statement similar as outlined Pandas dataframe add a field based on multiple if statements
But I want to use up to 13 different variables. This seems to pretty soon result in chaos. Is there some notation or other nice feature which allows me to specify such long and nested conditions in pandas?
So far np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html might be my best option.
Is there a shortcut if I would only match for equality in all conditions?
Am I forced to write out each conditional filter? Ir could I just have a single expression which is choosing a (single) lookup value which is produced.
Edit Ideally I would not want to match
df.loc[df['column_name'] == some_value]
for each value ie. 13* number of categorical levels (lets assume 7) would be a lot of different values; especially, if df.loc[df['fist'] == some_value][df['second'] == otherValue1] combination of conditions occur i.e. they are all chained.
edit
A minimal example
df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
'first2DigitsOfPostcode': ['12', '23', '12', '12'],
'valueOfProduct': ['low', 'medum', 'high', 'low'],
'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})
defines the lookup table which was generated by a sql query grouping by all the columns and aggregating the values (so due to the Cartesian product all. value combinations should be represented in the lookup table.
A new record could look like
new_values = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct': ['low']})
How can I sort of automate the lookup of all the conditions assuming all conditions require an match by equality (if this makes it easier.
I found
pd.lookup Vectorized look-up of values in Pandas dataframe which seems to work for a single column / condition
maybe a merge could be a solution? Python Pandas: DataFrame as a Lookup Table, but that not really produce the desired lookup result.
edit 2
The second answer seems to be pretty interesting. But
mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])
will unfortunately just return NaN for the lookup value.
Now that I better know what you're after, a dataframe merge is likely a much better choice:
IN: df.merge(new_values, how='inner')
OUT: ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0 1 12 foo low
1 1 12 baz low
Certainly shorter than the other answer I gave! I'll leave the old one though in case it inspires someone else.
I think df.isin() is along the lines of what you're looking for.
Using your example df, and these two:
exists = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'high'})
Then you can check to see what values match, if all, or just some:
df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True True
1 False False False
2 False True False
3 True True True
df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True False
1 False False False
2 False True True
3 True True False
If your "query" wasn't a dataframe but instead a list it wouldn't need the ".values[0]" bit. The problem with a dictionary is it tries to match the index as well.
It's not clear to me from your question exactly what you want returned, but you could then subset based on whether all (or some) of the rows are the same:
# Returns matching rows
df[df.isin(exists.values[0]).values.all(True)]
# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]
...There might be a smarter way to write that last one.
df.query is an option if you can write if can the query as and expression using the column names:
so you can do:
query_string = 'some long (but valid) boolean query'
example from pandas:
>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Pandas "Can only compare identically-labeled DataFrame objects" error

I'm using Pandas to compare the outputs of two files loaded into two data frames (uat, prod):
...
uat = uat[['Customer Number','Product']]
prod = prod[['Customer Number','Product']]
print uat['Customer Number'] == prod['Customer Number']
print uat['Product'] == prod['Product']
print uat == prod
The first two match exactly:
74357 True
74356 True
Name: Customer Number, dtype: bool
74357 True
74356 True
Name: Product, dtype: bool
For the third print, I get an error:
Can only compare identically-labeled DataFrame objects. If the first two compared fine, what's wrong with the 3rd?
Thanks
Here's a small example to demonstrate this (which only applied to DataFrames, not Series, until Pandas 0.19 where it applies to both):
In [1]: df1 = pd.DataFrame([[1, 2], [3, 4]])
In [2]: df2 = pd.DataFrame([[3, 4], [1, 2]], index=[1, 0])
In [3]: df1 == df2
Exception: Can only compare identically-labeled DataFrame objects
One solution is to sort the index first (Note: some functions require sorted indexes):
In [4]: df2.sort_index(inplace=True)
In [5]: df1 == df2
Out[5]:
0 1
0 True True
1 True True
Note: == is also sensitive to the order of columns, so you may have to use sort_index(axis=1):
In [11]: df1.sort_index().sort_index(axis=1) == df2.sort_index().sort_index(axis=1)
Out[11]:
0 1
0 True True
1 True True
Note: This can still raise (if the index/columns aren't identically labelled after sorting).
You can also try dropping the index column if it is not needed to compare:
print(df1.reset_index(drop=True) == df2.reset_index(drop=True))
I have used this same technique in a unit test like so:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(actual.reset_index(drop=True), expected.reset_index(drop=True))
At the time when this question was asked there wasn't another function in Pandas to test equality, but it has been added a while ago: pandas.equals
You use it like this:
df1.equals(df2)
Some differenes to == are:
You don't get the error described in the question
It returns a simple boolean.
NaN values in the same location are considered equal
2 DataFrames need to have the same dtype to be considered equal, see this stackoverflow question
EDIT:
As pointed out in #paperskilltrees answer index alignment is important. Apart from the solution provided there another option is to sort the index of the DataFrames before comparing the DataFrames. For df1 that would be df1.sort_index(inplace=True).
When you compare two DataFrames, you must ensure that the number of records in the first DataFrame matches with the number of records in the second DataFrame. In our example, each of the two DataFrames had 4 records, with 4 products and 4 prices.
If, for example, one of the DataFrames had 5 products, while the other DataFrame had 4 products, and you tried to run the comparison, you would get the following error:
ValueError: Can only compare identically-labeled Series objects
this should work
import pandas as pd
import numpy as np
firstProductSet = {'Product1': ['Computer','Phone','Printer','Desk'],
'Price1': [1200,800,200,350]
}
df1 = pd.DataFrame(firstProductSet,columns= ['Product1', 'Price1'])
secondProductSet = {'Product2': ['Computer','Phone','Printer','Desk'],
'Price2': [900,800,300,350]
}
df2 = pd.DataFrame(secondProductSet,columns= ['Product2', 'Price2'])
df1['Price2'] = df2['Price2'] #add the Price2 column from df2 to df1
df1['pricesMatch?'] = np.where(df1['Price1'] == df2['Price2'], 'True', 'False') #create new column in df1 to check if prices match
df1['priceDiff?'] = np.where(df1['Price1'] == df2['Price2'], 0, df1['Price1'] - df2['Price2']) #create new column in df1 for price diff
print (df1)
example from https://datatofish.com/compare-values-dataframes/
Flyingdutchman's answer is great but wrong: it uses DataFrame.equals, which will return False in your case.
Instead, you want to use DataFrame.eq, which will return True.
It seems that DataFrame.equals ignores the dataframe's index, while DataFrame.eq uses dataframes' indexes for alignment and then compares the aligned values. This is an occasion to quote the central gotcha of Pandas:
Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.
As we can see in the following examples, the data alignment is neither broken, nor enforced, unless explicitly requested. So we have three different situations.
No explicit instruction given, as to the alignment: == aka DataFrame.__eq__,
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame(index=[0, 1, 2], data={'col1':list('abc')})
In [3]: df2 = pd.DataFrame(index=[2, 0, 1], data={'col1':list('cab')})
In [4]: df1 == df2
---------------------------------------------------------------------------
...
ValueError: Can only compare identically-labeled DataFrame objects
Alignment is explicitly broken: DataFrame.equals, DataFrame.values, DataFrame.reset_index(),
In [5]: df1.equals(df2)
Out[5]: False
In [9]: df1.values == df2.values
Out[9]:
array([[False],
[False],
[False]])
In [10]: (df1.values == df2.values).all().all()
Out[10]: False
Alignment is explicitly enforced: DataFrame.eq, DataFrame.sort_index(),
In [6]: df1.eq(df2)
Out[6]:
col1
0 True
1 True
2 True
In [8]: df1.eq(df2).all().all()
Out[8]: True
My answer is as of pandas version 1.0.3.
Here I am showing a complete example of how to handle this error. I have added rows with zeros. You can have your dataframes from csv or any other source.
import pandas as pd
import numpy as np
# df1 with 9 rows
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
# df2 with 8 rows
df2 = pd.DataFrame({'Name':['John','Mike','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[25,45,14,34,26,44,29,42]})
# get lengths of df1 and df2
df1_len = len(df1)
df2_len = len(df2)
diff = df1_len - df2_len
rows_to_be_added1 = rows_to_be_added2 = 0
# rows_to_be_added1 = np.zeros(diff)
if diff < 0:
rows_to_be_added1 = abs(diff)
else:
rows_to_be_added2 = diff
# add empty rows to df1
if rows_to_be_added1 > 0:
df1 = df1.append(pd.DataFrame(np.zeros((rows_to_be_added1,len(df1.columns))),columns=df1.columns))
# add empty rows to df2
if rows_to_be_added2 > 0:
df2 = df2.append(pd.DataFrame(np.zeros((rows_to_be_added2,len(df2.columns))),columns=df2.columns))
# at this point we have two dataframes with the same number of rows, and maybe different indexes
# drop the indexes of both, so we can compare the dataframes and other operations like update etc.
df2.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)
# add a new column to df1
df1['New_age'] = None
# compare the Age column of df1 and df2, and update the New_age column of df1 with the Age column of df2 if they match, else None
df1['New_age'] = np.where(df1['Age'] == df2['Age'], df2['Age'], None)
# drop rows where Name is 0.0
df2 = df2.drop(df2[df2['Name'] == 0.0].index)
# now we don't get the error ValueError: Can only compare identically-labeled Series objects
I found where the error is coming from in my case:
The problem was that column names list was accidentally enclosed in another list.
Consider following example:
column_names=['warrior','eat','ok','monkeys']
df_good = pd.DataFrame(np.ones(shape=(6,4)),columns=column_names)
df_good['ok'] < df_good['monkeys']
>>> 0 False
1 False
2 False
3 False
4 False
5 False
df_bad = pd.DataFrame(np.ones(shape=(6,4)),columns=[column_names])
df_bad ['ok'] < df_bad ['monkeys']
>>> ValueError: Can only compare identically-labeled DataFrame objects
And the thing is you cannot visually distinguish the bad DataFrame from good.
In my case i just write directly param columns in creating dataframe, because data from one sql-query was with names, and without in other

Categories