If I pass a dictionary to the isin method of a dataframe, columns that are omitted in the dictionary are considered False by default.
df = pd.DataFrame()
df['foo'] = pd.Series([0, 0, 1, 1, 5, 5], dtype='category')
df['bar'] = pd.Series([4, 4, 2, 2, 1, 1], dtype='category')
values = {'foo' : [0, 1]}
print(df.isin(values))
Out[1]:
foo bar
0 True False
1 True False
2 True False
3 True False
4 False False
5 False False
This is annoying* because if I have a dataframe with many columns and I only want to impose conditions on a subset of them, I still have to list all names of the other columns with all their possible values. Is there an elegant way to avoid that?
*I later want to select the rows in which a condition holds, row_mask = df.isin(values).all(axis=1) so I would like all columns for which no condition is imposed to be True.
Related
I have a two lists of panda dataframes as follows,
import pandas as pd
import numpy as np
list_one = [pd.DataFrame({'sent_a.1': [0, 3, 2, 1], 'sent_a.2': [0, 1, 4, 0], 'sent_b.3': [0, 6, 0, 8],'sent_b.4': [1, 1, 8, 6],'ID':['id_1','id_1','id_1','id_1']}),
pd.DataFrame({'sent_a.1': [0, 3], 'sent_a.2': [0, 2], 'sent_b.3': [0, 6],'sent_b.4': [1, 1],'ID':['id_2','id_2']})]
list_two = [pd.DataFrame({'sent_a.1': [0, 5], 'sent_a.2': [0, 1], 'sent_b.3': [0, 6],'sent_b.4': [1, 1],'ID':['id_2','id_2']}),
pd.DataFrame({'sent_a.1': [0, 5, 3, 1], 'sent_a.2': [0, 2, 3, 1], 'sent_b.3': [0, 6, 6, 8],'sent_b.4': [1, 5, 8, 5],'ID':['id_1','id_1','id_1','id_1']})]
I would like to compare the dataframes in these two lists and if the values are the same, I would like to replace the value with 'True' and if the values are different, I would like to set them to 'False' and save the result in a different list of panda dataframes. I have done the following,
for dfs in list_one:
for dfs2 in list_two:
g = np.where(dfs == dfs2, 'True', 'False')
print (g)
but I get the error,
ValueError: Can only compare identically-labeled DataFrame objects
how can I sort values in these two lists, based on the values from column 'ID'?
Edit
I would like the dataframes that have the same value for column 'ID' to be compared. meaning that dataframes that have 'ID' == 'id_1' are to be compared with one another and dataframes that have 'ID' == 'id_2' to be compared with each other (not a cross comparison)
so the desired output is:
output = [ sent_a.1 sent_a.2 sent_b.3 sent_b.4 ID
0 True True True True id_1
1 False False True False id_1
2 False False False True id_1
3 False False True True id_1,
sent_a.1 sent_a.2 sent_b.3 sent_b.4 ID
0 True True True True id_2
1 True True False False id_2]
Based on your current example
For your first question:
how can I sort values in these two lists, based on the values from column 'ID'?
list_one = sorted(list_one,key=lambda x: x['ID'].unique()[0][3:], reverse=False)
list_two =sorted(list_two,key=lambda x: x['ID'].unique()[0][3:], reverse=False)
ValueError: Can only compare identically-labeled DataFrame objects
error due to different index values order in dataframes or dataframes are of different shapes
First way of comparison:
for dfs in list_one:
for dfs2 in list_two:
if dfs.shape == dfs2.shape:
g = np.where(dfs == dfs2, 'True', 'False')
print (g)
Second way:
I would like the dataframes that have the same value for column 'ID' to be compared
for dfs in list_one:
for dfs2 in list_two:
if (dfs['ID'].unique() == dfs2['ID'].unique()) and (dfs.shape == dfs2.shape):
g = np.where(dfs == dfs2, 'True', 'False')
print (g)
I have two data frames; df1 has Id and sendDate and df2 has Id and actDate. The two df's are not the same shape - df2 is a lookup table. There may be multiple instances of Id.
ex.
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
I want to add a boolean True/False in df1 to find when df1.Id == df2.Id and df1.sendDate == df2.actDate.
Expected output would add a column to df1:
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})
I'm new to python from R, so please let me know what other info you may need.
Use isin and boolean indexing
import pandas as pd
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11",
"2018-01-06", "2018-01-06",
"2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)
Output:
Id sendDate Match
0 1 2019-09-24 True
1 1 2020-09-11 True
2 2 2018-01-06 False
3 3 2018-01-06 False
4 2 2019-09-24 True
The .isin() approaches will find values where the ID and date entries don't necessarily appear together (e.g. Id=1 and date=2020-09-11 in your example). You can check for both by doing a .merge() and checking when df2's date field is not null:
df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()
A vectorized approach via numpy -
import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)
You can use .isin():
df1['id_bool'] = df1.Id.isin(df2.Id)
df1['date_bool'] = df1.sendDate.isin(df2.actDate)
Check out the documentation here.
Let df_1 and df_2 be:
In [1]: import pandas as pd
...: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
...: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [2]: df_1
Out[2]:
a b
0 1 4
1 2 5
2 3 6
We add a row r to df_1:
In [3]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
...: df_1 = df_1.append(r, ignore_index=True)
In [4]: df_1
Out[4]:
a b
0 1 4
1 2 5
2 3 6
3 x y
We now remove the added row from df_1 and get the original df_1 back again:
In [5]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
In [6]: df_1
Out[6]:
a b
0 1 4
1 2 5
2 3 6
In [7]: df_2
Out[7]:
a b
0 1 4
1 2 5
2 3 6
While df_1 and df_2 are identical, equals() returns False.
In [8]: df_1.equals(df_2)
Out[8]: False
Did reseach on SO but could not find a related question.
Am I doing somthing wrong? How to get the correct result in this case?
(df_1==df_2).all().all() returns True but not suitable for the case where df_1 and df_2 have different length.
This again is a subtle one, well done for spotting it.
import pandas as pd
df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
r = pd.DataFrame({'a': ['x'], 'b': ['y']})
df_1 = df_1.append(r, ignore_index=True)
df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
df_1.equals(df_2)
from pandas.util.testing import assert_frame_equal
assert_frame_equal(df_1,df_2)
Now we can see the issue as the assert fails.
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
Attribute "dtype" are different
[left]: object
[right]: int64
as you added strings to integers the integers became objects. so this is why the equals fails as well..
Use pandas.testing.assert_frame_equal(df_1, df_2, check_dtype=True), which will also check if the dtypes are the same.
(It will pick up in this case that your dtypes changed from int to 'object' (string) when you appended, then deleted, a string row; pandas did not automatically coerce the dtype back down to less expansive dtype.)
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
Attribute "dtype" are different
[left]: object
[right]: int64
As per df.equals docs:
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype.
So, df.equals will return True only when the elements have same values and the dtypes is also same.
When you add and delete the row from df_1, the dtypes changes from int to object, hence it returns False.
Explanation with your example:
In [1028]: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [1029]: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [1031]: df_1.dtypes
Out[1031]:
a int64
b int64
dtype: object
In [1032]: df_2.dtypes
Out[1032]:
a int64
b int64
dtype: object
So, if you see above, dtypes of both dfs are same, hence below condition returns True:
In [1030]: df_1.equals(df_2)
Out[1030]: True
Now after you add and remove the row:
In [1033]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
In [1034]: df_1 = df_1.append(r, ignore_index=True)
In [1036]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
In [1038]: df_1.dtypes
Out[1038]:
a object
b object
dtype: object
dtype has changed to object, hence below condition returns False:
In [1039]: df_1.equals(df_2)
Out[1039]: False
If you still want it to return True, you need to change the dtypes back to int:
In [1042]: df_1 = df_1.astype(int)
In [1044]: df_1.equals(df_2)
Out[1044]: True
Based on the comments of the others, in this case one can do:
from pandas.util.testing import assert_frame_equal
identical_df = True
try:
assert_frame_equal(df_1, df_2, check_dtype=False)
except AssertionError:
identical_df = False
Good morning,
I have a dataframe that has only values of True and False and want to get the row index where the value True exits.
I tried this:
[i for i in df_str[df_str.columns.values] if i== True]
But this return an empty array.
How can I do this?
Here's a way to do that. I'm using synthetic data for the sake of demonstration.
df = pd.DataFrame({"a": np.random.choice([True, False], 10),
"b": np.random.choice([True, False], 10)})
print(df)
# a b
# 0 False True
# 1 True False
# 2 False True
# 3 True True
# 4 False False
# 5 True False
# 6 False True
# 7 True False
# 8 True False
# 9 True True
# 'a' and 'b' are the columns you'd like to search
df[df[["a", "b"]].sum(axis=1) > 0].index.to_list()
# ==> [0, 1, 2, 3, 5, 6, 7, 8, 9]
Here is the solution
# for single column
df.index[df['col_name'] == True].tolist()
#for multiple columns
df[df[["a", "b"]].sum(axis=1) > 0].index.to_list()
The best way to get the indices where True is present (for the provided sample) is to use any. The following code will give you all the indices where any value in a particular row is True.
df=pd.DataFrame({"A":[True, False, False, True],"B":[True, True, False, False]})
indices=df[df.any(axis=1)].index
Expected Output
Int64Index([0, 1, 3], dtype='int64')
I want to do row comparisons by group based on a condition across 2 columns. This condition
is: (col1(i)-col1(j))*(col2(i)-col2(j)) <= 0, where we are comparing every row i with row j in columns col1 and col2. If the condition is satisfied for all row comparisons in the group, then set true for that group, else false.
data = {'group':['A', 'A', 'A', 'B', 'B', 'B'],
'col1':[1, 2, 3, 2, 3, 1], 'col2':[4, 3, 2, 2, 3, 1]}
df = pd.DataFrame(data)
df
with output
A True
B False
You can use shift for comparision with next row along with groupby+all for checking if all items in the group is True:
cond=((df['col1']-df['col1'].shift(-1))*(df['col2']-df['col2'].shift(-1))<=0)&(df['group']==df['group'].shift(-1))
cond.groupby(df['group']).all()
group
A True
B False
dtype: bool