Compare multiple columns of two data frames using pandas

Compare multiple columns of two data frames using pandas - python

I have two data frames; df1 has Id and sendDate and df2 has Id and actDate. The two df's are not the same shape - df2 is a lookup table. There may be multiple instances of Id.
ex.
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
I want to add a boolean True/False in df1 to find when df1.Id == df2.Id and df1.sendDate == df2.actDate.
Expected output would add a column to df1:
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})
I'm new to python from R, so please let me know what other info you may need.

Use isin and boolean indexing
import pandas as pd
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11",
"2018-01-06", "2018-01-06",
"2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)
Output:
Id sendDate Match
0 1 2019-09-24 True
1 1 2020-09-11 True
2 2 2018-01-06 False
3 3 2018-01-06 False
4 2 2019-09-24 True

The .isin() approaches will find values where the ID and date entries don't necessarily appear together (e.g. Id=1 and date=2020-09-11 in your example). You can check for both by doing a .merge() and checking when df2's date field is not null:
df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()

A vectorized approach via numpy -
import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)

You can use .isin():
df1['id_bool'] = df1.Id.isin(df2.Id)
df1['date_bool'] = df1.sendDate.isin(df2.actDate)
Check out the documentation here.

Related

How do I add the counts of two rows where the values in the columns are swapped with respect of the other?

I have a Dataframe as follows:
import pandas as pd
df = pd.DataFrame({'Target': [0 ,1, 2],
'Source': [1, 0, 3],
'Count': [1, 1, 1]})
I have to count how many pairs of Sources and Targets there are. (1,0) and (0,1) will be treated as duplicate, hence the count will be 2.
I need to do it several times as I have 79 nodes in total. Any help will be much appreciated.

import pandas as pd
# instantiate without the 'count' column to start over
In[1]: df = pd.DataFrame({'Target': [0, 1, 2],
'Source': [1, 0, 3]})
Out[1]: Target Source
0 0 1
1 1 0
2 2 3
To count pairs regardless of their order is possible by converting to numpy.ndarray and sorting the rows to make them identical:
In[1]: array = df.values
In[2]: array.sort(axis=1)
In[3]: array
Out[3]: array([[0, 1],
[0, 1],
[2, 3]])
And then turn it back to a DataFrame to perform .value_counts():
In[1]: df_sorted = pd.DataFrame(array, columns=['value1', 'value2'])
In[2]: df_sorted.value_counts()
Out[2]: value1 value2
0 1 2
2 3 1
dtype: int64

select a single value from a column after groupby another columns in python

I tried to select a single value of column class from each group of my dataframe after i performed the groupby function on the column first_register and second_register but it seems did not work.
Suppose I have a dataframe like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'class': [1, 1, 1, 2, 2, 2, 0, 0, 1],
'first_register': ["70/20", "70/20", "70/20", "71/20", "71/20", "71/20", np.NAN, np.NAN, np.NAN],
'second_register': [np.NAN, np.NAN, np.NAN, np.NAN, np.NAN, np.NAN, "72/20", "72/20", "73/20"]})
What I have tried and did not work at all:
group_by_df = df.groupby(["first_register", "second_register"])
label_class = group_by_df["class"].unique()
print(label_class)
How can I select/access each single class label from each group of dataframe?
The desired output can be an ordered list like this to represent each class of each group from the first group to the final group:
label_class = [1, 2, 0, 1]

Use dropna=False:
group_by_df = df.groupby(["first_register", "second_register"], dropna=False)
label_class = group_by_df["class"].unique()
first_register second_register
70/20 NaN [1]
71/20 NaN [2]
NaN 72/20 [0]
73/20 [1]
Name: class, dtype: object
if you knok length of unique class is 1 or you want get the first or the last:
label_class = group_by_df["class"].first()
Or:
label_class = group_by_df["class"].last()

Use GroupBy.first:
out = df.groupby(["first_register", "second_register"], dropna=False)["class"].first()
print (out)
first_register second_register
70/20 NaN 1
71/20 NaN 2
NaN 72/20 0
73/20 1
Name: class, dtype: int64
label_class = out.tolist()
print (label_class)
[1, 2, 0, 1]

Pandas df.equals() returning False on identical dataframes?

Let df_1 and df_2 be:
In [1]: import pandas as pd
...: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
...: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [2]: df_1
Out[2]:
a b
0 1 4
1 2 5
2 3 6
We add a row r to df_1:
In [3]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
...: df_1 = df_1.append(r, ignore_index=True)
In [4]: df_1
Out[4]:
a b
0 1 4
1 2 5
2 3 6
3 x y
We now remove the added row from df_1 and get the original df_1 back again:
In [5]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
In [6]: df_1
Out[6]:
a b
0 1 4
1 2 5
2 3 6
In [7]: df_2
Out[7]:
a b
0 1 4
1 2 5
2 3 6
While df_1 and df_2 are identical, equals() returns False.
In [8]: df_1.equals(df_2)
Out[8]: False
Did reseach on SO but could not find a related question.
Am I doing somthing wrong? How to get the correct result in this case?
(df_1==df_2).all().all() returns True but not suitable for the case where df_1 and df_2 have different length.

This again is a subtle one, well done for spotting it.
import pandas as pd
df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
r = pd.DataFrame({'a': ['x'], 'b': ['y']})
df_1 = df_1.append(r, ignore_index=True)
df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
df_1.equals(df_2)
from pandas.util.testing import assert_frame_equal
assert_frame_equal(df_1,df_2)
Now we can see the issue as the assert fails.
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
Attribute "dtype" are different
[left]: object
[right]: int64
as you added strings to integers the integers became objects. so this is why the equals fails as well..

Use pandas.testing.assert_frame_equal(df_1, df_2, check_dtype=True), which will also check if the dtypes are the same.
(It will pick up in this case that your dtypes changed from int to 'object' (string) when you appended, then deleted, a string row; pandas did not automatically coerce the dtype back down to less expansive dtype.)
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
Attribute "dtype" are different
[left]: object
[right]: int64

As per df.equals docs:
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype.
So, df.equals will return True only when the elements have same values and the dtypes is also same.
When you add and delete the row from df_1, the dtypes changes from int to object, hence it returns False.
Explanation with your example:
In [1028]: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [1029]: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In [1031]: df_1.dtypes
Out[1031]:
a int64
b int64
dtype: object
In [1032]: df_2.dtypes
Out[1032]:
a int64
b int64
dtype: object
So, if you see above, dtypes of both dfs are same, hence below condition returns True:
In [1030]: df_1.equals(df_2)
Out[1030]: True
Now after you add and remove the row:
In [1033]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
In [1034]: df_1 = df_1.append(r, ignore_index=True)
In [1036]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
In [1038]: df_1.dtypes
Out[1038]:
a object
b object
dtype: object
dtype has changed to object, hence below condition returns False:
In [1039]: df_1.equals(df_2)
Out[1039]: False
If you still want it to return True, you need to change the dtypes back to int:
In [1042]: df_1 = df_1.astype(int)
In [1044]: df_1.equals(df_2)
Out[1044]: True

Based on the comments of the others, in this case one can do:
from pandas.util.testing import assert_frame_equal
identical_df = True
try:
assert_frame_equal(df_1, df_2, check_dtype=False)
except AssertionError:
identical_df = False

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)

For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

Drop Columns that starts with any of a list of strings Pandas

I'm trying to drop all columns from a df that start with any of a list of strings. I needed to copy these columns to their own dfs, and now want to drop them from a copy of the main df to make it easier to analyze.
df.columns = ["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"...]
Entered some code that gave me this dataframes with these columns:
aaa.columns = ["AAA1234", "AAA5678"]
bbb.columns = ["BBB1234", "BBB5678"]
I did get the final df that I wanted, but my code felt rather clunky:
droplist_cols = [aaa, bbb]
droplist = []
for x in droplist_cols:
for col in x.columns:
droplist.append(col)
df1 = df.drop(labels=droplist, axis=1)
Columns of final df:
df1.columns = ["CCC123", "DDD123"...]
Is there a better way to do this?
--Edit for sample data--
df = pd.DataFrame([[1, 2, 3, 4, 5], [1, 3, 4, 2, 1], [4, 6, 9, 8, 3], [1, 3, 4, 2, 1], [3, 2, 5, 7, 1]], columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123"])
Desired result:
CCC123
0 5
1 1
2 3
3 1
4 1

IICU
Lets begin with a dataframe thus;
df=pd.DataFrame({"A":[0]})
Modify dataframe to include your columns
df2=df.reindex(columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"], fill_value=0)
Drop all columns starting with A
df3=df2.loc[:,~df2.columns.str.startswith('A')]
If you need to drop say A OR B I would
df3=df2.loc[:,~(df2.columns.str.startswith('A')|df2.columns.str.startswith('B'))]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare multiple columns of two data frames using pandas - python

A vectorized approach via numpy - import numpy as np df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)

You can use .isin(): df1['id_bool'] = df1.Id.isin(df2.Id) df1['date_bool'] = df1.sendDate.isin(df2.actDate) Check out the documentation here.

Related

How do I add the counts of two rows where the values in the columns are swapped with respect of the other?

select a single value from a column after groupby another columns in python

Pandas df.equals() returning False on identical dataframes?

pandas largest value per group with multi columns / why does it only work when flattening?

Drop Columns that starts with any of a list of strings Pandas

Categories

Resources