How to compare columns in pandas dataframes with NaN values - python

I need to merge 2 dataframes with identical columns: _id(str), id(int), logistic_flow(str), last_delivery_date(datetime), delivery_scheme(str) and find rows with discrepancy in columns:
df_orders_greenplum['delivery_scheme'] = df_orders_greenplum['delivery_scheme'].astype('str')
df_orders_greenplum['logistic_flow'] = df_orders_greenplum['logistic_flow'].astype('str')
df_orders_mongodb = df_orders_mongodb.astype(df_orders_greenplum.dtypes)
df_orders_mongodb.index = df_orders_mongodb.index.map(str)
df_orders_greenplum.index = df_orders_greenplum.index.map(str)
merged_df = df_orders_mongodb.merge(df_orders_greenplum, how='left',
on=['_id'],
suffixes=('_source', '_dest'))
merged_df = merged_df.replace(np.nan, None)
orders_with_discrepancies = merged_df.loc[
(merged_df.delivery_scheme_source != merged_df.delivery_scheme_dest)
| (merged_df.logistic_flow_source != merged_df.logistic_flow_dest)
| (merged_df.id_source != merged_df.id_dest)
| (merged_df.last_delivery_date_source != merged_df.last_delivery_date_dest)
]
And I get the following result:
The rows are absolutely identical and can't get into the output of merge and filter. I think the issue is in the None values (when I dropped (merged_df.logistic_flow_source != merged_df.logistic_flow_dest) the output becomes empty). How to compare None values right way?

Related

How to compare two columns in a grouped pandas dataframe?

I am unable to compare two columns inside a grouped pandas dataframe.
I used groupby method to group the fields with respect to two columns
I am required to get the list of fields that are not matching with the actual output.
file_name | page_no | field_name | value | predicted_value | actual_value
-------------------------------------------------------------------------
A 1 a 1 zx zx
A 2 b 0 xt xi
B 1 a 1 qw qw
B 2 b 0 xr xe
desired output:
b
Because b is the only field that is causing the mismatch between the two columns
The following is my code:
groups = df1.groupby(['file_name', 'page_no'])
a = pd.DataFrame(columns = ['file_name', 'page_no', 'value'])
for name, group in groups:
lst = []
if (group[group['predicted_value']] != group[group['actual_value']]):
lst = lst.append(group[group['field_name']])
print(lst)
I am required to get the list of fields that are not matching with the actual output.
Here, I'm trying to store them in a list but I am getting some key error.
The error is as follows:
KeyError: "None of [Index(['A', '1234'')] are in the [columns]"
Here is solution for test columns outside groups:
df1 = df[df['predicted_value'] != df['actual_value']]
s = df.loc[df['predicted_value'] != df['actual_value'], 'field_name']
L = s.tolist()
Does this solve your problem ?
# Create a new dataframe retrieving only non-matching values
df1=df[df['predicted_value']!=df['actual_value']]
# Store 'field_name' column in a list format
lst=list(df1['field_name'])
print(lst)

Is there a way to vectorize this function or improve its efficiency

This loop is intended to match subjects in df2 to subjects in df1 with a 1:4 ratio. The key here is randomly selecting subjects while avoiding redundancy. No subject should be matched twice. df1 has a few thousand subjects, whereas df2 has over one million. Every subject in df1 will be matched to four subjects in df2, those who aren't matched will be left out. Does anyone have ideas for improving its efficiency? An approach that also conserves RAM would be ideal. Thanks.
for x in range(4): # 1:4 matching
for index, row in df1.iterrows():
temp = df2.loc[(df2['matched'] != 1) & (df2['race_ethnicity'] == row['race_ethnicity']) & (df2['age'] == row['age']) & ((df2['date1'] > row['date2']) | (df2['date1'].isna()))]
a = temp.sample()
a['matched_subject'] = row['subject_id']
a['matched'] = '1'
a['possible_matches'] = len(temp)
It can be simplified to this, but I'd prefer to continue using the 'possible_matches' row for diagnostics.
for x in range(4): # 4 because 1:4 matching
for index, row in df1.iterrows():
a = df2.loc[(df2['matched_subject']=='') & (df2['race_ethnicity'] == row['race_ethnicity']) & (df2['age'] == row['age']) & ((df2['date1'] > row['date2']) | (df2['date1'] == 0))].sample()
a['matched_subject'] = row['pid']
Clarifications: All rows are unique in both DataFrames, representing a list of subjects that will be compared in survival analysis. Date1 is the datetime of outcome variable event, is present for a fraction of subjects in both DataFrames. Date 2 is the datetime of independent variable event, present for all df1 subjects and no df2 subjects. Sample inputs include:
subject_id (numeric)
race_ethnicity (str, 3 categories)
age (numeric)
Matched (binary, indicates if a df2 subject has been matched to one in df1)
Matched_subject (numeric, the subject_id of the matched subject)
date1 (datetime. outcome variable for survival models, present for some and not others in both dataframes)
date2 (datetime for event that is independent variable. Everyone in df1 has a date2, nobody in df2 has a date2. A df1 subject's date2 is their index date for survival analysis, and also serves as index date for their matched subjects who have no date2 variable)
We want to match four subjects in df2 for every one in df1. Date2 in df1 will be the index date for matched subjects in df2, hence we have a condition to ensure that date1 (outcome event) in df2 doesn't occur before date2 in df1 (index event)
Here is an example of two df2 subjects matched to df1. Also with an unmatched df1 and df2 subject. The real df2 is large enough to match all df1 subjects with great excess.
df1:
subject_id
race_ethnicity
age
date1
date2
matched
matched_subject
possible_matches
3a3r796e
Non-Hispanic white
55
(can be present in df1 or df2. if present, must be after date 2 in df1)
2012-01-01
1
3a3r796e matching based on df1, therefore these are the same
only important for df2. not important for analysis, just a diagnostic value
1234abcd
Non-Hispanic black
58
2017-01-01
2016-01-01
0
df2:
subject_id
race_ethnicity
age
date1
date2
matched
matched_subject
possible_matches
5c69a756
Non-Hispanic white
55
2015-01-01
(cannot be present in df2, by definition)
1
3a3r796e
571
7as89f75
Non-Hispanic white
55
1
3a3r796e
571
6376asef
Hispanic
42
2010-01-01
0
First let's define some helper functions:
def generate_data(df1_len, df2_len, seed=42):
"""Generate random data to help test different algorithms"""
np.random.seed(seed)
d2 = np.random.randint(0, 3000, size=df1_len)
df1 = pd.DataFrame({
'subject_id': np.arange(df1_len),
'race_ethnicity': np.random.choice(list('ABC'), df1_len),
'age': np.random.randint(18, 100, df1_len),
'date2': pd.Timestamp('2000-01-01') + pd.to_timedelta(d2, unit='D')
})
d1 = np.random.randint(0, 3000, size=int(df2_len * np.random.rand()))
d1 = np.hstack([d2, np.repeat(np.nan, df2_len - len(d1))])
df2 = pd.DataFrame({
'subject_id': np.arange(df2_len),
'race_ethnicity': np.random.choice(list('ABC'), df2_len),
'age': np.random.randint(18, 100, df2_len),
'date1': pd.Timestamp('2000-01-01') + pd.to_timedelta(d1, unit='D')
})
return df1, df2
def verify(df1, df2):
"""Verify that df1 and df2 are matched according to predefined rules"""
tmp = df1.merge(df2, how='left', left_on='subject_id', right_on='matched_subject', suffixes=('_1', "_2"))
assert (tmp['race_ethnicity_1'] == tmp['race_ethnicity_2']).all(), 'race_ethnicity does not match'
assert (tmp['age_1'] == tmp['age_2']).all(), 'age does not match'
assert ((tmp['date1'] > tmp['date2']) | tmp['date1'].isna()).all(), 'date1 must be NaT or grater than date2'
assert tmp.groupby('matched_subject').size().eq(4).all(), 'Invalid match ratio'
print('All is good')
The original solution
Allow me to make some changes in the interest of clarity. This version runs in
~28 seconds on my Mac:
df1, df2 = generate_data(500, 100_000)
df2['matched'] = False
df2['matched_subject'] = None
df2['possible_matches'] = None
for x in range(4): # 1:4 matching
for index, row in df1.iterrows():
cond = (
(df2['matched'] != 1) &
(df2['race_ethnicity'] == row['race_ethnicity']) &
(df2['age'] == row['age']) &
((df2['date1'] > row['date2']) | df2['date1'].isna())
)
temp = df2.loc[cond]
if temp.empty:
continue
idx = temp.sample().index
df2.loc[idx, 'matched_subject'] = row['subject_id']
df2.loc[idx, 'matched'] = True
df2.loc[idx, 'possible_matches'] = len(temp)
An improved version
By taking out the outer loop (for _ in range(4)), you can improve performance
almost 4 times. The code below executed in 7s:
df1, df2 = generate_data(5000, 1_000_000)
df2['matched'] = False
df2['matched_subject'] = None
df2['possible_matches'] = None
for index, row in df1.iterrows():
cond = (
(df2['matched'] != 1) &
(df2['race_ethnicity'] == row['race_ethnicity']) &
(df2['age'] == row['age']) &
((df2['date1'] > row['date2']) | df2['date1'].isna())
)
temp = df2.loc[cond]
if temp.empty:
continue
idx = temp.sample(4).index
df2.loc[idx, 'matched_subject'] = row['subject_id']
df2.loc[idx, 'matched'] = True
df2.loc[idx, 'possible_matches'] = len(temp)
A further improved version
Taking the idea that working on multiple rows at once is faster than doing it
one at a time, we can loop based on group of rows with similar characteristics
rather than looping with individual rows. This code runs in 600ms or ~46x faster
than the original version:
df1, df2 = generate_data(500, 100_000)
# Shuffle df2 so the matches will be random
df2 = df2.sample(frac=1)
# A dictionary to hold the result. Its keys are the indexes in df2 and its
# values are the indexes of df1
matches = {}
# We loop by group instead of individual row
grouped1 = df1.groupby(['race_ethnicity', 'age', 'date2'])
grouped2 = df2.groupby(['race_ethnicity', 'age'])
for (race_ethnicity, age, date2), subset1 in grouped1:
# Get all rows from df2 that have the same `race_ethnicity` and `age`
subset2 = grouped2.get_group((race_ethnicity, age))
# pd.Series is slow. Switch to np.array for speed
index2 = subset2.index.to_numpy()
date1 = subset2['date1'].to_numpy()
# Since all rows in subset1 and subset2 have already had the same
# `race_ethnicity` and `age`, we only need to filter for two things:
# 1. The relationship between `date1` and `date2`; and
# 2. That the row in `df2` has NOT been matched before
cond = (
(np.isnan(date1) | (date1 > date2))
& np.isin(index2, list(matches.keys()), invert=True)
)
# The match ratio
index1 = np.repeat(subset1.index.to_numpy(), 4)
# There is no way to know in advance how many rows in `subset2` will meet
# the matching criteria:
# * Ideally: cond.sum() == len(index1), ie. 4 rows in `subset2` for every
# row in `subset1`
# * If there are more matches than we need: we will take the first `4 *
# len(subset1)` rows
# * If there are not enough matches: eg. 6 rows in `subset2` for 2 rows in
# `subset1`, some rows in `subset1` will have to accept < 4 matches
n = min(cond.sum(), len(index1))
matches.update({
key: value for key, value in zip(index2[cond][:n], index1[:n])
})
tmp = pd.DataFrame({
'index2': matches.keys(),
'index1': matches.values()
})
df2 = (
df2.merge(tmp, how='left', left_index=True, right_on='index2')
.merge(df1['subject_id'].to_frame('matched_subject'), how='left', left_on='index1', right_index=True)
.drop(columns=['index1', 'index2'])
)
You can verify the solution:
verify(df1, df2)
# Output: All is good

How to dynamically match rows from two pandas dataframes

I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)
Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])

Speed Up Pandas DataFrame Groupby Apply

I have the following code that I found on another post here (and modified slightly). It works great and the output is just as I expect, however I am wondering if anyone has suggestions on speed improvements. I am comparing two dataframes with about 93,000 rows and 110 columns. It takes about 20 minutes for the groupby to complete. I have tried to think of ways to speed if up but haven't come across anything. I am trying to think of anything now before my data sizes increase in the future. I am also open to other ways of doing this!
###Function that is called to check values in dataframe groupby
def report_diff(x):
return 'SAME' if x[0] == x[1] else '{} | {}'.format(*x)
#return '' if x[0] == x[1] else '{} | {}'.format(*x)
print("Concatening CSV and XML data together...")
###Concat the dataframes together
df_all = pd.concat(
[df_csv, df_xml],
axis='columns',
keys=['df_csv', 'df_xml'],
join='outer',
)
print("Done")
print("Swapping column levels...")
###Display keys at the top of each column
df_final = df_all.swaplevel(axis='columns')[df_xml.columns[0:]]
print("Done")
df_final = df_final.fillna('None')
print("Grouping data and checking for matches...")
###Apply report_diff function to each row
df_excel = df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
You can use np.where and check where df_csv[df_xml.columns] is equal to df_xml, if True then the value is 'SAME' else you can join the values of both dataframes like you do.
SETUP
df_csv = pd.DataFrame({'a':range(4),'b':[0,0,1,1],'c':list('abcd')})
df_xml = pd.DataFrame({'b':[0,2,3,1],'c':list('bbce')})
METHOD
df_excel = pd.DataFrame( np.where( df_csv[df_xml.columns] == df_xml, #find where
'SAME', #True
df_csv[df_xml.columns].astype(str) + ' | ' + df_xml.astype(str)), #False
columns=df_xml.columns
index=df_xml.index)
print (df_excel)
b c
0 SAME a | b
1 0 | 2 SAME
2 1 | 3 SAME
3 SAME d | e
Which is the same result that I got with your method.

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Categories