Speed Up Pandas DataFrame Groupby Apply - python

I have the following code that I found on another post here (and modified slightly). It works great and the output is just as I expect, however I am wondering if anyone has suggestions on speed improvements. I am comparing two dataframes with about 93,000 rows and 110 columns. It takes about 20 minutes for the groupby to complete. I have tried to think of ways to speed if up but haven't come across anything. I am trying to think of anything now before my data sizes increase in the future. I am also open to other ways of doing this!
###Function that is called to check values in dataframe groupby
def report_diff(x):
return 'SAME' if x[0] == x[1] else '{} | {}'.format(*x)
#return '' if x[0] == x[1] else '{} | {}'.format(*x)
print("Concatening CSV and XML data together...")
###Concat the dataframes together
df_all = pd.concat(
[df_csv, df_xml],
axis='columns',
keys=['df_csv', 'df_xml'],
join='outer',
)
print("Done")
print("Swapping column levels...")
###Display keys at the top of each column
df_final = df_all.swaplevel(axis='columns')[df_xml.columns[0:]]
print("Done")
df_final = df_final.fillna('None')
print("Grouping data and checking for matches...")
###Apply report_diff function to each row
df_excel = df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

You can use np.where and check where df_csv[df_xml.columns] is equal to df_xml, if True then the value is 'SAME' else you can join the values of both dataframes like you do.
SETUP
df_csv = pd.DataFrame({'a':range(4),'b':[0,0,1,1],'c':list('abcd')})
df_xml = pd.DataFrame({'b':[0,2,3,1],'c':list('bbce')})
METHOD
df_excel = pd.DataFrame( np.where( df_csv[df_xml.columns] == df_xml, #find where
'SAME', #True
df_csv[df_xml.columns].astype(str) + ' | ' + df_xml.astype(str)), #False
columns=df_xml.columns
index=df_xml.index)
print (df_excel)
b c
0 SAME a | b
1 0 | 2 SAME
2 1 | 3 SAME
3 SAME d | e
Which is the same result that I got with your method.

Related

How to compare columns in pandas dataframes with NaN values

I need to merge 2 dataframes with identical columns: _id(str), id(int), logistic_flow(str), last_delivery_date(datetime), delivery_scheme(str) and find rows with discrepancy in columns:
df_orders_greenplum['delivery_scheme'] = df_orders_greenplum['delivery_scheme'].astype('str')
df_orders_greenplum['logistic_flow'] = df_orders_greenplum['logistic_flow'].astype('str')
df_orders_mongodb = df_orders_mongodb.astype(df_orders_greenplum.dtypes)
df_orders_mongodb.index = df_orders_mongodb.index.map(str)
df_orders_greenplum.index = df_orders_greenplum.index.map(str)
merged_df = df_orders_mongodb.merge(df_orders_greenplum, how='left',
on=['_id'],
suffixes=('_source', '_dest'))
merged_df = merged_df.replace(np.nan, None)
orders_with_discrepancies = merged_df.loc[
(merged_df.delivery_scheme_source != merged_df.delivery_scheme_dest)
| (merged_df.logistic_flow_source != merged_df.logistic_flow_dest)
| (merged_df.id_source != merged_df.id_dest)
| (merged_df.last_delivery_date_source != merged_df.last_delivery_date_dest)
]
And I get the following result:
The rows are absolutely identical and can't get into the output of merge and filter. I think the issue is in the None values (when I dropped (merged_df.logistic_flow_source != merged_df.logistic_flow_dest) the output becomes empty). How to compare None values right way?

Looping lambda function across multiple panda columns

I am struggling to loop a lambda function across multiple columns.
samp = pd.DataFrame({'ID':['1','2','3'], 'A':['1C22', '3X35', '2C77'],
'B': ['1C35', '2C88', '3X99'], 'C':['3X56', '2C73', '1X91']})
Essentially, I am trying to add three columns to this dataframe with a 1 if there is a 'C' in the string and a 0 if not (i.e. an 'X').
This function works fine when I apply it as a lambda function to each column individually, but I'm doing so to 40 differnt columns and the code is (I'm assuming) unnecessarily clunky:
def is_correct(str):
correct = len(re.findall('C', str))
return correct
samp.A_correct=samp.A.apply(lambda x: is_correct(x))
samp.B_correct=samp.B.apply(lambda x: is_correct(x))
samp.C_correct=samp.C.apply(lambda x: is_correct(x))
I'm confident there is a way to loop this, but I have been unsuccessful thus far.
You can iterate over the columns:
import pandas as pd
import re
df = pd.DataFrame({'ID':['1','2','3'], 'A':['1C22', '3X35', '2C77'],
'B': ['1C35', '2C88', '3X99'], 'C':['3X56', '2C73', '1X91']})
def is_correct(str):
correct = len(re.findall('C', str))
return correct
for col in df.columns:
df[col + '_correct'] = df[col].apply(lambda x: is_correct(x))
Let's try apply and join:
samp.join(samp[['A','B','C']].add_suffix('_correct')
.apply(lambda x: x.str.contains('C'))
.astype(int)
)
Output:
ID A B C A_correct B_correct C_correct
0 1 1C22 1C35 3X56 1 1 0
1 2 3X35 2C88 2C73 0 1 1
2 3 2C77 3X99 1X91 1 0 0

How to compare columns of two dataframes and have consequences when they match in Python Pandas

I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C

Positional string-formatting on pandas DataFrame

I'm using python to automatise some processes at work. My final product has to be in excel format (formulas have to be there, and everything has to be traceable), so I work on a pandas DataFrame and then export the result to a .xlsx.
What I want to do is to create a pandas DataFrame that looks like this:
ID Price Quantity Total
0 A =VLOOKUP(A2;'Sheet2'!A:J;6;0) =VLOOKUP(A2;'Sheet2'!A:J;7;0) =B2*C2
1 B =VLOOKUP(A3;'Sheet2'!A:J;6;0) =VLOOKUP(A3;'Sheet2'!A:J;7;0) =B3*C3
2 C =VLOOKUP(A4;'Sheet2'!A:J;6;0) =VLOOKUP(A4;'Sheet2'!A:J;7;0) =B4*C4
3 D =VLOOKUP(A5;'Sheet2'!A:J;6;0) =VLOOKUP(A5;'Sheet2'!A:J;7;0) =B5*C5
4 E =VLOOKUP(A6;'Sheet2'!A:J;6;0) =VLOOKUP(A6;’Sheet2'!A:J;7;0) =B6*C6
As you can see in the first row, the formulas reference A2, B2 and C2; the second row references A3, B3 and C3; the 'n' row references A(n+2), B(n+2) and C(n+2). The DataFrame has about 3.000 rows.
I want to generate this dataframe with a few lines of code, and i haven't got the expected result. I though using positional formatting would do:
df = pd.DataFrame()
df['temp'] = range(3000)
df['Price'] = """=VLOOKUP(A{0};'Sheet2'!A:J;6;0)""" .format(df.index + 2)
df['Quantity'] = """=VLOOKUP(A{0};'Sheet2'!A:J;7;0)""" .format(df.index + 2)
df['Total'] = """=B{0}*C{0}""" .format(df.index + 2)
df.drop('temp', axis=1, inplace=True)
Unfortunately it doesn't work. It returns something like this:
"=VLOOKUP(ARangeIndex(start=2, stop=3002, step=1);'Sheet2'!A:J;6;0)"
Does anyone have any suggestion on how to do this?
Thanks!
Try vectorised string concatenation:
df = pd.DataFrame(index=range(2000)) # no need for temp here, btw
idx = (df.index + 2).astype(str)
df['Price'] = "=VLOOKUP(A" + idx + ";'Sheet2'!A:J;6;0)"
A similar process follows for the remainder of your columns:
df['Quantity'] = "=VLOOKUP(A" + idx + ";'Sheet2'!A:J;7;0)"
df['Total'] = 'B' + idx + '*C' + idx
df.head()
Price Quantity Total
0 =VLOOKUP(A2;'Sheet2'!A:J;6;0) =VLOOKUP(A2;'Sheet2'!A:J;7;0) B2*C2
1 =VLOOKUP(A3;'Sheet2'!A:J;6;0) =VLOOKUP(A3;'Sheet2'!A:J;7;0) B3*C3
2 =VLOOKUP(A4;'Sheet2'!A:J;6;0) =VLOOKUP(A4;'Sheet2'!A:J;7;0) B4*C4
3 =VLOOKUP(A5;'Sheet2'!A:J;6;0) =VLOOKUP(A5;'Sheet2'!A:J;7;0) B5*C5
4 =VLOOKUP(A6;'Sheet2'!A:J;6;0) =VLOOKUP(A6;'Sheet2'!A:J;7;0) B6*C6

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?
Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.
IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]
You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result
Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.
with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Categories