How do you compare the values in two different dataframe groupby results? - python

I have two different dataframes populated with different name sets. For example:
t1 = pd.DataFrame(['Abe, John Doe', 'Smith, Brian', 'Lin, Sam', 'Lin, Greg'], columns=['t1'])
t2 = pd.DataFrame(['Abe, John', 'Smith, Brian', 'Lin, Sam', 'Lu, John'], columns=['t2'])
I need to find the intersection between the two data sets. My solution was to split by comma, and then groupby last name. Then I'll be able to compare last names, and then see if the first names of t2 are contained within t1. ['Lu, John'] is the only one that should be returned in above example.
What I need help on is how to compare values within two different dataframes that are grouped by a common column. Is there a way to intersect the results of a groupby for two different dataframes and then compare the values within each key value pair? I need to extract the names in t2 that are not in t1.
I have an idea that it should look something like this:
for last in t1:
print(t2.get_group(last)) #BUT run a compare between the two different lists here
Only problem is if the last name doesn't exist in the second groupby, it throws an error, so I can't even proceed to the next step mentioned by the comment, of comparing the values in the groups (first names).

This isn't pandas specific but python has a built in set class with an intersect operation, here's the documentation: https://docs.python.org/3/library/stdtypes.html?highlight=set#set
It works like so
set1 = set(my_list_of_elements)
set2 = set(my_other_list_of_elements)
intersecting_elements = set1 & set2
It's hard to tell if this is what you are looking for though, please update with a minimally, complete, and verifiable example as the comments say for a more accurate answer.
Update - based on comment
for last in t1:
try:
t2_last_group = t2.get_group(last)
# perform compare here
except:
pass

I ended up figuring this out. The pandas dataframe seems to look for contains(...).any(), so I included those. The problem of not being able to find a value in the second groupby dataframe, I surrounded the code with a try/exception. Solution is outlined below.
t1final = []
for index, row in t1.iterrows():
t1lastname = row['last']
t1firstname = row['first']
try:
x = t2groupby.get_group(t1lastname)
if(~x['first'].str.contains(t1firstname,case=False).any()):
t1final.append(t1lastname + ', ' + t1firstname)
except:
t1final.append(t1lastname + ', ' + t1firstname)

Related

Pandas comparing list across column values

I have a certain business requirement for which I am having trouble in implementing a faster solution (current solution takes 3 hrs per iteration)
Eg: Say I have a df
and there's a list :
l = [[a,b,c],[d,e,f]]
To do:
Compare all the list values across customer and check if they exist or not
If they exist then find the corresponding min and max date1
Currently the pseudo working code I have is :
for each customer:
group by customer and add column having code column into a list
for each list value:
check if particular list value exists (in case check if [a,b,c] exists in first loop)
if exists:
check for min date by group etc
This multiple for loop is taking too long to execute since I have 100k+ customers.
Any way to further improve this? I already eliminated one for loop reducing time from 10hrs to 3
l = [['a','b','c'],['d','e','f']]
Firstly flatten your list:
from pandas.core.common import flatten
l=list(flatten(l))
Then do boolean masking to check if the customer exists or not in your dataframe:
newdf=df[df['code'].isin(l)]
Finally do groupby():
#The below code groupby 'code':
newdf=newdf.groupby('code').agg(max_date1=('date1','max'),min_date1=('date1','min'))
#If You want to groupby customerid and code then use:
newdf=newdf.groupby(['customerid','code']).agg(max_date1=('date1','max'),min_date1=('date1','min'))
Now If you print newdf you will get your desired output
I slightly modified my approach.
Instead of looping through each customer (I have 100k+ customers)
I looped through each list :
checked if customers were present or not and then looped through filtered customers
This reduced the time by a couple of hours.
Thanks again for your help

Remove duplicated cell content using python?

I filter the duplicates, got duplicate on the same row and join items by comma and with this below code, don't really understand why the Join_Dup column is replicated?
dd = sales_all[sales_all['Order ID'].duplicated(keep=False)]
dd['Join_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
print(dd.head())
dd = dd[['Order ID','Join_Dup']].drop_duplicates()
dd
Order ID Join_Dup
0 176558 USB-C Charging Cable,USB-C Charging Cable,USB-...
2 176559 Bose SoundSport Headphones,Bose SoundSport Hea...
3 176560 Google Phone,Wired Headphones,Google Phone,Wir...
5 176561 Wired Headphones,Wired Headphones,Wired Headph...
... ... ...
186846 259354 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186847 259355 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186848 259356 34in Ultrawide Monitor,34in Ultrawide Monitor,...
186849 259357 USB-C Charging Cable,USB-C Charging Cable,USB-...
[178437 rows x 2 columns]
I need to remove the duplicates from the cell in each row, can some please help.
IIUC, let's try to prevent the duplicates in the groupby transform statement:
dd['Join_No_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(set(x)))
Edit disregard the second part of the answer. I will delete that portion if it ends up not being useful.
So you comment you want unique product strings for each Order ID. You can get that in a single step:
dd = (
sales_all.groupby(['Order ID', 'Product'])['some_other_column']
.size().rename('quantity').reset_index()
)
Now you have unique rows of OrderID/Product with the count of repeated products (or quantity, as in a regular invoice). You can work with that or you can groupby to form a list of products:
orders = dd.groupby('Order ID').Product.apply(list)
---apply vs transform---
Please note that if you use .transform as in your question you will invariably get a result with the same shape as the dataframe/series being grouped (i.e. grouping will be reversed and you will end up with the same number of rows, thus creating duplicates). The function .apply will pass the groups of your groupby to the same function, any function, but will not broadcast back to the original shape (it will return only one row per group).
Old Answer
So you are removing ALL Oder IDs that appear in multiple rows (if ID 14 appears in two rows you discard both rows). This makes the groupby in the next line redundant, as every grouped ID will have just one line.
Ok, now that's out of the way. Then presumably each row in Product contains a list which you are joining with a lambda. This step would be a little faster with a pandas native function.
dd['Join_Dup'] = dd.Product.str.join(', ')
# perhaps choose a better name for the column, once you remove duplicates it will not mean much (does 'Join_Products' work?)
Now to handle duplicates. You didn't actually need to join in the last step if all you wanted was to remove dups. Pandas can handle lists as well. But the part you were missing is the subset attribute.
dd = dd[['Order ID', 'Join_Dup']].drop_duplicates(subset='Join_Dup')

Tactic for comparing dataframes when column names are different and sequence is unknown

I need to compare two DataFrames at at time to find out if the values match or not. One DataFrame is from an Excel workbook and the other is from a SQL query. The problem is that not only might the columns be out of sequence, but the column headers might have a different name as well. This would prevent me from simply getting the Excel column headers and using those to rearrange the columns in the SQL DataFrame. In addition, I will be doing this across several tabs in an excel work book and against different queries. Not only do the column names differ from excel to SQL, but they may also differ from excel to excel and SQL to SQL.
I did create a solution, but not only is it very choppy, but I'm concerned it will begin to take up a considerable amount of memory to run.
The solution entails using lists in a list. If the excel value is in the same list as the SQL value they are considered a match and the function will return the final order that the SQL DataFrame must change to in order to match the same order that the Excel DataFrame is using. In case I missed some possibilities and the newly created order list has a different length than what is needed, I simply return the original SQL list of headers in the original order.
The example below is barely a fraction of what I will actually be working with. The actual number of variations and column names are much higher than the example below. Any suggestions anyone has on how to improve this function, or offer a better solution to this problem, would be appreciated.
Here is an example:
#Example data
exceltab1 = {'ColA':[1,2,3],
'ColB':[3,4,1],
'ColC':[4,1,2]}
exceltab2 = {'cColumn':[10,15,17],
'aColumn':[5,7,8],
'bColumn':[9,8,7]}
sqltab1 = {'Col/A':[1,2,3],
'Col/C':[4,1,2],
'Col/B':[3,4,1]}
sqltab2 = {'col_banana':[9,8,7],
'col_apple':[5,7,8],
'col_carrot':[10,15,17]}
#Code
import pandas as pd
ec1 = pd.DataFrame(exceltab1)
ec2 = pd.DataFrame(exceltab2)
sq1 = pd.DataFrame(sqltab1)
sq2 = pd.DataFrame(sqltab2)
#This will fail because the columns are out of order
result1 = (ec1.values == sq1.values).all()
def translate(excel_headers ,sql_headers):
translator = [["ColA", "aColumn", "Col/A", "col_apple"],
["ColB", "bColumn", "Col/B", "col_banana"],
["ColC", "cColumn", "Col/C", "col_carrot"]]
order = []
for i in range(len(excel_headers)):
for list in translator:
for item in sql_headers:
if excel_headers[i] in list and item in list:
order.append(item)
break
if len(order) != len(sql_headers):
return sql_headers
else:
return order
sq1 =sq1[translate(list(ec1.columns), list(sq1.columns))]
#This will pass because the columns now line up
result2 = (ec1.values == sq1.values).all()
print(f"Result 1: {result1} , Result 2: {result2}")
Result:
Result 1: False , Result 2: True
No code, but an algorithm.
We have a set of columns A and another B. We can compare a column from A and another from B and see if they're equal. We do that for all combinations of columns.
This can be seen as a bipartite graph where there are two groups of vertices A and B (one vertex for each column), and an edge exists between two vertices if those two columns are equal. Then the problem of translating column names is equivalent to finding a perfect matching in this bipartite graph.
An algorithm to do this with is Hopkroft-Karp, which has a Python implementation here. That finds maximum matchings, so you still have to check whether it found a perfect matching (that is, each column from A has an associated column from B).

Pandas: How to check if any of a list in a dataframe column is present in a range in another dataframe?

I'm trying to compare two bioinformatic DataFrames (one with transcription start and end genomic locations, and one with expression data). I need to check if any of a list of locations in one DataFrame is present within ranges defined by the start and end locations in the other DataFrame, returning rows/ids where they match.
I have tried a number of built-in methods (.isin, .where, .query,), but usually get stuck because the lists are nonhashable. I've also tried a nested for loop with iterrows and itertuples, which is exceedingly slow (my actual datasets are thousands of entries).
tss_df = pd.DataFrame(data={'id':['gene1','gene2'],
'locs':[[21,23],[34,39]]})
exp_df = pd.DataFrame(data={'gene':['geneA','geneB'],
'start': [15,31], 'end': [25,42]})
I'm looking to find that the row with id 'gene1' in tss_df has locations (locs) that match 'geneA' in exp_df.
The output would be something like:
output = pd.DataFrame(data={'id':['gene1','gene2'],
'locs': [[21,23],[34,39]],
'match': ['geneA','geneB']})
Edit: Based on a comment below, I tried playing with merge_asof:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; so I split out the first value in locs:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data, but I'll need to try it with my actual data!
Based on a comment below, I tried playing with merge_asof:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; so I split out the first value in locs:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data!

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.
why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)
Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.
You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

Categories