Merging dataframe with "in" operator like approach - python

I want to merge two dataframes. One consists of a "hit" dataframe where I have a number of protein ids. The second is a database dataframe, which contains all known proteins and their functions. See examples below.
The goal is to associate my protein hits with the respective row in the database dataframe. What complicates this is that some of my hits have multiple proteins in one row (Protein_3A; Protein_B below). Is there a way to merge these data frames using an in operation so that the row for protein 3 will be matched as shown in the df below, despite only one of the subtypes (Protein_3B) being present in the database dataframe?
Example starting dataframes
hit_df = pd.DataFrame({"Hits": ["Protein_1", "Protein_2", "Protein_3A; Protein_3B", "Protein_8", "Protein_5"]})
database_df = pd.DataFrame({"Proteins": ["Protein_1", "Protein_2", "Protein_3B", "Protein_4", "Protein_9", "Protein10"],"Function": ["FuncX", "FuncY", "FuncZ", "Unknown", "Unkwown", "FuncA"]})
Desired result dataframe
matched_results_df = pd.DataFrame({"Hits":["Protein_1", "Protein_2", "Protein_3B"], "Function":["FuncX", "FuncY", "FuncZ"]})

First split each expression according to the ; symbol on a new row.
hit_df= hit_df["Hits"].str.split("; ", expand = False).explode().to_frame()
print(hit_df)
Hits
0 Protein_1
1 Protein_2
2 Protein_3A
2 Protein_3B
3 Protein_8
4 Protein_5
then use isin function and get only matching values.
matched_results_df = database_df[database_df['Proteins'].isin(hit_df['Hits'])]

Related

How can we Match DataFrame using loops in python?

I am trying to create a python code aiming to match 2 tables puit_1 and puit_2 in order to retrieve the identifier id of the table puit_2 by creating a new column id in the table puit_1, if the name of the table puit_1 equals the table puit_2 .
below the loop :
puits_sig[id_credo] = pd.series([])
for j in range(len(puits_credo['noms'])):
id_pcredo = []
for i in range(len(puits_sig['sigle_puits'])):
if puits_credo['noms'][j] == puits_credo['noms'][j]:
if str(puits_sig['sigle_puits'][i]) in puits_credo['noms'][j]:
id_pcredo.append(puits_credo['ID_Objet'][j])
print(id_pcredo)
puits_sig['id_credo'] = id_pcredo
ValueError : length of values 1 does not match length of index 1212
This code gives me an error that I couldn't solve (for information, I'm a beginner in programming).
Any help below some extract from tables?
Extract from table 1
[sigle_puit]
[categorie]
BNY2D
ACTIF
BRM2
INACTIF
Extract from table 2
[Nom]
[ID Object]
BLY23
89231
BRM1
12175
What I believe you're looking for is an inner-merge Pandas operation. What this does is create a new dataframe which has the 'name' column which both puit_1 and puit_2 have in common, and all other information which comes from the columns in dataframes 1 and 2.
merged_df = puit_1.merge(puit_2, how = 'inner', on = ['name'])
See this question for more:
Merge DataFrames with Matching Values From Two Different Columns - Pandas
EDIT: If you want to keep rows which have a name which only exists in one dataframe, you might want to use how='outer' in your merge operation

Max Value in a Data Frame with Groupby using idmax()

I have a dataframe that has 10 columns.
I used this code to filter to the rows I want: basically, the rows where the Revision Date is less than the cutoff date (declared variable) and the Job Title is in the provided list.
aggregate = df.loc[(df['RevisionDate']<= cutoff_date) & (df['JobTitle'].isin(['Production Control Clerk','Customer Service Representative III, Data Entry Operator I','Accounting Clerk II','General Clerk III','Technical Instructor']))]
Then, I need to group them by the column WD (there are multiple of these), and then by Job Title (again, multiple of these). So I did that by:
aggregate1 = aggregate.groupby(['WD','JobTitle'])
This produces a dataframe object that has the required rows, and still all 10 columns.
Then, from this smaller dataframe, I need to pull out only the rows with the highest (max) Revision Number.
aggregate1 = aggregate.max('RevisionNumber')
However, this last step produces a dataframe, but with only 3 of the columns: WD, Job Title and Revision Number. I need ALL 10 of the columns.
Based on other questions I've seen posted here, I have tried to use idmax():
df2 = aggregate.loc[aggregate.groupby(['WD','JobTitle'])['RevisionNumber'].idmax()]
but I get this error:
AttributeError: 'SeriesGroupBy' object has no attribute 'idmax'
What am I doing wrong?
If you sort first, you can take the top row of each group
aggregate.sort_values(by='RevisionNumber', ascending=False).groupby(['WD','JobTitle']).head(1)

Iterate through two dataframes and create a dictionary one data frame that is a substring in strings found in the second dataframe (values)

I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc
The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}

Tactic for comparing dataframes when column names are different and sequence is unknown

I need to compare two DataFrames at at time to find out if the values match or not. One DataFrame is from an Excel workbook and the other is from a SQL query. The problem is that not only might the columns be out of sequence, but the column headers might have a different name as well. This would prevent me from simply getting the Excel column headers and using those to rearrange the columns in the SQL DataFrame. In addition, I will be doing this across several tabs in an excel work book and against different queries. Not only do the column names differ from excel to SQL, but they may also differ from excel to excel and SQL to SQL.
I did create a solution, but not only is it very choppy, but I'm concerned it will begin to take up a considerable amount of memory to run.
The solution entails using lists in a list. If the excel value is in the same list as the SQL value they are considered a match and the function will return the final order that the SQL DataFrame must change to in order to match the same order that the Excel DataFrame is using. In case I missed some possibilities and the newly created order list has a different length than what is needed, I simply return the original SQL list of headers in the original order.
The example below is barely a fraction of what I will actually be working with. The actual number of variations and column names are much higher than the example below. Any suggestions anyone has on how to improve this function, or offer a better solution to this problem, would be appreciated.
Here is an example:
#Example data
exceltab1 = {'ColA':[1,2,3],
'ColB':[3,4,1],
'ColC':[4,1,2]}
exceltab2 = {'cColumn':[10,15,17],
'aColumn':[5,7,8],
'bColumn':[9,8,7]}
sqltab1 = {'Col/A':[1,2,3],
'Col/C':[4,1,2],
'Col/B':[3,4,1]}
sqltab2 = {'col_banana':[9,8,7],
'col_apple':[5,7,8],
'col_carrot':[10,15,17]}
#Code
import pandas as pd
ec1 = pd.DataFrame(exceltab1)
ec2 = pd.DataFrame(exceltab2)
sq1 = pd.DataFrame(sqltab1)
sq2 = pd.DataFrame(sqltab2)
#This will fail because the columns are out of order
result1 = (ec1.values == sq1.values).all()
def translate(excel_headers ,sql_headers):
translator = [["ColA", "aColumn", "Col/A", "col_apple"],
["ColB", "bColumn", "Col/B", "col_banana"],
["ColC", "cColumn", "Col/C", "col_carrot"]]
order = []
for i in range(len(excel_headers)):
for list in translator:
for item in sql_headers:
if excel_headers[i] in list and item in list:
order.append(item)
break
if len(order) != len(sql_headers):
return sql_headers
else:
return order
sq1 =sq1[translate(list(ec1.columns), list(sq1.columns))]
#This will pass because the columns now line up
result2 = (ec1.values == sq1.values).all()
print(f"Result 1: {result1} , Result 2: {result2}")
Result:
Result 1: False , Result 2: True
No code, but an algorithm.
We have a set of columns A and another B. We can compare a column from A and another from B and see if they're equal. We do that for all combinations of columns.
This can be seen as a bipartite graph where there are two groups of vertices A and B (one vertex for each column), and an edge exists between two vertices if those two columns are equal. Then the problem of translating column names is equivalent to finding a perfect matching in this bipartite graph.
An algorithm to do this with is Hopkroft-Karp, which has a Python implementation here. That finds maximum matchings, so you still have to check whether it found a perfect matching (that is, each column from A has an associated column from B).

Create Multiple New Columns Based on Pipe-Delimited Column in Pandas

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3
You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1

Categories