Hi all guys,
I have got 2 dfs and I need to check if the values from the first are matching on the second, only for a specific column on each, and save the values matching in a new list. This is what I did but it is taking quite a lot of time and I was wandering if there's a more efficient way. The lists are like in the image above from 2 different tables.
for x in df_bd_names['Building_Name']:
for y in df_sup['Source_String']:
if x == y:
matching_words_sup.append(x)
Thanks
Let's create both dataframes:
df1 = pd.DataFrame({
'Building_Name': ['Exces', 'Excs', 'Exec', 'Executer', 'Executor']
})
df2 = pd.DataFrame({
'Source_String': ['Executer', 'Executor', 'Executor Of', 'Executor For', 'Exeutor']
})
Perform inner merge between dataframes and convert first column to list:
pd.merge(df1, df2, left_on='Building_Name', right_on='Source_String', how='inner')['Building_Name'].tolist()
Output:
['Executer', 'Executor']
def __init__(self, df1, df2):
self.df1 = df1
self.df2 = df2
def compareDFsEffectively(self):
np1 = self.df1.to_numpy()
np2 = self.df2.to_numpy()
np_new = np.intersect1d(np1,np2)
print(np_new)
df_new = pd.DataFrame(np_new)
print(df_new)
Related
I have two dataframes
df1
IMPACT Rank
HIGH 1
MODERATE 2
LOW 3
MODIFIER 4
df2['Annotation']
Annotation
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
There are multiple annotation in separated by , (comma), I want to consider only one annotation from the dataframe based on Rank in the df1.
My expected outputwill be:
df['RANKED']
RANKED
A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
I tried following code to generate the output: but did not give me the expected result
d = df1.set_index('IMPACT')['Rank'].to_dict()
max1 = df1['Rank'].max()+1
def f(x):
d1 = {y: d.get(y, max1) for y in x for y in x.split(',')}
return min(d1, key=d1.get)
df2['RANKED'] = df2['Annotation'].apply(f)
Any help appreciated..
TL;DR
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
df2['IMPACT'] = df["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
df_merge = df2.merge(df1, how='left', on='IMPACT')
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
Step-by-step
First you define your dataframes
df1 = pd.DataFrame({'IMPACT':['HIGH', 'MODERATE', 'LOW', 'MODIFIER'], 'Rank':[1,2,3,4]})
df2 = pd.DataFrame({
'Annotation':[
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||']
})
Now here is the tricky part. You should create a column with the list of the split by comma string of the original Annotation column. Then you explode this column so you can have the objective values repeated for each original string.
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
Next, you extract the IMPACT word from each RANKED column.
df2['IMPACT'] = df2["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
Then, you merge df1 and df2 to get the rank of each RANKED.
df_merge = df2.merge(df1, how='left', on='IMPACT')
Finally, this is the easy part where you discard everything you do not want in the final dataframe. This can be done via groupby.
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
RANKED Rank
A|missense_variant|MODERATE|PERM1|ENSG00000187... 2
A|missense_variant|HIGH|PERM1|ENSG00000187642|... 1
A|missense_variant|LOW|PERM1|ENSG00000187642|T... 3
OR by dropping duplicates
df_final = df_merge.sort_values(['Annotation', 'Rank'], ascending=[False,True]).drop_duplicates(subset=['Annotation']).drop(columns=['Annotation', 'IMPACT'])
I want to map two dataframes in pandas , in DF1 I have
df1
my second dataframe looks like
df2
I want to merge the two dataframes and get something like this
merged DF
on the basis of the 1 occuring in the DF1 , it should be replaced by the value after merging
so far i have tried
mergedDF = pd.merge(df1,df2, on=companies)
Seems like you need .idxmax() method.
merged = df1.merge(df2, on='Company')
merged['values'] = merged[[x for x in merged.columns if x != 'Company']].idxmax(axis=1)
I have 4 dataframe objects with 1 row and X columns that I would want to join, here's a screenshot of them:
I would want them to become one big row.
Thanks for anybody who helps!
You could use concatenate in dataframe as below,
df1 = pd.DataFrame(columns=list('ABC'))
df1.loc[0] = [1,1.23,'Hello']
df2 = pd.DataFrame(columns=list('DEF'))
df2.loc[0] = [2,2.23,'Hello1']
df3 = pd.DataFrame(columns=list('GHI'))
df3.loc[0] = [3,3.23,'Hello3']
df4 = pd.DataFrame(columns=list('JKL'))
df4.loc[0] = [4,4.23,'Hello4']
pd.concat([df1,df2,df3,df4],axis=1)
I have a dictionary of dataframes. Each of these dataframes has a column 'defrost_temperature'. What I want to do is make one new dataframe that collects all those columns, maintaining them as seperate columns.
This is what I am doing right now:
merged_defrosts = pd.DataFrame()
for key in df_dict.keys():
merged_defrosts[key] = df_dict[key]["defrost_temperature"]
But unfortunately, only the first column is filled correctly. The other columns are filled with NaN as shown in the screenshot
enter image description here
The different defrosts are not necessarily the same length. (the fourth dataframe is 108 rows, the others are 109 rows)
You can try pd.merge on index of the larger.
df_result = pd.DataFrame()
for i, df in enumerate(df_dict.values()):
s1, s2 = f'_{i}', f'_{i+1}'
m1, m2 = df_result.shape[0], df.shape[0]
if m1 == 0:
df_result = df
elif m1 >= m2:
df_result = df_result.merge(df, how=left, left_index=True, right_index=True, suffixes=(s1, s2))
else:
df_result = df.merge(df_result, how=left, left_index=True, right_index=True, suffixes=(s2, s1))
This would create undesired column names though that you can manually rename them afterwards.
You could try to concat the dataframes horizontaly after making the common column the index:
merged_defrosts = pd.concat([df.set_index("defrost_temperature") for df in df_dict.values()]
).reset_index()
I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)
IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog