Retrieve multiple lookup values in large dataset? - python

I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)

IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog

Related

How to extract the data from a column with priority order given in another file?

I have two dataframes
df1
IMPACT Rank
HIGH 1
MODERATE 2
LOW 3
MODIFIER 4
df2['Annotation']
Annotation
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
There are multiple annotation in separated by , (comma), I want to consider only one annotation from the dataframe based on Rank in the df1.
My expected outputwill be:
df['RANKED']
RANKED
A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
I tried following code to generate the output: but did not give me the expected result
d = df1.set_index('IMPACT')['Rank'].to_dict()
max1 = df1['Rank'].max()+1
def f(x):
d1 = {y: d.get(y, max1) for y in x for y in x.split(',')}
return min(d1, key=d1.get)
df2['RANKED'] = df2['Annotation'].apply(f)
Any help appreciated..
TL;DR
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
df2['IMPACT'] = df["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
df_merge = df2.merge(df1, how='left', on='IMPACT')
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
Step-by-step
First you define your dataframes
df1 = pd.DataFrame({'IMPACT':['HIGH', 'MODERATE', 'LOW', 'MODIFIER'], 'Rank':[1,2,3,4]})
df2 = pd.DataFrame({
'Annotation':[
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||']
})
Now here is the tricky part. You should create a column with the list of the split by comma string of the original Annotation column. Then you explode this column so you can have the objective values repeated for each original string.
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
Next, you extract the IMPACT word from each RANKED column.
df2['IMPACT'] = df2["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
Then, you merge df1 and df2 to get the rank of each RANKED.
df_merge = df2.merge(df1, how='left', on='IMPACT')
Finally, this is the easy part where you discard everything you do not want in the final dataframe. This can be done via groupby.
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
RANKED Rank
A|missense_variant|MODERATE|PERM1|ENSG00000187... 2
A|missense_variant|HIGH|PERM1|ENSG00000187642|... 1
A|missense_variant|LOW|PERM1|ENSG00000187642|T... 3
OR by dropping duplicates
df_final = df_merge.sort_values(['Annotation', 'Rank'], ascending=[False,True]).drop_duplicates(subset=['Annotation']).drop(columns=['Annotation', 'IMPACT'])

Copy matching value from one df to another given multiple conditions

I have two dataframes. The first, df1, has a non-unique ID and a timestamp value in ms. The other, df2, has the non-unique ID, a separate unique ID, a start time and an end time (both in ms).
I need to get the correct unique ID for each row in df1 from df2. I would do this by...
match each non-unique ID in df1 to the relevant series of rows in df2
of those rows, find the one with the start and end range that contains the timestamp in df1
get the unique ID from the resulting row and copy it to a new column in df1
I don't think I can use pd.merge since I need to compare the df1 timestamp to two different columns in df2. I would think df.apply is my answer, but I can't figure it out.
Here is some dummy code:
df1_dict = {
'nonunique_id': ['abc','def','ghi','jkl'],
'timestamp': [164.3,2071.2,1001.7,846.4]
}
df2_dict = {
'nonunique_id': ['abc','abc','def','def','ghi','ghi','jkl','jkl'],
'unique_id': ['a162c1','md85k','dk102','l394j','dj4n5','s092k','dh567','57ghed0'],
'time_start': [160,167,2065,2089,1000,1010,840,876],
'time_end': [166,170,2088,3000,1009,1023,875,880]
}
df1 = pd.DataFrame(data=df1_dict)
df2 = pd.DataFrame(data=df2_dict)
And here is a manual test...
df2['unique_id'][(df2['nonunique_id'].eq('abc')) & (df2['time_start']<=164.3) & (df2['time_end']>=164.3)]
...which returns the expected output (the relevant unique ID from df2):
0 a162c1
Name: unique_id, dtype: object
I'd like a function that can apply the above manual test automatically, and copy the results to a new column in df1.
I tried this...
def unique_id_fetcher(nonunique_id,timestamp):
cond_1 = df2['nonunique_id'].eq(nonunique_id)
cond_2 = df2['time_start']<=timestamp
cond_3 = df2['time_end']>=timestamp
unique_id = df2['unique_id'][(cond_1) & (cond_2) & (cond_3)]
return unique_id
df1['unique_id'] = df1.apply(unique_id_fetcher(df1['nonunique_id'],df1['timestamp']))
...but that results in:
ValueError: Can only compare identically-labeled Series objects
(Edited for clarity)
IIUC,
you can do a caretsian product of both dataframes and do a merge, then apply your logic
you create a dict and map the values back onto your df1 using non_unique_id as the key.
df1['key'] = 'var'
df2['key'] = 'var'
df3 = pd.merge(df1,df2,on=['key','nonunique_id'],how='outer')
df4 = df3.loc[
(df3["timestamp"] >= df3["time_start"]) & (df3["timestamp"] <= df3["time_end"])
]
d = dict(zip(df4['nonunique_id'],df4['unique_id']))
df1['unique_id'] = df1['nonunique_id'].map(d)
print(df1.drop('key',axis=1))
nonunique_id timestamp unique_id
0 abc 164.3 a162c1
1 def 2071.2 dk102
2 ghi 1001.7 dj4n5
3 jkl 846.4 dh567

add two pandas dataframe columns which differs by only suffix parameter for e.g., "A_x", "A_y" and rename these two columns addition with "A"

How to add two pandas dataframe columns which differs by only suffix parameter for e.g., "A_x", "A_y" and rename these two columns addition with "A".
For e.g., I have a data like this
enter image description here
The columns must be renamed without any of the suffix ie., to CT_1 or CT_2 etc....
Use:
df = pd.DataFrame([np.arange(6)], columns=['a','s','CT_1_x','CT_1_y','CT_2_x','CT_2_y'])
print (df)
a s CT_1_x CT_1_y CT_2_x CT_2_y
0 0 1 2 3 4 5
df = df.set_index(['a','s']).groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum().reset_index()
print (df)
a s CT_1 CT_2
0 0 1 5 9
To add the two columns
df['A'] = df['A_x'] + df['A_y']
and if you want to remove the original columns
df.drop(columns = ['A_x','A_y'])
If you have too many such columns col2sum = ['A_1', 'A_2', ...] to type by hand, the best way would be to melt the df into a long form.
dfm = melt(df, id_vars = ???, value_vars = col2sum)
and go from there (eg groupby) .

Iterate to find the repeat values in Pandas dataframe

Window 10, Python 3.6
I have a dataframe df
df=pd.DataFrame({'name':['boo', 'foo', 'too', 'boo', 'roo', 'too'],
'zip':['30004', '02895', '02895', '30750', '02895', '02895']})
I want to find the repeat record that has same 'name' and 'zip', and record the repeat times. The idea output is
name repeat zip
0 too 1 02895
Because my dataframe is much more than six rows, I need to use a iterate method. I appreciate any tips.
I believe you need groupby all columns and use GroupBy.size:
#create DataFrame from online source
#df = pd.read_csv('someonline.csv')
#df = pd.read_html('someurl')[0]
#L = []
#for x in iterator:
#in loop added data to list
# L.append(x)
##created DataFrame from contructor
#df = pd.DataFrame(L)
df = df.groupby(df.columns.tolist()).size().reset_index(name='repeat')
#if need specify columns
#df = df.groupby(['name','zip']).size().reset_index(name='repeat')
print (df)
name zip repeat
0 boo 30004 1
1 boo 30750 1
2 foo 02895 1
3 roo 02895 1
4 too 02895 2
Pandas has a handy .duplicated() method that can help you identify duplicates.
df.duplicated()
By passing the duplicate vector into a selection you can get the duplicate record:
df[df.duplicated()]
You can get the sum of the duplicated records by using .sum()
df.duplicated().sum()

How to sum columns from three different dataframes with a common key

I am reading in an excel spreadsheet about schools with three sheets as follows.
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
print xl.sheet_names
df1 = xl.parse(xl.sheet_names[0], skiprows=14)
df2 = xl.parse(xl.sheet_names[1], skiprows=14)
df3 = xl.parse(xl.sheet_names[2], skiprows=14)
df1.columns = [chr(65+i) for i in xrange(len(df1.columns))]
df2.columns = df1.columns
df3.columns = df1.columns
The unique id for each school is in column 'D' in each of the three dataframes. I would like to make a new dataframe which has two columns. The first is the sum of column 'G' from df1, df2, df3 and the second is the sum of column 'K' from df1, df2, df3. In other words, I think I need the following steps.
Filter rows for which unique column 'D' ids actually exist in all three dataframes. If the school doesn't appear in all three sheets then I discard it.
For each remaining row (school), add up the values in column 'G' in the three dataframes.
Do the same for column 'K'.
I am new to pandas but how should I do this? Somehow the unique ids have to be used in steps 2 and 3 to make sure the values that are added correspond to the same school.
Attempted solution
df1 = df1.set_index('D')
df2 = df2.set_index('D')
df3 = df3.set_index('D')
df1['SumK']= df1['K'] + df2['K'] + df3['K']
df1['SumG']= df1['G'] + df2['G'] + df3['G']
After concatenating the dataframes, you can use groupby and count to get a list of values for "D" that exist in all three dataframes since there is only one in each dataframe. You can then use this to filter concatenated dataframe to sum whichever columns you need, e.g.:
df = pd.concat([df1, df2, df3])
criteria = df.D.isin((df.groupby('D').count() == 3).index)
df[criteria].groupby('D')[['G', 'K']].sum()

Categories