Copy matching value from one df to another given multiple conditions - python

I have two dataframes. The first, df1, has a non-unique ID and a timestamp value in ms. The other, df2, has the non-unique ID, a separate unique ID, a start time and an end time (both in ms).
I need to get the correct unique ID for each row in df1 from df2. I would do this by...
match each non-unique ID in df1 to the relevant series of rows in df2
of those rows, find the one with the start and end range that contains the timestamp in df1
get the unique ID from the resulting row and copy it to a new column in df1
I don't think I can use pd.merge since I need to compare the df1 timestamp to two different columns in df2. I would think df.apply is my answer, but I can't figure it out.
Here is some dummy code:
df1_dict = {
'nonunique_id': ['abc','def','ghi','jkl'],
'timestamp': [164.3,2071.2,1001.7,846.4]
}
df2_dict = {
'nonunique_id': ['abc','abc','def','def','ghi','ghi','jkl','jkl'],
'unique_id': ['a162c1','md85k','dk102','l394j','dj4n5','s092k','dh567','57ghed0'],
'time_start': [160,167,2065,2089,1000,1010,840,876],
'time_end': [166,170,2088,3000,1009,1023,875,880]
}
df1 = pd.DataFrame(data=df1_dict)
df2 = pd.DataFrame(data=df2_dict)
And here is a manual test...
df2['unique_id'][(df2['nonunique_id'].eq('abc')) & (df2['time_start']<=164.3) & (df2['time_end']>=164.3)]
...which returns the expected output (the relevant unique ID from df2):
0 a162c1
Name: unique_id, dtype: object
I'd like a function that can apply the above manual test automatically, and copy the results to a new column in df1.
I tried this...
def unique_id_fetcher(nonunique_id,timestamp):
cond_1 = df2['nonunique_id'].eq(nonunique_id)
cond_2 = df2['time_start']<=timestamp
cond_3 = df2['time_end']>=timestamp
unique_id = df2['unique_id'][(cond_1) & (cond_2) & (cond_3)]
return unique_id
df1['unique_id'] = df1.apply(unique_id_fetcher(df1['nonunique_id'],df1['timestamp']))
...but that results in:
ValueError: Can only compare identically-labeled Series objects
(Edited for clarity)

IIUC,
you can do a caretsian product of both dataframes and do a merge, then apply your logic
you create a dict and map the values back onto your df1 using non_unique_id as the key.
df1['key'] = 'var'
df2['key'] = 'var'
df3 = pd.merge(df1,df2,on=['key','nonunique_id'],how='outer')
df4 = df3.loc[
(df3["timestamp"] >= df3["time_start"]) & (df3["timestamp"] <= df3["time_end"])
]
d = dict(zip(df4['nonunique_id'],df4['unique_id']))
df1['unique_id'] = df1['nonunique_id'].map(d)
print(df1.drop('key',axis=1))
nonunique_id timestamp unique_id
0 abc 164.3 a162c1
1 def 2071.2 dk102
2 ghi 1001.7 dj4n5
3 jkl 846.4 dh567

Related

creating a new column in a dataframe based on 4 other dataframes

Imagine we have 4 dataframes
df1(35000, 20)
df2(12000, 21)
df3(323, 18)
df4(220, 6)
Here is where it is get tricky:
df4 was created by a merge of df3 and df2 based on 1 column.
It took 3 columns from df3 and 3 columns from df2. (that is why it has 6 cols in total)
what I want is the following: I wish to create an extra column in df1 and insert specific values for the rows that have the same value in a specific column in df1 and df3. For this reason I have done the following
df1['new col'] = df1['Name'].isin(df3['Name'])
Now my new column is filled with values True/False whether the value in the column name is the same for both df1 and df2. So far so good, but what I want to fill this new column with the values of a specific column from df2. I tried the following
df1['new col'] = df1['Name'].map({True:df2['Address'],False:'no address inserted'})
However, it inserts all the values of addresses from df2 in that cell instead only the 1 value that is needed. Any ideas?
I also tried the following
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
df1['Code'] = np.where(merged['_merge'] == 'both', merged['Address'], 'n.a.')
but I get the following error
Length of values (1210) does not match length of index (35653)
merge using the how='left' and then fill the missing values with fillna.
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
merged[address_column].fillna('n.a.', inplace=True) #address column is the name or list of names of columns that you want the replace the nan's with

How to extract the data from a column with priority order given in another file?

I have two dataframes
df1
IMPACT Rank
HIGH 1
MODERATE 2
LOW 3
MODIFIER 4
df2['Annotation']
Annotation
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
There are multiple annotation in separated by , (comma), I want to consider only one annotation from the dataframe based on Rank in the df1.
My expected outputwill be:
df['RANKED']
RANKED
A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
I tried following code to generate the output: but did not give me the expected result
d = df1.set_index('IMPACT')['Rank'].to_dict()
max1 = df1['Rank'].max()+1
def f(x):
d1 = {y: d.get(y, max1) for y in x for y in x.split(',')}
return min(d1, key=d1.get)
df2['RANKED'] = df2['Annotation'].apply(f)
Any help appreciated..
TL;DR
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
df2['IMPACT'] = df["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
df_merge = df2.merge(df1, how='left', on='IMPACT')
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
Step-by-step
First you define your dataframes
df1 = pd.DataFrame({'IMPACT':['HIGH', 'MODERATE', 'LOW', 'MODIFIER'], 'Rank':[1,2,3,4]})
df2 = pd.DataFrame({
'Annotation':[
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||']
})
Now here is the tricky part. You should create a column with the list of the split by comma string of the original Annotation column. Then you explode this column so you can have the objective values repeated for each original string.
df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
Next, you extract the IMPACT word from each RANKED column.
df2['IMPACT'] = df2["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
Then, you merge df1 and df2 to get the rank of each RANKED.
df_merge = df2.merge(df1, how='left', on='IMPACT')
Finally, this is the easy part where you discard everything you do not want in the final dataframe. This can be done via groupby.
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
RANKED Rank
A|missense_variant|MODERATE|PERM1|ENSG00000187... 2
A|missense_variant|HIGH|PERM1|ENSG00000187642|... 1
A|missense_variant|LOW|PERM1|ENSG00000187642|T... 3
OR by dropping duplicates
df_final = df_merge.sort_values(['Annotation', 'Rank'], ascending=[False,True]).drop_duplicates(subset=['Annotation']).drop(columns=['Annotation', 'IMPACT'])

Alter value in pandas dataframe based on corresponding value in another dataframe

I have two pandas dataframes.
The first (df1) has two columns: 'Country' (string) and 'Population' (int). Each row consists of a different country and its corresponding population (~200 rows).
The second (df2) also has two columns: 'Country' (string) and 'Value' (int). Each country appears a variable number of times in random order with a corresponding value (thousands of rows).
I want to divide each value in df2['Value'] by the corresponding population of that row's country.
My attempt: (Assume there's a list called 'countries' containing all countries in these dataframes
for country in countries:
val = df2.loc[df2['Country'] == country]['Values'] # All values corresponding to country
pop = df1.loc[df1['Country'] == country]['Population'] # Population corresponding to country
df2.loc[df2['Country'] == country]['Values'] = val / pop
Is there a better way to do this? Perhaps a solution that doesn't involve a for-loop?
Thanks
Try the following:
# Assuming that there are the same countries in both df
df3 = pd.merge(df2, df1, how = 'inner' on='Country')
df3["Values2"] = df3["Values"] / df3["Population"]
An alternative implementation would be to join the two tables before applying the division operator. Something on the line of:
df2 = df2.join(df1,on='Country',how='left')
df2['Values'] = df2['Values'] / df2['Population']
You can use merge for that:
df3 = df2.merge(df1, on='Country') # maybe you want to use how='left'
df3['Div'] = df3['Values'] / df3['Population']
You can read more about merge in the docs

Retrieve multiple lookup values in large dataset?

I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)
IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog

Pandas: Add row depending on index

I just want to create a dataFrame that is updated with itself(df3), adding rows from other dataFrames (df1,df2) based on an index ("ID").
When adding a new dataFrame if an overlap index is found, update the data. If it is not found, add the data including the new index.
df1 = pd.DataFrame({"Proj. Num" :["A"],'ID':[000],'DATA':["NO_DATA"]})
df1 = df1.set_index(["ID"])
df2 = pd.DataFrame({"Proj. Num" :["B"],'ID':[100],'DATA':["OK"], })
df2 = df2.set_index(["ID"])
df3 = pd.DataFrame({"Proj. Num" :["B"],'ID':[100],'DATA':["NO_OK"], })
df3 = df3.set_index(["ID"])
#df3 = pd.concat([df1,df2, df3]) #Concat,merge,join???
df3
I have tried concatenate with _verify_integrity=False_ but it just gives an error, and I think there is a more simple/nicer way to do it.
Solution with concat + Index.duplicated for boolean mask and filter by boolean indexing:
df3 = pd.concat([df1, df2, df3])
df3 = df3[~df3.index.duplicated()]
print (df3)
DATA Proj. Num
ID
0 NO_DATA A
100 OK B
Another solution by comment, thank you:
df3 = pd.concat([df3,df1])
df3 = df3[~df3.index.duplicated(keep='last')]
print (df3)
DATA Proj. Num
ID
100 NO_OK B
0 NO_DATA A
You can concatenate all the dataframes along the index; group by index and decide which element to keep of the group sharing the same index.
From your question it looks like you want to keep the last (most updated) element with the same index. It is then important the order in which you pass the dataframes in the pd.concat function.
For a list of other methods, see here.
res = pd.concat([df1, df2, df3], axis = 0)
res.groupby(res.index).last()
Which gives:
DATA Proj. Num
ID
0 NO_DATA A
100 NO_OK B
#update existing rows
df3.update(df1)
#append new rows
df3 = pd.concat([df3,df1[~df1.index.isin(df3.index)]])
#update existing rows
df3.update(df2)
#append new rows
df3 = pd.concat([df3,df2[~df2.index.isin(df3.index)]])
Out[2438]:
DATA Proj. Num
ID
100 OK B
0 NO_DATA A

Categories