Join big dataframes based on partial string-match between columns - python

Two DataFrames have gene and isoform names that are not formatted the same way. I'd like to do a join and add the df2 columns name, isoform for all partial string matches between the isoform (df2) and the name (df1) in both DataFrames. df2 is a key for the isoforms/genes, where a gene can have many isoforms. In df1, basically an output from a gene-quantification software (SALMON) the name field has both, the gene and isoform in it. I cant use regex since isoforms have variable suffixs, such as ".","_", "-", and many others.
Another important piece of information is that each df1["Name"] cell has a unique isoform.
Piece of dfs to merge:
import pandas as pd
df1 = pd.DataFrame({'Name': {0: 'AT1G01010;AT1G01010.1;Isoseq::Chr1:3616-5846', 1: 'AT1G01010;AT1G01010_2;Isoseq::Chr1:3630-5894', 2: 'AT1G01010;AT1G01010.3;Isoseq::Chr1:3635-5849', 3: 'AT1G01020;AT1G01020.11;Isoseq::Chr1:6803-8713', 4: 'AT1G01020;AT1G01020.13;Isoseq::Chr1:6811-8713'}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
df2 = pd.DataFrame({'gene': {0: 'AT1G01010', 14: 'AT1G01010', 30: 'AT1G01010', 46: 'AT1G01020', 62: 'AT1G01020', 80: 'AT1G01020', 100: 'AT1G01020', 116: 'AT1G01020', 138: 'AT1G01020', 156: 'AT1G01020'}, 'isoform': {0: 'AT1G01010.1', 14: 'AT1G01010_2', 30: 'AT1G01010.3', 46: 'AT1G01020.1', 62: 'AT1G01020.10', 80: 'AT1G01020.11', 100: 'AT1G01020.12', 116: 'AT1G01020.13', 138: 'AT1G01020.14', 156: 'AT1G01020.15'}})
display(df1)
display(df2)
Desired output:
df3 = pd.DataFrame({'gene': {0: 'AT1G01010', 1:"AT1G01010", 2:"AT1G01010", 3:"AT1G01020", 4:"AT1G01020"},'isoform': {0: 'AT1G01010.1',1:"AT1G01010_2", 2:"AT1G01010.3", 3:"AT1G01020.11", 4:"AT1G01020.13"}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
#"Name" column from df1 is not necessary anymore. (the idea is to replace it for gene and isoform)
display(df3)
Real dfs size:
df1 = 143646 rows × 5 columns
df2 = 169499 rows × 2 columns
(since df1 may not have all the isoforms detected, it's always smaller than df2)
I tried some answers i found online, but since this dfs have a huge size, many need 50gb of RAM or so...
Already checked: Merge Dataframes Based on Partial Substrings Match, Join to Dataframes based on partial string matches in python, Join dataframes based on partial string-match between columns
Thanks for the help!

Related

how to aggregate columns based on the value of others

If i had a dataframe such as this, how would i create aggragtes such as min,max and mean for each Port for each given year?
df1 = pd.DataFrame({'Year': {0: 2019, 1: 2019, 2: 2019, 3: 2019, 4:2019},'Port': {0: 'NORTH SHIELDS', 1: 'NORTH SHIELDS' 2: 'NORTH SHIELDS', 3: 'NORTH SHIELDS', 4: 'NORTH SHIELDS'},'Vessel capacity units': {0: 760.5, 1: 760.5, 2: 760.5, 3: 760.5, 4: 760.5},'Engine power': {0: 790.0, 1: 790.0, 2: 790.0, 3: 790.0, 4: 790.0},'Registered tonnage': {0: 516.0, 1: 516.0, 2: 516.0, 3: 516.0, 4: 516.0},'Overall length': {0: 45.0, 1: 45.0, 2: 45.0, 3: 45.0, 4: 45.0},'Value(£)': {0: 2675.81, 1: 62.98, 2: 9.67, 3: 527.02, 4: 2079.0}, 'Landed Weight (tonnes)': {0: 0.978,1: 0.0135, 2: 0.001, 3: 0.3198, 4: 3.832}})
df1
IIUC
df.groupby(['PORT', 'YEAR'])['<WHATEVER COLUMN HERE>'].agg(['count', 'min', 'max', 'mean']) #groupys by 'PORT', 'YEAR' and finds the multiple arguments of count, min, max, and mean
Without any kind of background information this questions is tricky. Would you want it for every year or just some given years?
To extract min/max/mean etc is quite straightforward. I assume that you have some kind of datafile and have extracted a df from there:
file = 'my-data.csv' # the data file
df = pd.read_csv(file)
VALUE_I_WANT_TO_EXTRAXT = df['Column name']
Then for each port you can extract the min/max/mean data like this.
for i in range(Port):
print( i, np.min(VALUE_I_WANT_TO_EXTRAXT) )
But, as I said. Without any kind of specifik knowledge about the problem it is hard to provide a solution

sort the order of dataframes in a list of dataframes based on a value in each dataframe

I have a list of dataframe and I want to sort the order they are in the list
Each dataframe has the same structure as shown below
df1 = pd.DataFrame.from_dict({'Ch1': {0: -28, 1: -36, 2: -39, 3: -16}, 'Ch2': {0: 543, 1: 547, 2: 559, 3: 561}, 'Ch3': {0: -126, 1: -131, 2: -147, 3: -149}, 'time': {0: '2022-02-10 16.37.25.502', 1: '2022-02-10 16.37.25.502', 2: '2022-02-10 16.37.25.502', 3: '2022-02-10 16.37.25.502'}})
df2 = pd.DataFrame.from_dict({'Ch1': {0: 81, 1: 70, 2: 70, 3: 75}, 'Ch2': {0: 570, 1: 559, 2: 554, 3: 565}, 'Ch3': {0: -103, 1: -120, 2: -131, 3: -122}, 'time': {0: '2022-02-11 05.29.28.116', 1: '2022-02-11 05.29.28.116', 2: '2022-02-11 05.29.28.116', 3: '2022-02-11 05.29.28.116'}})
df3 = pd.DataFrame.from_dict({'Ch1': {0: -887, 1: -887, 2: -890, 3: -898}, 'Ch2': {0: 1307, 1: 1292, 2: 1301, 3: 1307}, 'Ch3': {0: 59, 1: 61, 2: 57, 3: 55}, 'time': {0: '2022-02-08 01.12.54.578', 1: '2022-02-08 01.12.54.578', 2: '2022-02-08 01.12.54.578', 3: '2022-02-08 01.12.54.578'}})
df_list = [df1,df2,df3]
the values in the "time" column does not change in each row within the same dataframe.
I want the dataframes in the list sorted by time (first to last) so that further processing and can match up with other data.
my attempt thus far.
for i in df_list:
b = pd.to_datetime(i['time'].iloc[0]) #grab the first cell that contains the time stamp
b = b.sort_values(by('time'))
returns the following error
ValueError: ('Unknown string format:', '2022-02-05 08.03.09.794')
I would expect the dataframes to appear in the list with df3 being fist, df1, second and df2 last. I the time column is going and needs to be dropped for other operations therefore I would like them sorted in time order already
Any help suggestion alternative approaches greatly appreciated
If you want to sort the rows of each dataframe, you need to provide the exact format of your datetime, and you should sort in place:
for d in df_list:
d['time'] = pd.to_datetime(d['time'], format='%Y-%m-%d %H.%M.%S.%f')
d.sort_values(by='time', inplace=True)
Or, if you want to sort the dataframes in the list, which is completely different, use:
df_list.sort(key=lambda d: d['time'].iloc[0])
You should be able to sort using the string due to your particular format (assuming YYYY-MM-DD).
To ensure sorting on datetime (for example if the format was MM-DD-YYYY):
df_list.sort(key=lambda d: pd.to_datetime(d['time'].iloc[0], format='%Y-%m-%d %H.%M.%S.%f'))

add a column to a dataset using a formula

let me rephrase my question:
I have the following dataset:
data = {
'globalId': {0: 4388064, 1: 4388200, 2: 4399344, 3: 4400638, 4: 4401765, 5: 4401831},
'publicatieDatum': {0: '2018-07-31', 1: '2018-09-24', 2: '2018-08-02', 3: '2018-08-04', 4: '2018-08-05', 5: '2018-08-06'},
'postcode': {0: '1774PG', 1: '7481LK', 2: '1068MS', 3: '5628EN', 4: '7731TV', 5: '5971CR'},
'koopPrijs': {0: 139000.0, 1: 209000.0, 2: 267500.0, 3: 349000.0, 4: 495000.0, 5: 162500.0}
}
df = pd.DataFrame(data)
print(df)
Now, I want to add a column called Gemeente.
This can be retreived using the following formula:
nomi.query_postal_code(["postcode"])
The postcode above should indicate the 4 numbers of the postcode within the postcode column.
I have 2 questions:
How can I add a code that calculates the gemeente for all rows in the above dataframe, based on the 'postcode', as specified above?
How can this code be written that it only selects the first 4 digits in the postcode column?
Apologies and thanks!
Try:
df["Gemeente"] = df.apply(lambda x:nomi.query_postal_code(x["postcode"]), axis=1)

Transforming a Dataframe with duplicate data in python

I would like to transform the below dataframe to concatenate duplicate data into a single row. For example:
data_dict={'FromTo_U': {0: 'L->R', 1: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 1: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 1: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',
1: 'Substitution - Missense',
2: 'Substitution - Missense'},
'PubMed': {0: '22523351', 1: '23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 1: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 1: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 1: 'COSMIC', 2: 'COSMIC'}}
df1=pd.DataFrame(data_dict)
transformed dataframe should be
data_dict_t={'FromTo_U': {0: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',2: 'Substitution - Missense'},
'PubMed': {0: '22523351,23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 2: 'COSMIC'}}
I want to merge the two rows of df1 only if PubMed IDs are different and rest of the columns have same data. Thanks in advance!
Use groupby + agg with str.join as the aggfunc.
c = df1.columns.difference(['PubMed']).tolist()
df1.groupby(c, as_index=False).PubMed.agg(','.join)
FromTo_U GeneName MutationAA_C MutationDescription VariantID \
0 L->R EGFR p.L858R Substitution - Missense COSM12979
1 S->I EGFR p.S768I Substitution - Missense COSM18486
VariantPos_U VariantSource PubMed
0 858 COSMIC 22523351,23915069
1 768 COSMIC 26862733

Split list in Pandas dataframe column into multiple columns

I am working with movie data and have a dataframe column for movie genre. Currently the column contains a list of movie genres for each movie (as most movies are assigned to multiple genres), but for the purpose of this analysis, I would like to parse the list and create a new dataframe column for each genre. So instead of having genre=['Drama','Thriller'] for a given movie, I would have two columns, something like genre1='Drama' and genre2='Thriller'.
Here is a snippet of my data:
{'color': {0: [u'Color::(Technicolor)'],
1: [u'Color::(Technicolor)'],
2: [u'Color::(Technicolor)'],
3: [u'Color::(Technicolor)'],
4: [u'Black and White']},
'country': {0: [u'USA'],
1: [u'USA'],
2: [u'USA'],
3: [u'USA', u'UK'],
4: [u'USA']},
'genre': {0: [u'Crime', u'Drama'],
1: [u'Crime', u'Drama'],
2: [u'Crime', u'Drama'],
3: [u'Action', u'Crime', u'Drama', u'Thriller'],
4: [u'Crime', u'Drama']},
'language': {0: [u'English'],
1: [u'English', u'Italian', u'Latin'],
2: [u'English', u'Italian', u'Spanish', u'Latin', u'Sicilian'],
3: [u'English', u'Mandarin'],
4: [u'English']},
'rating': {0: 9.3, 1: 9.2, 2: 9.0, 3: 9.0, 4: 8.9},
'runtime': {0: [u'142'],
1: [u'175'],
2: [u'202', u'220::(The Godfather Trilogy 1901-1980 VHS Special Edition)'],
3: [u'152'],
4: [u'96']},
'title': {0: u'The Shawshank Redemption',
1: u'The Godfather',
2: u'The Godfather: Part II',
3: u'The Dark Knight',
4: u'12 Angry Men'},
'votes': {0: 1793199, 1: 1224249, 2: 842044, 3: 1774083, 4: 484061},
'year': {0: 1994, 1: 1972, 2: 1974, 3: 2008, 4: 1957}}
Any help would be greatly appreciated! Thanks!
I think you need DataFrame constructor with add_prefix and last concat to original:
df1 = pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')
df = pd.concat([df.drop('genre',axis=1), df1], axis=1)
Timings:
df = pd.DataFrame(d)
print (df)
#5000 rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [394]: %timeit (pd.concat([df.drop('genre',axis=1), pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')], axis=1))
100 loops, best of 3: 3.4 ms per loop
In [395]: %timeit (pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1))
1 loop, best of 3: 757 ms per loop
This should work for you:
pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1)

Categories