Transforming a Dataframe with duplicate data in python

Transforming a Dataframe with duplicate data in python - python

I would like to transform the below dataframe to concatenate duplicate data into a single row. For example:
data_dict={'FromTo_U': {0: 'L->R', 1: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 1: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 1: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',
1: 'Substitution - Missense',
2: 'Substitution - Missense'},
'PubMed': {0: '22523351', 1: '23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 1: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 1: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 1: 'COSMIC', 2: 'COSMIC'}}
df1=pd.DataFrame(data_dict)
transformed dataframe should be
data_dict_t={'FromTo_U': {0: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',2: 'Substitution - Missense'},
'PubMed': {0: '22523351,23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 2: 'COSMIC'}}
I want to merge the two rows of df1 only if PubMed IDs are different and rest of the columns have same data. Thanks in advance!

Use groupby + agg with str.join as the aggfunc.
c = df1.columns.difference(['PubMed']).tolist()
df1.groupby(c, as_index=False).PubMed.agg(','.join)
FromTo_U GeneName MutationAA_C MutationDescription VariantID \
0 L->R EGFR p.L858R Substitution - Missense COSM12979
1 S->I EGFR p.S768I Substitution - Missense COSM18486
VariantPos_U VariantSource PubMed
0 858 COSMIC 22523351,23915069
1 768 COSMIC 26862733

Related

How to make this to data frame?

I am using python and I am trying to change this to dataframe but the length of the dictionary are different.
Do you have any ideas? The length of keys (0-6 in total) present are different in each row.
0 {1: 0.14428478, 3: 0.3088169, 5: 0.54362816}
1 {0: 0.41822478, 2: 0.081520624, 3: 0.40189278,...
2 {3: 0.9927109}
3 {0: 0.07826376, 3: 0.9162877}
4 {0: 0.022929467, 1: 0.0127365505, 2: 0.8355256...
...
59834 {1: 0.93473625, 5: 0.055679787}
59835 {1: 0.72145665, 3: 0.022041071, 5: 0.25396}
59836 {0: 0.01922486, 1: 0.019249884, 2: 0.5345934, ...
59837 {0: 0.014184893, 1: 0.23436697, 2: 0.58155864,...
59838 {0: 0.013977169, 1: 0.24653174, 2: 0.60093427,...
I would like get the codes of python.

Join big dataframes based on partial string-match between columns

Two DataFrames have gene and isoform names that are not formatted the same way. I'd like to do a join and add the df2 columns name, isoform for all partial string matches between the isoform (df2) and the name (df1) in both DataFrames. df2 is a key for the isoforms/genes, where a gene can have many isoforms. In df1, basically an output from a gene-quantification software (SALMON) the name field has both, the gene and isoform in it. I cant use regex since isoforms have variable suffixs, such as ".","_", "-", and many others.
Another important piece of information is that each df1["Name"] cell has a unique isoform.
Piece of dfs to merge:
import pandas as pd
df1 = pd.DataFrame({'Name': {0: 'AT1G01010;AT1G01010.1;Isoseq::Chr1:3616-5846', 1: 'AT1G01010;AT1G01010_2;Isoseq::Chr1:3630-5894', 2: 'AT1G01010;AT1G01010.3;Isoseq::Chr1:3635-5849', 3: 'AT1G01020;AT1G01020.11;Isoseq::Chr1:6803-8713', 4: 'AT1G01020;AT1G01020.13;Isoseq::Chr1:6811-8713'}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
df2 = pd.DataFrame({'gene': {0: 'AT1G01010', 14: 'AT1G01010', 30: 'AT1G01010', 46: 'AT1G01020', 62: 'AT1G01020', 80: 'AT1G01020', 100: 'AT1G01020', 116: 'AT1G01020', 138: 'AT1G01020', 156: 'AT1G01020'}, 'isoform': {0: 'AT1G01010.1', 14: 'AT1G01010_2', 30: 'AT1G01010.3', 46: 'AT1G01020.1', 62: 'AT1G01020.10', 80: 'AT1G01020.11', 100: 'AT1G01020.12', 116: 'AT1G01020.13', 138: 'AT1G01020.14', 156: 'AT1G01020.15'}})
display(df1)
display(df2)
Desired output:
df3 = pd.DataFrame({'gene': {0: 'AT1G01010', 1:"AT1G01010", 2:"AT1G01010", 3:"AT1G01020", 4:"AT1G01020"},'isoform': {0: 'AT1G01010.1',1:"AT1G01010_2", 2:"AT1G01010.3", 3:"AT1G01020.11", 4:"AT1G01020.13"}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
#"Name" column from df1 is not necessary anymore. (the idea is to replace it for gene and isoform)
display(df3)
Real dfs size:
df1 = 143646 rows × 5 columns
df2 = 169499 rows × 2 columns
(since df1 may not have all the isoforms detected, it's always smaller than df2)
I tried some answers i found online, but since this dfs have a huge size, many need 50gb of RAM or so...
Already checked: Merge Dataframes Based on Partial Substrings Match, Join to Dataframes based on partial string matches in python, Join dataframes based on partial string-match between columns
Thanks for the help!

python dataframe to dictionary with multiple columns in keys and values

I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!

You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()

The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)

Convert dataframe to dictionary but not to take column name as keys in python

I have a daframe given below. I want to convert it into dictionary. But I Don't want column name as keys.
data = {'0':[0.039169993,0.023344912], '1':[0.17865846,0.01093025],'2':[0.039170124,0.023344917], '3':[0.17865846,0.01093025],'4':[0.039170124,0.023344917]}
df= pd.DataFrame(data)
0.0 1.0 2.0 3.0 4.0
0 0.039169993 0.17865846 0.039170124 0.17865846 0.039170124
1 0.023344912 0.01093025 0.023344917 0.01093025 0.023344917
**Desired Result**:
{{0: 0.039169993, 1:0.023344912},
{0: 0.17865846, 1:0.01093025},
{0: 0.039170124, 1:0.023344917},
{0: 0.17865846, 1:0.01093025},
{0:0.039170124, 1:0.023344917}}
MyAttempt:
df.to_dict()
{'0': {0: 0.039169993, 1: 0.023344912},
'1': {0: 0.17865846, 1: 0.01093025},
'2': {0: 0.039170124, 1: 0.023344917},
'3': {0: 0.17865846, 1: 0.01093025},
'4': {0: 0.039170124, 1: 0.023344917}}
I dont want column name as keys. Is it possible to do?

You can use transpose or T and .to_dict(orient='records') to obtain the desired output like:
df.T.to_dict(orient='records')

The desired result has the format of a set of dictionaries, but you cannot have a set of dictionaries because dictionaries are not hashable, however you could have a list.
import pandas as pd
data = {'0': [0.039169993, 0.023344912], '1': [0.17865846, 0.01093025], '2': [0.039170124, 0.023344917],
'3': [0.17865846, 0.01093025], '4': [0.039170124, 0.023344917]}
df = pd.DataFrame(data)
result = list(df.to_dict().values())
print(result)
Output
[{0: 0.039170124, 1: 0.023344917}, {0: 0.039169993, 1: 0.023344912}, {0: 0.17865846, 1: 0.01093025}, {0: 0.17865846, 1: 0.01093025}, {0: 0.039170124, 1: 0.023344917}]

You can use this:
df.T.to_dict(orient='records')
[{0: 0.039169993, 1: 0.023344911999999999},
{0: 0.17865845999999999, 1: 0.010930250000000001},
{0: 0.039170124000000001, 1: 0.023344917},
{0: 0.17865845999999999, 1: 0.010930250000000001},
{0: 0.039170124000000001, 1: 0.023344917}]

Apply function across pandas dataframe columns

This seems to have been similarly answered, but I can't get it to work.
I have a pandas DataFrame that looks like sig_vars below. This df has a VAF and a Background column. I would like to use the ztest function from statsmodels to assign a p-value to a new p-value column.
The p-value is calculated something like this for each row:
from statsmodels.stats.weightstats import ztest
p_value = ztest(sig_vars.Background,value=sig_vars.VAF)[1]
I have tried something like this, but I can't quite get it to work:
def calc(x):
return ztest(x.Background, value=x.VAF.astype(float))[1]
sig_vars.dropna().assign(pval = lambda x: calc(x)).head()
It seems strange to me that this works just fine however:
def calc(x):
return ztest([0.0001,0.0002,0.0001], value=x.VAF.astype(float))[1]
sig_vars.dropna().assign(pval = lambda x: calc(x)).head()
Here is my DataFrame sig_vars:
sig_vars = pd.DataFrame({'AO': {0: 4.0, 1: 16.0, 2: 12.0, 3: 19.0, 4: 2.0},
'Background': {0: nan,
1: [0.00018832391713747646, 0.0002114408734430263, 0.000247843759294141],
2: nan,
3: [0.00023965141612200435,
0.00018864365214110544,
0.00036566589684372596,
0.0005452562704471102],
4: [0.00017349063150589867]},
'Change': {0: 'T>A', 1: 'T>C', 2: 'T>A', 3: 'T>C', 4: 'C>A'},
'Chrom': {0: 'chr1', 1: 'chr1', 2: 'chr1', 3: 'chr1', 4: 'chr1'},
'ConvChange': {0: 'T>A', 1: 'T>C', 2: 'T>A', 3: 'T>C', 4: 'C>A'},
'DP': {0: 16945.0, 1: 16945.0, 2: 16969.0, 3: 16969.0, 4: 16969.0},
'Downstream': {0: 'NaN', 1: 'NaN', 2: 'NaN', 3: 'NaN', 4: 'NaN'},
'Gene': {0: 'TIIIa', 1: 'TIIIa', 2: 'TIIIa', 3: 'TIIIa', 4: 'TIIIa'},
'ID': {0: '86.fastq/onlyProbedRegions.vcf',
1: '86.fastq/onlyProbedRegions.vcf',
2: '86.fastq/onlyProbedRegions.vcf',
3: '86.fastq/onlyProbedRegions.vcf',
4: '86.fastq/onlyProbedRegions.vcf'},
'Individual': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'IntEx': {0: 'TIII', 1: 'TIII', 2: 'TIII', 3: 'TIII', 4: 'TIII'},
'Loc': {0: 115227854, 1: 115227854, 2: 115227855, 3: 115227855, 4: 115227856},
'Upstream': {0: 'NaN', 1: 'NaN', 2: 'NaN', 3: 'NaN', 4: 'NaN'},
'VAF': {0: 0.00023605783416937148,
1: 0.0009442313366774859,
2: 0.0007071719017031057,
3: 0.0011196888443632507,
4: 0.00011786198361718427},
'Var': {0: 'A', 1: 'C', 2: 'A', 3: 'C', 4: 'A'},
'WT': {0: 'T', 1: 'T', 2: 'T', 3: 'T', 4: 'C'}})

Try this:
def calc(x):
return ztest(x['Background'], value=float(x['VAF']))[1]
sig_vars['pval'] = sig_vars.dropna().apply(calc, axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transforming a Dataframe with duplicate data in python - python

Related

How to make this to data frame?

Join big dataframes based on partial string-match between columns

python dataframe to dictionary with multiple columns in keys and values

Convert dataframe to dictionary but not to take column name as keys in python

Apply function across pandas dataframe columns

Categories

Resources