How to form multiple subsets of dataframes and compare and contrast

How to form multiple subsets of dataframes and compare and contrast - python

I have 2 dataframes, something like this:
data1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd, 2: efg},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 1: 11, 2: 12}})
data2 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs, 2: rst},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 1: 200, 2: 300}})
with more rows.
I want to take multiple subsets from each dataset and do a comparison of the total amount in each.
For example, take out 2 rows from data1 and data2:
data1_subset1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 10, 1: 11}})
data1_subset2 = pd.DataFrame({'transaction_id': {0: abc, 2: efg},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 2: 12}})
and so on till I have all possible 2 row combinations of data1.
data2_subset1 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 100, 1: 200}})
data2_subset2 = pd.DataFrame({'transaction_id': {0: pqr, 2: rst},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 2: 300}})
and so on till I have all possible 2 row combinations of data2.
Now for each of these subsets, say data1_subset1 vs data2_subset1, I would like to compare if the store_number and activity_code are matching using inner join and then check the difference between the total amount from data1_subset1 vs data2_subset1.
Further I would also like to extend this to all possible size combinations. In the above example we compared all 2 row combinations. But I would like to extend this to 2 row combinations vs 3 row combinations, 2 rows vs 4, 3 vs 5, and so on till all the possibilities are checked.
Is there an efficient way of doing this in Python / Pandas. The first approach I had in my mind was just a nested loop using indexes.

Use itertools.combinations:
from itertools import combinations
for comb in combinations(data1.index, r=2):
print(f'combination {comb}')
print(data1.loc[list(comb)])
As a function:
def subset(df, r=2):
for comb in combinations(df.index, r=r):
yield df.loc[list(comb)]
for df in subset(data1, r=2):
print(df)
output:
combination (0, 1)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
1 bcd 1048 deposit-check 11
combination (0, 2)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
2 efg 1048 deposit-check 12
combination (1, 2)
transaction_id store_number activity_code amount
1 bcd 1048 deposit-check 11
2 efg 1048 deposit-check 12
If you want more rows in the combination change the r=2 parameter to the number of wanted rows.

Related

How to make this to data frame?

I am using python and I am trying to change this to dataframe but the length of the dictionary are different.
Do you have any ideas? The length of keys (0-6 in total) present are different in each row.
0 {1: 0.14428478, 3: 0.3088169, 5: 0.54362816}
1 {0: 0.41822478, 2: 0.081520624, 3: 0.40189278,...
2 {3: 0.9927109}
3 {0: 0.07826376, 3: 0.9162877}
4 {0: 0.022929467, 1: 0.0127365505, 2: 0.8355256...
...
59834 {1: 0.93473625, 5: 0.055679787}
59835 {1: 0.72145665, 3: 0.022041071, 5: 0.25396}
59836 {0: 0.01922486, 1: 0.019249884, 2: 0.5345934, ...
59837 {0: 0.014184893, 1: 0.23436697, 2: 0.58155864,...
59838 {0: 0.013977169, 1: 0.24653174, 2: 0.60093427,...
I would like get the codes of python.

python dataframe to dictionary with multiple columns in keys and values

I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!

You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()

The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)

How to detect and remove outliers in dataframe

I have a dataset as this
{'SYMBOL': {0: 'BAF180', 1: 'ACTL6A', 2: 'DMAP1', 3: 'C1orf149', 4: 'YEATS4'}, 'Gene Name(s)': {0: ';PB1;BAF180;MGC156155;MGC156156;PBRM1;', 1: ';ACTL6A;ACTL6;BAF53A;MGC5382;', 2: ';DMAP1;DKFZp686L09142;DNMAP1;DNMTAP1;FLJ11543;KIAA1425;EAF2;SWC4;', 3: ';FLJ11730;CDABP0189;C1orf149;NY-SAR-91;RP3-423B22.2;Eaf6;', 4: ';YEATS4;4930573H17Rik;B230215M10Rik;GAS41;NUBI-1;YAF9;'}, 'Description': {0: 'polybromo 1', 1: 'BAF complex 53 kDa subunit|BAF53|BRG1-associated factor|actin-related protein|hArpN beta; actin-like 6A', 2: 'DNA methyltransferase 1 associated protein 1; DNMT1 associated protein 1', 3: 'hypothetical protein LOC64769|sarcoma antigen NY-SAR-91; chromosome 1 open reading frame 149', 4: 'NuMA binding protein 1|glioma-amplified sequence-41; YEATS domain containing 4'}, 'G.O. PROCESS': {0: 'Transcription', 1: 'Transcription', 2: 'Transcription', 3: 'Transcription', 4: 'Transcription'}, 'TurboSEQUESTScore': {0: 70.29, 1: 80.29, 2: 34.18, 3: 30.32, 4: 40.18}, 'Coverage %': {0: 6.7, 1: 28.0, 2: 10.7, 3: 24.2, 4: 21.1}, 'KD': {0: 183572.3, 1: 47430.4, 2: 52959.9, 3: 21501.9, 4: 26482.7}, 'Genebank Accession no': {0: 30794372, 1: 4757718, 2: 13123776, 3: 29164895, 4: 5729838}, 'MS/MS Peptide no.': {0: '9 (9 0 0 0 0)', 1: '9 (9 0 0 0 0)', 2: '4 (3 0 0 1 0)', 3: '3 (3 0 0 0 0)', 4: '4 (4 0 0 0 0)'}}
I would want to detect and remove outliers on the column TurboSEQUESTScore using 3 times of standard deviation as the threshold for outliers How can I go about it? This is what i have tried.
The name of dataframe is rename_df
z_scores = stats.zscore(rename_df['TurboSEQUESTScore'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=None)
I don't seem to solve this properly.

You were approaching it correctly only but just needed to pass the boolean abs_z_scores < 3 to your dataframe, i.e., rename_df[(abs_z_scores < 3)], to get the desired dataframe and then store it in any variable of your choice.
This will do the job in one line and is more readable-
import numpy as np
from scipy import stats
filtered_rename_df = rename_df[(np.abs(stats.zscore(rename_df["TurboSEQUESTScore"])) < 3)]
You'll get a new dataframe named filtered_rename_df with the filtered entries after removing outliers using z-score < 3.

How to convert if/else to np.where in pandas

My code is below
apply pd.to_numeric to the columns where supposed to int or float but coming as object. Can we convert more into pandas way like applying np.where
if df.dtypes.all() == 'object':
df=df.apply(pd.to_numeric,errors='coerce').fillna(df)
else:
df = df

A simple one liner is assign with selest_dtypes which will reassign existing columns
df.assign(**df.select_dtypes('O').apply(pd.to_numeric,errors='coerce').fillna(df))
np.where:
df[:] = (np.where(df.dtypes=='object',
df.apply(pd.to_numeric,errors='coerce').fillna(df),df)
Example (check Price column) :
d = {'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: '24000', 1: 'a', 2: '900'}}
df = pd.DataFrame(d)
print(df)
CusID Name Shop Price
0 1 Paul Pascal 24000
1 2 Mark Casio a
2 3 Bill Nike 900
df.to_dict()
{'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: '24000', 1: 'a', 2: '900'}}
(df.assign(**df.select_dtypes('O').apply(pd.to_numeric,errors='coerce')
.fillna(df)).to_dict())
{'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: 24000.0, 1: 'a', 2: 900.0}}

Equivalent of your if/else is df.mask
df_out = df.mask(df.dtypes =='O', df.apply(pd.to_numeric, errors='coerce')
.fillna(df))

Transforming a Dataframe with duplicate data in python

I would like to transform the below dataframe to concatenate duplicate data into a single row. For example:
data_dict={'FromTo_U': {0: 'L->R', 1: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 1: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 1: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',
1: 'Substitution - Missense',
2: 'Substitution - Missense'},
'PubMed': {0: '22523351', 1: '23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 1: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 1: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 1: 'COSMIC', 2: 'COSMIC'}}
df1=pd.DataFrame(data_dict)
transformed dataframe should be
data_dict_t={'FromTo_U': {0: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',2: 'Substitution - Missense'},
'PubMed': {0: '22523351,23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 2: 'COSMIC'}}
I want to merge the two rows of df1 only if PubMed IDs are different and rest of the columns have same data. Thanks in advance!

Use groupby + agg with str.join as the aggfunc.
c = df1.columns.difference(['PubMed']).tolist()
df1.groupby(c, as_index=False).PubMed.agg(','.join)
FromTo_U GeneName MutationAA_C MutationDescription VariantID \
0 L->R EGFR p.L858R Substitution - Missense COSM12979
1 S->I EGFR p.S768I Substitution - Missense COSM18486
VariantPos_U VariantSource PubMed
0 858 COSMIC 22523351,23915069
1 768 COSMIC 26862733

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to form multiple subsets of dataframes and compare and contrast - python

Related

How to make this to data frame?

python dataframe to dictionary with multiple columns in keys and values

How to detect and remove outliers in dataframe

How to convert if/else to np.where in pandas

Transforming a Dataframe with duplicate data in python

Categories

Resources