How to detect and remove outliers in dataframe - python

I have a dataset as this
{'SYMBOL': {0: 'BAF180', 1: 'ACTL6A', 2: 'DMAP1', 3: 'C1orf149', 4: 'YEATS4'}, 'Gene Name(s)': {0: ';PB1;BAF180;MGC156155;MGC156156;PBRM1;', 1: ';ACTL6A;ACTL6;BAF53A;MGC5382;', 2: ';DMAP1;DKFZp686L09142;DNMAP1;DNMTAP1;FLJ11543;KIAA1425;EAF2;SWC4;', 3: ';FLJ11730;CDABP0189;C1orf149;NY-SAR-91;RP3-423B22.2;Eaf6;', 4: ';YEATS4;4930573H17Rik;B230215M10Rik;GAS41;NUBI-1;YAF9;'}, 'Description': {0: 'polybromo 1', 1: 'BAF complex 53 kDa subunit|BAF53|BRG1-associated factor|actin-related protein|hArpN beta; actin-like 6A', 2: 'DNA methyltransferase 1 associated protein 1; DNMT1 associated protein 1', 3: 'hypothetical protein LOC64769|sarcoma antigen NY-SAR-91; chromosome 1 open reading frame 149', 4: 'NuMA binding protein 1|glioma-amplified sequence-41; YEATS domain containing 4'}, 'G.O. PROCESS': {0: 'Transcription', 1: 'Transcription', 2: 'Transcription', 3: 'Transcription', 4: 'Transcription'}, 'TurboSEQUESTScore': {0: 70.29, 1: 80.29, 2: 34.18, 3: 30.32, 4: 40.18}, 'Coverage %': {0: 6.7, 1: 28.0, 2: 10.7, 3: 24.2, 4: 21.1}, 'KD': {0: 183572.3, 1: 47430.4, 2: 52959.9, 3: 21501.9, 4: 26482.7}, 'Genebank Accession no': {0: 30794372, 1: 4757718, 2: 13123776, 3: 29164895, 4: 5729838}, 'MS/MS Peptide no.': {0: '9 (9 0 0 0 0)', 1: '9 (9 0 0 0 0)', 2: '4 (3 0 0 1 0)', 3: '3 (3 0 0 0 0)', 4: '4 (4 0 0 0 0)'}}
I would want to detect and remove outliers on the column TurboSEQUESTScore using 3 times of standard deviation as the threshold for outliers How can I go about it? This is what i have tried.
The name of dataframe is rename_df
z_scores = stats.zscore(rename_df['TurboSEQUESTScore'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=None)
I don't seem to solve this properly.

You were approaching it correctly only but just needed to pass the boolean abs_z_scores < 3 to your dataframe, i.e., rename_df[(abs_z_scores < 3)], to get the desired dataframe and then store it in any variable of your choice.
This will do the job in one line and is more readable-
import numpy as np
from scipy import stats
filtered_rename_df = rename_df[(np.abs(stats.zscore(rename_df["TurboSEQUESTScore"])) < 3)]
You'll get a new dataframe named filtered_rename_df with the filtered entries after removing outliers using z-score < 3.

Related

How to make this to data frame?

I am using python and I am trying to change this to dataframe but the length of the dictionary are different.
Do you have any ideas? The length of keys (0-6 in total) present are different in each row.
0 {1: 0.14428478, 3: 0.3088169, 5: 0.54362816}
1 {0: 0.41822478, 2: 0.081520624, 3: 0.40189278,...
2 {3: 0.9927109}
3 {0: 0.07826376, 3: 0.9162877}
4 {0: 0.022929467, 1: 0.0127365505, 2: 0.8355256...
...
59834 {1: 0.93473625, 5: 0.055679787}
59835 {1: 0.72145665, 3: 0.022041071, 5: 0.25396}
59836 {0: 0.01922486, 1: 0.019249884, 2: 0.5345934, ...
59837 {0: 0.014184893, 1: 0.23436697, 2: 0.58155864,...
59838 {0: 0.013977169, 1: 0.24653174, 2: 0.60093427,...
I would like get the codes of python.

How to form multiple subsets of dataframes and compare and contrast

I have 2 dataframes, something like this:
data1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd, 2: efg},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 1: 11, 2: 12}})
data2 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs, 2: rst},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 1: 200, 2: 300}})
with more rows.
I want to take multiple subsets from each dataset and do a comparison of the total amount in each.
For example, take out 2 rows from data1 and data2:
data1_subset1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 10, 1: 11}})
data1_subset2 = pd.DataFrame({'transaction_id': {0: abc, 2: efg},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 2: 12}})
and so on till I have all possible 2 row combinations of data1.
data2_subset1 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 100, 1: 200}})
data2_subset2 = pd.DataFrame({'transaction_id': {0: pqr, 2: rst},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 2: 300}})
and so on till I have all possible 2 row combinations of data2.
Now for each of these subsets, say data1_subset1 vs data2_subset1, I would like to compare if the store_number and activity_code are matching using inner join and then check the difference between the total amount from data1_subset1 vs data2_subset1.
Further I would also like to extend this to all possible size combinations. In the above example we compared all 2 row combinations. But I would like to extend this to 2 row combinations vs 3 row combinations, 2 rows vs 4, 3 vs 5, and so on till all the possibilities are checked.
Is there an efficient way of doing this in Python / Pandas. The first approach I had in my mind was just a nested loop using indexes.
Use itertools.combinations:
from itertools import combinations
for comb in combinations(data1.index, r=2):
print(f'combination {comb}')
print(data1.loc[list(comb)])
As a function:
def subset(df, r=2):
for comb in combinations(df.index, r=r):
yield df.loc[list(comb)]
for df in subset(data1, r=2):
print(df)
output:
combination (0, 1)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
1 bcd 1048 deposit-check 11
combination (0, 2)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
2 efg 1048 deposit-check 12
combination (1, 2)
transaction_id store_number activity_code amount
1 bcd 1048 deposit-check 11
2 efg 1048 deposit-check 12
If you want more rows in the combination change the r=2 parameter to the number of wanted rows.

Cannot manipulate Dataframe to calculate zscore using simple looping

I'm following a Datacamp course on "efficient data manipulation" on pandas. On their videos, by way of example, they are demonstrating the native method of looping over the dataframe to calculate the zscore.
I have found this specific course strange with what seem to be errors in the code and I'm wondering if it was done for an older version of Python, but it is more likely just me not getting it.
The Dataframe is basically something like this:
df = pd.DataFrame({'total_bill': {0: 16.99, 1: 10.34, 2: 21.01, 3: 23.68, 4: 24.59, 5: 25.29, 6: 8.77, 7: 26.88, 8: 15.04, 9: 14.78}, 'tip': {0: 1.01, 1: 1.66, 2: 3.5, 3: 3.31, 4: 3.61, 5: 4.71, 6: 2.0, 7: 3.12, 8: 1.96, 9: 3.23}, 'sex': {0: 'Female', 1: 'Male', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male', 6: 'Male', 7: 'Male', 8: 'Male', 9: 'Male'}, 'smoker': {0: 'No', 1: 'No', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'No', 7: 'No', 8: 'No', 9: 'No'}, 'day': {0: 'Sun', 1: 'Sun', 2: 'Sun', 3: 'Sun', 4: 'Sun', 5: 'Sun', 6: 'Sun', 7: 'Sun', 8: 'Sun', 9: 'Sun'}, 'time': {0: 'Dinner', 1: 'Dinner', 2: 'Dinner', 3: 'Dinner', 4: 'Dinner', 5: 'Dinner', 6: 'Dinner', 7: 'Dinner', 8: 'Dinner', 9: 'Dinner'}, 'size': {0: 2, 1: 3, 2: 3, 3: 2, 4: 4, 5: 4, 6: 2, 7: 4, 8: 2, 9: 2}})
So the code on the slides is as follows:
mean_female = df.groupby("sex").mean()["total_bill"]["Female"]
mean_male = df.groupby("sex").mean()["total_bill"]["Male"]
std_female = df.groupby("sex").std()["total_bill"]["Female"]
std_male = df.groupby("sex").std()["total_bill"]["Male"]
Followed by this...
for i in range(len(df)):
if df.iloc[i,2] == "Female":
df.iloc[i][0] = (df.iloc[i,0] - mean_female) / std_female
else:
df.iloc[i][0] = (df.iloc[i,0] - mean_male) / std_male
When I run the code (which is from datacamp not mine) I get the usual copy of a slice warning, but (more importantly) NOTHING happens to the data frame.
I assume the objective is to have something like this:
zscore = lambda x: (x - x.mean()) / x.std()
dfsex = restaurant.groupby('sex')
dfzscore = grouptime["total_bill"].transform(zscore)
dfzscore
I'm a little confused so any help figuring this out is much appreciated.
Cheers!
.iloc[i,0] should be used instead of .iloc[i][0]. The dataframe will be updated correctly after fixing this bug. Evidence:
df
Out[58]:
total_bill tip sex smoker day time size
0 -0.707107 1.01 Female No Sun Dinner 2
1 -1.138059 1.66 Male No Sun Dinner 3
2 0.402209 3.50 Male No Sun Dinner 3
3 0.787637 3.31 Male No Sun Dinner 2
4 0.707107 3.61 Female No Sun Dinner 4
5 1.020048 4.71 Male No Sun Dinner 4
6 -1.364696 2.00 Male No Sun Dinner 2
7 1.249573 3.12 Male No Sun Dinner 4
8 -0.459590 1.96 Male No Sun Dinner 2
9 -0.497122 3.23 Male No Sun Dinner 2
Explanation: Let's take a close look at df.iloc[i][0]. The first step df.iloc[i] returns a Series in-place indeed. The second step [0], however, just returns a copy of value which is not in-place. Therefore df won't be updated.
In short, every indice must be put inside .iloc[] (or arguably better .iat[] in this case) for the value assignment to happen in-place.
use:
df.assign(column0_name= lambda x: np.where(x['column2_name']=='Female',
(x['column0_name'] - mean_female) / std_female),
(x['column0_name'] - mean_male) / std_male)))
Instead of:
for i in range(len(df)):
if df.iloc[i,2] == "Female":
df.iloc[i][0] = (df.iloc[i,0] - mean_female) / std_female
else:
df.iloc[i][0] = (df.iloc[i,0] - mean_male) / std_male
It works on series and works faster than for loop.

Transforming a Dataframe with duplicate data in python

I would like to transform the below dataframe to concatenate duplicate data into a single row. For example:
data_dict={'FromTo_U': {0: 'L->R', 1: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 1: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 1: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',
1: 'Substitution - Missense',
2: 'Substitution - Missense'},
'PubMed': {0: '22523351', 1: '23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 1: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 1: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 1: 'COSMIC', 2: 'COSMIC'}}
df1=pd.DataFrame(data_dict)
transformed dataframe should be
data_dict_t={'FromTo_U': {0: 'L->R', 2: 'S->I'},
'GeneName': {0: 'EGFR', 2: 'EGFR'},
'MutationAA_C': {0: 'p.L858R', 2: 'p.S768I'},
'MutationDescription': {0: 'Substitution - Missense',2: 'Substitution - Missense'},
'PubMed': {0: '22523351,23915069', 2: '26862733'},
'VariantID': {0: 'COSM12979', 2: 'COSM18486'},
'VariantPos_U': {0: '858', 2: '768'},
'VariantSource': {0: 'COSMIC', 2: 'COSMIC'}}
I want to merge the two rows of df1 only if PubMed IDs are different and rest of the columns have same data. Thanks in advance!
Use groupby + agg with str.join as the aggfunc.
c = df1.columns.difference(['PubMed']).tolist()
df1.groupby(c, as_index=False).PubMed.agg(','.join)
FromTo_U GeneName MutationAA_C MutationDescription VariantID \
0 L->R EGFR p.L858R Substitution - Missense COSM12979
1 S->I EGFR p.S768I Substitution - Missense COSM18486
VariantPos_U VariantSource PubMed
0 858 COSMIC 22523351,23915069
1 768 COSMIC 26862733

Split list in Pandas dataframe column into multiple columns

I am working with movie data and have a dataframe column for movie genre. Currently the column contains a list of movie genres for each movie (as most movies are assigned to multiple genres), but for the purpose of this analysis, I would like to parse the list and create a new dataframe column for each genre. So instead of having genre=['Drama','Thriller'] for a given movie, I would have two columns, something like genre1='Drama' and genre2='Thriller'.
Here is a snippet of my data:
{'color': {0: [u'Color::(Technicolor)'],
1: [u'Color::(Technicolor)'],
2: [u'Color::(Technicolor)'],
3: [u'Color::(Technicolor)'],
4: [u'Black and White']},
'country': {0: [u'USA'],
1: [u'USA'],
2: [u'USA'],
3: [u'USA', u'UK'],
4: [u'USA']},
'genre': {0: [u'Crime', u'Drama'],
1: [u'Crime', u'Drama'],
2: [u'Crime', u'Drama'],
3: [u'Action', u'Crime', u'Drama', u'Thriller'],
4: [u'Crime', u'Drama']},
'language': {0: [u'English'],
1: [u'English', u'Italian', u'Latin'],
2: [u'English', u'Italian', u'Spanish', u'Latin', u'Sicilian'],
3: [u'English', u'Mandarin'],
4: [u'English']},
'rating': {0: 9.3, 1: 9.2, 2: 9.0, 3: 9.0, 4: 8.9},
'runtime': {0: [u'142'],
1: [u'175'],
2: [u'202', u'220::(The Godfather Trilogy 1901-1980 VHS Special Edition)'],
3: [u'152'],
4: [u'96']},
'title': {0: u'The Shawshank Redemption',
1: u'The Godfather',
2: u'The Godfather: Part II',
3: u'The Dark Knight',
4: u'12 Angry Men'},
'votes': {0: 1793199, 1: 1224249, 2: 842044, 3: 1774083, 4: 484061},
'year': {0: 1994, 1: 1972, 2: 1974, 3: 2008, 4: 1957}}
Any help would be greatly appreciated! Thanks!
I think you need DataFrame constructor with add_prefix and last concat to original:
df1 = pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')
df = pd.concat([df.drop('genre',axis=1), df1], axis=1)
Timings:
df = pd.DataFrame(d)
print (df)
#5000 rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [394]: %timeit (pd.concat([df.drop('genre',axis=1), pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')], axis=1))
100 loops, best of 3: 3.4 ms per loop
In [395]: %timeit (pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1))
1 loop, best of 3: 757 ms per loop
This should work for you:
pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1)

Categories