How to delete outliers of a specific column

How to delete outliers of a specific column - python

I'd to delete some outliers from my dataframe
Product
Brand
Year
calcium_100g
phosphorus_100g
iron_100g
magnesium_100g
Poduct A
Brand A
2020
8
50
NaN
NaN
Poduct B
Brand A
2021
54
-1
NaN
17
Poduct C
Brand C
2020
NaN
NaN
NaN
NaN
Poduct D
Brand C
2018
NaN
50
80
NaN
Poduct E
Brand E
2019
123
50
NaN
27
Outliers I'd like to delete are values bigger than 100 and below 0 from columns ending by "_100g" (-1 and 123) in that case.
I found a way to filter columns ending by "_100g"
Columns100g = list(data.filter(like='_100g', axis = 1).columns)
But at this point I can't find a way to delete my outliers.

I would suggest using the drop() method which accepts index to remove the rows, and to use lt(0) and gt(100) to get those indexes to be removed using | (or) and any(1) which would return True for any condition being satisfied for any column in the selected ones:
# Columns that have '_100g'
c = df.filter(like='_100g').columns
# Drop the rows above / below your threshold
new = df.drop(df[
df[(df[c].lt(0)) | (df[c].gt(100))
].any(1)==True].index)
Prints back:
print(new)
Product Brand Year ... phosphorus_100g iron_100g magnesium_100g
0 Poduct A Brand A 2020 ... 50.0 NaN NaN
2 Poduct C Brand C 2020 ... NaN NaN NaN
3 Poduct D Brand C 2018 ... 50.0 80.0 NaN
[3 rows x 7 columns]

Using your Columns100g, you can use a for-loop to filter multiple columns:
for col in Columns100g:
data = data[(data[col].fillna(0)>=0)&(data[col].fillna(0)<=100)]
Edit: But if you want to change outlier values to NaN, you can simply do:
data[(data[Columns100g]<0)|(data[Columns100g]>100)] = np.nan
Output:
Product Brand Year calcium_100g phosphorus_100g iron_100g magnesium_100g
0 Poduct A Brand A 2020 8.0 50.0 NaN NaN
1 Poduct B Brand A 2021 54.0 NaN NaN 17.0
2 Poduct C Brand C 2020 NaN NaN NaN NaN
3 Poduct D Brand C 2018 NaN 50.0 80.0 NaN
4 Poduct E Brand E 2019 NaN 50.0 NaN 27.0

Pandas have a function name drop that you can use alongside the filter code you wrote.
# Deleting columns
# Delete the "_100" column from the dataframe
data = data.drop("_100", axis=1)
# alternatively, delete columns using the columns parameter of drop
data = data.drop(columns="_100")
# Delete the _100 column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
data.drop("_100", axis=1, inplace=True).
# Delete multiple columns from the dataframe
data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)
If you can replace the ones I wrote (_100) with your filter, you are good to go.
also, you can read more about the drop function in the link below:
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

Related

How to groupby and count since the first non nan value?

I have this df:
CODE TMAX
0 000130 NaN
1 000130 NaN
2 000130 32.0
3 000130 32.2
4 000130 NaN
5 158328 NaN
6 158328 8.8
7 158328 NaN
8 158328 NaN
9 158328 9.2
... ... ...
I want to count the number of non nan values and the number of nan values in the 'TMAX' column. But i want to count since the first non NaN value and by code.
Expected result in code 000130: 2 non nan values and 1 NaN values.
Expected result in code 158328: 2 non nan values and 2 NaN values.
Same with the other codes...
How can i do this?
Thanks in advance.

If need CODEs too add GroupBy.cummax and count values by crosstab:
m = df.TMAX.notna()
s = m[m.groupby(df['CODE']).cummax()]
df1 = pd.crosstab(df['CODE'], s).rename(columns={True:'non NaNs',False:'NaNs'})
print (df1)
TMAX NaNs non NaNs
CODE
130 1 2
158328 2 2
If need explicitely filter also column CODE by mask:
m = df.TMAX.notna()
mask = m.groupby(df['CODE']).cummax()
df1 = pd.crosstab(df.loc[mask, 'CODE'], m[mask]).rename(columns={True:'non NaNs',False:'NaNs'})

Use first_valid_index to find the first non-NaN index and filter. Then use isna to create a boolean mask and count the values.
def countNaN(s):
return (
s.loc[s.first_valid_index():]
.isna()
.value_counts()
.rename({True: 'NaN', False: 'notNaN'})
)
df.groupby('CODE').apply(countNaN)
Output
CODE
000130 notNaN 2
NaN 1
158328 notNaN 2
NaN 2

Pivoting A Column in Pandas based on groups and dynamic column names

I have a data frame in python pandas as follows:
( the first two columns, mygroup1 & mygroup2 are groupby columns)
df =
**mygroup1 mygroup2 tname #dt #num #vek**
a p alpha may 6 a
b q alpha june 8 b
c r beta may 9 c
d s beta june 11 d
I want to pivot the table (the values in tname column) which should be the following with names of columns joined with tname values taken from the other columns (#dt,#num and #vec)
**mygroup1 mygroup2 alpha#dt alpha#num alpha#vec beta#dt beta#num beta#vec**
a p may 6 a nan nan nan
b q june 8 b nan nan nan
c r nan nan nan may 9 c
d s nan nan nan june 11 d
I am trying to do a pivot using pandas pivot table but not able to get in the below format which I really want. I will appreciate any help.

You can do:
new_df = df.set_index(['mygroup1','mygroup2','tname']).unstack('tname')
new_df.columns = [f'{y}{x}' for x,y in new_df.columns]
new_df = new_df.sort_index(axis=1).reset_index()
Output:
mygroup1 mygroup2 alpha#dt alpha#num alpha#vek beta#dt beta#num beta#vek
0 a p may 6.0 a NaN NaN NaN
1 b q june 8.0 b NaN NaN NaN
2 c r NaN NaN NaN may 9.0 c
3 d s NaN NaN NaN june 11.0 d

Form single Row from all rows with corresponding values in pandas

I have dataframe as follows:
2017 2018
A B C A B C
0 12 NaN NaN 98 NaN NaN
1 NaN 23 NaN NaN 65 NaN
2 NaN NaN 45 NaN NaN 43
I want to convert this data frame into:
2017 2018
A B C A B C
0 12 23 45 98 65 43

First back filling missing values and then select first row by double [] for one row DataFrame:
df = df.bfill().iloc[[0]]
#alternative
#df = df.ffill().iloc[-1]]
print (df)
2017 2018
A B C A B C
0 12.0 23.0 45.0 98.0 65.0 43.0

One could sum along the columns:
import pandas as pd
import numpy as np
# Create DataFrame:
tmp = np.hstack((np.diag([12., 23., 42.]), np.diag([98., 65., 43.])))
tmp[tmp == 0] = np.NaN
df = pd.DataFrame(tmp, )
# Sum:
df2 = pd.DataFrame(df.sum(axis=0)).T
Resulting in:
0 1 2 3 4 5
0 12.0 23.0 42.0 98.0 65.0 43.0
This is convenient because Dataframe.sum ignores NaN by default. Couple of notes:
One loses the column names in this approach.
All-NaN columns will return 0 in the result.

Update column with NaN with mean of filtered rows

I have the following DataFrame
VOTES CITY
24 A
22 A
20 B
NaN A
NaN A
30 B
NaN C
I need to fill the NaN with mean of values where CITY is 'A' or 'C'
The following code I tried was only updating the first row in VOTES and rest allwere updated to NaN.
train['VOTES'][((train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))]=train['VOTES'].loc[((~train['VOTES'].isna()) & (train['CITY'].isin(['A','C'])))].astype(int).mean(axis=0)
The output of 'VOTES' after this all values are updated as 'NaN' except one record which is at index 0. The Mean is calculated correctly though .

Use Series.fillna only for filtered rows with mean of filtered rows:
train['VOTES_EN']=train['VOTES'].astype(str).str.extract(r'(-?\d+\.?\d*)').astype(float)
m= train['CITY'].isin(['A','C'])
mean = train.loc[m,'VOTES_EN'].mean()
train.loc[m,'VOTES_EN']=train.loc[m,'VOTES_EN'].fillna(mean)
train['VOTES_EN'] = train['VOTES_EN'].astype(int)
print (train)
VOTES CITY VOTES_EN
0 24.0 A 24
1 22.0 A 22
2 20.0 B 20
3 NaN A 23
4 NaN A 23
5 30.0 B 30
6 NaN C 23

How do I pivot a pandas DataFrame and then add hierarchical columns?

Can someone please help me understand the steps to convert a Python pandas DataFrame that is in record form (data set A), into one that is pivoted with nested columns (as shown in data set B)?
For this question the underlying schema has the following rules:
Each ProjectID appears once
Each ProjectID is associated to a single PM
Each ProjectID is associated to a single Category
Multiple ProjectIDs can be associated with a single Category
Multiple ProjectIDs can be associated with a single PM
Input Data Set A
df_A = pd.DataFrame({'ProjectID':[1,2,3,4,5,6,7,8],
'PM':['Bob','Jill','Jack','Jack','Jill','Amy','Jill','Jack'],
'Category':['Category A','Category B','Category C','Category B','Category A','Category D','Category B','Category B'],
'Comments':['Justification 1','Justification 2','Justification 3','Justification 4','Justification 5','Justification 6','Justification 7','Justification 8'],
'Score':[10,7,10,5,15,10,0,2]})
Desired Output
Notice above the addition of a nested index across the columns. Also notice that 'Comments' and 'Score' both appear at the same level beneath 'ProjectID'. Finally see how the desired output does NOT aggregate any data, but groups/merges the category data into one row per category value.
I have tried so far:
df_A.set_index(['Category','ProjectID'],append=True).unstack() - This would only work if I first create a nested index of ['Category','ProjectID] and ADD that to the original numerical index created with a standard dataframe, however it repeats each instance of a Category/ProjectID match as its own row (because of the original index).
df_A.groupby() - I wasn't able to use this because it appears to force aggregation of some sort in order to get all of the values of a single category on a single row.
df_A.pivot('Category','ProjectID',values='Comments') - I can perform a pivot to avoid unwanted aggregation and it starts to look similar to my intended output, but can only see the 'Comments' field and also cannot set nested columns this way. I receive an error when trying to set values=['Comments','Score'] in the pivot statement.
I think the answer is somewhere between pivot, unstack, set_index, or groupby, but I don't know how to complete the pivot, and then add the appropriate nested column index.
I'd appreciate any thoughts you all have.
Question updated based on Mr. T's comments. Thank you.

I think this is what you are looking for:
pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2)
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
0 Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
EDIT:
To select rows by category you should get rid of the row index 0 by adding .xs():
In [3]: df_A_transformed = pd.DataFrame(df_A.set_index(['PM', 'ProjectID', 'Category']).sort_index().stack()).T.stack(2).xs(0)
In [4]: df_A_transformed
Out[4]:
PM Amy Bob ... Jill
ProjectID 6 1 ... 5 7
Comments Score Comments Score ... Comments Score Comments Score
Category ...
Category A NaN NaN Justification 1 10 ... Justification 5 15 NaN NaN
Category B NaN NaN NaN NaN ... NaN NaN Justification 7 0
Category C NaN NaN NaN NaN ... NaN NaN NaN NaN
Category D Justification 6 10 NaN NaN ... NaN NaN NaN NaN
[4 rows x 16 columns]
In [5]: df_A_transformed.loc['Category B']
Out[5]:
PM ProjectID
Amy 6 Comments NaN
Score NaN
Bob 1 Comments NaN
Score NaN
Jack 3 Comments NaN
Score NaN
4 Comments Justification 4
Score 5
8 Comments Justification 8
Score 2
Jill 2 Comments Justification 2
Score 7
5 Comments NaN
Score NaN
7 Comments Justification 7
Score 0
Name: Category B, dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to delete outliers of a specific column - python

Related

How to groupby and count since the first non nan value?

Pivoting A Column in Pandas based on groups and dynamic column names

Form single Row from all rows with corresponding values in pandas

Update column with NaN with mean of filtered rows

How do I pivot a pandas DataFrame and then add hierarchical columns?

Categories

Resources