Remove outliers from aggregated Dataframe (Python) - python

My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?

df2 = df2.loc[df2['products'] <= 20]

Related

Fill cell containing NaN with average of value before and after considering groupby

I would like to fill missing values in a pandas dataframe with the average of the cells directly before and after the missing value considering that there are different IDs.
maskedid test value
1 A 4
1 B NaN
1 C 5
2 A 5
2 B NaN
2 B 2
expected DF
maskedid test value
1 A 4
1 B 4.5
1 C 5
2 A 5
2 B 3.5
2 B 2
Try to interpolate:
df['value'] = df['value'].interpolate()
And by group:
df['value'] = df.groupby('maskedid')['value'].apply(pd.Series.interpolate)

Replace specific row-wise duplicate cells in selected columns without dropping rows

How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan
Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

How to add a nested column to a 3D Pandas DataFrame?

New to Pandas, not very sure how the 3D DataFrame works. My dataframe, called 'new' looks like this:
unique cat numerical
a b c d e f
0 0 1 2 3 4 5
1 0 1 2 3 4 5
I want to insert column 'z' so that it ends up like this:
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5
I successfully made a new column after slicing out 'unique' from my dataframe:
Doing this:
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
Gets me this:
a b z
0 0 1 9
1 0 1 9
However I have no idea how to put it back into the dataframe. I tried:
new['unique'] = new_column
But I've since found out that it just tries to replace all the values in all the rows and columns found under 'unique', like this:
new['unique'] = 'a'
Gets:
unique cat numerical
a b c d e f
0 a a 2 3 4 5
1 a a 2 3 4 5
And using .loc gets this instead:
unique cat numerical
a b c d e f
0 NaN NaN 2 3 4 5
1 NaN NaN 2 3 4 5
Here's my full code:
import pandas as pd
import numpy as np
data=[[0,1,2,3,4,5],[0,1,2,3,4,5]]
datatypes=np.array(['unique','unique','cat','cat','numerical','numerical'])
columnnames=np.array(['a','b','c','d','e','f'])
new = pd.DataFrame(data=data, columns=pd.MultiIndex.from_tuples(zip(datatypes,columnnames)))
print('new: ')
print(new)
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
print('\nnew column:')
print(new_column)
new.loc[:,'unique'] = new_column
print('\nattempt 1:')
print(new)
new['unique'] = new_column
print('\nattempt 2:')
print(new)
One way to do this:
# Create your new multiindexed column:
new['unique','z'] = 9
# Re-order your columns in your desired order:
new = new[['unique', 'cat', 'numerical']]
>>> new
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5

How to extract values of one dataframe with values of other dataframe in pandas?

Suppose that you create the next python pandas data frames:
In[1]: print df1.to_string()
ID value
0 1 a
1 2 b
2 3 c
3 4 d
In[2]: print df2.to_string()
Id_a Id_b
0 1 2
1 4 2
2 2 1
3 3 3
4 4 4
5 2 2
How can I create a frame df_ids_to_values with the next values:
In[2]: print df_ids_to_values.to_string()
value_a value_b
0 a b
1 d b
2 b a
3 c c
4 d d
5 b b
In other words, I would like to replace the id's of df2 with the corresponding values in df1. I have tried doing this by performing a for loop but it is very slow and I am hopping that there is a function in pandas that allow me to do this operation very efficiently.
Thanks for your help...
Start by setting an index on df1
df1 = df1.set_index('ID')
then join the two columns
df = df2.join(df1, on='Id_a')
df = df.rename(columns = {'value' : 'value_a'})
df = df.join(df1, on='Id_b')
df = df.rename(columns = {'value' : 'value_b'})
result:
> df
Id_a Id_b value_a value_b
0 1 2 a b
1 4 2 d b
2 2 1 b a
3 3 3 c c
4 4 4 d d
5 2 2 b b
[6 rows x 4 columns]
(and you get to your expected output with df[['value_a','value_b']])

Categories