How to replace 0 values with mean based on groupby - python

I have a dataframe with two features: gps_height (numeric) and region (categorical).
The gps_height contains a lot of 0 values, which are missing values in this case. I want to fill the 0 values with the mean of the coherent region.
My reasoning is as follows:
1. Drop the zero values and take the mean values of gps_height, grouped by region
df[df.gps_height !=0].groupby(['region']).mean()
But how do I replace the zero values in my dataframe with those mean values?
Sample data:
gps_height region
0 1390 Iringa
1 1400 Mara
2 0 Iringa
3 250 Iringa
...

Use:
df = pd.DataFrame({'region':list('aaabbbccc'),
'gps_height':[2,3,0,3,4,5,1,0,0]})
print (df)
region gps_height
0 a 2
1 a 3
2 a 0
3 b 3
4 b 4
5 b 5
6 c 1
7 c 0
8 c 0
Replace 0 to missing values, and then replace NANs by fillna with means by GroupBy.transformper groups:
df['gps_height'] = df['gps_height'].replace(0, np.nan)
df['gps_height']=df['gps_height'].fillna(df.groupby('region')['gps_height'].transform('mean'))
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0
Or filter out 0 values, aggregate means and map all 0 rows:
m = df['gps_height'] != 0
s = df[m].groupby('region')['gps_height'].mean()
df.loc[~m, 'gps_height'] = df['region'].map(s)
#alternative
#df['gps_height'] = np.where(~m, df['region'].map(s), df['gps_height'])
print (df)
region gps_height
0 a 2.0
1 a 3.0
2 a 2.5
3 b 3.0
4 b 4.0
5 b 5.0
6 c 1.0
7 c 1.0
8 c 1.0

I ended up facing the same problem that #ahbon raised: what if there are more than one column to group by? And this was the closest question that I found to my problem. After a serious struggle, I came to a solution.
As far as I know (there are pandas specific functions to do similar things) It could not be an elegant/orthodox one, so I'd appreciate some feedback.
There it goes:
import pandas as pd
import random
random.seed(123)
df = pd.DataFrame({"A":list('a'*4+'b'*4+'c'*4+'d'*4),
"B":list('xy'*8),
"C":random.sample(range(17), 16)})
print(df)
A B C
0 a x 1
1 a y 8
2 a x 16
3 a y 12
4 b x 6
5 b y 4
6 b x 14
7 b y 0
8 c x 13
9 c y 5
10 c x 2
11 c y 9
12 d x 10
13 d y 11
14 d x 3
15 d y 15
First get the indices of 0 values to retrieve the non zero data and get the mean by group.
idx = list(df[df["C"] != 0].index)
data_to_group = df.iloc[idx,]
grouped_data = pd.DataFrame(data_to_group.groupby(["A", "B"])["C"].mean())
And now the tricky part. Here is where I get the impression that it could be a more elegant solution:
Stack, unstack and reset index
Then merge with the subset of rows in df where C is 0; drop C from the first and keep C from the second
Finaly update df with this subset with no zero in C.
grouped_data = grouped_data.stack().unstack().reset_index()
zero_rows = df[df.C == 0]
zero_rows_replaced = pd.merge(left = zero_rows, right = grouped_data,
how = "left", on=["A", "B"],
suffixes=('_x','')).drop('C_x', axis=1)
zero_rows_replaced = zero_rows_replaced.set_index(zero_rows.index.copy())
df.update(zero_rows_replaced)
print(df)
A B C
0 a x 1
1 a y 8
2 a x 16
3 a y 12
4 b x 6
5 b y 4
6 b x 14
7 b y 4
8 c x 13
9 c y 5
10 c x 2
11 c y 9
12 d x 10
13 d y 11
14 d x 3
15 d y 15

Related

i cant find the min value(which is>0) in each row in selected columns df[df[col]>0]

this is my data and i want to find the min value of selected columns(a,b,c,d) in each row then calculate the difference between that and dd. I need to ignore 0 in rows, I mean in the first row i need to find 8
need to ignore 0 in rows
Then just replace it with nan, consider following simple example
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,0],"B":[3,5,7],"C":[7,0,7]})
df.replace(0,np.nan).apply(min)
df["minvalue"] = df.replace(0,np.nan).apply("min",axis=1)
print(df)
gives output
A B C minvalue
0 1 3 7 1.0
1 2 5 0 2.0
2 0 7 7 7.0
You can use pandas.apply with axis=1 and all column ['a','b','c','d'] convert to Series then replace 0 with +inf and find min. At the end compute diff min with colmun 'dd'.
import numpy as np
df['min_dd'] = df.apply(lambda row: min(pd.Series(row[['a','b','c','d']]).replace(0,np.inf)) - row['d'], axis=1)
print(df)
a b c d dd min_dd
0 0 15 0 8 6 2.0 # min_without_zero : 8 , dd : 6 -> 8-6=2
1 2 0 5 3 2 0.0 # min_without_zero : 2 , dd : 2 -> 2-2=0
2 5 3 3 0 2 1.0 # 3 - 2
3 0 2 3 4 2 0.0 # 2 - 2
You can try
cols = ['a','b','c','d']
df['res'] = df[cols][df[cols].ne(0)].min(axis=1) - df['dd']
print(df)
a b c d dd res
0 0 15 0 8 6 2.0
1 2 0 5 3 2 0.0
2 5 3 3 0 2 1.0
3 2 3 4 4 2 0.0

Python dataframe rank each column based on row values

I have a data frame. I want to rank each column based on its row value
Ex:
xdf = pd.DataFrame({'A':[10,20,30],'B':[5,30,20],'C':[15,3,8]})
xdf =
A B C
0 10 5 15
1 20 30 3
2 30 20 8
Expected result:
xdf =
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
OR
xdf =
A B C A_Rk B_Rk C_Rk
0 10 5 15 2 3 1
1 20 30 3 2 1 2
2 30 20 8 1 2 3
Why I need this:
I want to track the trend of each column and how it is changing. I would like to show this by the plot. Maybe a bar plot showing how many times A got Rank1, 2, 3, etc.
My approach:
xdf[['Rk_1','Rk_2','Rk_3']] = ""
for i in range(len(xdf)):
xdf.loc[i,['Rk_1','Rk_2','Rk_3']] = dict(sorted(dict(xdf[['A','B','C']].loc[i]).items(),reverse=True,key=lambda item:item[1])).keys()
Present output:
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
I am iterating through each row, converting each row, column into a dictionary, sorting the values, and then extracting the keys (columns). Is there a better approach? My actual data frame has 10000 rows, 12 columns to be ranked. I just executed and it took around 2 minutes.
You should be able to get your desired dataframe by using:
ranked = xdf.join(xdf.rank(ascending=False, method='first', axis=1), rsuffix='_rank')
This'll give you:
A B C A_rank B_rank C_rank
0 10 5 15 2.0 3.0 1.0
1 20 30 3 2.0 1.0 3.0
2 30 20 8 1.0 2.0 3.0
Then do whatever you need to do plotting wise.

Python/Pandas: Use lookup DataFrame + function to replace specific/null values in DataFrame

Say I have an incomplete dataset in a Pandas DataFrame such as:
incData = pd.DataFrame({'comp': ['A']*3 + ['B']*5 + ['C']*4,
'x': [1,2,3] + [1,2,3,4,5] + [1,2,3,4],
'y': [3,None,7] + [1,4,7,None,None] + [4,None,2,1]})
And also a DataFrame with fitting parameters that I could use to fill holes:
fitTable = pd.DataFrame({'slope': [2,3,-1],
'intercept': [1,-2,5]},
index=['A','B','C'])
I would like to achieve the following using y=x*slope+intercept for the None entries only:
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
One way I envisioned is by using join and drop:
incData = incData.join(fitTable,on='comp')
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
incData[incData['y'].isnull()]['slope']+\
incData[incData['y'].isnull()]['intercept']
incData.drop(['slope','intercept'], axis=1, inplace=True)
However, that does not seem very efficient, because it adds and removes columns. It seems that I am making this too complicated, do I overlook a simple more direct solution? Something more like this non-functional code:
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
fitTable[incData[incData['y'].isnull()]['comp']]['slope']+\
fitTable[incData[incData['y'].isnull()]['comp']]['intercept']
I am pretty new to Pandas, so I sometimes get a bit mixed up with the strict indexing rules...
you can use map on the column 'comp' once mask with null value in 'y' like:
mask = incData['y'].isna()
incData.loc[mask, 'y'] = incData.loc[mask, 'x']*\
incData.loc[mask,'comp'].map(fitTable['slope']) +\
incData.loc[mask,'comp'].map(fitTable['intercept'])
and your non-functional code, I guess it would be something like:
incData.loc[mask,'y'] = incData.loc[mask, 'x']*\
fitTable.loc[incData.loc[mask, 'comp'],'slope'].to_numpy()+\
fitTable.loc[incData.loc[mask, 'comp'],'intercept'].to_numpy()
IIUC:
incData.loc[pd.isna(incData['y']), 'y'] = incData[pd.isna(incData['y'])].apply(lambda row: row['x']*fitTable.loc[row['comp'], 'slope']+fitTable.loc[row['comp'], 'intercept'], axis=1)
incData
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
merge is another option
# merge two dataframe together on comp
m = incData.merge(fitTable, left_on='comp', right_index=True)
# y = mx+b
m['y'] = m['x']*m['slope']+m['intercept']
comp x y slope intercept
0 A 1 3 2 1
1 A 2 5 2 1
2 A 3 7 2 1
3 B 1 1 3 -2
4 B 2 4 3 -2
5 B 3 7 3 -2
6 B 4 10 3 -2
7 B 5 13 3 -2
8 C 1 4 -1 5
9 C 2 3 -1 5
10 C 3 2 -1 5
11 C 4 1 -1 5

Keep all cells above given value in pandas DataFrame

I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.
If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0
You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0

Setting the value of one column based on three conditions of another column

I have a DataFrame with 2 columns, a and b, and I would like to populate a third column, c based on the following three conditions:
if a.diff() > 0 then c = b.shift() + b
elif a.diff() < 0 then c = b.shift() - b
elif a.diff() == 0 then c = b.shift()
What is a Pythonic, one-liner way of doing this?
Example:
a b c
0 2 10 Nan
1 3 16 26
2 1 12 4
3 1 18 12
4 3 11 29
5 1 13 -2
Use numpy.select and cache shifted and diffed Series for better performance and readibility:
diff = df.a.diff()
shifted = df.b.shift()
df['c'] = np.select([diff > 0, diff < 0], [shifted + df.b, shifted - df.b], default=shifted)
print (df)
a b c
0 2 10 NaN
1 3 16 26.0
2 1 12 4.0
3 1 18 12.0
4 3 11 29.0
5 1 13 -2.0

Categories