I have df as shown below
df:
player goals_oct goals_nov
messi 2 4
neymar 2 NaN
ronaldo NaN 3
salah NaN NaN
levenoski 2 2
Where I would like to calculate the average goal scored by each player. Which is the average of goals_oct and goals_nov when both the data are available else the available column, if both not available then NaN
Expected output
player goals_oct goals_nov avg_goals
messi 2 4 3
neymar 2 NaN 2
ronaldo NaN 3 3
salah NaN NaN NaN
levenoski 2 0 1
I tried the below code, but it did not works
conditions_g = [(df['goals_oct'].isnull() and df['goals_nov'].notnull()),
(df['goals_oct'].notnull() and df['goals_nov'].isnull())]
choices_g = [df['goals_nov'], df['goals_oct']]
df['avg_goals']=np.select(conditions_g, choices_g, default=(df['goals_oct']+df['goals_nov'])/2)
Simply use mean(axis=1). It will skip NaNs:
columns = df.columns[1:] # all columns except the first
df['avg_goal'] = df[columns].mean(axis=1)
Output:
>>> df
player goals_oct goals_nov avg_goal
0 messi 2.0 4.0 3.0
1 neymar 2.0 NaN 2.0
2 ronaldo NaN 3.0 3.0
3 salah NaN NaN NaN
4 levenoski 2.0 2.0 2.0
Try this it will work
df["avg_goals"] = np.where(df.goals_oct.isnull(),
np.where(df.goals_nov.isnull(), np.NaN, df.goals_nov),
np.where(df.goals_nov.isnull(), df.goals_oct, (df.goals_oct + df.goals_nov) / 2))
if you want to consider 0 as empty value then you can convert 0 to np.NaN and try above statement it will work
So, i have some data in list form, such as:
Q=[2,3,4,5,6,7,8,9,10,11,12] #values
M=[11,0,1,2,3,4,5,6,7,8,9] #months
Y=[2010,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011] #years
And i want to get a dataframe, with one row per year, and one column per month, adding the data of Q on the positions given by M and Y.
so far i have tried a couple of things, my current code is as follows:
def save_data(data_list,year_info,month_info):
#how many datapoints
n_data=len(data_list)
#how many years
y0=year_info[0]
yf=year_info[n_data-1]
n_years=yf-y0+1
#creating the list i want to fill out
df_list=[[math.nan]*12]*n_years
ind=0
for y in range(n_years):
for m in range(12):
if ind<len(data_list):
if year_info[ind]-y0==y and month_info[ind]==m:
df_list[y][m]=data_list[ind]
ind+=1
df=pd.DataFrame(df_list)
return df
I get this output:
0
1
2
3
4
5
6
7
8
9
10
11
0
3
4
5
6
7
8
9
10
11
12
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
2
And i want to get:
0
1
2
3
4
5
6
7
8
9
10
11
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
nan
I have tried doing a bunch of diferent things, but so far nothing has worked, I'm wondering if there's a more straightforward way of doing this, my code seems to be overwriting in a weird way, i do not know for instance why is there a 2 on the last value of second row, since that's the first value of my list.
Thanks in advance!
Try pivot:
(pd.DataFrame({'Y':Y,'M':M,'Q':Q})
.pivot(index='Y', columns='M', values='Q')
)
Output:
M 0 1 2 3 4 5 6 7 8 9 11
Y
2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0
2011 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 NaN
I created this dataframe and need to group my data into category with the same number of beds, city, baths and sort(descending) each elements in the group by price.
Secondly I need to find the difference between each price with the one ranked after into the same group.
For example the result should be like that:
1 bed, 1 bath, Madrid, 10
1 bed, 1 bath, Madrid, 8
1 bed, 1 bath, Madrid, 5
1 bed, 1 bath, Madrid, 1
I should get 2, 3, 4...
I tried some code it seems far than what I expect to find...
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df
df['gap'] = df.sort_values('price',ascending=False).groupby(['city','beds','baths'])['price'].diff()
print (df)
Many thanks in advance.
I would use pd.to_numeric with errors = 'coerce'
to get rid of the strings in the price column, I would then calculate the difference without taking into account those rooms whose price is unknown (using DataFrame.dropna). Then I show the result ordering in DataFrame and without ordering:
df['price']=pd.to_numeric(df['price'],errors = 'coerce')
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
or using GroupBy.shift:
df['difference_price'] = df['price'].sub( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])
.price
.shift(-1) )
Display result
print(df,'\n'*3,'Sorted DatFrame: ')
print(df.sort_values(['city','beds','baths','price'],ascending = [True,True,True,False]))
Output
id city beds baths price difference_price
0 1 paris 1 2 10.0 4.0
1 2 madrid 2 2 8.0 1.0
2 3 madrid 2 2 11.0 3.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
5 6 madrid 2 1 7.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
8 9 madrid 1 4 NaN NaN
9 10 paris 2 1 3.0 NaN
10 11 madrid 2 2 7.0 NaN
11 12 paris 2 3 12.0 NaN
12 13 madrid 2 3 7.0 NaN
13 14 madrid 1 1 3.0 NaN
14 15 paris 1 1 3.0 NaN
15 16 madrid 1 1 4.0 1.0
16 17 paris 1 1 5.0 2.0
Sorted DatFrame:
id city beds baths price difference_price
15 16 madrid 1 1 4.0 1.0
13 14 madrid 1 1 3.0 NaN
8 9 madrid 1 4 NaN NaN
5 6 madrid 2 1 7.0 NaN
2 3 madrid 2 2 11.0 3.0
1 2 madrid 2 2 8.0 1.0
10 11 madrid 2 2 7.0 NaN
12 13 madrid 2 3 7.0 NaN
16 17 paris 1 1 5.0 2.0
14 15 paris 1 1 3.0 NaN
0 1 paris 1 2 10.0 4.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
9 10 paris 2 1 3.0 NaN
11 12 paris 2 3 12.0 NaN
If I understand correctly with:
group my data into category with the same number of beds, city, baths and sort(descending)
All data that does not fulfill the value should be deleted? (Where beds and baths are different). This is my code to provide an answer given your problem:
import numpy as np
import pandas as pd
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df_new = df[df['beds'] == df['baths']]
df_new = df_new.sort_values(['city','price'],ascending=[False,False]).reset_index(drop=True)
df_new['diff_price'] = df_new.groupby(['city','beds','baths'])['price'].diff(-1)
print(df_new)
Output:
id city beds baths price diff_price
0 17 paris 1 1 5 NaN
1 15 paris 1 1 3 -2
2 3 madrid 2 2 11 NaN
3 2 madrid 2 2 8 -3
4 11 madrid 2 2 7 -1
5 16 madrid 1 1 4 NaN
6 14 madrid 1 1 3 -1