Calculate value in Columne based on two other columns Dataframe - python

I am trying to calculate the mean value for the two classes Physics and Math and include it in a seperate Column Mean. Furthermore I am just trying to calculate the mean for classes wher both need a grade. This is working to make a filter. The only think which is not working is to calculate the Mean value. For the missing ones it works but somehow it sets the ones where I have a value to zero which is weird. The data looks like the following:
Date School Math Physics Mean Flag
01.01.2020 ABC 3 4 1
01.03.2020 ABC 2 3 1
01.05.2020 ABC 2 1 1.5 2
01.07.2020 ABC 2 1 1
01.08.2020 ABC 2 1 1.5 2
01.04.2020 ABC 2 3
01.06.2020 ABC 1 3
My code looks as the following:
import pandas as pd
path = 'School_grades.xlsx'
df = pd.read_excel(path)
df_copy = df.copy(deep=True)
df_copy['Date'] = pd.to_datetime(df_copy.Date)
df_copy = df_copy[(df_copy["Flag"] != 3)]
df_copy['Mean'] = ((df_copy['Math'] + df_copy['Physics'])/2).where(df_copy['Flag'] == 1)
print(df_copy)
My code provides the following where the columns where I had already Means included to NaN:
Date School Math Physics Mean Flag
0 2020-01-01 ABC 3.0 4.0 3.5 1
1 2020-01-03 ABC 2.0 3.0 2.5 1
2 2020-01-05 ABC 2.0 1.0 NaN 2
3 2020-01-07 ABC 2.0 1.0 1.5 1
4 2020-01-08 ABC 2.0 1.0 NaN 2
But would rather expect something like this:
Date School Math Physics Mean Flag
0 2020-01-01 ABC 3.0 4.0 3.5 1
1 2020-01-03 ABC 2.0 3.0 2.5 1
2 2020-01-05 ABC 2.0 1.0 1.5 2
3 2020-01-07 ABC 2.0 1.0 1.5 1
4 2020-01-08 ABC 2.0 1.0 1.5 2

Your .where() method has no "else" statement but return a series for each row of the dataframe. This means it only return values where your where statement is True and missing values where it is False, essentially throwing your previous results away.
There are multiple way to solve this. One is the following using numpy library.
np.where() essentially has a series with True/False values. Where True use the next provided series, where false use the last provided series. Here we insert the previous mean values.
import numpy as np
df_copy['Mean'] = np.where(df_copy['Flag'] == 1, ((df_copy['Math'] + df_copy['Physics'])/2), df_copy['Mean'])

you forgot to add the other parameter in pandas.where()
>> df_copy['Mean'] = ((df_copy['Math'] + df_copy['Physics'])/2).where(df_copy['Flag'] == 1,df_copy['Mean'])
>> print(df_copy)
Date School Math Physics Mean Flag
0 01.01.2020 ABC 3.0 4.0 3.5 1
1 01.03.2020 ABC 2.0 3.0 2.5 1
2 01.05.2020 ABC 2.0 1.0 1.5 2
3 01.07.2020 ABC 2.0 1.0 1.5 1
4 01.08.2020 ABC 2.0 1.0 1.5 2
5 01.04.2020 ABC 2.0 NaN NaN 3
6 01.06.2020 ABC NaN 1.0 NaN 3
Use pandas.DataFrame.mean to calculate the average
df_copy['Mean'] = df_copy[['Math','Physics']].mean(axis=1).where(df_copy.Flag == 1,df_copy['Mean'])
You can also use the numpy.where
import numpy as np
df_copy['Mean'] = np.where(df_copy.Flag == 1,df_copy[['Math','Physics']].mean(axis=1),df_copy['Mean'])

Related

How to populate NaN values based on conditions from two other columns using Pandas?

I have a dataframe that looks something like this:
ID
hiqual
Wave
1
1.0
g
1
NaN
i
1
NaN
k
2
1.0
g
2
NaN
i
2
NaN
k
3
1.0
g
3
NaN
i
4
5.0
g
4
NaN
i
This is a long format dataframe and I have my hiqual variable for my first measurement wave (g). I would like to populate the NaN values for the subsequent measurement waves (i and k) as the same value give in wave g for each ID.
I tried using fillna() but I am not sure how to provide the two conditions of ID and Wave and how to populate based on that. I would be grateful for any help/suggestions on this?
The exact expected output is unclear, but think you might want:
m = df['hiqual'].isna()
df.loc[m, 'hiqual'] = df['Wave'].mask(m).ffill()
If you dataframe is already ordered by ID and wave columns, you can simply fill forward values:
>>> df.sort_values(['ID', 'Wave']).ffill()
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
You can also use explicitly g values:
g_vals = df[df['Wave']=='g'].set_index('ID')['hiqual']
df['hiqual'] = df['hiqual'].fillna(df['ID'].map(g_vals))
print(df)
print(g_vals)
# Output
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
# g_vals
ID
1 1.0
2 1.0
3 1.0
4 5.0
Name: hiqual, dtype: float64

Applying a function to chunks of the Dataframe

I have a Dataframe (df) (for instance - simplified version)
A B
0 2.0 3.0
1 3.0 4.0
and generated 20 bootstrap resamples that are all now in the same df but differ in the Resample Nr.
A B
0 1 0 2.0 3.0
1 1 1 3.0 4.0
2 2 1 3.0 4.0
3 2 1 3.0 4.0
.. ..
.. ..
39 20 0 2.0 3.0
40 20 0 2.0 3.0
Now I want to apply a certain function on each Reample Nr. Say:
C = sum(df['A'] * df['B']) / sum(df['B'] ** 2)
The outlook would look like this:
A B C
0 1 0 2.0 3.0 Calculated Value X1
1 1 1 3.0 4.0 Calculated Value X1
2 2 1 3.0 4.0 Calculated Value X2
3 2 1 3.0 4.0 Calculated Value X2
.. ..
.. ..
39 20 0 2.0 3.0 Calculated Value X20
40 20 0 2.0 3.0 Calculated Value X20
So there are 20 different new values.
I know there is a df.iloc command where I can specify my row selection df.iloc[row, column] but I would like to find a command where I don't have to repeat the code for the 20 samples.
My goal is to find a command that identifies the Resample Nr. automatically and then calculates the function for each Resample Nr.
How can I do this?
Thank you!
Use DataFrame.assign to create two new columns x and y that corresponds to df['A'] * df['B'] and df['B']**2, then use DataFrame.groupby on Resample Nr. (or level=1) and transform using sum:
s = df.assign(x=df['A'].mul(df['B']), y=df['B']**2)\
.groupby(level=1)[['x', 'y']].transform('sum')
df['C'] = s['x'].div(s['y'])
Result:
A B C
0 1 0 2.0 3.0 0.720000
1 1 1 3.0 4.0 0.720000
2 2 1 3.0 4.0 0.750000
3 2 1 3.0 4.0 0.750000
39 20 0 2.0 3.0 0.666667
40 20 0 2.0 3.0 0.666667

Pandas: How to replace values of Nan in column based on another column?

Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3

pd.NamedAgg overwrites previous columns values

This is the dataframe I used.
token name ltp change
0 12345.0 abc 2.0 NaN
1 12345.0 abc 5.0 1.500000
2 12345.0 abc 3.0 -0.400000
3 12345.0 abc 9.0 2.000000
4 12345.0 abc 5.0 -0.444444
5 12345.0 abc 16.0 2.200000
6 6789.0 xyz 1.0 NaN
7 6789.0 xyz 5.0 4.000000
8 6789.0 xyz 3.0 -0.400000
9 6789.0 xyz 13.0 3.333333
10 6789.0 xyz 9.0 -0.307692
11 6789.0 xyz 20.0 1.222222
While trying to solve this question, I encountered this wierd behaviour of pd.NamedAgg
#Worked as intended
df.groupby('name').agg(pos=pd.NamedAgg(column='change',aggfunc=lambda x: x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
# Output
pos neg
name
abc 3.0 2.0
xyz 3.0 2.0
When doing it over specific column
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
#Output
pos neg
name
abc 2.0 2.0
xyz 2.0 2.0
pos columns values are over-written with neg column values.
Another example below:
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()))
#Output
pos neg
name
abc 4.855556 4.855556
xyz 7.847863 7.847863
More weirder results:
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
max = pd.NamedAgg(column='ltp',aggfunc='max'))
# I'm applying on Series `'change'` but I mentioned `column='ltp'` which should
# raise an `KeyError: "Column 'ltp' does not exist!"` but it produces results as follows
pos neg max
name
abc 4.855556 4.855556 2.2
xyz 7.847863 7.847863 4.0
The problem is when using it with pd.Series
s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one = pd.NamedAgg(column='new',aggfunc='sum'))
one
1 2
2 4
3 6
4 4
5 5
Shouldn't it raise an KeyError?
Some more weird results, The values one column are not over-written when we use different column names.
s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
second=pd.NamedAgg(column='something',aggfunc='max'))
one second
1 2 1
2 4 2
3 6 3
4 4 4
5 5 5
Values are over-written when we use the same column name in pd.NamedAgg
s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
second=pd.NamedAgg(column='weird',aggfunc='max'))
one second # Values of column `one` are over-written
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
My pandas version
pd.__version__
# '1.0.3'
From the pandas documentation:
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
In [82]: animals.groupby("kind").height.agg(
....: min_height='min',
....: max_height='max',
....: )
....:
Out[82]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
But couldn't find why using it with column produces weird results.
UPDATE :
Bug report is filed by #jezrael in github issue #34380, and here too.
EDIT: This is a bug confirmed by pandas-dev and this has been resolved in PR BUG: aggregations were getting overwritten if they had the same name #30858
If there is specified columns after groupby use solution specified in paragraph:
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
df = df.groupby('name')['change'].agg(pos = lambda x:x.gt(0).sum(),\
neg = lambda x:x.lt(0).sum())
print (df)
pos neg
name
abc 3.0 2.0
xyz 3.0 2.0
why using it with column produces weird results.
I think it is bug, instead wrong output is should raise error.

How to find the cumsum of a column?

dict={"asset":["S3","S2","E4","E1","A6","A8"],
"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
i want to do cumsum of the column "cards_plus1".
How can I do this?
the output of the column cumsum should be that:
0
2
5
8
10
13
i want to start with zero instead of 2.. i want this outup : cards_plus1_cumsum 0 2 5 8 10 13
We can just pad a zero before the sums:
dframe["cumsum"] = np.pad(dframe["cards_plus1"][:-1].cumsum(), (1, 0), 'constant')
Try this:
First, replace the blank values by nan
import pandas as pd
import numpy as np
dict={"asset":["S3","S2","E4","E1","A6","A8"],"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
## replace blank values by nan
print(dframe.replace(r'^\s*$', np.nan, regex=True, inplace=True))
print (dframe)
>>> asset Rank number_of_attributes number_of_cards cards_plus1
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
Now the data type of the cards_plus1 column is object - change to numeric
### convert data type of the cards_plus1 to numeric
dframe['cards_plus1'] = pd.to_numeric(dframe['cards_plus1'])
Now calculate cumulative sum
### now we can calculate cumsum
dframe['cards_plus1_cumsum'] = dframe['cards_plus1'].cumsum()
print(dframe)
>>>
asset Rank number_of_attributes number_of_cards cards_plus1 \
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
cards_plus1_cumsum
1 2.0
2 5.0
3 8.0
4 10.0
5 13.0
6 NaN
Instead of replacing the blank values by nan, you can replace them by zero, depends on what you want.. Hope this helped..

Categories