How to calculate a groupby mean and variance in a pandas DataFrame? - python

I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?

Related

How to pivot a DataFrame creating new columns, considering the max item repeated

I have the next pd.DataFrame:
Index ID Name Date Days
1 1 Josh 5-1-20 10
2 1 Josh 9-1-20 10
3 1 Josh 19-1-20 6
4 2 Mike 1-1-20 10
5 3 George 1-4-20 10
6 4 Rose 1-2-20 10
7 4 Rose 11-5-20 5
8 5 Mark 1-9-20 10
9 6 Joe 1-4-21 10
10 7 Jill 1-1-21 10
I'm needing to make a DataFrame where the ID is not repeated, for that, I want to creat new columns (Date y Days), considering the case with most repeatitions (3 in this case).
The desired output is the next DataFrame:
Index ID Name Date 1 Date 2 Date 3 Days1 Days2 Days3
1 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
2 2 Mike 1-1-20 10
3 3 George 1-4-20 10
4 4 Rose 1-2-20 11-5-20 10 5
5 5 Mark 1-9-20 10
6 6 Joe 1-4-21 10
7 7 Jill 1-1-21 10
Try:
df_out = df.set_index(['ID','Name',df.groupby('ID').cumcount()+1]).unstack()
df_out.columns = [f'{i} {j}' for i, j in df_out.columns]
df_out.fillna('').reset_index()
Output:
ID Name Index 1 Index 2 Index 3 Date 1 Date 2 Date 3 Days 1 Days 2 Days 3
0 1 Josh 1.0 2.0 3.0 5-1-20 9-1-20 19-1-20 10.0 10.0 6.0
1 2 Mike 4.0 1-1-20 10.0
2 3 George 5.0 1-4-20 10.0
3 4 Rose 6.0 7.0 1-2-20 11-5-20 10.0 5.0
4 5 Mark 8.0 1-9-20 10.0
5 6 Joe 9.0 1-4-21 10.0
6 7 Jill 10.0 1-1-21 10.0
Here is a solution using pivot with a helper column:
df2 = (df
.assign(col=df.groupby('ID').cumcount().add(1).astype(str))
.pivot(index=['ID','Name'], columns='col', values=['Date', 'Days'])
.fillna('')
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
Output:
ID Name Date_1 Date_2 Date_3 Days_1 Days_2 Days_3
0 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
1 2 Mike 1-1-20 10
2 3 George 1-4-20 10
3 4 Rose 1-2-20 11-5-20 10 5
4 5 Mark 1-9-20 10
5 6 Joe 1-4-21 10
6 7 Jill 1-1-21 10

How to use each vector entry to fill NAN's of a separate groups in a dataframe

Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0

Add the values of several columns when the number of columns exceeds 3 - Pandas

I have a pandas dataframe with several columns of dates, numbers and bill amounts. I would like to add the amounts of the other invoices with the 3rd one and change the invoice number by "1111".
Here is an example:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Bill4
Date 4
ID Bill 4
Bill5
Date 5
ID Bill 5
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
3
23
2001-11-23
6
76
2011-08-19
12
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
21
2009-05-04
9
Nan
Nan
Nan
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
And I would like to get this :
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
4
6
2000-10-04
1
45
2000-11-05
2
150
1999-12-05
1111
6
8
2016-05-03
7
39
2017-08-09
8
59
2018-07-14
1111
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
This example is a sample of my data, I may have many more than 5 columns.
Thanks for your help
with a little of data manipulation, you should be able to do it as:
df = df.replace('Nan', np.nan)
idx_col_bill3 = 7
step = 3
idx_col_bill3_id = 10
cols = df.columns
bills = df[cols[range(idx_col_bill3,len(cols), step)]].sum(axis=1)
bills.replace(0, nan, inplace=True)
df = df[cols[range(idx_col_bill3_id)]]
df['Bill3'] = bills
df['ID Bill 3'].iloc._setitem_with_indexer(df['ID Bill 3'].notna(),1111)

How to move every element in a column by n range in a dataframe using python?

I have a dataframe df that looks like below:
No A B value
1 23 36 1
2 45 23 1
3 34 12 2
4 22 76 NaN
...
I would like to shift each of the value in "value" column by 2. And the first row "value" should not be shifted.
I have already tried the normal shift, which directly shifts everthing by 2.
df['value']=df['value'].shift(2)
i expect the below result:
No A B value
1 23 36 1
2 45 23 Nan
3 34 12 Nan
4 22 76 1
5 10 12 Nan
6 34 2 Nan
7 21 11 2
...
In your case
df['Newvalue']=pd.Series(df.value.values,index=np.arange(len(df))*3)
df
Out[41]:
No A B value Newvalue
0 1 23 36 1.0 1.0
1 2 45 23 1.0 NaN
2 3 34 12 2.0 NaN
3 4 22 76 NaN 1.0

how to groupby hour in a pandas multiindex

I have a pandas multiindex with two indices, a data and a gender columns. It looks like this:
Division North South West East
Date Gender
2016-05-16 19:00:00 F 0 2 3 3
M 12 15 12 12
2016-05-16 20:00:00 F 12 9 11 11
M 10 13 8 9
2016-05-16 21:00:00 F 9 4 7 1
M 5 1 12 10
Now if I want to find the average values for each hour, I know I can do like:
df.groupby(df.index.hour).mean()
but this does not seem to work when you have a multi index. I found that I could do reach the Date index like:
df.groupby(df.index.get_level_values('Date').hour).mean()
which sort of averages over the 24 hours in a day, but I loose track of the Gender index...
so my question is: how can I find the average hourly values for each Division by Gender?
I think you can add level of MultiIndex, need pandas 0.20.1+:
df1 = df.groupby([df.index.get_level_values('Date').hour,'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Another solution:
df1 = df.groupby([df.index.get_level_values('Date').hour,
df.index.get_level_values('Gender')]).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Or simply create columns from MultiIndex:
df = df.reset_index()
df1 = df.groupby([df['Date'].dt.hour, 'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10

Categories