Pandas GroupBy and Calculate Z-Score [duplicate] - python

This question already has an answer here:
adding a grouped-by zscore column to a pandas dataframe
(1 answer)
Closed 3 years ago.
So I have a dataframe that looks like this:
pd.DataFrame([[1, 10, 14], [1, 12, 14], [1, 20, 12], [1, 25, 12], [2, 18, 12], [2, 30, 14], [2, 4, 12], [2, 10, 14]], columns = ['A', 'B', 'C'])
A B C
0 1 10 14
1 1 12 14
2 1 20 12
3 1 25 12
4 2 18 12
5 2 30 14
6 2 4 12
7 2 10 14
My goal is to get the z-scores of column B, relative to their groups by column A and C. I know I can calculate the mean and standard deviation of each group
test.groupby(['A', 'C']).mean()
B
A C
1 12 22.5
14 11.0
2 12 11.0
14 20.0
test.groupby(['A', 'C']).std()
B
A C
1 12 3.535534
14 1.414214
2 12 9.899495
14 14.142136
Now for every item in column B I want to calculate it's z-score based off of these means and standard deviations. So the first result would be (10 - 11) / 1.41. I feel like there has to be a way to do this without too much complexity but I've been stuck on how to proceed. Let me know if anyone can point me in the right direction or if I need to clarify anything!

Do with transform
Mean=test.groupby(['A', 'C']).B.transform('mean')
Std=test.groupby(['A', 'C']).B.transform('std')
Then
(test.B - Mean) / Std
One function zscore from scipy
from scipy.stats import zscore
test.groupby(['A', 'C']).B.transform(lambda x : zscore(x,ddof=1))
Out[140]:
0 -0.707107
1 0.707107
2 -0.707107
3 0.707107
4 0.707107
5 0.707107
6 -0.707107
7 -0.707107
Name: B, dtype: float64
Ok Show my number tie out hehe
(test.B - Mean) / Std ==test.groupby(['A', 'C']).B.transform(lambda x : zscore(x,ddof=1))
Out[148]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
Name: B, dtype: bool

Related

Reshape data frame, so the index column values become the columns

I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4

Set a new column based on values from previous rows from different column

I am searching for an efficient way to set a new column based on values from previous rows from different columns. Imagine you have this DataFrame:
pd.DataFrame([[0, 22], [1, 15], [2, 18], [3, 9], [4, 10], [6, 11], [8, 12]],
columns=['days', 'quantity'])
days quantity
0 0 22
1 1 15
2 2 18
3 3 9
4 4 10
5 6 11
6 8 12
Now, I want to have a third column 'quantity_3days_ago', like this:
days quantity quantity_3days_ago
0 0 22 NaN
1 1 15 NaN
2 2 18 NaN
3 3 9 22
4 4 10 15
5 6 11 9
6 8 12 10
So I need to use the 'days' column to check what the 'quantity' column says for 3 days ago. In case there is no exact value in the 'days' column I want 'quantity_3days_ago' to be the value of the row before. See the last row as an example: 8 - 3 would be 5 in which case I would take the 'quantity' value of the row with days equals 4 for the 'quantity_3days_ago'. I hope this is understandable. I tried using rolling windows and shifting, but I wasn't able to get the desired result. I guess it would probably be possible with a loop over the whole DataFrame. However, this would be rather inefficient. I wonder if this can be done in one line. Thanks for your help!
We can do reindex before shift
rng = range(df.days.iloc[0],df.days.iloc[-1]+1)
df['new'] = df.days.map(df.set_index('days').reindex(rng ,method='ffill')['quantity'].shift(3))
df
Out[125]:
days quantity new
0 0 22 NaN
1 1 15 NaN
2 2 18 NaN
3 3 9 22.0
4 4 10 15.0
5 6 11 9.0
6 8 12 10.0

find indices where df value is in bin range of other df

I am trying to create a new column df2["v2"] in a dataframe filled with the values from a different dataframe df1["v1"].
The first dataframe holds values from measurement 1 which are measured at the times stored in df1["T1"]. The second dataframe should now store the values from measurement 1, but has a different time sampeling. In the real world task the time sampling is not evenly spaced (nor monotonically increasing, at least by default).
df1 = pd.DataFrame({"T1": [0, 5, 10, 15], "v1":[0, 1, 2, 3]})
df2 = pd.DataFrame({"T2": np.arange(0, 15)})
A stupid way of doing this could be:
df2["v2"] = pd.Series()
for n in range(df1["T1"].size-1):
t1 = df1["T1"].iloc[n]
t2 = df1["T1"].iloc[n+1]
mask = (t1 <= df2["T2"]) & (df2["T2"] < t2)
df2["v2"].loc[mask]= df1["v1"].iloc[n]
The resulting dataframe should look like this:
T2 v2
0 0 0.0
1 1 0.0
2 2 0.0
3 3 0.0
4 4 0.0
5 5 1.0
6 6 1.0
7 7 1.0
8 8 1.0
9 9 1.0
10 10 2.0
11 11 2.0
12 12 2.0
13 13 2.0
14 14 2.0
Whats the fastest/most elegant way of achieving the same?
Here is one way of solving the problem with pd.cut:
bins = pd.cut(df1['T1'], df1['T1'], right=False)
mapping = df1[:-1].set_index(bins[:-1])['v1']
df2['v2'] = df2['T2'].map(mapping)
Details:
Categorize the values in column T1 into discrete intervals characterised by the column T1 itself:
>>> bins
0 [0.0, 5.0)
1 [5.0, 10.0)
2 [10.0, 15.0)
3 NaN
Name: T1, dtype: category
Categories (3, interval[int64]): [[0, 5) < [5, 10) < [10, 15)]
Create a mapping series:
>>> mapping
T1
[0, 5) 0
[5, 10) 1
[10, 15) 2
Name: v1, dtype: int64
map the values in the column T2 with help of above mapping series:
>>> df2
T2 v2
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2

Is there an opposite function of pandas.DataFrame.droplevel (like keeplevel)?

Is there an opposite function of pandas.DataFrame.droplevel where I can keep some levels of the multi-level index/columns using either the level name or index?
Example:
df = pd.DataFrame([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
], columns=['a','b','c','d']).set_index(['a','b','c']).T
a 1 5 9 13
b 2 6 10 14
c 3 7 11 15
d 4 8 12 16
Both the following commands can return the following dataframe:
df.droplevel(['a','b'], axis=1)
df.droplevel([0, 1], axis=1)
c 3 7 11 15
d 4 8 12 16
I am looking for a "keeplevel" command such that both the following commands can return the following dataframe:
df.keeplevel(['a','b'], axis=1)
df.keeplevel([0, 1], axis=1)
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
There is no keeplevel because it would be redundant: in a closed and well-defined set, when you define what you want to drop, you automatically define what you want to keep
You may get the difference from what you have and what droplevel returns.
def keeplevel(df, levels, axis=1):
return df.droplevel(df.axes[axis].droplevel(levels).names, axis=axis)
>>> keeplevel(df, [0, 1])
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
Using set to find the different
df.droplevel(list(set(df.columns.names)-set(['a','b'])),axis=1)
Out[134]:
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
You can modify the Index objects, which should be fast. Note, this will even modify inplace.
def keep_level(df, keep, axis):
idx = pd.MultiIndex.from_arrays([df.axes[axis].get_level_values(x) for x in keep])
df.set_axis(idx, axis=axis, inplace=True)
return df
keep_level(df.copy(), ['a', 'b'], 1) # Copy to not modify original for illustration
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16
keep_level(df.copy(), [0, 1], 1)
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16

Pandas groupby aggregate to new columns

I have a DataFrame that looks something like this:
A B C D
1 10 22 14
1 12 20 37
1 11 8 18
1 10 10 6
2 11 13 4
2 12 10 12
3 14 0 5
and a function that looks something like this (NOTE: it's actually doing something more complex that can't be easily separated into three independent calls, but I'm simplifying for clarity):
def myfunc(g):
return min(g), mean(g), max(g)
I want to use groupby on A with myfunc to get an output on columns B and C (ignoring D) something like this:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
I can do the following:
df2.groupby('A')[['B','C']].agg(
{
'min': lambda g: myfunc(g)[0],
'mean': lambda g: myfunc(g)[1],
'max': lambda g: myfunc(g)[2]
})
But then—aside from this being ugly and calling myfunc multiple times—I end up with
max mean min
B C B C B C
A
1 12 22 10.75 15.0 10 8
2 12 13 11.50 11.5 11 10
3 14 0 14.00 0.0 14 0
I can use .swaplevel(axis=1) to swap the column levels, but even then B and C are in multiple duplicated columns, and with the multiple function calls it feels like barking up the wrong tree.
If you arrange for myfunc to return a DataFrame whose columns are ['A','B','C','D'] and whose rows index are ['min', 'mean', 'max'], then you could use groupby/apply to call the function (once for each group) and concatenate the results as desired:
import numpy as np
import pandas as pd
def myfunc(g):
result = pd.DataFrame({'min':np.min(g),
'mean':np.mean(g),
'max':np.max(g)}).T
return result
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3],
'B': [10, 12, 11, 10, 11, 12, 14],
'C': [22, 20, 8, 10, 13, 10, 0],
'D': [14, 37, 18, 6, 4, 12, 5]})
result = df.groupby('A')[['B','C']].apply(myfunc)
result = result.unstack(level=-1)
print(result)
prints
B C
max mean min max mean min
A
1 12.0 10.75 10.0 22.0 15.0 8.0
2 12.0 11.50 11.0 13.0 11.5 10.0
3 14.0 14.00 14.0 0.0 0.0 0.0
For others who may run across this and who do not need a custom function, note
that it behooves you to always use builtin aggregators (below, specified by the
strings 'min', 'mean' and 'max') if possible. They perform better than
custom Python functions. Happily, in this toy problem, it produces the desired result:
In [99]: df.groupby('A')[['B','C']].agg(['min','mean','max'])
Out[99]:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
Something like this might work.
df2.groupby('A')[['B','C']]
aggregated = df2.agg(['min', 'mean', 'max'])
then you could use swap level to get the column order swapped around
aggregated.columns = aggregated.columns.swaplevel(0, 1)
aggregated.sortlevel(0, axis=1, inplace=True)

Categories