Standardize variable by group - why is the mean always zero? - python

I have the following data:
df = pd.DataFrame({'sound': ['A', 'B', 'B', 'A', 'B', 'A'],
'score': [10, 5, 6, 7, 11, 1]})
print(df)
sound score
0 A 10
1 B 5
2 B 6
3 A 7
4 B 11
5 A 1
If I standardize (i.e. Z score) the score variable, I get the following values. The mean of the new z column is basically 0, with SD of 1, both of which are expected for a standardized variable:
df['z'] = (df['score'] - df['score'].mean())/df['score'].std()
print(df)
print('Mean: {}'.format(df['z'].mean()))
print('SD: {}'.format(df['z'].std()))
sound score z
0 A 10 0.922139
1 B 5 -0.461069
2 B 6 -0.184428
3 A 7 0.092214
4 B 11 1.198781
5 A 1 -1.567636
Mean: -7.401486830834377e-17
SD: 1.0
However, what I'm actually interested in is calculating Z scores based on group membership (sound). For example, if a score is from sound A, then convert that value to a Z score using the mean and SD of sound A values only. Likewise, sound B Z scores will only use mean and SD from sound B. This will obviously produce different values compared to regular Z score calculation:
df['zg'] = df.groupby('sound')['score'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
print('Mean: {}'.format(df['zg'].mean()))
print('SD: {}'.format(df['zg'].std()))
sound score z zg
0 A 10 0.922139 0.872872
1 B 5 -0.461069 -0.725866
2 B 6 -0.184428 -0.414781
3 A 7 0.092214 0.218218
4 B 11 1.198781 1.140647
5 A 1 -1.567636 -1.091089
Mean: 3.700743415417188e-17
SD: 0.894427190999916
My question is: why is the mean of the group-based standardized values (zg) also basically equal to 0? Is this expected behaviour or is there an error in my calculation somewhere?
The z scores make sense because standardizing within a variable essentially forces the mean to 0. But the zg values are calculated using different means and SDs for each sound group, so I'm not sure why the mean of that new variable has also been set to 0.
The only situation where I can see this happening is if the sum of values > 0 is equal to sum of values < 0, which when averaged would cancel out to 0. This happens in a regular Z score calculation but I'm surprised that this also happens when operating across multiple groups like this...

I think it makes perfect sense. If E[abc | def ] is the expectation of abc given def), then in df['zg']:
m1 = E['zg' | sound = 'A'] = (0.872872 + 0.218218 -1.091089)/3 ~ 0
m2 = E['zg' | sound = 'B'] = (-0.725866 - 0.414781 + 1.140647)/3 ~ 0
and
E['zg'] = (m1+m2)/2 = (0.872872 + 0.218218 -1.091089 -0.725866 - 0.414781 + 1.140647)/6 ~ 0

Yes, this is expected behavior.
In fancy words, using the Law of Iterated Expectations,
And specifically, if groups Y are finite and thus countable,
where
However, by construction, every E[X|Y_j] is 0 for all values of Y in your set G of possible groups.
Thus, the total average will also be zero.

Related

Getting conditioning values from a distribution

I am trying to capture values from a kernel distribution which gives almost 0 and at the end of the tail. My try is to take values from the kernel function , distributed in a timeline from -120 to 120 and make the percentage change for the values from the kernel , so then i can declared an arbitrary rule that 10 consecutive negative changes and have which kernel value is almost 0 i can declare as the starting point from the ending of the curve.
Illustration example for which point of the kernel function i want to obtain.
in this case the final value which i will like to obtain is around 300
my dataframe looks like (this is not the same example values from above) :
df
id event_time
1 2
1 3
1 3
1 5
1 9
1 10
2 1
2 1
2 2
2 2
2 5
2 5
# my try
def find_value(df):
if df.shape[0] == 1:
return df.iloc[0].event_time
kernel = stats.gaussian_kde(df['event_time'])
time = list(range(-120,120))
a = kernel(time)
b = np.diff(a) / a[:-1] * 100
so far i have a which represent Y axis from the graph and b which represent the change in Y. The reason that i did this is for making the logic made at the begging but dont know how to code it. after writing the function i was thinking in using an groupby and a apply

How to include NULL values as zero to variance calculation in Python?

I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})

Can we use a pandas data frame to calculate the next value using a previous value? A good example would be the Fibonacci numbers

So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0

Discard points with X,Y coordinate close to eachother in Dataframe

I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5

Python Data Frame: cumulative sum of column until condition is reached and return the index

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
Say I have a simple data frame with two columns:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.
Example:
cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4
I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
Opt - 1:
You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.
Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
When performing np.isclose on the series later converted to an array:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
Opt - 2:
Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
Opt - 3:
Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
I think you can directly add a column with the cumulative sum as:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
This could even be done as following code:
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return (index,df.loc([df.Num_Albums==i,'Num_authors']))
This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.
or could even be returned as an array by
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return df.loc([df.Num_Albums==i,'Num_authors']).index.values
I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!
I am also new so hope it helps !!

Categories