How SelectKBest (chi2) calculates score?

How SelectKBest (chi2) calculates score? - python

I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, but I don't understand exactly how this score value is calculated. I know that theoretically high score is more valuable, but I need a mathematical formula or an example to calculate the score for learning this deeply.
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(dataValues, dataTargetEncoded)
feat_importances = pd.Series(fit.scores_, index=dataValues.columns)
topFatures = feat_importances.nlargest(50).copy().index.values
print("TOP 50 Features (Best to worst) :\n")
print(topFatures)
Thank you in advance

Say you have one feature and a target with 3 possible values
X = np.array([3.4, 3.4, 3. , 2.8, 2.7, 2.9, 3.3, 3. , 3.8, 2.5])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
X y
0 3.4 0
1 3.4 0
2 3.0 0
3 2.8 1
4 2.7 1
5 2.9 1
6 3.3 2
7 3.0 2
8 3.8 2
9 2.5 2
First we binarize the target
y = LabelBinarizer().fit_transform(y)
X y1 y2 y3
0 3.4 1 0 0
1 3.4 1 0 0
2 3.0 1 0 0
3 2.8 0 1 0
4 2.7 0 1 0
5 2.9 0 1 0
6 3.3 0 0 1
7 3.0 0 0 1
8 3.8 0 0 1
9 2.5 0 0 1
Then perform a dot product between feature and target, i.e. sum all feature values by class value
observed = y.T.dot(X)
>>> observed
array([ 9.8, 8.4, 12.6])
Next take a sum of feature values and calculate class frequency
feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = y.mean(axis=0).reshape(1, -1)
>>> class_prob, feature_count
(array([[0.3, 0.3, 0.4]]), array([[30.8]]))
Now as in the first step we take the dot product, and get expected and observed matrices
expected = np.dot(class_prob.T, feature_count)
>>> expected
array([[ 9.24],[ 9.24],[12.32]])
Finally we calculate a chi^2 value:
chi2 = ((observed.reshape(-1,1) - expected) ** 2 / expected).sum(axis=0)
>>> chi2
array([0.11666667])
We have a chi^2 value, now we need to judge how extreme it is. For that we use a chi^2 distribution with number of classes - 1 degrees of freedom and calculate the area from chi^2 to infinity to get the probability of chi^2 be the same or more extreme than what we've got. This is a p-value. (using chi square survival function from scipy)
p = scipy.special.chdtrc(3 - 1, chi2)
>>> p
array([0.94333545])
Compare with SelectKBest:
s = SelectKBest(chi2, k=1)
s.fit(X.reshape(-1,1),y)
>>> s.scores_, s.pvalues_
(array([0.11666667]), [0.943335449873492])

Related

Seaborn lmplot/regplot no fit for logistic: All-NaN slice encountered [duplicate]

As part of my assignment I am building logistic regression model but I am getting an error "Perfect separation detected, results not available" while building it.
**X_train :-**
year amt_spnt rank
1 -1.723034 -0.418500 0.272727
2 0.716660 2.088507 -0.636364
3 1.174102 -0.558333 -1.545455
4 -0.503187 -1.297451 1.181818
5 1.326583 -0.628250 -1.545455
**y_train :-**
1 0
2 1
3 1
4 0
5 1
Name: result, dtype: int64
**Logistic Model code:-**
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
**Dataset before and after scaling**
**Image for evidence:-**
[![Evidence][1]][1]
[1]: https://i.stack.imgur.com/cTncA.png

This is a model setting issue, because of the perfect separation, your model can not converge. Perfect separation means there is one (or more) variable in your independent variables that can perfectly distinct dependent variable = 0 from dependent variable = 1. See the following example:
Y 0 0 0 0 0 0 1 1 1 1
X 1 2 3 4 4 4 5 6 7 8
If X <= 4, Y = 0
If X > 4, Y = 1
A short answer to your question is to find such variable in your independent variable and remove it from your model.

Training a model with List of points column

I want to classify cracks by their depths.
To do it, I store in a data frame the following features:
WindowsDf = pd.DataFrame(dataForWindowsDf, columns=['IsCrack', 'CheckTypeEncode', 'DepthCrack',
'WindowOfInterest'])
#dataForWindowsDf is a list which iteratively built from csv files.
#Windows data frame taking this list and build a data frame from it.
So, my target column is 'DepthCrack' and the other columns are part of feature vector.
WindowOfInterest is a column of 2d list - list of points - a graph that represents a test that is done (based on electro-magnetic waves returned from a surface as a function of time) :
[[0.9561600000000001, 0.10913097635410397], [0.95621,0.1100000]...]
The problem I faced is how to train a model - using a column of 2d list(I tried to push that as it is and it didn't work)?
What way do you suggest to deal with this problem?
I thought about extracting features from the 2d-list - to get one dimensional features(integral and etc.)

You might transform this one feature in two, like WindowOfInterest can become :
WindowOfInterest_x1 and WindowOfInterest_x2
For example from your DataFrame :
>>> import pandas as pd
>>> df = pd.DataFrame({'IsCrack': [1, 1, 1, 1, 1],
... 'CheckTypeEncode': [0, 1, 0, 0, 0],
... 'DepthCrack': [0.4, 0.2, 1.4, 0.7, 0.1],
... 'WindowOfInterest': [[0.9561600000000001, 0.10913097635410397], [0.95621,0.1100000], [0.459561, 0.635410397], [0.4495621,0.32], [0.621,0.2432]]},
... index = [0, 1, 2, 3, 4])
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest
0 1 0 0.4 [0.9561600000000001, 0.10913097635410397]
1 1 1 0.2 [0.95621, 0.11]
2 1 0 1.4 [0.459561, 0.635410397]
3 1 0 0.7 [0.4495621, 0.32]
4 1 0 0.1 [0.621, 0.2432]
We can split the list like so :
>>> df[['WindowOfInterest_x1','WindowOfInterest_x2']] = pd.DataFrame(df['WindowOfInterest'].tolist(), index=df.index)
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest WindowOfInterest_x1 WindowOfInterest_x2
0 1 0 0.4 [0.9561600000000001, 0.10913097635410397] 0.956160 0.109131
1 1 1 0.2 [0.95621, 0.11] 0.956210 0.110000
2 1 0 1.4 [0.459561, 0.635410397] 0.459561 0.635410
3 1 0 0.7 [0.4495621, 0.32] 0.449562 0.320000
4 1 0 0.1 [0.621, 0.2432] 0.621000 0.243200
To finish, we can drop the WindowOfInterest column :
>>> df = df.drop(['WindowOfInterest'], axis=1)
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest_x1 WindowOfInterest_x2
0 1 0 0.4 0.956160 0.109131
1 1 1 0.2 0.956210 0.110000
2 1 0 1.4 0.459561 0.635410
3 1 0 0.7 0.449562 0.320000
4 1 0 0.1 0.621000 0.243200
Now you can pass WindowOfInterest_x1 and WindowOfInterest_x2 as features for you model.

Calculating Manhattan distance in Python without result

I have these two data frames in python and I'm trying to calculate the Manhattan distance and later on the Euclidean distance, but I'm stuck in this Manhattan distance and can't figure it out what is going wrong.
Here is what I have tried so far:
ratings = pd.read_csv("toy_ratings.csv", ",")
person1 = ratings[ratings['Person'] == 1]['Rating']
person2 = ratings[ratings['Person'] == 2]['Rating']
ratings.head()
Person Movie Rating
0 1 11 2.5
1 1 12 3.5
2 1 15 2.5
3 3 14 3.5
4 2 12 3.5
Here is data inside the person1 and person2
print("*****person1*****")
print(person1)
*****person1*****
0 2.5
1 3.5
2 2.5
5 3.0
22 3.5
23 3.0
36 5.0
print("*****person2*****")
print(person2)
*****person2*****
4 3.5
6 3.0
8 1.5
9 5.0
11 3.0
24 3.5
This was the function that I have tried to build without any luck:
def ManhattanDist(person1, person2):
distance = 0
for rating in person1:
if rating in person2:
distance += abs(person1[rating] - person2[rating])
return distance
The thing is that the function gives 0 back and this is not correct, when I debug I can see that it never enters the second loop. How can I perform a check to see the both rows has a value and loop?

I think the function should give back (= return) the distance in any case: either the distance is zero as initiated, or it is is somethhing else. So the function should look like
def ManhattanDist(person1, person2):
distance = 0
for rating in person1:
if rating in person2:
distance += abs(person1[rating] - person2[rating])
return distance
I think the distance should be built by two vectors of the same length (at least I cannot imagine any thing else). If this is the case you can do (without your function)
import numpy as np
p1 = np.array(person1)
p2 = np.array(person2)
#--- scalar product as similarity indicator
dist1 = np.dot(p1,p2)
#--- Euclidean distance
dist2 = np.linalg.norm(p1-p2)
#--- manhatten distance
dist3 = np.sum(np.abs(p1-p2))

You function is returning 1 value ... It should (I guess) return a list of values.

Standardize variable by group - why is the mean always zero?

I have the following data:
df = pd.DataFrame({'sound': ['A', 'B', 'B', 'A', 'B', 'A'],
'score': [10, 5, 6, 7, 11, 1]})
print(df)
sound score
0 A 10
1 B 5
2 B 6
3 A 7
4 B 11
5 A 1
If I standardize (i.e. Z score) the score variable, I get the following values. The mean of the new z column is basically 0, with SD of 1, both of which are expected for a standardized variable:
df['z'] = (df['score'] - df['score'].mean())/df['score'].std()
print(df)
print('Mean: {}'.format(df['z'].mean()))
print('SD: {}'.format(df['z'].std()))
sound score z
0 A 10 0.922139
1 B 5 -0.461069
2 B 6 -0.184428
3 A 7 0.092214
4 B 11 1.198781
5 A 1 -1.567636
Mean: -7.401486830834377e-17
SD: 1.0
However, what I'm actually interested in is calculating Z scores based on group membership (sound). For example, if a score is from sound A, then convert that value to a Z score using the mean and SD of sound A values only. Likewise, sound B Z scores will only use mean and SD from sound B. This will obviously produce different values compared to regular Z score calculation:
df['zg'] = df.groupby('sound')['score'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
print('Mean: {}'.format(df['zg'].mean()))
print('SD: {}'.format(df['zg'].std()))
sound score z zg
0 A 10 0.922139 0.872872
1 B 5 -0.461069 -0.725866
2 B 6 -0.184428 -0.414781
3 A 7 0.092214 0.218218
4 B 11 1.198781 1.140647
5 A 1 -1.567636 -1.091089
Mean: 3.700743415417188e-17
SD: 0.894427190999916
My question is: why is the mean of the group-based standardized values (zg) also basically equal to 0? Is this expected behaviour or is there an error in my calculation somewhere?
The z scores make sense because standardizing within a variable essentially forces the mean to 0. But the zg values are calculated using different means and SDs for each sound group, so I'm not sure why the mean of that new variable has also been set to 0.
The only situation where I can see this happening is if the sum of values > 0 is equal to sum of values < 0, which when averaged would cancel out to 0. This happens in a regular Z score calculation but I'm surprised that this also happens when operating across multiple groups like this...

I think it makes perfect sense. If E[abc | def ] is the expectation of abc given def), then in df['zg']:
m1 = E['zg' | sound = 'A'] = (0.872872 + 0.218218 -1.091089)/3 ~ 0
m2 = E['zg' | sound = 'B'] = (-0.725866 - 0.414781 + 1.140647)/3 ~ 0
and
E['zg'] = (m1+m2)/2 = (0.872872 + 0.218218 -1.091089 -0.725866 - 0.414781 + 1.140647)/6 ~ 0

Yes, this is expected behavior.
In fancy words, using the Law of Iterated Expectations,
And specifically, if groups Y are finite and thus countable,
where
However, by construction, every E[X|Y_j] is 0 for all values of Y in your set G of possible groups.
Thus, the total average will also be zero.

regression coefficient calculation in python

I have a Dataframe and an input text file of activity.Dataframe is produced via pandas.I want to find out the regression coefficient of each term using following formula
Y=C1aX1a+C1bX1b+...+C2aX2a+C2bX2b+....C0 ,
where Y is the activity Cna the regression coefficient for the residue choice a at position n, X the dummy variable coding (xna= 1 or 0) corresponding to the presence or absence of residue choice a at position n, and C0 the mean value of the activity.
My dataframe look likes
2u 2s 4r 4n 4m 7h 7v
0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0
Here 1 and 0 represents the presence and absence of residues respectively.
Using MLR(multiple linear regression) how can i find out the regression coefficient of each residue ie, 2u,2s,4r,4n,4m,7h,7v.
C1a represents the regression coefficient of residue a at 1st position(here 1a is 2u,1b is 2s, 2a is 4r...) X1a represents the dummy value ie 0 or 1 corresponding to 1a.
Activity file contain following data
6.5
5.9
5.7
6.4
5.2
So first equation will look like
6.5=C1a*0+C1b*1+C2a*1+C2b*0+C2c*0+C3a*0+C3b*1+C0
…
Can I get regression coefficient using numpy?.Please help me, All suggestions will be appreciated.

Let A be your dataframe (you can get it as a pure and simple numpy array. Read it in using np.loadtxt if it's CSV), and y be your activity file (again, a numpy array), and use np.linalg.lstsq
DF = """0 1 1 0 0 0 1
0 1 0 1 0 0 1
1 0 0 1 0 1 0
1 0 0 0 1 1 0
1 0 1 0 0 1 0"""
res = """6.5, 5.9, 5.7, 6.4, 5.2"""
A = np.fromstring ( DF, sep=" " ).reshape((5,7))
y = np.fromstring(res, sep=" ")
(x, res, rango, svals ) = np.linalg.lstsq(A, y )
print x
# 2.115625, 2.490625, 1.24375 , 1.19375 , 2.16875 , 2.115625, 2.490625
print np.sum(A.dot(x)**2) # Sum of squared residuals:
# 177.24750000000003
print A.dot(x) # Print predicition
# 6.225, 6.175, 5.425, 6.4 , 5.475

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How SelectKBest (chi2) calculates score? - python

Related

Seaborn lmplot/regplot no fit for logistic: All-NaN slice encountered [duplicate]

Training a model with List of points column

Calculating Manhattan distance in Python without result

Standardize variable by group - why is the mean always zero?

regression coefficient calculation in python

Categories

Resources