Removing features based on variance

Removing features based on variance - python

I am creating a model using an advanced regression house price dataset. It has 37 numerical features. I want to make a feature selection by removing features with zero or very low variance. I used Variance Threshold, and it didn't remove any features.
for i in range(0,len(list(df.var()))):
print(df.columns[i],df.var()[i])
Output-
MSSubClass 1789.338306402389
MSZoning 589.7491687482642
LotFrontage 99625649.6503417
LotArea 1.9126794482991696
Street 1.2383223637883065
LotShape 912.2154126019891
LandContour 426.2328222558135
Utilities 32784.971167885175
LotConfig 208025.46846873628
LandSlope 26023.90777883106
Neighborhood 195246.40617940607
Condition1 192462.36170908928
Condition2 149450.07920371392
BldgType 190557.0753373038
HouseStyle 2364.204048090632
OverallQual 276129.63336259616
OverallCond 0.2692682171124828
YearBuilt 0.05700282610532444
YearRemodAdd 0.30350822011698775
RoofStyle 0.25289370651694854
RoofMatl 0.6654938173077709
Exterior1st 0.048548921667120055
Exterior2nd 2.6419033490756916
MasVnrType 0.41559474964087506
MasVnrArea 609.5825091487371
ExterQual 0.5584797243373708
ExterCond 45712.51022890529
Foundation 15709.813369543657
BsmtQual 4389.861203488976
BsmtCond 3735.5503258002063
BsmtExposure 859.5058709756354
BsmtFinType1 3108.889358915411
BsmtFinSF1 1614.215993315013
BsmtFinType2 246138.0553972849
BsmtFinSF2 7.309594674528473
BsmtUnfSF 1.763836649234308
TotalBsmtSF 6311111264.297451
These are the features and their variance. Based on this, what should be my threshold value

First of all, each feature has different scales so you cannot compare Variances to each other. One technique you can use is to use scaling, like MinMaxScaler (not Standard scaler). This will allow you to compare the variances and you can choose a low threshold value like 0.05 or 0.03. But the threshold really depends. Try evaluating models for different thresholds and compare the results. Once you scaled your data, the threshold usually will be between 0 and 1.
If you want a deeper understanding into VT, check out this post.

Related

Regression analysis for linear regression

I have a regression model where my target variable (days) quantitative values ranges between 2 to 30. My RMSE is 2.5 and all the other X variables(nominal) are categorical and hence I have dummy encoded them.
I want to know what would be a good value of RMSE? I want to get something within 1-1.5 or even lesser but I am unaware what I should do to achieve the same.
Note# I have already tried feature selection and removing features will less importance.
Any ideas would be appreciated.

If your x values are categorical then it does not necessarily make much sense binding them to a uniform grid. Who's to say category A and B should be spaced apart the same as B and C. Assuming that they are will only lead to incorrect representation of your results.
As your choice of scale is the unknowns, you would be better in terms of visualisation to set your uniform x grid as being the day number and then seeing where the categories would place on the y scale if given a linear relationship.
RMS Error doesn't come into it at all if you don't have quantitative data for x and y.

Increase feature importance

I am working on a classification problem. I have around 1000 features and target variable has 2 classes. All the 1000 features have values 1 or 0. I am trying to find feature importance but my feature importance values varies from 0.0 - 0.003. I am not sure if such low value is meaningful.
Is there a way I can increase feature importance.
# Variable importance
rf = RandomForestClassifier(min_samples_split=10, random_state =1)
rf.fit(X, Y)
print ("Features sorted by their score:")
a = (list(zip(map(lambda x: round(x, 3), rf.feature_importances_), X)))
I would really appreciate any help! Thanks

Since you only have two target classes you can perform an unequal variance t-test which has been useful to find important features in a binary classification task when all other feature ranking methods have failed me. You can implement this using scipy.stats.ttest_ind function. It basically is a statistical test that checks whether the two distributions are different. if the returned p-value is less than 0.05, they can be assumed to be different distributions. To implement for each feature, follow these steps:
Extract all predictor values for class 1 and 2 respectively.
Run test_ind on these two distributions, specifying that they're variance is unknown, and make sure it's a two tailed t-test
If the p-value is less than 0.05, this feature is important.
Alternatively, you can do this for all your features and use the p-value as the measure of feature importance. The lower, the p-value, the higher the importance of a feature.
Cheers!

How do I denormalize the sklearn diabetes dataset?

There is a nice example of linear regression in sklearn using a diabetes dataset.
I copied the notebook version and played with it a bit in Jupyterlab. Of course, it works just like the example. But I wondered what I was really seeing.
There is a chart with unlabeled axes.
I wondered what the label (dependent variable) was.
I wondered which of the 10 independent variables was being used.
So I played around with the nice features provided by ipython/jupyter:
diabetes.DESCR
Diabetes dataset
================
Notes
-----
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of
n = 442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attributes:
:Age:
:Sex:
:Body mass index:
:Average blood pressure:
:S1:
:S2:
:S3:
:S4:
:S5:
:S6:
Note: Each of these 10 feature variables have been mean centered and scaled by the standard
deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'
From the Source URL, we are led to the original raw data which is a tab-separated unnormalized copy of the data. It also further explains what the "S" features were in the problem domain.
Interestingly, sex was one of [1,2] with a guess as to what they meant.
But my real question is whether there is a way within sklearn to determine
how to denormalize the data in sklearn?
Is there a way to denormalize the coefficients and intercept so that one could
express the fit algebraically?
or is this just a demonstration of linear regression?

There is no way to denormalize data without any information about the data prior to the normalization. However, note that the sklearn.preprocessing classes MinMaxScaler, StandardScaler, etc. do include inverse_transform methods (example), so if this were also provided in the example it would be easy to do. As it stands, as you say, this is just a regression demonstration.

Standardize range of data based on one of the dataframe's column

I have a pandas.dataframe with a column passengers with a range which may vary greatly depending on the function creating the dataframe.
The other columns are often more or less of constant ranges (they're economy indicators).
segments.head(2);
passengers gdp gdp_per_capita inflation unemployment \
Month
2002-01-01 11688 4461.087 31634.953 150.847 14.418
2002-02-01 9049 4142.153 29321.702 204.132 14.738
population
Month
2002-01-01 339.59
2002-02-01 343.32
My most valuable data is the number of passengers, so I do not want to transform it. However, the differences of scale of the other measures, which I want to use as predictors, make it difficult to track the variations (sometimes in tens of thousands, sometimes in decimals).
How could I standardize the range of all my columns to be consistent with the mean(passengers)?

There are different ways you can approach that problem, you can make/apply a manual transformation function, or you can use a pre existing function, such as sklearn.preprocessing.StandardScaler.
StandardScaler will "Standardize features by removing the mean and scaling to unit variance". You can hence shift mean and adjust unit variance accordingly to your desires/needs.
However, it looks to me you are going to try and build a predictive model on that data, if so,the best approach would be to test all hypothesis, and keep what works best, my advice is:
Remove skew from passagers (if present) - Log & Log1p are most common transforms, but depending on your data other transforms might be better. You should test arbitrary functions as well (inverse, or 1/(X+1) for example) and use the best transform (skew closest to 0)
Test both scaled / non scaled features. If data is skewed test both with transform/without as above.
If outliers are present test both with and without (outliers converted to borderline values / outliers converted to np.nan) Make a boolean feature column identifying outliers for each feature. Test to see if its valuable information or just noise to the model.
Hope that helps,

Use K-means to learn features in Python

Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)

The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.

This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?

The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.