I have a pandas.dataframe with a column passengers with a range which may vary greatly depending on the function creating the dataframe.
The other columns are often more or less of constant ranges (they're economy indicators).
segments.head(2);
passengers gdp gdp_per_capita inflation unemployment \
Month
2002-01-01 11688 4461.087 31634.953 150.847 14.418
2002-02-01 9049 4142.153 29321.702 204.132 14.738
population
Month
2002-01-01 339.59
2002-02-01 343.32
My most valuable data is the number of passengers, so I do not want to transform it. However, the differences of scale of the other measures, which I want to use as predictors, make it difficult to track the variations (sometimes in tens of thousands, sometimes in decimals).
How could I standardize the range of all my columns to be consistent with the mean(passengers)?
There are different ways you can approach that problem, you can make/apply a manual transformation function, or you can use a pre existing function, such as sklearn.preprocessing.StandardScaler.
StandardScaler will "Standardize features by removing the mean and scaling to unit variance". You can hence shift mean and adjust unit variance accordingly to your desires/needs.
However, it looks to me you are going to try and build a predictive model on that data, if so,the best approach would be to test all hypothesis, and keep what works best, my advice is:
Remove skew from passagers (if present) - Log & Log1p are most common transforms, but depending on your data other transforms might be better. You should test arbitrary functions as well (inverse, or 1/(X+1) for example) and use the best transform (skew closest to 0)
Test both scaled / non scaled features. If data is skewed test both with transform/without as above.
If outliers are present test both with and without (outliers converted to borderline values / outliers converted to np.nan) Make a boolean feature column identifying outliers for each feature. Test to see if its valuable information or just noise to the model.
Hope that helps,
Related
I am creating a model using an advanced regression house price dataset. It has 37 numerical features. I want to make a feature selection by removing features with zero or very low variance. I used Variance Threshold, and it didn't remove any features.
for i in range(0,len(list(df.var()))):
print(df.columns[i],df.var()[i])
Output-
MSSubClass 1789.338306402389
MSZoning 589.7491687482642
LotFrontage 99625649.6503417
LotArea 1.9126794482991696
Street 1.2383223637883065
LotShape 912.2154126019891
LandContour 426.2328222558135
Utilities 32784.971167885175
LotConfig 208025.46846873628
LandSlope 26023.90777883106
Neighborhood 195246.40617940607
Condition1 192462.36170908928
Condition2 149450.07920371392
BldgType 190557.0753373038
HouseStyle 2364.204048090632
OverallQual 276129.63336259616
OverallCond 0.2692682171124828
YearBuilt 0.05700282610532444
YearRemodAdd 0.30350822011698775
RoofStyle 0.25289370651694854
RoofMatl 0.6654938173077709
Exterior1st 0.048548921667120055
Exterior2nd 2.6419033490756916
MasVnrType 0.41559474964087506
MasVnrArea 609.5825091487371
ExterQual 0.5584797243373708
ExterCond 45712.51022890529
Foundation 15709.813369543657
BsmtQual 4389.861203488976
BsmtCond 3735.5503258002063
BsmtExposure 859.5058709756354
BsmtFinType1 3108.889358915411
BsmtFinSF1 1614.215993315013
BsmtFinType2 246138.0553972849
BsmtFinSF2 7.309594674528473
BsmtUnfSF 1.763836649234308
TotalBsmtSF 6311111264.297451
These are the features and their variance. Based on this, what should be my threshold value
First of all, each feature has different scales so you cannot compare Variances to each other. One technique you can use is to use scaling, like MinMaxScaler (not Standard scaler). This will allow you to compare the variances and you can choose a low threshold value like 0.05 or 0.03. But the threshold really depends. Try evaluating models for different thresholds and compare the results. Once you scaled your data, the threshold usually will be between 0 and 1.
If you want a deeper understanding into VT, check out this post.
I have a regression model where my target variable (days) quantitative values ranges between 2 to 30. My RMSE is 2.5 and all the other X variables(nominal) are categorical and hence I have dummy encoded them.
I want to know what would be a good value of RMSE? I want to get something within 1-1.5 or even lesser but I am unaware what I should do to achieve the same.
Note# I have already tried feature selection and removing features will less importance.
Any ideas would be appreciated.
If your x values are categorical then it does not necessarily make much sense binding them to a uniform grid. Who's to say category A and B should be spaced apart the same as B and C. Assuming that they are will only lead to incorrect representation of your results.
As your choice of scale is the unknowns, you would be better in terms of visualisation to set your uniform x grid as being the day number and then seeing where the categories would place on the y scale if given a linear relationship.
RMS Error doesn't come into it at all if you don't have quantitative data for x and y.
There is a nice example of linear regression in sklearn using a diabetes dataset.
I copied the notebook version and played with it a bit in Jupyterlab. Of course, it works just like the example. But I wondered what I was really seeing.
There is a chart with unlabeled axes.
I wondered what the label (dependent variable) was.
I wondered which of the 10 independent variables was being used.
So I played around with the nice features provided by ipython/jupyter:
diabetes.DESCR
Diabetes dataset
================
Notes
-----
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of
n = 442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attributes:
:Age:
:Sex:
:Body mass index:
:Average blood pressure:
:S1:
:S2:
:S3:
:S4:
:S5:
:S6:
Note: Each of these 10 feature variables have been mean centered and scaled by the standard
deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'
From the Source URL, we are led to the original raw data which is a tab-separated unnormalized copy of the data. It also further explains what the "S" features were in the problem domain.
Interestingly, sex was one of [1,2] with a guess as to what they meant.
But my real question is whether there is a way within sklearn to determine
how to denormalize the data in sklearn?
Is there a way to denormalize the coefficients and intercept so that one could
express the fit algebraically?
or is this just a demonstration of linear regression?
There is no way to denormalize data without any information about the data prior to the normalization. However, note that the sklearn.preprocessing classes MinMaxScaler, StandardScaler, etc. do include inverse_transform methods (example), so if this were also provided in the example it would be easy to do. As it stands, as you say, this is just a regression demonstration.
I have the a dataframe which includes heights. The data can not go below zero. That's why i can not use standard deviation as this data is not a normal distribution. I can not use 68-95-99.7 rule here because it fails in my case. Here is my dataframe, mean and SD.
0.77132064
0.02075195
0.63364823
0.74880388
0.49850701
0.22479665
0.19806286
0.76053071
0.16911084
0.08833981
Mean: 0.41138725956196015
Std: 0.2860541519582141
If I get 2 std, as you can see the number becomes negative.
-2 x std calculation = 0.41138725956196015 - 0.2860541519582141 x 2 = -0,160721044354468
I have tried using percentile and not satisfied with it to be honest. How can i apply Chebyshev's inequality to this problem? Here what i did so far:
np.polynomial.Chebyshev(df['Heights'])
But this returns numbers not a SD level i can measure. Or do you think Chebyshev is the best choice in my case?
Expected solution:
I am expecting to get a range like 75% next height will be between 0.40 - 0.43 etc.
EDIT1: Added histogram
To be more clear, I have added my real data's histogram
EDIT2: Some values from real data
Mean: 0.007041500928135767
Percentile 50: 0.0052000000000000934
Percentile 90: 0.015500000000000047
Std: 0.0063790857035425025
Var: 4.06873389299246e-05
Thanks a lot
You seem to be confusing two ideas from the same mathematician, Chebyshev. These ideas are not the same.
Chebysev's inequality states a fact that is true for many probability distributions. For two standard deviations, it states that three-fourths of the data items will lie within two standard deviations from the mean. As you state, for normal distributions about 19/20 of the items will lie in that interval, but Chebyshev's inequality is an absolute bound that is met by practically all distributions. The fact that your data values are never negative does not change the truth of the inequality; it just makes the actual proportion of values in the interval even larger, so the inequality is even more true (in a sense).
Chebyshev polynomials do not involve statistics, but are simply a series (or two series) of polynomials, commonly used in calculating approximations for computer functions. That is what np.polynomial.Chebyshev involves, and therefore does not seem useful to you at all.
So calculate Chebyshev's inequality yourself. There is no need for a special function for that, since it is so easy (this is Python 3 code):
def Chebyshev_inequality(num_std_deviations):
return 1 - 1 / num_std_deviations**2
You can change that to handle the case where k <= 1 but the idea is obvious.
In your particular case: the inequality says that at least 3/4, or 75%, of the data items will lie within 2 standard deviations of the mean, which means more than 0.41138725956196015 - 2 * 0.2860541519582141 and less than than 0.41138725956196015 + 2 * 0.2860541519582141 (note the different signs), which simplifies to the interval
[-0.16072104435446805, 0.9834955634783884]
In your data, 100% of your data values are in that interval, so Chebyshev's inequality was correct (of course).
Now, if your goal is to predict or estimate where a certain percentile is, Chebyshev's inequality does not help much. It is an absolute lower bound, so it gives one limit to a percentile. For example, by what we did above we know that the 12.5'th percentile is at or above -0.16072104435446805 and the 87.5'th percentile is at or below 0.9834955634783884. Those facts are true but are probably not what you want. If you want an estimate that is closer to the actual percentile, this is not the way to go. The 68-95-99.7 rule is an estimate--the actual locations may be higher or lower, but if the distribution is normal than the estimate will not be far off. Chebyshev's inequality does not do that kind of estimate.
If you want to estimate the 12.5'th and 87.5'th percentiles (showing where 75 percent of all the population will fall) you should calculate those percentiles of your sample and use those values. If you don't know more details about the kind of distribution you have, I don't see any better way. There are reasons why normal distributions are so popular!
It sounds like you want the boundaries for the middle 75% of your data.
The middle 75% of the data is between the 12.5th percentile and the 87.5th percentile, so you can use the quantile function to get the values at the locations:
[df['Heights'].quantile(0.5 - 0.75/2), df['Heights'].quantile(0.5 + 0.75/2)]
#[0.09843618875, 0.75906485625]
As per What does it mean when the standard deviation is higher than the mean? What does that tell you about the data? - Quora, SD is a measure of "spread" and mean is a measure of "position". As you can see, these are more or less independent things. Now, if all your samples are positive, SD cannot be greater than the mean because of the way it's calculated, but 2 or 3 SDs very well can.
So, basically, SD being roughly equal to the mean means that your data are all over the place.
Now, a random variable that's strictly positive indeed cannot be normally distributed. But for a rough estimation, seeing that you still have a bell shape, we can pretend it is and still use SD as a rough measure of the spread (though, since 2 and 3 SD can go into negatives, they lack any physical meaning here whatsoever and so are unusable for the sake of our pretention):
E.g. to get a rough prediction of grass growth, you can still take the mean and apply whatever growth model you're using to it -- that will get the new, prospective mean. Then applying the same to meanĀ±SD will give an idea of the new SD.
This is very rough, of course. But to get any better, you'll have to somehow check which distribution you're dealing with and use its peak and spread characteristics instead of mean and SD. And in any case, your prediction will not be any better than your growth model -- studies of which are anything but conclusive judging by e.g. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3040.2005.01490.x (not a single formula there).
I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.
There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing is coupled to the data you are studying, but in general you could explore:
Assessing missing values, by computing their percentage per column
Compute the variance and remove variables with near zero variance
Assess the inter variable correlation to detect redundancy
You can compute these scores easily in pandas as follows:
data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index
variance.reset_index(inplace=True)
#reordering columns
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)
missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False)
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|")
The above would generate three files holding respectively, the variance, missing values percentage and correlation results.
Refer to this blog article for a hands on tutorial.
always split your data to train and test split to prevent overfiting.
if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.
you also have to look for missing datas and replace or remove them.
if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)
you also have to check outliers by using boxplot.
outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.
its the best if we check the multicollinearity.
if some features have correlation we have multicollinearity can couse wrong prediction for our model.
for using your data some of the columns might be categorical with sholud be converted to numerical.