thinking about a problem… should you standardize two predictors that are already on the same scale (say kilograms) but may have different ranges? The model is a KNN
I think you should because the model will give the predictor eith the higher range more importance in calculating distance
It is better to standardize the data even though being on same scale. Standardizing would reduce the distance (specifically euclidean) that would help weights to not vary much from the point intial to them. Having huge seperated distance would rather have more calculation involved. Also distance calculation done in KNN requires feature values to scaling is always prefered.
Related
I recently started working in the field of machine learning and stuff related to it using python. Today I'm working on a dataset where I would like to apply a dimension reduction and apply my model to evaluate the score. This dataset got 30 features.
I start with a simple algorithm which is the Logistic Regression but before applying my logistic regression I want to do a PCA.
To determine which number of components is the best I used the gridsearchCV with my logistic regression only playing with the C parameter and my PCA where I choose the number of components.
The result I got is that the more components I use for my PCA the better is the precision score. For my example with n_components=30 I get a precision score of 0.81.
The problem is that I thought PCA is used for dimension reduction (i.e working with fewer features) and that it could help increasing score. Is there something I do not understand?
pca = PCA()
logistic = LogisticRegression()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
param_grid = {
'pca__n_components': [5,10,15,20,25,30],
'logistic__C': [0.01,0.1,1,10,100]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='precision') # fix adding a tuple scoring
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
results = pd.DataFrame(search.cv_results_)
output : Best parameter (CV score=0.881):
{'logistic__C': 0.01, 'pca__n_components': 30}
Thanks in advance for your reply
EDIT: I add this screenshot for more information on the score with number of components
In general, when you do dimension reduction, you lose some information. It is not surprising then that you get a higher score with the full set of PCA features. Working with few features could indeed help increase the score but not necessarily, there are also other good reasons for using PCA for dimension reduction. Here are the main advantages of PCA:
PCA is one good technique for dimension reduction (with its own limitations) in the sense that it concentrate the variance of the dataset in the first dimensions of the computed new space. Hence, dropping the last features is done at a minimal cost in terms of information carried by the dataset (under certain hypotheses). Using PCA for dimension reduction mitigates the risk of overfitting by limiting the number of features, while losing a minimal amount of information. In this sense, less features can increase the score by avoiding overfitting but that is not always true.
Dimension reduction with PCA can also be useful when working with noisy data. PCA will not directly eliminate the noise, but the first few features will have a higher signal-to-noise ratio since the variance of the dataset is concentrated there. The last features may be then dominated by noise and dropped.
Since PCA projects the dataset on a new orthonormal basis, the new features will be all independant from each other. This property is often required by a lot of machine learning algorithms to achieve optimal performance.
Of course, PCA should not be used in any case as it has its own hypotheses and limitations. Here are what I consider the main ones (non exhaustive):
PCA is sensitive to the scaling of the variables. As an example, if you have a temperaturecolumn in your dataset, you will get a different transformation depending on whether you use Celsius or Fahrenheit as the unit because their scale are different. When the variables have different scales, PCA is a bit arbitrary. This can be corrected by scaling all variables to unit variance, but at the cost of modifying (compressing or expanding) the fluctuations of the variables in all dimensions.
PCA captures linear correlations between between the features but fails to capture non-linear correlations.
What would be interesting in your case would be to compare the score obtained with and without the PCA transformation. You would see then if there is a benefit in using it.
Last but not least, your plot shows an interesting thing. The gain in the score between 20 and 30 features is very low (1% ?). You can wonder whether it is worth keeping ten additional features for this very low gain. Indeed, keeping more features increases the risk of having a model with a lower ability to generalize. Cross validation mitigates already this risk, but there are no guarantees that when you apply the model on unseen data, this unseen data will have the exact same properties as your training dataset.
I'm currently trying to train a GP regression model in GPflow which will predict precipitation values given some meteorological inputs. I'm using a Linear+RBF+WhiteNoise kernel, which seems appropriate given the set of predictors I'm using.
My problem at the moment is that when I get the model to predict new values, it has a tendency to predict negative precipitation - see attached figure.
How can I "enforce" physical constraints when building the model? The training data doesn't contain any negative precipitation values, but it does contain a lot of values close to zero, which I assume means the GPR model isn't learning the "precipitation must be >=0" constraint very well.
If there's a way of explicitly enforcing a constraint like this it'd be perfect, but I'm not sure how that would work. Would this require a different optimization algorithm? Or is it possible to somehow build this constraint into the kernel structure?
This is more of a question for CrossValidated ... A Gaussian process is essentially a distribution over functions with Gaussian marginals: the predictive distribution of f(x) at any point is by construction a Gaussian, not constrained. E.g. if you have lots of observations close to zero, your model expects that something just below zero must also be very likely.
If your observations are strictly positive, you could use a different likelihood, e.g. Exponential (gpflow.likelihoods.Exponential) or Beta (gpflow.likelihoods.Beta). Note that model.predict_y() always returns mean and variance, and for non-Gaussian likelihoods the variance may not actually be what you want. In practice, you're more likely to care about quantiles (e.g. 10%-90% confidence interval); there is an open issue on the GPflow github that relates to this. Which likelihood you use is part of your modelling choice, and depends on your data.
The simplest practical answer to your problem is to consider modelling the log-precipitation: if your original dataset is X and Y (with Y > 0 for all entries), compute logY = np.log(Y) and create your GP model e.g. using gpflow.models.GPR((X, logY), kernel). You then predict logY at test points, and can then convert it back from log-precipitation into precipitation space. (This is equivalent to a LogNormal likelihood, which isn't currently implemented in GPflow, though this would be straightforward.)
I have about 8000 features measuring a two level response variable i.e. output can belong to class 1 or 0.
The 8000 features consist of about 3000 features with 0-1 values and about 5000 features (which are basically words from text data and their tfidf scores.
I am building a linear svm model on this to predict my output variable and am getting decent results/ accuracy, recall and precision around 60-70%
I am looking for help with the following:
Standardization: do the 0-1 values need to be standardized? Do tfidf scores need to be standardized even if I use sublinear tdf=true ?
Dimension reduction: I have tried f_classif using SelectPercentile function of sklearn so far. Any other dimension reduction techniques that can be suggested? I have gone through the sklearn dimension reduction url which also talks about chi2 dim reduction but that isn't giving me good results. Can pca be applied if the data is a mix of 0-1 columns and tfidf score columns?
Remove collinearity: How can I remove highly correlated independent variables.
I am fairly new to python and machine learning, so any help would be appreciated.
(edited to include additional questions)
1 - I would centre and scale your variables for a linear model. I don't know if it's strictly necessary for SVMs, but if I recall correctly, spatial based models are better if the variables are in the same ranges. I don't think there's any harm in doing this anyway (vs. unscaled/uncentred). Someone may correct me - I don't do much by way of text analysis.
2 - (original answer) = Could you try applying a randomForest model, then inspecting the importance scores (discarding those with low importance). With so many features I'd worry about memory issues but if your machine can handle it...?
Another good approach here would be to use ridge/lasso logistic regression. This by its very nature is good at identifying (and discarding) redundant variables, and can help with your question 3 (correlated variables).
Appreciate you're new to this, but both these models above are good at getting around correlation / non-significant variables, so you may want to use these on the way to finalising an SVM.
3 - There's no magic bullet that I know of. The above may help. I predominantly use R, and within that there's a package called Boruta which is good for this step. There may be a Python equivalent?
I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.
I am new to the concept of scaling a feature in Machine Learning, I read that scaling will be useful when one feature range is very high when compared to other features. But if I choose to scale the training data then:
Can I just scale that one feature that has high range?
If I scale the entire X of train data then do I need to also scale the y of train data and entire test data?
Yes, you can scale that one feature that has high range, but do ensure that there is no other feature that has a high range, because if it exist and has not been scaled then that feature will make the algorithm overlook the contributions of the scaled features and effect the result(output value) with even a slight change in it. It is recommended( but not compulsory) to scale all the features in the training set.
You do not need to scale the Y of train data as the algorithm or model will set the parameter values to get least Cost(error), that is k{Y(output)-Y(original)} anyway. But if the Xtrain was scaled then the test set(feature values, Xtest)(Scale Ytest only if the Ytrain was scaled) needs to be scaled(using training mean and variance) before feeding it to the model because the model hasn't seen this data before and has been trained on data with scaled range, so if the test data has a feature value diverging from the corresponding feature range in train data by a considerably high value then the model will output a wrong prediction for the corresponding test data.
Yes, you can scale a single feature. You can interpret scaling as a means of giving the same importance to each feature. For instance, imagine you have data about people and you describe your examples via two features: height and weight. If you measure height in meters and weight in kilograms, a k-Nearest Neighbours classifier when computing the distance between two examples is likely to make its decisions solely based on the weight. In that case, you can scale one of the features to the same range of the other. Commonly, we scale all the features to the same range (e.g. 0 - 1). In addition, remember that all the values you use to scale your training data must be used to scale the test data.
As for the dependent variable y you do not need to scale it.