How do you know if your dataset suffers from high-dimensionality problems?

How do you know if your dataset suffers from high-dimensionality problems? - python

There seems to be many techniques for reducing dimensionality (pca, svd etc) in order to escape the curse of dimensionality. But how do you know that your dataset in fact suffers from high-dimensionality problems? Is there a best practice, like visualizations or can one even use KNN to find out?
I have a dataset with 99 features and 1 continuous label (price) and 30 000 instances.

The curse of dimensionality refers to a problem that dictates the relation between feature dimension and your data size. It has been pointed out that as your feature size/dimension grows, the amount of data in order to successfully model your problem will also grow exponentially.
The problem actually arises when there is exponential growth in your data. Because you have to think of how to handle it properly ( storage/ computation power needed).
So we usually experiment to figure out the right size of the dimension we need for our problem (maybe using cross-validation) and only select those features. Also, keep in mind that using lots of features comes with a high risk of overfitting.
You can use either Feature selection or feature extraction for dimension reduction.LASSO can be used for feature selection or PCA, LDA for feature extraction.

Related

Worse results when training on entire dataset

After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:
What is validation data used for in a Keras Sequential model?
Your model doesn't "see" your validation set and isn´t in any way trained on it
https://machinelearningmastery.com/train-final-machine-learning-model/
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.
However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.
Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).

Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff
As in the comment #CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.
EXPLAINATION
Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.
MY ASSUMPTION
I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-
play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification
If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.
SOLUTION
By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.
How to make BETTER DATA
Balance the data
Remove outliers
Scale (Normalize) the data
CONCLUSION
It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.

Why can my perceptron not separate perfectly a number of points that are less than the number of features?

I am quite new in machine learning and decided that a good way to start getting some experience would be to play around with some real data bases and the python scikit library. I used haberman's surgery data, a binary classification task, that can be found at https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival. I trained a few perceptrons using this data. At some point, I decided to demonstrate the concept of overfitting. Therefore, I mapped all 306 data points, of 3 features each, to a very high dimension, getting all terms up to and including the 11th degree. That is a vast 364 features (which is more than the 306 data points). Yet, when I trained the model, I did not achieve zero in-sample error. I figured the reason should be that there are some points that coincide and have different labels, so I removed duplicate data points, but again, I could not achieve zero in-sample error. Here is the interesting part of my code using the methods of the scikit library:
perceptron = Perceptron()
polynomial = preprocessing.PolynomialFeatures(11)
perceptron.fit(polynomial.fit_transform(X), Y)
print(perceptron.score(polynomial.fit_transform(X),Y))
And the output I got was a mere 0.7, an accuracy score far from 1 (100%) that I expected. What am I missing?

You only have 11 polynomial features. If you want to be guaranteed to hit every point you need almost as many if not more polynomial features than your number of datapoints. This is because each additional polynomial feature allows the graph to bend again.
Having a bunch of features of the same degree can't really increase your complexity in the way you expect. If your function is first degree for example, you really can't expect it to be anything other than linear, regardless of the like term count.
So while you may have more features than datapoints, since you don't have more polynomial features than datapoints most of your features are effectively tweaking the same weights.

How to choose parameters for svm in sklearn

I'm trying to use SVM from sklearn for a classification problem. I got a highly sparse dataset with more than 50K rows and binary outputs.
The problem is I don't know quite well how to efficiently choose the parameters, mainly the kernel, gamma anc C.
For the kernels for example, am I supposed to try all kernels and just keep the one that gives me the most satisfying results or is there something related to our data that we can see in the first place before choosing the kernel ?
Same goes for C and gamma.
Thanks !

Yes, this is mostly a matter of experimentation -- especially as you've told us very little about your data set: separability, linearity, density, connectivity, ... all the characteristics that affect classification algorithms.
Try the linear and Gaussian kernels for starters. If linear doesn't work well and Gaussian does, then try the other kernels.
Once you've found the best 1 or 2 kernels, then play with the cost and gamma parameters. Gamma is a "slack" parameter: it gives the kernel permission to make a certain proportion of raw classification errors as a trade-off for other benefits: width of the gap, simplicity of the partition function, etc.
I haven't yet had an application that got more than trivial benefit from altering the cost.

Are there feature selection algorithms that can be applied to categorical data inputs?

I am training a neural network which has 10 or so categorical inputs. After one-hot encoding these categorical inputs I end up feeding around 500 inputs into the network.
I would love to be able to ascertain the importance of each of my categorical inputs. Scikit-learn has numerous feature importance algorithms, however can any of these be applied to categorical data inputs? All of the examples use numerical inputs.
I could apply these methods to the one-hot encoded inputs, but how would I extract the meaning after applying to binarised inputs? How does one go about judging feature importance on categorical inputs?

Using the feature selection algorithms on one hot encoding might be miss leading due to the relations between the encoded features. For example, if you encode a feature of n values into n features and you have n-1 of the m selected, the last feature is not needed.
Since the number of your features is quite low (~10), feature selection not help you so much since you'll probably be able to reduce only few of them without loosing too much information.
You wrote that the one hot encoding turns the 10 features into 500, meaning that each feature has about 50 values. In this case you might be more interested in discretisation algorithms, manipulating on the values themselves. If there is an implied order on the values, you can use algorithms for continuos features. Another option is simply to omit rare values or values without a strong correlation to the concept.
In case that you use feature selection, most algorithms will work on categorial data but you should beware of corner cases. For example, the mutual information, suggested by #Igor Raush is an excellent measure. However, features with many values tend to have higher entropy than feature withe less values. That in turn might lead into higher mutual information and a bias into features of many values. A way to cope with this problem is to normalize by dividing the mutual information by the feature entropy.
Another set of feature selection algorithms that might help you are the wrappers. They actually delegate the learning to the classification algorithm and therefore they are indifferent of the representation as long as the classification algorithm can cope with it.

How to calculate probability(confidence) of SVM classification for small data set?

Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.