I'm using K-mean clustering and I have no idea about the true labels of the data. I used PCA and I've got 4 clusters. However, the clusters seem to be imbalanced.
I was wondering how I can fix the class imbalanced problem in this unsupervised learning task?
Related
I have implemented code for analysing k-means clustering and hierarchical clustering on the following student performance dataset, but have trouble visualising the plots for the clusters.
Since this is a multiclassification dataset, PCA does not work on it, and I am not aware of an alternate method or workaround it.
Dataset link:
https://archive.ics.uci.edu/ml/datasets/Student+Performance
Let's assume we're dealing with continuous features and responses. We fit a linear regression model (let's say first order) and after CV we get a somehow good r^2 (let's say r^2=0.8).
Why do we go for other ML algorithms? I've read some research papers and they were trying different ML algorithms and taking the simple linear model as a base model for comparison. In these papers, the linear model outperformed other algorithms, what I have difficulty understanding is why do we go for other ML algorithms then? Why can't we just be satisfied with the linear model especially in the specific case where other algorithms perform poorly?
The other question is what do they gain from presenting the other algorithms in their research papers if these algorithms performed poorly?
The Best model for solving predictive problems such as continuous output is the regression model especially if you do it using a Neural network (polynomial or linear) with hyperparameter tuning based on the problem.
Using other ML algorithms such as Decision Tree or SVM or any other model where their main goal is classification but on the paper, they say it can do regression also in fact, they can't predict any new values.
but in the field of research people always try to find a better way to predict values other than regression, like in the classification world we start with Logistic regression -> decision tree and now we have SVM and ensembling models and DeepLearning.
I think the answer is because you never know.
especially in the specific case where other algorithms perform poorly?
You know they performed poorly because someone tried dose models. It's always worthy trying various models.
I understand Random Forest models can be used both for classification and regression situations. Is there a more specific criteria to determine where a random forest model would perform better than common regressions (Linear, Lasso, etc) to estimate values or Logistic Regression for classification?
The idea of a random forest model is built from a bunch of decision trees, and it is an supervised ensemble learning algorithm to reduce the over-fitting issue in individual decision trees.
The theory in machine learning is that there is no single model that outperforms all other models and hence, it is always recommended to try out different models before obtaining the optimal model.
With that said, there are preferences of model selection when one is dealing with data of different natures. Each model makes intrinsic assumptions about the data and the model with assumptions that are most aligned with the data generally works better for the data. For instance, logistic model is suitable for categorical data with a smooth linear decision boundary and if the data has this feature whereas a random forest does not assume a smooth linear decision boundary. Hence, the nature of your data makes a difference in your choice of models and it is always good to try them all before reaching to a conclusion.
I worked with a dataset which contains 2 classes (95%, 5%).
And the features of these 2 classes have almost the same distribution.
Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?
Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/#rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
I am using Kmeans Clustring algorithm from Sci-kit learn library and dimension of my data is 169 and that's why I am unable to visualize the result of clustering.
Is there any way to measure the performance of algorithm?
Secondly, I have the labels of data and I want to test the learned model with the test dataset but I am not sure the labels Kmeans algo gave to cluster coincide with the labels I have.
There are ways of visualizing high dimensional data. You can sample some dimensions, use PCA components, MDS, tSNE, parallel coordinates, and many more.
If you even just read the Wikipedia article on clustering, there is a section on evaluation, including supervised as well as unsupervised evaluation. But the results of such evaluation can be very misleading...
Bear on mind that if you have labeled data, supervised methods should always outperform unsupervised methods that do not have the labels: they don't know what to look for - there is lie reason to believe that every clustering happens to align with some labels. In particular, on most data there will be many reasonable clusterings that capture different aspects of your data.