The difference between feature importance and feature weights in XGBoost - python

I am trying to analyze the output of running xgb classifier. I haven't been able to find a proper explanation of the difference between the feature weights and the features importance chart.
Here is a sample screenshot (not from my dataset but the same analysis I am running).
I will appreciate explanations or references to where I can get any.
Thanks in advance
Screenshot

Related

How to train a machine learning model in python including several target variables

I am trying to build a machine learning model in python. I used pytorch and sklearn to make the model. My model is a bit complicated: I have one input feature but several target variables. My target variables are values making a curve and I used each value of the curve as a different feature. I showed five different curves in the upladed figure.
I used algorithms like DecisionTreeRegressor and RandomeForestRegressor to fit the only input variable to several target variables. But the prediction of trained model is not so well for extrapolation. The trained model can create the a series of data but not so accure. Does anyone know such trained model in Python? I tried hyperparameter tuning using GridSearchCV but it did not help me.
In advance I do appreciate your help and feedback.

Code and dataset(small size) for image clustering

Can anyone pls provide code and dataset for Unsupervised image clustering. There is no resources are available on the internet regarding image clustering and its implementation
If you are looking for some tutorial with dataset and python code examples, here you will find some examples.
Keras & Sklearn for binary (cat or dog) clustering.
https://towardsdatascience.com/image-clustering-using-k-means-4a78478d2b83
Combining CNN and K-Means for multilabel clustering. (Data from Kaggle). At the end you can find all the code.
https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34

How to classify unlabelled data?

I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.
Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.
You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!

Software for Image classification

Currently I am working for a project to classify a given set of test images into one of the 5 predefined categories. I implemented Logistic Regression with a feature vector of 240 features for each image and trained it using 100 images/ category. The learning accuracy I achieved was ~98% for each category, whereas when tested on validation set consisting of 500 images (100 images/category), only ~57% images were rightly classified.
Please suggest me few libraries/tools which I can use (preferably based on Neural Network) in order to attain higher accuracy.
I tried using a Java based tool, Neurophy (neuroph.sourceforge.net) on windows but, it didn't run as expected.
Edit: The feature vector were already provided for the project. I am also looking for a better feature extraction tool for Images.
You can get help from this paper Image Classification
In My opinion, SVM is relatively better than logistic regression when it comes to multi-class response problems. We use it in e commerce classification of product where there are 1000s of response level and thousands of features.
Based on your tags I assume you would like a python package, scikit-learn has good classification routines: scikit-learn.org.
I have had good success using the WEKA tools, you need to isolate the feature set that you are interested in and then apply a classifier from this library. The examples are very clear. http://weka.wikispaces.com

Prepare data for text classification using Scikit Learn SVM

I'm trying to apply SVM from Scikit learn to classify the tweets I collected.
So, there will be two categories, name them A and B.
For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'.
However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for.
I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values.
Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work.
And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data?
Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.
Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

Categories