I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.
Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.
You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!
Related
I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!
There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?
Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)
Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).
Currently I am working for a project to classify a given set of test images into one of the 5 predefined categories. I implemented Logistic Regression with a feature vector of 240 features for each image and trained it using 100 images/ category. The learning accuracy I achieved was ~98% for each category, whereas when tested on validation set consisting of 500 images (100 images/category), only ~57% images were rightly classified.
Please suggest me few libraries/tools which I can use (preferably based on Neural Network) in order to attain higher accuracy.
I tried using a Java based tool, Neurophy (neuroph.sourceforge.net) on windows but, it didn't run as expected.
Edit: The feature vector were already provided for the project. I am also looking for a better feature extraction tool for Images.
You can get help from this paper Image Classification
In My opinion, SVM is relatively better than logistic regression when it comes to multi-class response problems. We use it in e commerce classification of product where there are 1000s of response level and thousands of features.
Based on your tags I assume you would like a python package, scikit-learn has good classification routines: scikit-learn.org.
I have had good success using the WEKA tools, you need to isolate the feature set that you are interested in and then apply a classifier from this library. The examples are very clear. http://weka.wikispaces.com
I'm trying to apply SVM from Scikit learn to classify the tweets I collected.
So, there will be two categories, name them A and B.
For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'.
However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for.
I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values.
Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work.
And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data?
Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.
Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.
I am working on a classification task related to written text and I wonder how important it is to perform some kind of "feature selection" procedure in order to improve the classification results.
I am using a number of features (around 40) related to the subject, but I am not sure if all the features are really relevant or not and in which combinations. I am experementing with SVM (scikits) and LDAC (mlpy).
If a have a mix of relevant and irrelevant features, I assume I will get poor classification results. Should I perform a "feature selection procedure" before classification?
Scikits has an RFE procedure that is tree-based that is able to rank the features. Is it meaningful to rank the features with a tree-based RFE to choose the most important features and to perform the actual classification with SVM (non linear) or LDAC? Or should I implement some kind of wrapper method using the same classifier to rank the features (trying to classify with different groups of features would be very time consuming)?
Just try an see if it improves the classification score as measured with cross validation. Also before trying RFE, I would try less CPU intensive schemes such as univariate chi2 feature selection.
Having 40 features is not too bad. Some machine-learning is impeded by irrelevant features, but many things are quite robust to them (eg naive Bayes, SVM, decision trees). You probably don't need to do feature selection unless you decide to add many more features in.
It's not a bad idea to throw away useless features, but don't waste your own mental time on trying that out unless you have a particular motivation to.