Algorithm for recognizing similar data?

Algorithm for recognizing similar data? - python

I've been given a youtube trending dataset with the assignment to make a predictive model which outputs the probability of a video getting into trending with at least 60% accuracy.
I have the title, channel, thumbnail_link, views, likes, dislikes, comments, date, ...
I've done some analyses and go figure the important columns are
category, tags(a "|" separated list)
The problem is that it's assumed all videos have trended so I can't use a classifier and fit it with training data to predict a trending yes/no column or use a regression algorithm without changing the goal to "predict how liked will it be" or something.
So it sounds like what I'm looking for is a clustering alg, I've looked into KMeans but as far as I can tell it won't do the trick
I'm thinking that I could compare video by video which categories and tags it contains and score it by the popularity of them or make a distance calculating similarity function but the implication is that I should use scikit

This sounds like a one-class classification problem. Some options are:
fit a representative distribution of the data, then for a new observation (video) check how likely it is to have come from that distribution
fit a classifier that will essentially find the boundaries of the data, then for a new observation tell you how far in/out-side of the boundary it is, for example scikit-learn.svm.OneClassSVM
fit cluster centers, or find archetypal examples, and then for a new observation tell how far it is from the cluster center compared to an average observation in the training data
Just some ideas, there are certainly other approaches. :)

Related

Classifier for time based data to binary label

I have access to a dataframe of 100 persons and how they performed on a certain motion test. This frame contains about 25,000 rows per person since the performance of this person is kept track of (approximately) each centisecond (10^-2). We want to use this data to predict a binary y-label, that is to say, if someone has a motor problem or not.
Trained neural networks on mean's and variances of certain columns per person classified +-72% of the data correctly.
Naive bayes classifier on mean's and variances of certain columns per person classified +-80% correctly.
Now since this is time based data, 'performance on this test through time', we were suggested to use Recurrent Neural Networks. I've looked into this and I find that this is mostly used to predict future events, i.e. the events happening in the next centiseconds.
Question is, is it in general feasible to use RNN's on (in a way time-based) data like this to predict a binary label? If not, what is?

Yes it definitely is feasible and also very common. Search for any document classification tasks (e.g. sentiment) for examples of this kind of tasks.

Which classification model do you suggest for predicting a credit score?

I have a data set that contains information about whether medium-budget companies can get loans. There are data on the data set that approximately 38,000 different companies will receive loans. And based on this data, I'm trying to estimate each company's credit score. What would be your suggestion?

Do you have credit scores? Without labeled data I think you might consider reformulating the problem.
If you do, then you can implement any number of regression algorithms from OLS all the way up to an ANN. Rather than look for the "one true" algorithm, many projects implement TPOT or grid search as part of model selection.

How to explain clustering results?

Say I have a high dimensional dataset which I assume to be well separable by some kind of clustering algorithm. And I run the algorithm and end up with my clusters.
Is there any sort of way (preferable not "hacky" or some kind of heuristic) to explain "what features and thresholds were important in making members of cluster A (for example) part of cluster A?"
I have tried looking at cluster centroids but this gets tedious with a high dimensional dataset.
I have also tried fitting a decision tree to my clusters and then looking at the tree to determine which decision path most of the members of a given cluster follow. I have also tried fitting an SVM to my clusters and then using LIME on the closest samples to the centroids in order to get an idea of what features were important in classifying near the centroids.
However, both of these latter 2 ways require the use of supervised learning in an unsupervised setting and feel "hacky" to me, whereas I'd like something more grounded.

Have you tried using PCA or some other dimensionality reduction techniques and checking whether the clusters still hold? Sometimes relationships still exist in lower dimensions (Caveat: it doesn't always help one's understanding of the data). Cool article about visualizing MNIST data. http://colah.github.io/posts/2014-10-Visualizing-MNIST/. I hope this helps a bit.

Do not treat the clustering algorithm as a black box.
Yes, k-means uses centroids. But most algorithms for high-dimensional data don't (and don't use k-means!). Instead, they will often select some features, projections, subspaces, manifolds, etc. So look at what information the actual clustering algorithm provides!

Formatting and combining word frequency with other data machine learning python

I'm new in machine learning algorithms. I extensively read the scikit learn website and other SO post, which led me to build my first machine learning algorithm using the RandomForestClassifier and LinearSVC.
I'm working on medical notes. Each stay of a patient is associated (or not) to a code corresponding to a complication (bleeding, infection, heart attack...)
Using the notes, fitted and transformed with Countvectorizer and tfidfTransformer, i can accurately predict most of the codes. However, i'd like to add more data to my training dataset: length of stay, number of operations, title of operations, ICU stay duration...etc...
After parsing the web and SO, i ended up by adding all continuous/binary/scaled value to my word frequency array.
e.g: [0,0,0.34,0,0.45,0, 2, 45] (last 2 numbers are added data, whereas previous one match countvectorizer and tfdif.fit_transform(train_set)
However, this seems to me to be a gross way to combine data, and a huge number of words could mask others data.
I tried to set my data like: [[0,0,0.34,0,0.45,0],[2],[45]] but it doesn't work.
I searched the web, but no real clue, even though i might not be the first one facing this issue...:p
Thanks for your help
Edit:
Thanks for your detailed valuable answer. I really appreciated. However, what is exactly the range 0-1: is it the {predict_proba} value (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) ?. I understood that the score is the accuracy of the prediction model. Then when you have all your predictions depending of each variable, do you average all of them ? Eventually, i'm working with multiple outputs, i guess it's not a problem since i can get a prediction for each of the output (btw predict_proba(X) give me an array like [array([[0.,1.]]), array ([[0.2,0.8]]).....] with a random forest tree classifier. i guess one of the number is the probability of the output, but i haven't explored this yet !)

Your first solution of just appending to the list is the correct solution. However, you should think about what this is implying. If you have 100 words and add two additional features, each specific word will get the same "weight" as the added features - IE - your added features won't be treated very strongly in the model. Additionally, you're saying that the last feature with a value of 45 is 100x the value of the feature 4th from end (0.45).
One common way to get around that is to use an ensemble model. Instead of adding those features to your list of words and predicting, first build a prediction model just using the words. That prediction will be in the range 0-1 and will capture the "sentiment" of the article. Then, scale your other variables (minmax scaler, normal distribution, etc.). Finally, combine the score from the words with the last two scaled variables and run another prediction on a list like this [.86,.2,.65]. In this way, you have transformed all of the words to a sentiment score, which you can use as a feature.
Hope that helps.
EDIT PER YOUR UPDATE ABOVE
Yes, in this instance you could use the predict_proba, but really if everything is scaled correctly, and you are using 1/0 as your targets for a class you don't need the predict_proba. The idea is to take the prediction from the words and combine it with the other variables. You do not average the predictions, you make a prediction from the predictions! This is called ensemble learning. Train another model with the output of your predictions as the features. Here is a flow of what you need to do.

Thanks for your time and your detailed answer. I think i get it. In short:
Prediction based on words, and for each bag of words of the training set (t1), you pull out a "sentiment"
Create a new array for each training set row with the sentiment and others values->new training set(t2)
Make a prediction based on t2.
Apply previous steps to the test.
One more question though !
What is the "sentiment" value ?! For each bag of words, i have a sparse matrix (countvectorizer+tf_idf). So how do you calculate the sentiment ? Do you run each row of the test again the rest of the test ? and your sentiment is the clf.predict(X) value ?

Classifying Documents into Categories

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.

Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.