Related
I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each cluster and get the representative topics. However, I'm having a hard time evaluating the results (apart from visual analysis, which is not great as the data grows). With LDA, although it's hard to evaluate it, i've been using the coherence measure. However, does anyone have any idea on how to evaluate the clusters made by HDBSCAN? I haven't been able to find much info on it, so if anyone has any idea, I'd very much appreciate!
HDBSCAN implements Density-Based Clustering Validation in the method called relative_validity. It will allow you to compare one clustering, obtained with a given set of hyperparameters, to another one.
In general, read about cluster analysis and cluster validation.
Here's a good discussion about this with the author of the HDBSCAN library.
Its the same problem everywhere in unsupervised learning.
It is unsupervised, you are trying to discover something new and interesting. There is no way for the computer to decide whether something is actually interesting or new. It can decide and trivial cases when the prior knowledge is coded in machine processable form already, and you can compute some heuristics values as a proxy for interestingness. But such measures (including density-based measures such as DBCV are actually in no way better to judge this than the clustering algorithm itself is choosing the "best" solution).
But in the end, there is no way around manually looking at the data, and doing the next steps - try to put into use what you learned of the data. Supposedly you are not invory tower academic just doing this because of trying to make up yet another useless method... So use it, don't fake using it.
You can try the clusteval library. This library helps your to find the optimal number of clusters in your dataset, also for hdbscan. When you have the cluster labels, you can start enrichment analysis using hnet.
pip install clusteval
pip install hnet
Example:
# Import library
from clusteval import clusteval
# Set the method
ce = clusteval(method='hdbscan')
# Evaluate
results = ce.fit(X)
# Make plot of the evaluation
ce.plot()
# Make scatter plot using the first two coordinates.
ce.scatter(X)
So at this point you have the optimal detected cluster labels and now you may want to know whether there is association between any of the clusters with a (group of) feature(s) in your meta-data. The idea is to compute for each cluster label how often it is seen for a particular class in your meta-data. This can be defined with a P-value. The lower the P-value (below alpha=0.05), the less likely it happened by random chance.
results is a dict and contains the optimal cluster labels in the key labx. With hnet we can compute the enrichment very easily. More information can be found here: https://erdogant.github.io/hnet
# Import library
import hnet
# Get labels
clusterlabels = results['labx']
# Compute the enrichment of the cluster labels with the dataframe df
enrich_results = hnet.enrichment(df, clusterlabels)
When we look at the enrich_results, there is a column with category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels clusterlabels.
The target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables in the dataframe. This can occur because we may need to better estimate the cluster labels or its a mixed group or something else.
More information about cluster enrichment can be found here:
https://erdogant.github.io/hnet/pages/html/Use%20Cases.html#cluster-enrichment
I am new to machine learning and I am wondering whether it would be possible to use my available biological data for clustering. I want to find out whether a group of DNA sequences can be clustered into two groups, efficient and not efficient.
I have five sets, each containing about 480 short sequences (lets call them samples). Each set is having an effect with different strength:
Set1 - Very good effect
Set2 - Good effect
Set3 - Minor effect
Set4 - Very minor effect
Set5 - No effect
Each sample has some features, e.g. free energy,starting with a specific nucleotide...
Now my question is whether I can find out which type of sample in my sets are playing a role for the effect of the whole set. My only assumption is that in set1 I have more efficient samples then in set5 (either none or very few). A very simple (not realistic) result could be, all samples which start with nucleotide 'A' end end with nucleotide 'C' are causing the effect.
Is it possible to use machine learning to find out?
Thanks!
That definitely sounds like a problem where machine learning could give good results. I recommend that you look into scikit-learn, a powerful and easy to use toolkit for machine learning in Python. There are many introductory examples and tutorials available.
For your use case, I would say that random forests could give good results, although it's hard to say without knowing more about the structure of the data. They are available in the class RandomForestClassifier in sklearn. Again, there are many tutorials and examples to be found.
Since your training data is unlabeled, you may want to look into unsupervised learning methods. A simple class of such methods are clustering algorithms. In sklearn, you can find, for instance, k-means clustering along other such algorithms. The idea would be to let the algorithm split your data into different cluster and see if there is any correlation between cluster membership and observed effect.
It is unclear from your description what the 5 sets (what sound like labels) correspond to, but I will assume that you are essentially asking about feature learning: you would like to know which features to choose to best predict what set a given sequence is from. Determining this from scratch is an open problem in machine learning and there are many possible approaches depending on the particulars of your situation.
You can select a set of features (just by making logical guesses) and calculate them for all sequences, then perform PCA on all the vectors you have generated. PCA will give you the linear combination of features that accounts for the most variability in your data which is useful in designing meaningful features.
I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.
I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.
Any insight would be much appreciated. Thanks!
You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.
If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that #Robert McGibbon suggested comes into play.
For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);
log-in time (modulo 24 hrs)
location (maybe an array of integer locations, each integer representing a different country)
user-agent (a similar array of integer user-agents)
and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.
So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:
from sklearn import svm
# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()
# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)
and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM
if clf.predict(log_in_data) < 0:
fire_alert_event()
else:
# log-in is not dissimilar to previous attempts
print('log in ok')
if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.
My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.
There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!
I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.
My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).
I believe #sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.
Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.
If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.
http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt
explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.
I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).
This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Additional
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.
I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.
There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Edit:
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
Edit2
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
#Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
Check the project for details: https://pypi.org/project/TextFeatureSelection/
Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.
I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).
I'm planning of implementing a document ranker which uses neural networks. How can one rate a document by taking in to consideration the ratings of similar articles?. Any good python libraries for doing this?. Can anyone recommend a good book for AI, with python code.
EDIT
I'm planning to make a recommendation engine which would make recommendations from similar users as well as using the data clustered using tags. User would be given chance to vote for articles. There will be about hundred thousand articles. Documents would be clustered based on their tags. Given a keyword articles would be fetched based on their tags and passed through a neural network for ranking.
The problem you are trying to solve is called "collaborative filtering".
Neural Networks
One state-of-the-art neural network method is Deep Belief Networks and Restricted Boltzman Machines. For a fast python implementation for a GPU (CUDA) see here. Another option is PyBrain.
Academic papers on your specific problem:
This is probably the state-of-the-art of neural networks and collaborative filtering (of movies):
Salakhutdinov, R., Mnih, A. Hinton, G, Restricted Boltzman
Machines for Collaborative Filtering, To appear in
Proceedings of the 24th International Conference on
Machine Learning 2007.
PDF
A Hopfield network implemented in Python:
Huang, Z. and Chen, H. and Zeng, D. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems (TOIS), 22, 1,116--142, 2004, ACM. PDF
A thesis on collaborative filtering with Restricted Boltzman Machines (they say Python is not practical for the job):
G. Louppe. Collaborative filtering: Scalable
approaches using restricted Boltzmann machines.
Master's thesis, Universite de Liege, 2010.
PDF
Neural networks are not currently the state-of-the-art in collaborative filtering. And they are not the simplest, wide-spread solutions. Regarding your comment about the reason for using NNs being having too little data, neural networks don't have an inherent advantage/disadvantage in that case. Therefore, you might want to consider simpler Machine Learning approaches.
Other Machine Learning Techniques
The best methods today mix k-Nearest Neighbors and Matrix Factorization.
If you are locked on Python, take a look at pysuggest (a Python wrapper for the SUGGEST recommendation engine) and PyRSVD (primarily aimed at applications in collaborative filtering, in particular the Netflix competition).
If you are open to try other open source technologies look at: Open Source collaborative filtering frameworks and http://www.infoanarchy.org/en/Collaborative_Filtering.
Packages
If you're not committed to neural networks, I've had good luck with SVM, and k-means clustering might also be helpful. Both of these are provided by Milk. It also does Stepwise Discriminant Analysis for feature selection, which will definitely be useful to you if you're trying to find similar documents by topic.
God help you if you choose this route, but the ROOT framework has a powerful machine learning package called TMVA that provides a large number of classification methods, including SVM, NN, and Boosted Decision Trees (also possibly a good option). I haven't used it, but pyROOT provides python bindings to ROOT functionality. To be fair, when I first used ROOT I had no C++ knowledge and was in over my head conceptually too, so this might actually be amazing for you. ROOT has a HUGE number of data processing tools.
(NB: I've also written a fairly accurate document language identifier using chi-squared feature selection and cosine matching. Obviously your problem is harder, but consider that you might not need very hefty tools for it.)
Storage vs Processing
You mention in your question that:
...articles would be fetched based on their tags and passed through a neural network for ranking.
Just as another NB, one thing you should know about machine learning is that processes like training and evaluating tend to take a while. You should probably consider ranking all documents for each tag only once (assuming you know all the tags) and storing the results. For machine learning generally, it's much better to use more storage than more processing.
Now to your specific case. You don't say how many tags you have, so let's assume you have 1000, for roundness. If you store the results of your ranking for each doc on each tag, that gives you 100 million floats to store. That's a lot of data, and calculating them all will take a while, but retrieving them is very fast. If instead you recalculate the ranking for each document on demand, you have to do 1000 passes of it, one for each tag. Depending on the kind of operations you're doing and the size of your docs, that could take a few seconds to a few minutes. If the process is simple enough that you can wait for your code to do several of these evaluations on demand without getting bored, then go for it, but you should time this process before making any design decisions / writing code you won't want to use.
Good luck!
If I understand correctly, your task is something related to Collaborative filtering. There are many possible approaches to this problem; I suggest you follow the wikipedia page to have an overview of the main approaches you can choose.
For your project work I can suggest looking at Python based intro to Neural Networks with a simple BackProp NN implementation and a classification example. This is not "the" solution, but perhaps you can build your system out of that example without the need for a bigger framework.
You might want to check out PyBrain.
The FANN library also looks promising.
I am not really sure if a neural networks are the best way to solve this. I think Euclidean Distance Score or Pearson Correlation Score combined with item or user based filtering would be a good start.
An excellent book on the topic is: Programming Collective Intelligence from Toby Segaran