I have a dataset that would be equivalent to a dict of 5 millions key-values, both strings.
Each key is unique but there are only a couple hundreds of different values.
Keys are not natural words but technical references. The values are "families", grouping similar technical references. Similar is meant in the sense of "having similar regex", "including similar characters", or some sort of pattern.
Example of key-values:
ADSF33344 : G1112
AWDX45603 : G1112
D99991111222 : X3334
E98881188393 : X3334
A30-00005-01 : B0007
B45-00234-07A : B0007
F50-01120-06 : B0007
The final goal is to feed an algorithm with a list of new references (never seen before) and the algorithm would return a suggested family for each reference, ideally together with a percentage of confidence, based on what it learned from the dataset.
The suggested family can only come from the existing families found in the dataset. No need to "invent" new family name.
I'm not familiar with machine learning so I don't really know where to start. I saw some solutions through Sklearn or TextBlob and I understand that I'm looking for a classifier algorithm but every tutorial is oriented toward analysis of large texts.
Somehow, I don't find how to handle my problem, although it seems to be a "simpler" problem than analysing newspaper articles in natural language...
Could you indicate me sources or tutorials that could help me?
Make a training dataset, and train a classifier. Most classifiers work on the values of a set of features that you define yourself. (The kind of features depends on the classifier; in some cases they are numeric quantities, in other cases true/false, in others they can take several discrete values.) You provide the features and the classifier decides how important each feature is, and how to interpret their combinations.
By way of a tutorial you can look at chapter 6 of the NLTK book. The example task, the classification of names into male and female, is structurally very close to yours: Based on the form of short strings (names), classify them into categories (genders).
You will translate each part number into a dictionary of features. Since you don't show us the real data, nobody give you concrete suggestions, but you should definitely make general-purpose features as in the book, and in addition you should make a feature out of every clue, strong or weak, that you are aware of. If supplier IDS differ in length, make a length feature. If the presence (or number or position) of hyphens is a clue, make that into a feature. If some suppliers' parts use a lot of zeros, ditto. Then make additional features for anything else, e.g. "first three letters" that might be useful. Once you have a working system, experiment with different feature sets and different classifier engines and algorithms, until you get acceptable performance.
To get good results with new data, don't forget to split up your training data into training, testing and evaluation subsets. You could use all this with any classifier, but the NLTK's Naive Bayes classifier is pretty quick to train so you could start with that. (Note that the features can be discrete values, e.g. first_letter can be the actual letter; you don't need to stick to boolean features.)
Related
I have a list of merchant category:
[
'General Contractors–Residential and Commercial',
'Air Conditioning, Heating and Plumbing Contractors',
'Electrical Contractors',
....,
'Insulation, Masonry, Plastering, Stonework and Tile Setting Contractors'
]
I want to exclude merchants from my dataframe if df['merchant_category'].str.contains() any of such merchant categories.
However, I cannot guarantee that the value in my dataframe has the long name as in the list of merchant category. It could be that my dataframe value is just air conditioning.
As such, df = df[~df['merchant_category'].isin(list_of_merchant_category)] will not work.
If you can collect a long list of positive examples (categories you definitely want to keep), & negative examples (categories you definitely want to exclude), you could try to train a text classifier on that data.
It would then be able to look at new texts and make a reasonable guess as to whether you want them included or excluded, based on their similarity to your examples.
So, as you're working in Python, I suggest you look for online tutorials and examples of "binary text classification" using Scikit-Learn.
While there's a bewildering variety of possible approaches to both representing/vectorizing your text, and then learning to make classifications from those vectors, you may have success with some very simple ones commonly used in intro examples. For example, you could represent your textual categories with bag-of-words and/or character-n-gram (word-fragments) representations. Then try NaiveBayes or SVC classifiers (and others if you need to experiment for possibly-bettr results).
Some of these will even report a sort of 'confidence' in their predictions - so you could potentially accept the strong predictions, but highlight the weak predictions for human review. When a human then looks at, an definitively rules on, a new 'category' string – because it was highlighted an iffy prediction, or noticed as an error, you can then improve the overall system by:
adding that to the known set that are automatically included/excluded based on an exact literal comparison
re-training the system, so that it has a better chance at getting other new similar strings correct
(I know this is a very high-level answer, but once you've worked though some attempts based on other intro tutorials, and hit issues with your data, you'll be able to ask more specific questions here on SO to get over any specific issues.)
I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.
I’m trying to figure out which direction to take my Python NLP project in, and I’d be very grateful to the SO community for any advice.
Problem:
Let’s say I have 100 .txt files that contain the minutes of 100 meetings held by a decisionmaking body. I also have 100 .txt files of corresponding meeting outcomes, which contain the resolutions passed by this body. The outcomes fall into one of seven categories – 1 – take no action, 2 – take soft action, 3 – take stronger action, 4 – take strongest action, 5 – cancel soft action previously taken, 6 – cancel stronger action previously taken, 7 – cancel strongest action previously taken. Alternatively, this can be presented on a scale from -3 to +3, with 0 signifying no action, +1 signifying soft action, -1 signifying cancellation of soft action previously taken, and so on.
Based on the text of the inputs, I’m interested in predicting which of these seven outcomes will occur.
I’m thinking of treating this as a form of sentiment analysis, since the decision to take a certain kind of action is basically a sentiment. However, all the sentiment analysis examples I’ve found have focused on positive/negative dichotomies, sometimes adding in neutral sentiment as a category. I haven’t found any examples with more than 3 possible classifications of outcomes – not sure whether this is because I haven’t looked in the right places, because it just isn’t really an approach of interest for whatever reason, or because this approach is a silly idea for some reason of which I’m not yet quite sure.
Question 1. Should be I approaching this as a form of sentiment analysis, or is there some other approach that would work better? Should I instead treat this as a kind of categorization matter, similar to classifying news articles by topic and training the model to recognize the "topic" (outcome)?
Corpus:
I understand that I will need to build a corpus for training/test data, and it looks like I have two immediately evident options:
1 – hand-code a CSV file for training data that would contain some key phrases from each input text and list the value of the corresponding outcome on a 7-point scale, similar to what’s been done here: http://help.sentiment140.com/for-students
2 – use the approach Pang and Lee used (http://www.cs.cornell.edu/people/pabo/movie-review-data/) and put each of my .txt files of inputs into one of seven folders based on outcomes, since the outcomes (what kind of action was taken) are known based on historical data.
The downside to the first option is that it would be very subjective – I would determine which keywords/phrases I think are the most important to include, and I may not necessarily be the best arbiter. The downside to the second option is that it might have less predictive power because the texts are pretty long, contain lots of extraneous words/phrases, and are often stylistically similar (policy speeches tend to use policy words). I looked at Pang and Lee’s data, though, and it seems like that may not be a huge problem, since the reviews they’re using are also not very varied in terms of style. I’m leaning towards the Pang and Lee approach, but I’m not sure if it would even work with more than two types of outcomes.
Question 2. Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?
Question 3. Given all of the above, which classifier should I be using? I’m thinking maximum entropy would work best; I’ve also looked into random forests, but I have no experience with the latter and really have no idea what I’m doing (yet) when it comes to them.
Thank you very much in advance :)
Question 1 - The most straightforward way to think of this is as a text classification task (sentiment analysis is one kind of text classification task, but by no means the only one).
Alternatively, as you point out, you could consider your data as existing on a continuum ranging from -3 (cancel strongest action previously taken) to +3 (take strongest action), with 0 (take no action) in the middle. In this case you could treat the outcome as a continuous variable with a natural ordering. If so, then you could treat this as a regression problem rather than a classification problem. It's hard to know whether this is a sensible thing to do without knowing more about the data. If you suspect you will have a number of words/phrases that will be very probable at one end of the scale (-3) and very improbable at the other (+3), or vice versa, then regression may make sense. On the other hand, if the relevant words/phrases are associated with strong emotion and are likely to appear at either end of the scale but not in the middle, then you may be better off treating it as classification. It also depends on how you want to evaluate your results. If your algorithm predicts that a document is a -2 and it's actually a -3, will it be penalized less than if it had predicted +3? If so, it might be better to treat this as a regression task.
Question 2. "Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?"
Note that the set of documents (the .txt files of meeting minutes and corresponding outcomes) is your corpus -- the typical thing to do is randomly select 20% or so to be set aside as test data and use the remaining 80% as training data. The two general options you consider above are options for selecting the set of features that your classification or regression algorithm should attend to.
You correctly identify the upsides and downsides of the two most obvious approaches for coming up with features (hand-picking your own vs. Pang & Lee's approach of just using unigrams (words) as phrases).
Personally I'd also lean towards this latter approach, given that it's notoriously hard for humans to predict which phrases will be useful for classification--although there's no reason why you couldn't combine the two, having your initial set of features include all words plus whatever phrases you think might be particularly relevant. As you point out, there will be a lot of extraneous words, so it may help to throw out words that are very infrequent, or that don't differ enough in frequency between classes to provide any discriminative power. Approaches for reducing an initial set of features are known as "feature selection" techniques - one common method is mentioned here. Or see this paper for a more comprehensive list.
You could also consider features like the percent of high-valence words, high-arousal words, or high-dominance words, using the dataset here (click Supplementary Material and download the zip).
Depending on how much effort you want to put into this project, another common thing to do is to try a whole bunch of approaches and see which works best. Of course, you can't test which approach works best using data in the test set--that would be cheating and would run the risk of overfitting to the test data. But you can set aside a small part of your training set as 'validation data' (i.e. a mini-test set that you use for testing different approaches). Given that you don't have that much training data (80 documents or so), you could consider using cross validation.
Question 3 - The best way is probably to try different approaches and pick whatever works best in cross-validation. But if I had to pick one or two, I personally have found that k-nearest neighbor classification (with low k) or SVMs often work well for this kind of thing. A reasonable approach might be
having your initial features be all unigrams (words) + phrases that
you think might be predictive after you look at some training data;
applying a feature selection technique to trim down your feature set;
applying any
algorithm that can deal with high-dimensional/text features, such as those in http://www.csc.kth.se/utbildning/kth/kurser/DD2475/ir10/forelasningar/Lecture9_4.pdf (lots of good tips in that pdf!), or those that achieved decent performance in the Pang & Lee paper.
Other possibilities are discussed in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf . Often the specific algorithm matters less than the features that go into it. Frankly it sounds like a very difficult sort of classification task, so it's possible that nothing will work very well.
If you decide to treat it as a regression rather than a classification task, you could go with k nearest neighbors regression ( http://www.saedsayad.com/k_nearest_neighbors_reg.htm ) or ridge regression.
Random forests often do not work well with large numbers of dependent features (words), though they may work well if you end up deciding to go with a smaller number of features (for example, a set of words/phrases you manually select, plus % of high-valence words and % of high-arousal words).
I am new to machine learning and I am wondering whether it would be possible to use my available biological data for clustering. I want to find out whether a group of DNA sequences can be clustered into two groups, efficient and not efficient.
I have five sets, each containing about 480 short sequences (lets call them samples). Each set is having an effect with different strength:
Set1 - Very good effect
Set2 - Good effect
Set3 - Minor effect
Set4 - Very minor effect
Set5 - No effect
Each sample has some features, e.g. free energy,starting with a specific nucleotide...
Now my question is whether I can find out which type of sample in my sets are playing a role for the effect of the whole set. My only assumption is that in set1 I have more efficient samples then in set5 (either none or very few). A very simple (not realistic) result could be, all samples which start with nucleotide 'A' end end with nucleotide 'C' are causing the effect.
Is it possible to use machine learning to find out?
Thanks!
That definitely sounds like a problem where machine learning could give good results. I recommend that you look into scikit-learn, a powerful and easy to use toolkit for machine learning in Python. There are many introductory examples and tutorials available.
For your use case, I would say that random forests could give good results, although it's hard to say without knowing more about the structure of the data. They are available in the class RandomForestClassifier in sklearn. Again, there are many tutorials and examples to be found.
Since your training data is unlabeled, you may want to look into unsupervised learning methods. A simple class of such methods are clustering algorithms. In sklearn, you can find, for instance, k-means clustering along other such algorithms. The idea would be to let the algorithm split your data into different cluster and see if there is any correlation between cluster membership and observed effect.
It is unclear from your description what the 5 sets (what sound like labels) correspond to, but I will assume that you are essentially asking about feature learning: you would like to know which features to choose to best predict what set a given sequence is from. Determining this from scratch is an open problem in machine learning and there are many possible approaches depending on the particulars of your situation.
You can select a set of features (just by making logical guesses) and calculate them for all sequences, then perform PCA on all the vectors you have generated. PCA will give you the linear combination of features that accounts for the most variability in your data which is useful in designing meaningful features.
I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).
This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Additional
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.
I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.
There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Edit:
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
Edit2
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
#Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
Check the project for details: https://pypi.org/project/TextFeatureSelection/
Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.
I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).