Location of the words in text

Location of the words in text - python

NLTK package of Python has a function dispersion plot, which shows location of chosen words in text. If there any numeric measure of such dispersion that can be calculated in python? E.g. I want to measure weather the word "money" is spread among the text or rather concentrated in one chapter?

I believe there are multiple metrics that can be used to give a quantitative measure of what you are defining as informativeness of a word over a body of text.
Methodology
Since you mention chapter and text as the levels you wish to evaluate, the basic methodology would be the same:
Break a given text into chapters
Evaluate model on chapter and text level
Compare evaluation on chapter and text level
If the comparison is over a threshold you could claim it is meaningful or informative. Other metrics on the two levels could be used depending on the model.
Models
There are a few models that can be used.
Raw counts
Raw counts of words could be used on chapter and text levels. A threshold of percentage could be used to determine a topic as representative of the text.
For example, if num_word_per_chapter/num_all_words_per_chapter > threshold and/or num_word_per_text/num_all_words_text > threshold then you could claim it is representative. This might be a good baseline. It is essentially a bag-of-words like technique.
Vector Space Models
Vector space models are used in Information Retrieval and Distributional Semantics. They usually used sparse vectors of counts or TF-IDF. Two vectors are compared with cosine similarity. Closer vectors have smaller angles and are considered "more alike".
You could create chapter-term matrices and average cosine similarity metrics for a text body. If the average_cos_sim > threshold you could claim it is more informative of the topic.
Examples and Difficulties
Here is a good example of VSM with NLTK. This may be a good place to start for a few tests.
The difficulties I foresee are:
Chapter Splitting
Finding Informative Threshold
I can't give you a more practical code based answer at this time, but I hope this gives you some options to start with.

Related

Sentences embedding using word2vec

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.

Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance

If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this

Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.

using LDA for dimension reduction / clustering

I'm currently facing a text mining problem where my goal is to identify clusters within a corpus of short texts.
The idea is, that these clusters represent some kind of technical/domain specific content which all members of the respective cluster have in common.
The final evaluation of the clustering has to be made from a domain knowledge based perspetive.
I worked myself through a bunch of different approaches.
Topic modeling with lda seems a good one to start with.
So each of my documents is represented through a mixture of different topics (which are based on the coherent occrurance of single words or n_grams)
My first idea was to use the resulting topics as clusters/groups to group my documents.
But one single document can consist of different topics, so I'm not sure wether this is a good idea.
Furthermore, as LDA is not using a distance measure to calculate it's topics, I'm lacking some kind of metric to evalute my lda based clusters. Because of the fact, that I'm missing a given ground truth I'm bound to methods, which are not bound to a given ground truth. I used the silhouette score to evalute my clusters, but while this metric is based on distances, lda is not. I'm not sure wether it actualy makes sense.
My second thought was to use the lda results as a preprocessing step for dimension reduction.
On these new Input vector I could apply distance based clustering methods like agglomerative clustering, kmeans, dbscan.
I also found some posts and papers, which pointed to self organizing maps to solve that kind of problem. Is this approach worth following it, compared to the methods described above?
Is it a reasonable approach to use lda topics as clusters or as a preprocessing step?
What are metrics to evalute non distance approaches like lda?
Are there any other approaches which I should take into account?

TfidfVectorizer - Normalisation bias

I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)

Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book

use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.

Feature space reduction for tag prediction

I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.
Work Done
Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.
tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())
Questions
How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
Once the feature set is reduced, the way I have planned to use it is as follows:
Preprocess the test data.
Find the unigrams and bigrams.
For the ones stored in redis, find the corresponding best-k tags
Apply some kind of weight for the title and body text
Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.
Are there any suggestions on how to improve upon this? Can a classifier come in handy?
Footnote
I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.

Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.

A baseline statistical approach would treat this as a classification problem. Features are bags-of-words processed by a maximum entropy classifier like Mallet http://mallet.cs.umass.edu/classification.php. Maxent (aka logistic regression) is good at handling large feature spaces. Take the probability associated with each each tag (i.e., the class labels) and choose some decision threshold that gives you a precision/recall tradeoff that works for your project. Some of the Mallet documentation even mentions topic classification, which is very similar to what you are trying to do.
The open questions are how well Mallet handles the size of your data (which isn't that big) and whether this particular tool is a non-starter with the technology stack you mentioned. You might be able to train offline (dump the reddis database to a text file in Mallet's feature format) and run the Mallet-learned model in Python. Evaluating a maxent model is simple. If you want to stay in Python and have this be more automated, there are Python-based maxent implementations in NLTK and probably in scikit-learn. This approach is not at all state-of-the-art, but it'll work okay and be a decent baseline with which to compare more complicated methods.

Feature Selection and Reduction for Text Classification

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).

This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Additional
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.

I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.

There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Edit:
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
Edit2
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
#Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
Check the project for details: https://pypi.org/project/TextFeatureSelection/

Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.

I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.