Understanding results of word2vec gensim for finding substitutes [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I have implemented the word2vec model on transaction data (link) of a single category.
My goal is to find substitutable items from the data.
The model is giving results but I want to make sure that my model is giving results based on customers historical data (considering context) and not just based on content (semantic data). Idea is similar to the recommendation system.
I have implemented this using the gensim library, where I passed the data (products) in form of a list of lists.
Eg.
[['BLUE BELL ICE CREAM GOLD RIM', 'TILLAMK CHOC CHIP CK DOUGH IC'],
['TALENTI SICILIAN PISTACHIO GEL', 'TALENTI BLK RASP CHOC CHIP GEL'],
['BREYERS HOME MADE VAN ICE CREAM',
'BREYERS HOME MADE VAN ICE CREAM',
'BREYERS COFF ICE CREAM']]
Here, each of the sub lists is the past one year purchase history of a single customer.
# train word2vec model
model = Word2Vec(window = 5, sg = 0,
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
# extract all vectors
X = []
words = list(model.wv.index_to_key)
for word in words:
x = model.wv.get_vector(word)
X.append(x)
Y = np.array(X)
Y.shape
def similar_products(v, n = 3):
# extract most similar products for the input vector
ms = model.wv.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model.wv['BLUE BELL ICE CREAM GOLD RIM'])
Results:
[('BLUE BELL ICE CREAM BROWN RIM', 0.7322707772254944),
('BLUE BELL ICE CREAM LIGHT', 0.4575043022632599),
('BLUE BELL ICE CREAM NSA', 0.3731085956096649)]
To get intuitive understanding of word2vec and its working on how results are obtained, I created a dummy dataset where I wanted to find subtitutes of 'FOODCLUB VAN IC PAIL'.
If two products are in the same basket multiple times then they are substitutes.
Looking at the data first substitute should be 'FOODCLUB CHOC CHIP IC PAIL' but the results I obtained are:
[('FOODCLUB NEAPOLITAN IC PAIL', 0.042492810636758804),
('FOODCLUB COOKIES CREAM ICE CREAM', -0.04012278839945793),
('FOODCLUB NEW YORK VAN IC PAIL', -0.040678512305021286)]
Can anyone help me understand the intuitive working of word2vec model in gensim? Will each product be treated as word and customer list as sentence?
Why are my results so absurd in dummy dataset? How can I improve?
What hyperparameters play a significant role w.r.t to this model? Is negative sampling required?

You may not get a very good intuitive understanding of usual word2vec behavior using these sorts of product-baskets as training data. The algorithm was originally developed for natural-language texts, where texts are runs of tokens whose frequencies, & co-occurrences, follow certain indicative patterns.
People certainly do use word2vec on runs-of-tokens that aren't natural language - like product baskets, or logs-of-actions, etc – but to the extent such tokens have very-different patterns, it's possible extra preprocessing or tuning will be necessary, or useful results will be harder to get.
As just a few ways customer-purchases might be different from real language, depending on what your "pseudo-texts" actually represent:
the ordering within a text might be an artifact of how you created the data-dump rather than anything meaningful
the nearest-neighbors to each token within the window may or may not be significant, compared to more distant tokens
customer ordering patterns might in general not be as reflective of shades-of-relationships as words-in-natural-language text
So it's not automatic that word2vec will give interesting results here, for recommendatinos.
That's especially the case with small datasets, or tiny dummy datasets. Word2vec requires lots of varied data to pack elements into interesting relative positions in a high-dimensional space. Even small demos usually have a vocabulary (count of unique tokens) of tens-of-thousands, with training texts that provide varied usage examples of every token dozens of times.
Without that, the model never learns anything interesing/generalizable. That's especially the case if trying to create a many-dimensions model (say the default vector_size=100) with a tiny vocabulary (just dozens of unique tokens) with few usage examples per example. And it only gets worse if tokens appear fewer than the default min_count=5 times – when they're ignored entirely. So don't expect anything interesting to come from your dummy data, at all.
If you want to develop an intuition, I'd try some tutorials & other goals with real natural language text 1st, with a variety of datasets & parameters, to get a sense of what has what kind of effects on result usefulness – & only after that try to adapt word2vec to other data.
Negative-sampling is the default, & works well with typical datasets, especially as they grow large (where negative-sampling suffes less of a performance hit than hierarchical-softmax with large vocabularies). But a toggle between those two modes is unlike to cause giant changes in quality unless there are other problems.
Sufficient data, of the right kind, is the key – & then tweaking parameters may nudge end-result usefulness in a better direction, or shift it to be better for certain purposes.
But more specific parameter tips are only possible with clearer goals, once some baseline is working.

Related

Sentences embedding using word2vec

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this
Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.

Word-sense disambiguation based on sets of words using pre-trained embeddings

I am interested in identifying the WordNet synset IDs for each word in a set of tags.
The words in the set provide the context for the word sense disambiguation, such as:
{mole, skin}
{mole, grass, fur}
{mole, chemistry}
{bank, river, river bank}
{bank, money, building}
I know of the lesk algorithm and libraries, such as pywsd, which is based on 10+ year old tech (which may still be cutting edge -- that is my question).
Are there better performing algorithms by now that make sense of pre-trained embeddings, like GloVe, and maybe the distances of these embeddings to each other?
Are there ready-to-use implementations of such WSD algorithms?
I know this question is close to the danger zone of asking for subjective preferences - as in this 5-year old thread. But I am not asking for an overview of options or the best software for a problem.
Transfer learning, particularly models like Allen AI’s ELMO, OpenAI’s Open-GPT, and Google’s BERT allowed researchers to smash multiple benchmarks with minimal task-specific fine-tuning and provided the rest of the NLP community with pretrained models that could easily (with less data and less compute time) be fine-tuned and implemented to produce state of the art results.
these representations will help you accuratley retrieve results matching the customer's intent and contextual meaning(), even if there's no keyword or phrase overlap.
To start off, embeddings are simply (moderately) low dimensional representations of a point in a higher dimensional vector space.
By translating a word to an embedding it becomes possible to model the semantic importance of a word in a numeric form and thus perform mathematical operations on it.
When this was first possible by the word2vec model it was an amazing breakthrough. From there, many more advanced models surfaced which not only captured a static semantic meaning but also a contextualized meaning. For instance, consider the two sentences below:
I like apples.
I like Apple macbooks
Note that the word apple has a different semantic meaning in each sentence. Now with a contextualized language model, the embedding of the word apple would have a different vector representation which makes it even more powerful for NLP tasks.
contextual embedding's like BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.

First time working with Word2Vec, try to cluster users based on their skill set

For my thesis I have to analyse the skill of candidates. I have to cluster the users and compare their skillsets. The information is classified so I made a random database which have the same structure so I can show how my data is build.
import random
listOfSkills = ["Dutch","Java OSGI","XML Transformation Query","Java Enterprise Edition","Functional Design","Scrum","Python","JavaScript","Ruby","Java","SQL","Data Analytics","Machine Learning","Deep Learning","English"]
rand_item = random.choice(listOfSkills)
n = 5
rand_items = random.sample(listOfSkills, n)
test_skillset = []
for i in range(5):
result = random.sample(listOfSkills, n)
string = ", ".join(result)
test_skillset.append(string)
test_id_ = np.arange(0, len(test_skillset)).tolist()
test_dict = {'id' : test_id_,
'skillset' : test_skillset
}
test_df = pd.DataFrame(test_dict)
After running this code I get a DataFrame which looks like this:
id, skillset
0, "Java, ruby, ..."
1, "Java, ruby, ..."
2, "Java, ruby, ..."
This is the same for the database I got.
The list of skills are some of the skills I found in the database. Also there are more users in the database which also have more skills.
I am quite new to machine learning and to using Word2Vec models. I tried some stuff but almost all the time I don't get a result which gives me extra information. Some of the skills have a long name which maybe messes up the model. Or I did something wrong.
One of my goals is to cluster the users and find similarities between each skill set.
My final goal is to match the vectors of the skill set with vacancies to check how good of a match a user can be with an open vacancy. But first I need to know if I can find similarities between the users.
So my question are:
How can I use Word2Vec to find individual similarities between skills?
How can I use Word2Vec cluster the users to find similar skillsets?
Sorry if my question is a bit vague, English isn't my native language and Python is a bit new to me.
I am open to clarify things if needed.
Word2vec was originally unveiled as an algorithm trained on long, real natural language texts, which include many subtly-varied examples of word usage, in original contexts.
You seem to be applying it to a smaller controlled-vocabulary of known-skills, using training-data that isn't full natural language communication – just lists.
Word2vec & similar algorithms can sometimes offer interesting results on such not-quite-real-language corpora, but can require more tinkering with the training data, and parameters further from the usual defaults for natural language texts, for better results.
In particular, if you are using a randomly-generated corpus – especially one generated by uniform sampling from a tiny list of just 15 'words'! – you shouldn't expect the word2vec algorithm to do anything useful. There's no language-like pattern of relative word-meanings in such an artificial corpus. (And, any tiny hints-of-correlation any one run might show would be noise from your random sample, totally unlike the gradations of meaning real languages have, and real training texts would show.)
(There are probably other errors, as well, in your "tried some stuff" experiments that didn't yield useful results – but this use of a random training set is just the most obvious problem, in what you've shown.)
To get useful "skills vectors" you'll need lots of realistic data, and to adjust things like the size/options of your training to match the limits of what you have. (As just one example, it's nonsensical to try to train even 20-dimensional 'dense embedding' vectors, like those from word2vec, for a vocabulary that's less than 20 tokens long – & you'd probably need 400+ unique tokens to make 20-dimensional vectors start making sense.)
With the right data, you should start to see meaningful relationships between such skills-vectors – with related skills nearer each other than unrelated skills, and even the directions-of-differences suggestive of certain human-describable aspects of skills (like "more enterprisey", or "more abstract mathy"). But you can't even really eyeball those results for sanity/improvement unless it's realistic data, with real relationships, which you can evaluate using your domain knowledge.
You might then be able to try alternate ways of composing those into per-candidate summaries (such as averaging the skills-vectors together), or just use some composite measure-of-distance which relies on word-vectors (like say "Word Mover's Distance"), in order to try candidate-level clustering.
Good luck!

Identifying multiple categories and associated sentiment within text

If you have a corpus of text, how can you identify all the categories (from a list of pre-defined categories) and the associated sentiment (positive/negative writing) with it?
I will be doing this in Python but at this stage I am not necessarily looking for a language specific solution.
Let's look at this question with an example to try and clarify what I am asking.
If I have a whole corpus of reviews for products e.g.:
Microsoft's Xbox One offers impressive graphics and a solid list of exclusive 2015 titles. The Microsoft console currently edges ahead of the PS4 with a better selection of media apps. The console's fall-2015 dashboard update is a noticeable improvement. The console has backward compatibility with around 100 Xbox 360 titles, and that list is poised to grow. The Xbox One's new interface is still more convoluted than the PS4's. In general, the PS4 delivers slightly better installation times, graphics and performance on cross-platform games. The Xbox One also lags behind the PS4 in its selection of indie games. The Kinect's legacy is still a blemish. While the PS4 remains our overall preferred choice in the game console race, the Xbox One's significant course corrections and solid exclusives make it a compelling alternative.
And I have a list of pre-defined categories e.g. :
Graphics
Game Play
Game Selection
Apps
Performance
Irrelevant/Other
I could take my big corpus of reviews and break them down by sentence. For each sentence in my training data I can hand tag them with the appropriate categories. The problem is that there could be various categories in 1 sentence.
If it was 1 category per sentence then any classification algorithm from scikit-learn would do the trick. When working with multi-classes I could use something like multi-label classification.
Adding in the sentiment is the trickier part. Identifying sentiment in a sentence is a fairly simple task but if there is a mix of sentiment on different labels that becomes different.
The example sentence "The Xbox One has a good selection of games but the performance is worse than the PS4". We can identify two of our pre-defined categories (game selection, performance) but we have positive sentiment towards game selection and a negative sentiment towards performance.
What would be a way to identify all categories in text (from our pre-defined list) with their associated sentiment?
One simple method is to break your training set into minimal sentences using a parser and use that as the input for labelling and sentiment classification.
Your example sentence:
The Xbox One has a good selection of games but the performance is worse than the PS4
Using the Stanford Parser, take S tags that don't have child S tags (and thus are minimal sentences) and put the tokens back together. For the above sentence that would give you these:
The Xbox One has a good selection of games
the performance is worse than the PS4
Sentiment within an S tag should be consistent most of the time. If sentences like The XBox has good games and terrible graphics are common in your dataset you may need to break it down to NP tags but that seems unlikely.
Regarding labelling, as you mentioned any multi-label classification method should work.
For more sophisticated methods, there's a lot of research on join topic-sentiment models - a search for "topic sentiment model" turns up a lot of papers and code. Here's sample training data from a paper introducing a Hidden Topic Sentiment Model that looks right up your alley. Note how in the first sentence with labels there are two topics.
Hope that helps!
The only approach I could think of would consists of a set of steps.
1) Use some library to extract entities from text and their relationships. For example, check this article:
http://www.nltk.org/book/ch07.html
By parsing each text you may figure out which entities you have in each text and which chunks of text are related to the entity.
2) Use NLTKs sentiment extraction to analyze chunks specifically related to each entity and obtain their sentiment. That gives you sentiment of each entity.
3) After that you need to come of with a way to map entities which you may face in text to what you call 'topics'. Unfortunately, I don't see a way to automate it since you clearly not define topics conventionally, through word frequency (like in topic modelling algorithms - LDA, NMF etc).

NLP with Python - how to build a corpus, which classifier to use?

I’m trying to figure out which direction to take my Python NLP project in, and I’d be very grateful to the SO community for any advice.
Problem:
Let’s say I have 100 .txt files that contain the minutes of 100 meetings held by a decisionmaking body. I also have 100 .txt files of corresponding meeting outcomes, which contain the resolutions passed by this body. The outcomes fall into one of seven categories – 1 – take no action, 2 – take soft action, 3 – take stronger action, 4 – take strongest action, 5 – cancel soft action previously taken, 6 – cancel stronger action previously taken, 7 – cancel strongest action previously taken. Alternatively, this can be presented on a scale from -3 to +3, with 0 signifying no action, +1 signifying soft action, -1 signifying cancellation of soft action previously taken, and so on.
Based on the text of the inputs, I’m interested in predicting which of these seven outcomes will occur.
I’m thinking of treating this as a form of sentiment analysis, since the decision to take a certain kind of action is basically a sentiment. However, all the sentiment analysis examples I’ve found have focused on positive/negative dichotomies, sometimes adding in neutral sentiment as a category. I haven’t found any examples with more than 3 possible classifications of outcomes – not sure whether this is because I haven’t looked in the right places, because it just isn’t really an approach of interest for whatever reason, or because this approach is a silly idea for some reason of which I’m not yet quite sure.
Question 1. Should be I approaching this as a form of sentiment analysis, or is there some other approach that would work better? Should I instead treat this as a kind of categorization matter, similar to classifying news articles by topic and training the model to recognize the "topic" (outcome)?
Corpus:
I understand that I will need to build a corpus for training/test data, and it looks like I have two immediately evident options:
1 – hand-code a CSV file for training data that would contain some key phrases from each input text and list the value of the corresponding outcome on a 7-point scale, similar to what’s been done here: http://help.sentiment140.com/for-students
2 – use the approach Pang and Lee used (http://www.cs.cornell.edu/people/pabo/movie-review-data/) and put each of my .txt files of inputs into one of seven folders based on outcomes, since the outcomes (what kind of action was taken) are known based on historical data.
The downside to the first option is that it would be very subjective – I would determine which keywords/phrases I think are the most important to include, and I may not necessarily be the best arbiter. The downside to the second option is that it might have less predictive power because the texts are pretty long, contain lots of extraneous words/phrases, and are often stylistically similar (policy speeches tend to use policy words). I looked at Pang and Lee’s data, though, and it seems like that may not be a huge problem, since the reviews they’re using are also not very varied in terms of style. I’m leaning towards the Pang and Lee approach, but I’m not sure if it would even work with more than two types of outcomes.
Question 2. Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?
Question 3. Given all of the above, which classifier should I be using? I’m thinking maximum entropy would work best; I’ve also looked into random forests, but I have no experience with the latter and really have no idea what I’m doing (yet) when it comes to them.
Thank you very much in advance :)
Question 1 - The most straightforward way to think of this is as a text classification task (sentiment analysis is one kind of text classification task, but by no means the only one).
Alternatively, as you point out, you could consider your data as existing on a continuum ranging from -3 (cancel strongest action previously taken) to +3 (take strongest action), with 0 (take no action) in the middle. In this case you could treat the outcome as a continuous variable with a natural ordering. If so, then you could treat this as a regression problem rather than a classification problem. It's hard to know whether this is a sensible thing to do without knowing more about the data. If you suspect you will have a number of words/phrases that will be very probable at one end of the scale (-3) and very improbable at the other (+3), or vice versa, then regression may make sense. On the other hand, if the relevant words/phrases are associated with strong emotion and are likely to appear at either end of the scale but not in the middle, then you may be better off treating it as classification. It also depends on how you want to evaluate your results. If your algorithm predicts that a document is a -2 and it's actually a -3, will it be penalized less than if it had predicted +3? If so, it might be better to treat this as a regression task.
Question 2. "Am I correct in assuming that these are my two general options for building the corpus? Am I missing some other (better) option?"
Note that the set of documents (the .txt files of meeting minutes and corresponding outcomes) is your corpus -- the typical thing to do is randomly select 20% or so to be set aside as test data and use the remaining 80% as training data. The two general options you consider above are options for selecting the set of features that your classification or regression algorithm should attend to.
You correctly identify the upsides and downsides of the two most obvious approaches for coming up with features (hand-picking your own vs. Pang & Lee's approach of just using unigrams (words) as phrases).
Personally I'd also lean towards this latter approach, given that it's notoriously hard for humans to predict which phrases will be useful for classification--although there's no reason why you couldn't combine the two, having your initial set of features include all words plus whatever phrases you think might be particularly relevant. As you point out, there will be a lot of extraneous words, so it may help to throw out words that are very infrequent, or that don't differ enough in frequency between classes to provide any discriminative power. Approaches for reducing an initial set of features are known as "feature selection" techniques - one common method is mentioned here. Or see this paper for a more comprehensive list.
You could also consider features like the percent of high-valence words, high-arousal words, or high-dominance words, using the dataset here (click Supplementary Material and download the zip).
Depending on how much effort you want to put into this project, another common thing to do is to try a whole bunch of approaches and see which works best. Of course, you can't test which approach works best using data in the test set--that would be cheating and would run the risk of overfitting to the test data. But you can set aside a small part of your training set as 'validation data' (i.e. a mini-test set that you use for testing different approaches). Given that you don't have that much training data (80 documents or so), you could consider using cross validation.
Question 3 - The best way is probably to try different approaches and pick whatever works best in cross-validation. But if I had to pick one or two, I personally have found that k-nearest neighbor classification (with low k) or SVMs often work well for this kind of thing. A reasonable approach might be
having your initial features be all unigrams (words) + phrases that
you think might be predictive after you look at some training data;
applying a feature selection technique to trim down your feature set;
applying any
algorithm that can deal with high-dimensional/text features, such as those in http://www.csc.kth.se/utbildning/kth/kurser/DD2475/ir10/forelasningar/Lecture9_4.pdf (lots of good tips in that pdf!), or those that achieved decent performance in the Pang & Lee paper.
Other possibilities are discussed in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf . Often the specific algorithm matters less than the features that go into it. Frankly it sounds like a very difficult sort of classification task, so it's possible that nothing will work very well.
If you decide to treat it as a regression rather than a classification task, you could go with k nearest neighbors regression ( http://www.saedsayad.com/k_nearest_neighbors_reg.htm ) or ridge regression.
Random forests often do not work well with large numbers of dependent features (words), though they may work well if you end up deciding to go with a smaller number of features (for example, a set of words/phrases you manually select, plus % of high-valence words and % of high-arousal words).

Categories