Identifying important words and phrases in text - python

I have text stored in a python string.
What I Want
To identify key words in that text.
to identify N-grams in that text (ideally more than just bi and tri grams).
Keep in mind...
The text might be small (e.g. tweet sized)
The text might be middle (e.g. news article sized)
The text might be large (e.g. book or chapter sized)
What I Have
I'm already using nltk to break the corpus into tokens and remove stopwords:
# split across any non-word character
tokenizer = nltk.tokenize.RegexpTokenizer('[^\w\']+', gaps=True)
# tokenize
tokens = tokenizer.tokenize(text)
# remove stopwords
tokens = [w for w in tokens if not w in nltk.corpus.stopwords.words('english')]
I'm aware of the BigramCollocationFinder and TrigramCollectionFinder which does exaclty what I'm looking for for those two cases.
The Question
I need advice for n-grams of higher order, improving the kinds of results that come from BCF and TCF, and advice on the best way to identify the most unique individual key words.
Many thanks!

As for the best way to identify the most unique individual key words, tfidf is the total measure. So, you have somehow to integrate a search engine ( or make a simple custom inverted index that is dynamic and holds term frequencies, document frequencies ) as to calculate tfidf efficiently and on-the-fly.
As for your N-grams, why don't you create a custom parser using a "window" approach ( window is of length N) that identifies, say, the most frequent of them? ( just keep every N-gram as a key in a dictionary with value either the frequency or a score (based on tfidf of individual terms))

Related

words not available in corpus for Word2Vec training

I am totally new to Word2Vec. I want to find cosine similarity between word pairs in my data. My codes are as follows:
import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec(corpus_file="corpus.txt", sg=0, window =7, size=100, min_count=10, iter=4)
vocabulary = list(model.wv.vocab)
data=pd.read_csv("experiment.csv")
cos_similarity = model.wv.similarity(data['word 1'], data['word 2'])
The problem is some words in the data columns of my "experiment.csv" file: "word 1" and "word 2" are not present in the corpus file ("corpus.txt"). So this error is returned:
"word 'setosa' not in vocabulary"
What should I do to handle words that are not present in my input corpus? I want to assign words in my experiment that are not present in the input corpus the vector zero, but I am stuck how to do it.
Any ideas for my problems?
It's really easy to give unknown words the origin (all 'zero') vector:
word = data['word 1']
if word in model.wv:
vec = model[word]
else:
vec = np.zeros(100)
But, this is unlikely what you want. The zero vector can't be cosine-similarity compared to other vectors.
It's often better to simply ignore unknown words. If they were so rare that your training data didn't haven enough of them to create a vector, they can't contribute much to other analyses.
If they're still important, the best approach is to get more data, with realistic usage contexts, so they get meaningful vectors.
Another alternative is to use an algorithm, such as the word2vec variant FastText, which can always synthesize a guess-vector for any words that were out-of-vocabulary (OOV) based on the training data. It does this by learning word-vectors for word-fragments (charactewr n-grams), then assembling a vector for a new unknown word from those fragments. It's often better than random, because unknown words are often typos or variants of known words with which they share a lot of segments. But it's still not great, and for really odd strings, essentially returns a random vector.
Another tactic I've seen used, but wouldn't personally recommend, is to replace a lot of the words that would otherwise be ignored – such as those with fewer than min_count occurrences – with some single plug token, like say '<OOV>'. Then that synthetic token becomes a quite-common word, but gets an almost entirely meaningless: a random low-magnitude vector. (The prevalence of this fake word & noise-vector in training will tend to make other surrounding words' vectors worse or slower-to-train, compared to simply eliding the low-frequency words.) But then, when dealing with later unknown words, you can use this same '<OOV>' pseudoword's vector as a not-too-harmful stand-in.
But again: it's almost always better to do some combination of – (a) more data; (b) ignoring rare words; (c) using a algorithm like FastText which can synthesize better-than-nothing vectors – than collapse all unknown words to a single nonsense vector.

Faster ways to POS tag using Python

I have ~2 million rows of text. Each row can be one or multiple sentences. I need to POS tag the entire corpus. The corpus is a list of strings, for example:
corpus = ["I am awesome. I really am.", "The earth is round.", \
"What is the name of our planet? Is it Earth?"]
The corpus has around 2 million strings, which I'm reading out of a database.
My ideal code for POS tagging looks like this, where I tokenise sentences, then tokenise words, and then POS tag:
from nltk import word_tokenize, sent_tokenize
from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger()
for item in corpus:
for sentence in sent_tokenize(item):
tags = tagger.tag(word_tokenize(sentence))
This, however, is extremely slow. I read about pos_tag_sents() but I'm guessing that's going to take forever to do 2 million data points in one shot. Is there any other, faster way of doing this? I'm looking for a way to speed this up at least 2-3x. My main objective is to capture major word forms (nouns, verbs, question words, etc.), so I'm open to other POS taggers, provided they can speed up the process by 2-3x.

Cross-Lingual Word Sense Disambiguation

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation.
Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text.
So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?
I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work.
e.g.
def ambigu_word2(document, document2):
words = ['letter']
for sentences in document:
tokens = word_tokenize(sentences)
for item in tokens:
x = w_lemma.lemmatize(item)
for w in words:
if w == x in sentences:
print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))
I would be really grateful if you could provide any guidance on this matter.

Statistical sentence suggestion model like spell checking

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.
Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?
Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words.
You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence).
You can consider 2-gram (Markov model):
What is the probability for "kitten" to follow "cute"?
...or 3-gram:
What is the probability for "kitten" to follow "the cute"?
etc.
Obviously training the model with n+1-gram is costlier than n-gram.
Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens))
You can also try to consider the lemmas instead of the words.
Here some interesting lectures about N-gram modelling:
Introduction to N-grams
Estimating N-gram Probabilities
This is a simple and short example of code (2-gram) not optimized:
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

How to find polysemy words from input query?

I am working on polysemy disambiguation project and for that I am trying to find polysemous words from input query. The way I am doing it is:
#! /usr/bin/python
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
print "enter input query"
string = raw_input()
str1 = [i for i in string.split() if i not in stop]
a = list()
for w in str1:
if(len(wn.synsets(w)) > 1):
a.append(w)
Here list a will contain polysemous words.
But using this method almost all words will be considered as polysemy.
e.g if my input query is "milk is white in colour" then it is storing ('milk','white','colour') as polysemy words
WordNet is known to be very fine grained and it sometimes makes distinctions between very subtly different senses that you and I might think are the same. There have been attempts to make WordNet coarser, google "Automatic of a coarse grained WordNet". I am not sure if the results of that paper are available for download, but you can always contact the authors.
Alternatively, change your working definition of polysemy. If the most frequent sense of a word accounts for more than 80% of its uses in a large corpus, then the word is not polysemous. You will have to obtain frequency counts for the different senses of as many words as possible. Start your research here and here.

Categories