Python Count Number of Phrases in Text - python

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)

You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)

Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

Related

Find most common multi words in an input file in Python

Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, u.s. capitol, etc.". Words that occur together the most.
import re
from collections import Counter
with open('full.txt') as f:
passage = f.read()
words = re.findall(r'\w+', passage)
cap_words = [word for word in words]
word_counts = Counter(cap_words)
for k, v in word_counts.most_common():
print(k, v)
I have this currently, however, this only find one word. How do I find multiple words?
What you're looking for is a way to count bigrams (strings containing two words).
The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!
import nltk
from nltk import bigrams
from collections import Counter
tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))
What you call mutliwords (there is no such thing) is actually called bigrams. You can get a list of bigrams from a list of words by zipping it with itself with a displacement:
bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]
P.S. NLTK would be indeed a better tool to get bigrams.

scikit-learn TfidfVectorizer ignoring certain words

I'm trying TfidfVectorizer on a sentence taken from wikipedia page about the History of Portugal. However i noticed that the TfidfVec.fit_transform method is ignoring certain words. Here's the sentence i tried with:
sentence = "The oldest human fossil is the skull discovered in the Cave of Aroeira in Almonda."
TfidfVec = TfidfVectorizer()
tfidf = TfidfVec.fit_transform([sentence])
cols = [words[idx] for idx in tfidf.indices]
matrix = tfidf.todense()
pd.DataFrame(matrix,columns = cols,index=["Tf-Idf"])
output of the dataframe:
Essentially, it is ignoring the words "Aroeira" and "Almonda".
But i don't want it to ignore those words so what should i do? I can't find anywhere on the documentation where they talk about this.
Another question is why is the word "the" repeated? should the algorithm consider just one "the" and compute its tf-idf?
tfidf.indices are just indexes for feature names in TfidfVectorizer.
Getting words by this indexes from the sentence is a mistake.
You should get columns names for your df as TfidfVec.get_feature_names()
The output is the giving two the because you have two in the sentence. The entire sentence is encoded and your getting values for each of the indices. The reason why the other two words are not appearing is because they are rare words. You can make them appear by reducing the threshold.
Refer to min_df and max_features:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Count number of texts in which a word occurs

I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1

Understanding LDA / topic modelling -- too much topic overlap

I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach).
I have a small number of literary texts (novels) and would like to extract some general topics using LDA.
I'm using the gensim module in Python along with some nltk features. For a test I've split up my original texts (just 6) into 30 chunks with 1000 words each. Then I converted the chunks into document-term matrices and ran the algorithm. This is the code (although I think it doesn't matter for the question) :
# chunks is a 30x1000 words matrix
dictionary = gensim.corpora.dictionary.Dictionary(chunks)
corpus = [ dictionary.doc2bow(chunk) for chunk in chunks ]
lda = gensim.models.ldamodel.LdaModel(corpus = corpus, id2word = dictionary,
num_topics = 10)
topics = lda.show_topics(5, 5)
However the result is completely different from any example I've seen in that the topics are full of meaningless words that can be found in all source documents, e.g. "I", "he", "said", "like", ... example:
[(2, '0.009*"I" + 0.007*"\'s" + 0.007*"The" + 0.005*"would" + 0.004*"He"'),
(8, '0.012*"I" + 0.010*"He" + 0.008*"\'s" + 0.006*"n\'t" + 0.005*"The"'),
(9, '0.022*"I" + 0.014*"\'s" + 0.009*"``" + 0.007*"\'\'" + 0.007*"like"'),
(7, '0.010*"\'s" + 0.009*"I" + 0.006*"He" + 0.005*"The" + 0.005*"said"'),
(1, '0.009*"I" + 0.009*"\'s" + 0.007*"n\'t" + 0.007*"The" + 0.006*"He"')]
I don't quite understand why that happens, or why it doesn't happen with the examples I've seen. How do I get the LDA model to find more distinctive topics with less overlap? Is it a matter of filtering out more common words first? How can I adjust how many times the model runs? Is the number of original texts too small?
LDA is extremely dependent on the words used in a corpus and how frequently they show up. The words you are seeing are all stopwords - meaningless words that are the most frequent words in a language e.g. "the", "I", "a", "if", "for", "said" etc. and since these words are the most frequent, it will negatively impact the model.
I would use the nltk stopword corpus to filter out these words:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
Then make sure your text does not contain any of the words in the stop_words list (by whatever pre processing method you are using) - an example is below
text = text.split() # split words by space and convert to list
text = [word for word in text if word not in stop_words]
text = ' '.join(text) # join the words in the text to make it a continuous string again
You may also want to remove punctuation and other characters ("/","-") etc.) then use regular expressions:
import re
remove_punctuation_regex = re.compile(r"[^A-Za-z ]") # regex for all characters that are NOT A-Z, a-z and space " "
text = re.sub(remove_punctuation_regex, "", text) # sub all non alphabetical characters with empty string ""
Finally, you may also want to filter on most frequent or least frequent words in your corpus, which you can do using nltk:
from nltk import FreqDist
all_words = text.split() # list of all the words in your corpus
fdist = FreqDist(all_words) # a frequency distribution of words (word count over the corpus)
k = 10000 # say you want to see the top 10,000 words
top_k_words, _ = zip(*fdist.most_common(k)) # unzip the words and word count tuples
print(top_k_words) # print the words and inspect them to see which ones you want to keep and which ones you want to disregard
That should get rid of the stopwords and extra characters, but still leaves the vast problem of topic modelling (which I wont try to explain here but will leave some tips and links).
Assuming you know a little bit about topic modelling, lets start. LDA is a bag of words model, meaning word order doesnt matter. The model assigns a topic distribution (of a predetermined number of topics K) to each document, and a word distribution to each topic. A very insightful high level video explains this here. If you want to see more of the mathematics, but still at an accessible level, check out this video. The more documents the better, and usually longer documents (with more words) also fair better using LDA - this paper shows that LDA doesnt perform well with short texts (less than ~20 words). K is up to you to choose, and really depends on your corpus of documents (how large it is, what different topics it covers etc.). Usually a good value of K is between 100-300, but again this really depends on your corpus.
LDA has two hyperparamters, alpha and beta (alpha and eta in gemsim) - a higher alpha means each text will be represented by more topics (so naturally a lower alpha means each text will be represented by less topics). A high eta means each topic is represented by more words, and a low eta means each topic is represented by less words - so with a low eta you would get less "overlap" between topics.
There's many insights you could gain using LDA
What are the topics in a corpus (naming topics may not matter to your application, but if it does this can be done by inspecting the words in a topic as you have done above)
What words contribute most to a topic
What documents in the corpus are most similar (using a similarity metric)
Hope this has helped. I was new to LDA a few months ago but I've quickly gotten up to speed using stackoverflow and youtube!

Extracting collocates for a given word from a text corpus - Python

I am trying to find out how to extract the collocates of a specific word out of a text. As in: what are the words that make a statistically significant collocation with e.g. the word "hobbit" in the entire text corpus? I am expecting a result similar to a list of words (collocates ) or maybe tuples (my word + its collocate).
I know how to make bi- and tri-grams using nltk, and also how to select only the bi- or trigrams that contain my word of interest. I am using the following code (adapted from this StackOverflow question).
import nltk
from nltk.collocations import *
corpus = nltk.Text(text) # "text" is a list of tokens
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tri_finder = TrigramCollocationFinder.from_words(corpus)
# Only trigrams that appear 3+ times
tri_finder.apply_freq_filter(3)
# Only the ones containing my word
my_filter = lambda *w: 'Hobbit' not in w
tri_finder.apply_ngram_filter(my_filter)
print tri_finder.nbest(trigram_measures.likelihood_ratio, 20)
This works fine and gives me a list of trigrams (one element of of which is my word) each with their log-likelihood value. But I don't really want to select words only from a list of trigrams. I would like to make all possible N-Gram combinations in a window of my choice (for example, all words in a window of 3 left and 3 right from my word - that would mean a 7-Gram), and then check which of those N-gram words has a statistically relevant frequency paired with my word of interest. I would like to take the Log-Likelihood value for that.
My idea would be:
1) Calculate all N-Gram combinations in different sizes containing my word (not necessarily using nltk, unless it allows to calculate units larger than trigrams, but i haven't found that option),
2) Compute the log-likelihood value for each of the words composing my N-grams, and somehow compare it against the frequency of the n-gram they appear in (?). Here is where I get lost a bit... I am not experienced in this and I don't know how to think this step.
Does anyone have suggestions how I should do?
And assuming I use the pool of trigrams provided by nltk for now: does anyone have ideas how to proceed from there to get a list of the most relevant words near my search word?
Thank you
Interesting problem ...
Related to 1) take a look at this thread...different nice solutions to make ngrams .. basically I lo
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print (grams)
The other way could be:
phrases = Phrases(doc,min_count=2)
bigram = models.phrases.Phraser(phrases)
phrases = Phrases(bigram[doc],min_count=2)
trigram = models.phrases.Phraser(phrases)
phrases = Phrases(trigram[doc],min_count=2)
Quadgram = models.phrases.Phraser(phrases)
... (you could continue infinitely)
min_count controls the frequency of each word in the corpora.
Related to 2) It's somehow tricky calculating loglikelihood for more than two variables since you should count for all the permutations. look this thesis which guy proposed a solution (page 26 contains a good explanation).
However, in addition to log-likelihood function, there is PMI (Pointwise Mutual Information) metric which calculates the co-occurrence of pair of words divided by their individual frequency in the text. PMI is easy to understand and calculate which you could use it for each pair of the words.

Categories