I have to write a script that will give me all content words in decending order of frequency.
I need the 10 most frequent content words, I thus not only need to make a list of the 10 most frequent words of my corpus, I will also need to filter out any content words (and, or, any punctuation...).
What I have so far is the following
fileids=corpus.fileids ()
text=corpus.words(fileids)
wlist=[]
ftable=nltk.FreqDist (text)
wlist.append(ftable.keys () )
This gives me a very neat list of all words in decending order of frequency, but how do I filter the function words out?
Thank you.
You want to filter out a set of words (stopwords). Taking the core idea from this SO answer:
You need to introduce a couple of lines into your code:
Just after
fileids=corpus.fileids ()
text=corpus.words(fileids)
Add the following lines: Create a list of stopwords and filter them out from your text
#get a list of the stopwords
stp = nltk.corpus.stopwords.words('english')
#from your text of words, keep only the ones NOT in stp
filtered_text = [w for w in text if not w in stp]
Now, continue as you would
wlist=[]
ftable=nltk.FreqDist (filtered_text)
wlist.append(ftable.keys () )
Hope that helps.
Related
I am using Yake (Yet Another Keyword Extractor) to extract keywords from a dataframe.
I want to extract only bigrams and trigrams, but Yake allows only to set a max ngram size and not a min size. How do you would remove them?
Example df.head(0):
Text:
'oui , yes , i mumbled , the linguistic transition now in limbo .'
Keywords:
'[('oui', 0.04491197687864554),
('linguistic transition', 0.09700399286574239),
('mumbled', 0.15831692877998726)]'
I want to remove oui, mumbled and their scores from keywords column.
Thank you for your time!
If your problem is that the keywords list contains some monograms, you can simply do a filter that ignores words without spaces and create a new list. I'll give you an example:
keywords_without_unigrams = []
for kw in keywords:
if(' ' in kw[0]):
keywords_without_unigrams.append(kw)
for kw in keywords_without_unigrams:
print(kw)
If you need the handle the mono-gram case from Yake just pass the output through a filter that adds the n-grams to the result list only if there is a space in the first element of that tuple or if the str.split() of that element results in more than 1 sub-element. If you're using a function and applying it to the dataframe, include this step in that function.
I am trying to extract separated multi words from a python list with two different list as a query string. My sentences list is
lst = ['we have the terrible HIV epidemic that takes down the life expectancy of the African ','and I take the regions down here','The poorest are down']
lst_verb = ['take','go','wake']
lst_prep = ['down','up','in']
import re
output=[]
item = 'down'
p = re.compile(r'(?:\w+\s+){1,20}'+item)
for i in lst:
output.append(p.findall(i))
for item in output:
print(item)
with this i am able to extract word from the list, However I am only want to extract separated multiwords, i.e it should extract the word from the list "and I take the regions down here".
furthermore, I want to use the word from lst_verb and lst_prep as query string.
for example
re.findall(r \lst_verb+'*.\b'+ \lst_prep)
Thank you for your answer.
You can use regex like
(?is)^(?=.*\b(take)\b)(?=.*?\b(go)\b)(?=.*\b(where)\b)(?=.*\b(wake)\b).*
To match Multiple words
like this your example
use functions to create regex string from the verbs and prep.
hope this helps
I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1
I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)
You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)
Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter
I am trying to find out how to extract the collocates of a specific word out of a text. As in: what are the words that make a statistically significant collocation with e.g. the word "hobbit" in the entire text corpus? I am expecting a result similar to a list of words (collocates ) or maybe tuples (my word + its collocate).
I know how to make bi- and tri-grams using nltk, and also how to select only the bi- or trigrams that contain my word of interest. I am using the following code (adapted from this StackOverflow question).
import nltk
from nltk.collocations import *
corpus = nltk.Text(text) # "text" is a list of tokens
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tri_finder = TrigramCollocationFinder.from_words(corpus)
# Only trigrams that appear 3+ times
tri_finder.apply_freq_filter(3)
# Only the ones containing my word
my_filter = lambda *w: 'Hobbit' not in w
tri_finder.apply_ngram_filter(my_filter)
print tri_finder.nbest(trigram_measures.likelihood_ratio, 20)
This works fine and gives me a list of trigrams (one element of of which is my word) each with their log-likelihood value. But I don't really want to select words only from a list of trigrams. I would like to make all possible N-Gram combinations in a window of my choice (for example, all words in a window of 3 left and 3 right from my word - that would mean a 7-Gram), and then check which of those N-gram words has a statistically relevant frequency paired with my word of interest. I would like to take the Log-Likelihood value for that.
My idea would be:
1) Calculate all N-Gram combinations in different sizes containing my word (not necessarily using nltk, unless it allows to calculate units larger than trigrams, but i haven't found that option),
2) Compute the log-likelihood value for each of the words composing my N-grams, and somehow compare it against the frequency of the n-gram they appear in (?). Here is where I get lost a bit... I am not experienced in this and I don't know how to think this step.
Does anyone have suggestions how I should do?
And assuming I use the pool of trigrams provided by nltk for now: does anyone have ideas how to proceed from there to get a list of the most relevant words near my search word?
Thank you
Interesting problem ...
Related to 1) take a look at this thread...different nice solutions to make ngrams .. basically I lo
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print (grams)
The other way could be:
phrases = Phrases(doc,min_count=2)
bigram = models.phrases.Phraser(phrases)
phrases = Phrases(bigram[doc],min_count=2)
trigram = models.phrases.Phraser(phrases)
phrases = Phrases(trigram[doc],min_count=2)
Quadgram = models.phrases.Phraser(phrases)
... (you could continue infinitely)
min_count controls the frequency of each word in the corpora.
Related to 2) It's somehow tricky calculating loglikelihood for more than two variables since you should count for all the permutations. look this thesis which guy proposed a solution (page 26 contains a good explanation).
However, in addition to log-likelihood function, there is PMI (Pointwise Mutual Information) metric which calculates the co-occurrence of pair of words divided by their individual frequency in the text. PMI is easy to understand and calculate which you could use it for each pair of the words.