Find most common multi words in an input file in Python - python

Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, u.s. capitol, etc.". Words that occur together the most.
import re
from collections import Counter
with open('full.txt') as f:
passage = f.read()
words = re.findall(r'\w+', passage)
cap_words = [word for word in words]
word_counts = Counter(cap_words)
for k, v in word_counts.most_common():
print(k, v)
I have this currently, however, this only find one word. How do I find multiple words?

What you're looking for is a way to count bigrams (strings containing two words).
The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!
import nltk
from nltk import bigrams
from collections import Counter
tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

What you call mutliwords (there is no such thing) is actually called bigrams. You can get a list of bigrams from a list of words by zipping it with itself with a displacement:
bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]
P.S. NLTK would be indeed a better tool to get bigrams.

Related

Find NLTK Wordnet Synsets for combined items in a list

I am new to NLTK. I want to use nltk to extract hyponyms for a given list of words, specifically, for some combined words
my code:
import nltk
from nltk.corpus import wordnet as wn
list = ["real_time", 'Big_data', "Healthcare",
'Fuzzy_logic', 'Computer_vision']
def get_synset(a_list):
synset_list = []
for word in a_list:
a = wn.synsets(word)[:1] #The index is to ensure each word gets assigned 1st synset only
synset_list.append(a)
return synset_list
lst_synsets = get_synset(list)
lst_synsets
Here is the output:
[[Synset('real_time.n.01')],
[],
[Synset('healthcare.n.01')],
[Synset('fuzzy_logic.n.01')],
[]]
How could I find NLTK Wordnet Synsets for combined items? if no, any suggestion to use one of these methods for combined terms?

Is there a way to reverse stem in python nltk?

I have a list of stems in NLTK/python and want to get the possible words that create that stem.
Is there a way to take a stem and get a list of words that will stem to it in python?
To the best of my knowledge the answer is No, and depending on the stemmer it might be difficult to come up with an exhaustive search for reverting the effect of the stemming rules and the results would be mostly invalid words by any standard. E.g for Porter stemmer:
from nltk.stem.porter import *
stemmer = PorterStemmer()
stemmer.stem('grabfuled')
# results in "grab"
So, a reverse function would generate "grabfuled" as one of the valid words as "-ed" and "-ful" suffixes are removed consecutively in the stemming process.
However, given a valid lexicon, you can do the following which is independent of the stemming method:
from nltk.stem.porter import *
from collections import defaultdict
vocab = set(['grab', 'grabbing', 'grabbed', 'run', 'running', 'eat'])
# Here porter stemmer, but can be any other stemmer too
stemmer = PorterStemmer()
d = defaultdict(set)
for v in vocab:
d[stemmer.stem(v)].add(v)
print(d)
# defaultdict(<class 'set'>, {'grab': {'grab', 'grabbing', 'grabbed'}, 'eat': {'eat'}, 'run': {'run', 'running'}})
Now we have a dictionary that maps stems to the valid words that can generate them. For any stem we can do the following:
print(d['grab'])
# {'grab', 'grabbed', 'grabbing'}
For building the vocabulary you can tokenize a corpus or use nltk's built-in dictionary of English words.

How to find words relevances in a single document?

I want to find the relevance of some words (like economy, technology) in a single document.
The document has around 30 pages, the idea is to extract all text and determine words relevances for this document.
I know that TF-IDF is used in a group of document, but is it possible to use TF-IDF to solve this problem? If not, how can I do this in Python?
Using NLTK and one of its builtin corpora, you can make some estimates on how "relevant" a word is:
from collections import Counter
from math import log
from nltk import word_tokenize
from nltk.corpus import brown
toks = word_tokenize(open('document.txt').read().lower())
tf = Counter(toks)
freqs = Counter(w.lower() for w in brown.words())
n = len(brown.words())
for word in tf:
tf[word] *= log(n / (freqs[word] + 1))**2
for word, score in tf.most_common(10):
print('%8.2f %s' % (score, word))
Change document.txt to the name of your document and the script will output the ten most "relevant" words in it.

Python Count Number of Phrases in Text

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)
You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)
Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

How to find collocations in text, python

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !

Categories