Here are two examples, one that works and is derived from the https://www.nltk.org/book/ch02.html
and another that does not. The first examples plots single words frequencies, here ['america', 'citizen']. The second is a modified version (evidently incorrectly) that attempts to plot frequencies of the bigram ['america citizen']. I would like to plot ngram frequencies such as for a bigram like ['america citizen'].
Plot Example 1
Plot Example 2 - failed
import nltk
from nltk.book import *
import matplotlib.pyplot as plt
from nltk.corpus import inaugural
inaugural.fileids()
plt.ion() # turns interactive mode on
[fileid[:4] for fileid in inaugural.fileids()]
############- this works ####
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
ax = plt.axes()
cfd.plot()
############- this does not work ####
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['american citizen']
if w.lower().startswith(target))
ax = plt.axes()
cfd.plot()
It seems to me that you are trying to find 'american citizen' which is a collocation comprised of 2 words looking among single words. This is bound to fail. You would have to check for such a bigram among pairs of consecutive words and for that, you need to zip the lists of words shifting the second by 1 word.
The key difference in your code (you can add more collocations in the form of pairs of words to the list of the last for):
def zip2(lst):
ilst = iter(lst)
_ = next(ilst) # drop the first element
return zip(lst, ilst)
cfd = nltk.ConditionalFreqDist(
(t1 + ' ' + t2, fileid[:4])
for fileid in inaugural.fileids()
for w1, w2 in zip2(inaugural.words(fileid))
for t1, t2 in [('american', 'citizen',)]
if w1.lower().startswith(t1) and w2.lower().startswith(t2)
)
ax = plt.axes()
cfd.plot()
Related
How do I measure or find the Zipf distribution ? For example, I have a corpus of english words. How do I find the Zipf distribution ? I need to find the Zipf ditribution and then plot a graph of it. But I am stuck in the first step which is to find the Zipf distribution.
Edit: From the frequency count of each word, it is clear that it obeys the Zipf law. But my aim is to plot a zipf distribution graph. I have no idea about how to calculate the data for the distribution graph
I don't pretend to understand statistics. However, based upon reading from scipy site, here is a naive attempt in python.
Build Data
First we get our data. For example we download data from National Library of Medicine MeSH (Medical Subject Heading) ASCII file d2016.bin (28 MB).
Next, we open file, convert to string.
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
Next we locate individual words in the file and separate out words.
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
Finally we prepare a dict with unique words as key and word count as values.
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
Build zipf distribution data
For speed purpose we limit data to 1000 words.
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
After that we get frequency of values , convert to numpy array and use numpy.random.zipf function to draw samples from a zipf distribution.
Distribution parameter a =2. as a sample as it needs to be greater than 1.
For visibility purpose we limit data to 50 sample points.
s = frequency.values()
s = np.array(s)
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
And finally plot the data.
Putting All Together
import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
#build dict of words based on frequency
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)
#Calculate zipf and plot the data
a = 2. # distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot
I have a file (lets say corpus.txt) of around 700 lines, each line containing numbers separated by -. For example:
86-55-267-99-121-72-336-89-211
59-127-245-343-75-245-245
First I need to read the data from the file, find the frequency of each number, measure the Zipf distribution of these numbers and then plot the distribution. I have done the first two parts of the task. I am stuck in drawing the Zipf distribution.
I know that numpy.random.zipf(a, size=None) should be used for this. But I am finding it extremely hard to use it. Any pointers or code snippet would be extremely helpful.
Code:
# Counts frequency as per given n
def calculateFrequency(fileDir):
frequency = {}
for line in fileDir:
line = line.strip().split('-')
for i in line:
frequency.setdefault(i, 0)
frequency[i] += 1
return frequency
fileDir = open("corpus.txt")
frequency = calculateFrequency(fileDir)
fileDir.close()
print(frequency)
## TODO: Measure and draw zipf distribution
As stated numpy.random.zipf(a, size=None) will produce plot of Samples that are drawn from a zipf distribution with specified parameter of a > 1.
However, since your question was difficulty in using numpy.random.zipf method, here is an naive attempt as discussed on scipy zipf documentation site.
Below is a simulated corpus.txt that has 10 lines of random data per line. However, each line may have duplicates as compared to other lines to simulate recurrance.
16-45-3-21-16-34-30-45-5-28
11-40-22-10-40-48-22-23-22-6
40-5-33-31-46-42-47-5-27-14
5-38-12-22-19-1-11-35-40-24
20-11-24-10-9-24-20-50-21-4
1-25-22-13-32-14-1-21-19-2
25-36-18-4-28-13-29-14-13-13
37-6-36-50-21-17-3-32-47-28
31-20-8-1-13-24-24-16-33-47
26-17-39-16-2-6-15-6-40-46
Working Code
import csv
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Read '-' seperated corpus data and get its frequency in a dict
frequency = {}
with open('corpus.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='-', quotechar='|')
for line in reader:
for word in line:
count = frequency.get(word,0)
frequency[word] = count + 1
#define zipf distribution parameter
a = 2.
#get list of values from frequency and convert to numpy array
s = frequency.values()
s = np.array(s)
# Display the histogram of the samples, along with the probability density function:
count, bins, ignored = plt.hist(s, 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot of histogram of the samples, along with the probability density function
Gensim has a tutorial saying how to, given a document/query string, say what other documents are most similar to it, in descending order:
http://radimrehurek.com/gensim/tut3.html
It can also display what topics are associated with an entire model at all:
How to print the LDA topics models from gensim? Python
But how do you find what topics are associated with a given document/query string? Ideally with some numeric similarity metric for each topic? I haven't been able to find anything on that.
If you want to find the topic distribution of unseen documents then you need to convert the document of interest into a bag of words representation
from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
for line in inp:
text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest
From this code you get a sparse vector where the indices represent the topics 0....n and each 'weight' is the probability that the words in the document belong to that topic in the model.
You can visualize the distribution by creating a bar graph using matplotlib.
y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
x_axis.append(topic_id + 1)
y_axis.append(dist)
width = 1
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()
If you want to see the topn terms in each topic you can print them like this. Referencing the graph, you can look up the topn words you printed and determine how the document was interpreted by the model.
You can also find distances between two different document probability distribution vectors by using vector calculations like hellinger distance, euclidean, jensen shannon etc.
I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.
So I have a 1000 numpy.array like so:
[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
[-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
...
...[1000 Vectors]
]
And words array like so:
["hello","hi","bye","good_bye"...1000]
I have ran K-Means on my data, and the results I got made sense:
X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
print(l,words[idx])
--- Output ---
0 hello
0 hi
1 bye
1 good_bye
0 = greeting 1 = farewell
However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.
How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page?
I had the same problem till now!
After finding always your post after searching it online (keyword = hierarchy clustering on word2vec).
I had to give you a perhaps valid solution.
sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]
import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
l = linkage(model.wv.syn0, method='complete', metric='seuclidean')
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')
dendrogram(
l,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=16., # font size for the x axis labels
orientation='left',
leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()
I have a list of paragraphs, where I want to run a zipf distribution on their combination.
My code is below:
from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt
paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()
ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
verticalalignment="bottom",
horizontalalignment="left")
PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.
I know it's been a while since this question was asked. However, I came across a possible solution for this problem at scipy site.
I thought I would post here in case anyone else required.
I didn't have paragraph info, so here is a whipped up dict called frequency that has paragraph occurrence as its values.
We then get its values and convert to numpy array. Define zipf distribution parameter which has to be >1.
Finally display the histogram of the samples,along with the probability density function
Working Code:
import random
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Generate sample dict with random value to simulate paragraph data
frequency = {}
for i,j in enumerate(range(50)):
frequency[i]=random.randint(1,50)
counts = frequency.values()
tokens = frequency.keys()
#Convert counts of values to numpy array
s = np.array(counts)
#define zipf distribution parameter. Has to be >1
a = 2.
# Display the histogram of the samples,
#along with the probability density function
count, bins, ignored = plt.hist(s, 50, normed=True)
plt.title("Zipf plot for Combined Article Paragraphs")
x = np.arange(1., 50.)
plt.xlabel("Frequency Rank of Token")
y = x**(-a) / special.zetac(a)
plt.ylabel("Absolute Frequency of Token")
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot