Zipf Distribution: How do I measure Zipf Distribution - python

How do I measure or find the Zipf distribution ? For example, I have a corpus of english words. How do I find the Zipf distribution ? I need to find the Zipf ditribution and then plot a graph of it. But I am stuck in the first step which is to find the Zipf distribution.
Edit: From the frequency count of each word, it is clear that it obeys the Zipf law. But my aim is to plot a zipf distribution graph. I have no idea about how to calculate the data for the distribution graph

I don't pretend to understand statistics. However, based upon reading from scipy site, here is a naive attempt in python.
Build Data
First we get our data. For example we download data from National Library of Medicine MeSH (Medical Subject Heading) ASCII file d2016.bin (28 MB).
Next, we open file, convert to string.
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
Next we locate individual words in the file and separate out words.
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
Finally we prepare a dict with unique words as key and word count as values.
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
Build zipf distribution data
For speed purpose we limit data to 1000 words.
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
After that we get frequency of values , convert to numpy array and use numpy.random.zipf function to draw samples from a zipf distribution.
Distribution parameter a =2. as a sample as it needs to be greater than 1.
For visibility purpose we limit data to 50 sample points.
s = frequency.values()
s = np.array(s)
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
And finally plot the data.
Putting All Together
import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
#build dict of words based on frequency
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)
#Calculate zipf and plot the data
a = 2. # distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot

Related

How do I generate 1000 data points?

I am a bit confused since I am trying to learn python.
My question is how can I generate 1000 datapoints for a noisy S-curve and then save it to a .txt file?
You may consider using the random module to generate a large list of random values
COUNT = 1000 # Number of data points
UPPER_BOUND = 100 # The domain they occupy, exclusive at the upper bound
LOWER_BOUND = 0
data_points = []
for _ in range(COUNT):
data_points.append(random.randint(LOWER_BOUND, UPPER_BOUND))
To save this to a text file, use the open() method with the "w" value to write into a file:
with open("filename.txt", "w") as f:
f.write(data_points)
The use of the with clause removes the need to call close() on the file after it is used.
You can use scipy.stats.logistic for the "S-shaped" curve and numpy.random.uniform for the noise:
import numpy as np
from scipy.stats import logistic
N = 1000
x = np.linspace(-10,10, num=N)
noise = np.random.uniform(0, 0.1, size=N)
points = logistic.cdf(x)+noise
np.savetxt('points.txt', points)
content of points.txt (first lines):
5.163273718724530059e-02
2.404908177729772611e-02
7.221953948290879555e-02
3.023476195714707923e-02
4.972362503720893084e-02
8.986980537557204274e-02
9.878733026764449643e-02
9.584209234526251675e-02
7.709992266714442433e-02
1.367468690439026940e-02
How the data looks like:
import matplotlib.pyplot as plt
plt.plot(x, points)

NTLK nltk.ConditionalFreqDist - Plot ngrams

Here are two examples, one that works and is derived from the https://www.nltk.org/book/ch02.html
and another that does not. The first examples plots single words frequencies, here ['america', 'citizen']. The second is a modified version (evidently incorrectly) that attempts to plot frequencies of the bigram ['america citizen']. I would like to plot ngram frequencies such as for a bigram like ['america citizen'].
Plot Example 1
Plot Example 2 - failed
import nltk
from nltk.book import *
import matplotlib.pyplot as plt
from nltk.corpus import inaugural
inaugural.fileids()
plt.ion() # turns interactive mode on
[fileid[:4] for fileid in inaugural.fileids()]
############- this works ####
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))
ax = plt.axes()
cfd.plot()
############- this does not work ####
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['american citizen']
if w.lower().startswith(target))
ax = plt.axes()
cfd.plot()
It seems to me that you are trying to find 'american citizen' which is a collocation comprised of 2 words looking among single words. This is bound to fail. You would have to check for such a bigram among pairs of consecutive words and for that, you need to zip the lists of words shifting the second by 1 word.
The key difference in your code (you can add more collocations in the form of pairs of words to the list of the last for):
def zip2(lst):
ilst = iter(lst)
_ = next(ilst) # drop the first element
return zip(lst, ilst)
cfd = nltk.ConditionalFreqDist(
(t1 + ' ' + t2, fileid[:4])
for fileid in inaugural.fileids()
for w1, w2 in zip2(inaugural.words(fileid))
for t1, t2 in [('american', 'citizen',)]
if w1.lower().startswith(t1) and w2.lower().startswith(t2)
)
ax = plt.axes()
cfd.plot()

Error in plotting of frequency histogram from csv data

I am working with a csv file with pandas module on python3. Csv file consists of 5 columns: job, company's name, description of the job, amount of reviews, location of the job; and i want to plot a frequency histogram , where i pick only the jobs containing the words "mechanical engineer" and find the frequencies of the 5 most frequent locations for the "mechanical engineer" job.
So,i defined a variable engloc which stores all the "mechanical engineer" jobs.
engloc=df[df.position.str.contains('mechanical engineer|mechanical engineering', flags=re.IGNORECASE, regex=True)].location
and did a histogram plot with matplotlib with code i found online
x = np.random.normal(size = 1000)
plt.hist(engloc, bins=50)
plt.gca().set(title='Frequency Histogram ', ylabel='Frequency');
but it printed like this
How can i plot a proper frequency histogram where it plots using only 5 of the most frequent locations for jobs containing "mechanical engineer" words, instead of putting all of the locations in the graph?
This is a sample from the csv file
Something along the following lines should help you with numerical data:
import numpy as np
counts_, bins_ = np.histogram(englog.values)
filtered = [(c,b) for (c,b) in zip(counts_,bins_) if counts_>=5]
counts, bins = list(zip(*filtered))
plt.hist(bins[:-1], bins, weights=counts)
For a string type try:
from collections import Counter
coords, counts = list(zip(*Counter(englog.values).most_common(5)))
plt.bar(coords, counts)

Zipf Distribution: How do I measure Zipf Distribution using Python / Numpy

I have a file (lets say corpus.txt) of around 700 lines, each line containing numbers separated by -. For example:
86-55-267-99-121-72-336-89-211
59-127-245-343-75-245-245
First I need to read the data from the file, find the frequency of each number, measure the Zipf distribution of these numbers and then plot the distribution. I have done the first two parts of the task. I am stuck in drawing the Zipf distribution.
I know that numpy.random.zipf(a, size=None) should be used for this. But I am finding it extremely hard to use it. Any pointers or code snippet would be extremely helpful.
Code:
# Counts frequency as per given n
def calculateFrequency(fileDir):
frequency = {}
for line in fileDir:
line = line.strip().split('-')
for i in line:
frequency.setdefault(i, 0)
frequency[i] += 1
return frequency
fileDir = open("corpus.txt")
frequency = calculateFrequency(fileDir)
fileDir.close()
print(frequency)
## TODO: Measure and draw zipf distribution
As stated numpy.random.zipf(a, size=None) will produce plot of Samples that are drawn from a zipf distribution with specified parameter of a > 1.
However, since your question was difficulty in using numpy.random.zipf method, here is an naive attempt as discussed on scipy zipf documentation site.
Below is a simulated corpus.txt that has 10 lines of random data per line. However, each line may have duplicates as compared to other lines to simulate recurrance.
16-45-3-21-16-34-30-45-5-28
11-40-22-10-40-48-22-23-22-6
40-5-33-31-46-42-47-5-27-14
5-38-12-22-19-1-11-35-40-24
20-11-24-10-9-24-20-50-21-4
1-25-22-13-32-14-1-21-19-2
25-36-18-4-28-13-29-14-13-13
37-6-36-50-21-17-3-32-47-28
31-20-8-1-13-24-24-16-33-47
26-17-39-16-2-6-15-6-40-46
Working Code
import csv
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Read '-' seperated corpus data and get its frequency in a dict
frequency = {}
with open('corpus.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='-', quotechar='|')
for line in reader:
for word in line:
count = frequency.get(word,0)
frequency[word] = count + 1
#define zipf distribution parameter
a = 2.
#get list of values from frequency and convert to numpy array
s = frequency.values()
s = np.array(s)
# Display the histogram of the samples, along with the probability density function:
count, bins, ignored = plt.hist(s, 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot of histogram of the samples, along with the probability density function

Constructing Zipf Distribution with matplotlib, FITTED-LINE

I have a list of paragraphs, where I want to run a zipf distribution on their combination.
My code is below:
from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt
paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()
ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
verticalalignment="bottom",
horizontalalignment="left")
PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.
I know it's been a while since this question was asked. However, I came across a possible solution for this problem at scipy site.
I thought I would post here in case anyone else required.
I didn't have paragraph info, so here is a whipped up dict called frequency that has paragraph occurrence as its values.
We then get its values and convert to numpy array. Define zipf distribution parameter which has to be >1.
Finally display the histogram of the samples,along with the probability density function
Working Code:
import random
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Generate sample dict with random value to simulate paragraph data
frequency = {}
for i,j in enumerate(range(50)):
frequency[i]=random.randint(1,50)
counts = frequency.values()
tokens = frequency.keys()
#Convert counts of values to numpy array
s = np.array(counts)
#define zipf distribution parameter. Has to be >1
a = 2.
# Display the histogram of the samples,
#along with the probability density function
count, bins, ignored = plt.hist(s, 50, normed=True)
plt.title("Zipf plot for Combined Article Paragraphs")
x = np.arange(1., 50.)
plt.xlabel("Frequency Rank of Token")
y = x**(-a) / special.zetac(a)
plt.ylabel("Absolute Frequency of Token")
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot

Categories