Zipf Distribution: How do I measure Zipf Distribution using Python / Numpy - python

I have a file (lets say corpus.txt) of around 700 lines, each line containing numbers separated by -. For example:
86-55-267-99-121-72-336-89-211
59-127-245-343-75-245-245
First I need to read the data from the file, find the frequency of each number, measure the Zipf distribution of these numbers and then plot the distribution. I have done the first two parts of the task. I am stuck in drawing the Zipf distribution.
I know that numpy.random.zipf(a, size=None) should be used for this. But I am finding it extremely hard to use it. Any pointers or code snippet would be extremely helpful.
Code:
# Counts frequency as per given n
def calculateFrequency(fileDir):
frequency = {}
for line in fileDir:
line = line.strip().split('-')
for i in line:
frequency.setdefault(i, 0)
frequency[i] += 1
return frequency
fileDir = open("corpus.txt")
frequency = calculateFrequency(fileDir)
fileDir.close()
print(frequency)
## TODO: Measure and draw zipf distribution

As stated numpy.random.zipf(a, size=None) will produce plot of Samples that are drawn from a zipf distribution with specified parameter of a > 1.
However, since your question was difficulty in using numpy.random.zipf method, here is an naive attempt as discussed on scipy zipf documentation site.
Below is a simulated corpus.txt that has 10 lines of random data per line. However, each line may have duplicates as compared to other lines to simulate recurrance.
16-45-3-21-16-34-30-45-5-28
11-40-22-10-40-48-22-23-22-6
40-5-33-31-46-42-47-5-27-14
5-38-12-22-19-1-11-35-40-24
20-11-24-10-9-24-20-50-21-4
1-25-22-13-32-14-1-21-19-2
25-36-18-4-28-13-29-14-13-13
37-6-36-50-21-17-3-32-47-28
31-20-8-1-13-24-24-16-33-47
26-17-39-16-2-6-15-6-40-46
Working Code
import csv
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Read '-' seperated corpus data and get its frequency in a dict
frequency = {}
with open('corpus.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='-', quotechar='|')
for line in reader:
for word in line:
count = frequency.get(word,0)
frequency[word] = count + 1
#define zipf distribution parameter
a = 2.
#get list of values from frequency and convert to numpy array
s = frequency.values()
s = np.array(s)
# Display the histogram of the samples, along with the probability density function:
count, bins, ignored = plt.hist(s, 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot of histogram of the samples, along with the probability density function

Related

How do I generate 1000 data points?

I am a bit confused since I am trying to learn python.
My question is how can I generate 1000 datapoints for a noisy S-curve and then save it to a .txt file?
You may consider using the random module to generate a large list of random values
COUNT = 1000 # Number of data points
UPPER_BOUND = 100 # The domain they occupy, exclusive at the upper bound
LOWER_BOUND = 0
data_points = []
for _ in range(COUNT):
data_points.append(random.randint(LOWER_BOUND, UPPER_BOUND))
To save this to a text file, use the open() method with the "w" value to write into a file:
with open("filename.txt", "w") as f:
f.write(data_points)
The use of the with clause removes the need to call close() on the file after it is used.
You can use scipy.stats.logistic for the "S-shaped" curve and numpy.random.uniform for the noise:
import numpy as np
from scipy.stats import logistic
N = 1000
x = np.linspace(-10,10, num=N)
noise = np.random.uniform(0, 0.1, size=N)
points = logistic.cdf(x)+noise
np.savetxt('points.txt', points)
content of points.txt (first lines):
5.163273718724530059e-02
2.404908177729772611e-02
7.221953948290879555e-02
3.023476195714707923e-02
4.972362503720893084e-02
8.986980537557204274e-02
9.878733026764449643e-02
9.584209234526251675e-02
7.709992266714442433e-02
1.367468690439026940e-02
How the data looks like:
import matplotlib.pyplot as plt
plt.plot(x, points)

Zipf Distribution: How do I measure Zipf Distribution

How do I measure or find the Zipf distribution ? For example, I have a corpus of english words. How do I find the Zipf distribution ? I need to find the Zipf ditribution and then plot a graph of it. But I am stuck in the first step which is to find the Zipf distribution.
Edit: From the frequency count of each word, it is clear that it obeys the Zipf law. But my aim is to plot a zipf distribution graph. I have no idea about how to calculate the data for the distribution graph
I don't pretend to understand statistics. However, based upon reading from scipy site, here is a naive attempt in python.
Build Data
First we get our data. For example we download data from National Library of Medicine MeSH (Medical Subject Heading) ASCII file d2016.bin (28 MB).
Next, we open file, convert to string.
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
Next we locate individual words in the file and separate out words.
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
Finally we prepare a dict with unique words as key and word count as values.
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
Build zipf distribution data
For speed purpose we limit data to 1000 words.
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
After that we get frequency of values , convert to numpy array and use numpy.random.zipf function to draw samples from a zipf distribution.
Distribution parameter a =2. as a sample as it needs to be greater than 1.
For visibility purpose we limit data to 50 sample points.
s = frequency.values()
s = np.array(s)
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
And finally plot the data.
Putting All Together
import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
#build dict of words based on frequency
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)
#Calculate zipf and plot the data
a = 2. # distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot

Fitting a single gaussian to 'noisy' data yields a poor fit in some cases

I have some noisy data that can contain 0 and n gaussian shapes, I am trying to implement an algorithm that takes the highest data points and fits a gaussian to that as per the following 'scheme':
New attempt, steps:
fit a spline through all data points
get first derivative of spline function
get both data points (left/right) where f'(x) = around 0 the data point with max intensity
fit a gaussian through the data points returned from 3
4a. Plot the gaussian (stopping at baseline) in the pdf
Calculate area under gaussian curve
Calculate area under raw data points
Calculate percentage of total area explained by gaussian area
I have implemented this concept using the following code (minimal working example):
#! /usr/bin/env python
from scipy.interpolate import InterpolatedUnivariateSpline
from scipy.optimize import curve_fit
from scipy.signal import argrelextrema
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data = [(9.60380153195,187214),(9.62028167623,181023),(9.63676350256,174588),(9.65324602212,169389),(9.66972824591,166921),(9.68621215187,167597),(9.70269675106,170838),(9.71918105436,175816),(9.73566703995,181552),(9.75215371878,186978),(9.76864010158,191718),(9.78512816681,194473),(9.80161692526,194169),(9.81810538757,191203),(9.83459553243,186603),(9.85108637051,180273),(9.86757691233,171996),(9.88406913682,163653),(9.90056205454,156032),(9.91705467586,149928),(9.93354897998,145410),(9.95004397733,141818),(9.96653867816,139042),(9.98303506191,137546),(9.99953213889,138724)]
data2 = [(9.60476933166,163571),(9.62125990879,156662),(9.63775225872,150535),(9.65424539203,146960),(9.67073831905,146794),(9.68723301904,149326),(9.70372850238,152616),(9.72022377931,155420),(9.73672082933,156151),(9.75321866271,154633),(9.76971628954,151549),(9.78621568961,148298),(9.80271587303,146333),(9.81921584976,146734),(9.83571759987,150351),(9.85222013334,156612),(9.86872245996,164192),(9.88522656011,171199),(9.90173144362,175697),(9.91823612015,176867),(9.93474257034,175029),(9.95124980389,171762),(9.96775683032,168449),(9.98426563055,165026)]
def gaussFunction(x, *p):
""" TODO
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
def quantify(data):
""" TODO
"""
backGround = 105000 # Normally this is dynamically determined but this value is fine for testing on the provided data
time,intensity = zip(*data)
x_data = np.array(time)
y_data = np.array(intensity)
newX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
f = InterpolatedUnivariateSpline(x_data, y_data)
fPrime = f.derivative()
newY = f(newX)
newPrimeY = fPrime(newX)
maxm = argrelextrema(newPrimeY, np.greater)
minm = argrelextrema(newPrimeY, np.less)
breaks = maxm[0].tolist() + minm[0].tolist()
maxPoint = 0
for index,j in enumerate(breaks):
try:
if max(newY[breaks[index]:breaks[index+1]]) > maxPoint:
maxPoint = max(newY[breaks[index]:breaks[index+1]])
xData = newX[breaks[index]:breaks[index+1]]
yData = [x - backGround for x in newY[breaks[index]:breaks[index+1]]]
except:
pass
# Gaussian fit on main points
newGaussX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
p0 = [np.max(yData), xData[np.argmax(yData)],0.1]
try:
coeff, var_matrix = curve_fit(gaussFunction, xData, yData, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
newGaussY = [x + backGround for x in newGaussY]
# Generate plot for visual confirmation
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_data, y_data, 'b*')
plt.plot((newX[0],newX[-1]),(backGround,backGround),'red')
plt.plot(newX,newY, color='blue',linestyle='dashed')
plt.plot(newGaussX, newGaussY, color='green',linestyle='dashed')
plt.title("Test")
plt.xlabel("rt [m]")
plt.ylabel("intensity [au]")
plt.savefig("Test.pdf",bbox_inches="tight")
plt.close(fig)
except:
pass
# Call the test
#quantify(data)
quantify(data2)
where normally the background (red line in below pictures) is dynamically determined, but for the sake of this example I have set it to a fixed number. The problem that I have is that for some data it works really well:
Corresponding f'(x):
However, for some other data it fails horrendously:
Corresponding f'(x):
Therefore, I would like to hear some suggestions or ideas on why this happens and on potential approaches to fix it. I have included the data that is shown in the picture below (in case anyone wants to try it):
The error lied in the following bit:
breaks = maxm[0].tolist() + minm[0].tolist()
for index,j in enumerate(breaks):
The breaks list now contains both the maxima and minima, but they are not sorted by time. Resulting in the list yielding the following data points for the poor fit: 9.78, 9.62 and 9.86.
The program would then examine data from 9.78 to 9.62 and 9.62 to 9.86, which meant that 9.62 to 9.86 contained the highest intensity data point yielding the fit that is shown in the second graph.
The fix was rather simple by just adding a sort on the breaks in between, as follows:
breaks = maxm[0].tolist() + minm[0].tolist()
breaks = sorted(breaks)
for index,j in enumerate(breaks):
The program then yielded a fit more closely resembling what I would expect:

Constructing Zipf Distribution with matplotlib, FITTED-LINE

I have a list of paragraphs, where I want to run a zipf distribution on their combination.
My code is below:
from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt
paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()
ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
verticalalignment="bottom",
horizontalalignment="left")
PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.
I know it's been a while since this question was asked. However, I came across a possible solution for this problem at scipy site.
I thought I would post here in case anyone else required.
I didn't have paragraph info, so here is a whipped up dict called frequency that has paragraph occurrence as its values.
We then get its values and convert to numpy array. Define zipf distribution parameter which has to be >1.
Finally display the histogram of the samples,along with the probability density function
Working Code:
import random
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Generate sample dict with random value to simulate paragraph data
frequency = {}
for i,j in enumerate(range(50)):
frequency[i]=random.randint(1,50)
counts = frequency.values()
tokens = frequency.keys()
#Convert counts of values to numpy array
s = np.array(counts)
#define zipf distribution parameter. Has to be >1
a = 2.
# Display the histogram of the samples,
#along with the probability density function
count, bins, ignored = plt.hist(s, 50, normed=True)
plt.title("Zipf plot for Combined Article Paragraphs")
x = np.arange(1., 50.)
plt.xlabel("Frequency Rank of Token")
y = x**(-a) / special.zetac(a)
plt.ylabel("Absolute Frequency of Token")
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot

Method for avoiding random number repetition - python

I am using the random number routines in python in the following code in order to create a noise signal.
res = 10
# Add noise to each X bin accross the signal
X = np.arange(-600,600,res)
for i in range(10000):
noise = [random.uniform(-2,2) for i in xrange(len(X))]
# custom module to save output of X and noise to .fits file
wp.save_fits('test10000', X, noise)
plt.plot(V, I)
plt.show()
In this example I am generate 10,000 'noise.fits' files, that I then wish to co-add together in order to show the expected 1/sqrt(N) dependence of the stacked noise root-mean-square (rms) as a function of the number of objects co-added.
My problem is that the rms follows this dependancy up until ~1000 objects, at which point it deviates upwards, suggesting that the random number generator.
Is there a routine or way to structure the code which will avoid or minimise this repetition? (Ideally with the number as a float in between a max and min value >1 and <-1)?
Here is the output of the co-adding code as well as the code pasted at the bottom for reference.
If I use the module random.random() the result is worse.
Here is my code which adds the noise signal files together, averaging over the number of objects.
import os
import numpy as np
from astropy.io import fits
import matplotlib.pyplot as plt
import glob
rms_arr =[]
#vel_w_arr = []
filelist = glob.glob('/Users/thbrown/Documents/HI_stacking/mockcat/testing/test10000/M*.fits')
filelist.sort()
for i in (filelist[:]):
print(i)
#open an existing FITS file
hdulist = fits.open(str(i))
# assuming the first extension is the table we assign data to record array
tbdata = hdulist[1].data
#index = np.arange(len(filelist))
# Access the signal column
noise = tbdata.field(1)
# access the vel column
X = tbdata.field(0)
if i == filelist[0]:
stack = np.zeros(len(noise))
tot_rms = 0
#print len(stack)
# sum signal in loop
stack = (stack + noise)
rms = np.std(stack)
rms_arr = np.append(rms_arr, rms)
numgal = np.arange(1, np.size(filelist)+1)
avg_rms = rms_arr / numgal

Categories