Constructing Zipf Distribution with matplotlib, FITTED-LINE - python

I have a list of paragraphs, where I want to run a zipf distribution on their combination.
My code is below:
from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt
paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()
ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
verticalalignment="bottom",
horizontalalignment="left")
PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.

I know it's been a while since this question was asked. However, I came across a possible solution for this problem at scipy site.
I thought I would post here in case anyone else required.
I didn't have paragraph info, so here is a whipped up dict called frequency that has paragraph occurrence as its values.
We then get its values and convert to numpy array. Define zipf distribution parameter which has to be >1.
Finally display the histogram of the samples,along with the probability density function
Working Code:
import random
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Generate sample dict with random value to simulate paragraph data
frequency = {}
for i,j in enumerate(range(50)):
frequency[i]=random.randint(1,50)
counts = frequency.values()
tokens = frequency.keys()
#Convert counts of values to numpy array
s = np.array(counts)
#define zipf distribution parameter. Has to be >1
a = 2.
# Display the histogram of the samples,
#along with the probability density function
count, bins, ignored = plt.hist(s, 50, normed=True)
plt.title("Zipf plot for Combined Article Paragraphs")
x = np.arange(1., 50.)
plt.xlabel("Frequency Rank of Token")
y = x**(-a) / special.zetac(a)
plt.ylabel("Absolute Frequency of Token")
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot

Related

How to use values of find_peak function Python

I have to analyse a PPG signal. I found something to find the peaks but I can't use the values of the heights. They are stored in like a dictionary array or something and I don't know how to extract the values out of it. I tried using dict.values() but that didn't work.
import matplotlib.pyplot as plt
import numpy as np
from scipy.signal import savgol_filter
data = pd.read_excel('test_heartpy.xlsx')
arr = np.array(data)
time = arr[1:,0] # time in s
ECG = arr[1:,1] # ECG
PPG = arr[1:,2] # PPG
filtered = savgol_filter(PPG, 251, 3)
plt.plot(time, filtered)
plt.xlabel('Time (in s)')
plt.ylabel('PPG')
plt.grid('on')
The PPG signal looks like this. To search for the peaks I used:
# searching peaks
from scipy.signal import find_peaks
peaks, heights_peak_0 = find_peaks(PPG, height=0.2)
heights_peak = heights_peak_0.values()
plt.plot(PPG)
plt.plot(peaks, np.asarray(PPG)[peaks], "x")
plt.plot(np.zeros_like(PPG), "--", color="gray")
plt.title("PPG peaks")
plt.show()
print(heights_peak_0)
print(heights_peak)
print(peaks)
Printing:
{'peak_heights': array([0.4822998 , 0.4710083 , 0.43884277, 0.46728516, 0.47094727,
0.44702148, 0.43029785, 0.44146729, 0.43933105, 0.41400146,
0.45318604, 0.44335938])}
dict_values([array([0.4822998 , 0.4710083 , 0.43884277, 0.46728516, 0.47094727,
0.44702148, 0.43029785, 0.44146729, 0.43933105, 0.41400146,
0.45318604, 0.44335938])])
[787 2513 4181 5773 7402 9057 10601 12194 13948 15768 17518 19335]
Signal with highlighted peaks looks like this.
heights_peak_0 is the properties dict returned by scipy.signal.find_peaks
You can find more information about what is returned here
You can extract the array containing all the heights of the peaks with heights_peak_0["peak_heights"]
# the following will give you an array with the values of peaks
heights_peak_0['peak_heights']
# peaks seem to be the indices where find_peaks function foud peaks in the original signal. So you can get the peak values this way also
PPG[peaks]
According to the docs, the find_peaks() functions returns a tuple consisting of the peaks itself and a properties dict. As you are only interested in the peak values, you can simply ignore the second element of the tuple and only use the first one.
Assuming you want to have the 'coordinates' of your peaks you could then combine the peak heights (y-values) with its positions (x-values) like so (based on the first code snippet given in the docs):
import matplotlib.pyplot as plt
from scipy.misc import electrocardiogram
from scipy.signal import find_peaks
x = electrocardiogram()[2000:4000]
peaks, _ = find_peaks(x, distance=150)
peaks_x_values = peaks
peaks_y_values = x[peaks]
peak_coordinates = list(zip(peaks_x_values, peaks_y_values))
print(peak_coordinates)
plt.plot(x)
plt.plot(peaks_x_values, peaks_y_values, "x")
plt.show()
Printing:
[(65, 0.705), (251, 1.155), (431, 1.705), (608, 1.96), (779, 1.925), (956, 2.09), (1125, 1.745), (1292, 1.37), (1456, 1.2), (1614, 0.81), (1776, 0.665), (1948, 0.665)]

Probability density function in SciPy behaves differently than expected

I am trying to plot normal distribution curve using Python. First I did it manually by using the normal probability density function and then I found there's an exiting function pdf in scipy under stats module. However, the results I get are quite different.
Below is the example that I tried:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
mean = 5
std_dev = 2
num_dist = 50
# Draw random samples from a normal (Gaussion) distribution
normalDist_dataset = np.random.normal(mean, std_dev, num_dist)
# Sort these values.
normalDist_dataset = sorted(normalDist_dataset)
# Create the bins and histogram
plt.figure(figsize=(15,7))
count, bins, ignored = plt.hist(normalDist_dataset, num_dist, density=True)
new_mean = np.mean(normalDist_dataset)
new_std = np.std(normalDist_dataset)
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
plt.plot(normalDist_dataset, normal_curve1, linewidth=4, linestyle='dashed')
plt.plot(bins, normal_curve2, linewidth=4, color='y')
The result shows how the two curves I get are very different from each other.
My guess is that it is has something to do with bins or pdf behaves differently than usual formula. I have used the same and new mean and standard deviation for both the plots. So, how do I change my code to match what stats.norm.pdf is doing?
I don't know yet which curve is correct.
Function plot simply connects the dots with line segments. Your bins do not have enough dots to show a smooth curve. Possible solution:
....
normal_curve1 = stats.norm.pdf(normalDist_dataset, new_mean, new_std)
bins = normalDist_dataset # Add this line
normal_curve2 = (1/(new_std *np.sqrt(2*np.pi))) * (np.exp(-(bins - new_mean)**2 / (2 * new_std**2)))
....

Python: Plot a histogram given the counts (frequencies) and the bins

To illustrate my problem I prepared an example:
First, I have two arrays 'a'and 'b'and I'm interested in their distribution:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
plt.show()
This code gives me a histogram with two 'curves'. Now I want to subtract one 'curve' from the other, and by this I mean that I do this for each bin separately:
n3 = n2-n1
I don't need negative counts so:
for i in range(0,len(n2)):
if n3[i]<0:
n3[i]=0
else:
continue
The new histogram curve should be plotted in the same range as the previous ones and it should have the same number of bins. So I have the number of bins and their position (which will be the same as the ones for the other curves, please refer to the block above) and the frequency or counts (n3) that every bins should have. Do you have any ideas of how I can do this with the data that I have?
You can use a step function to plot n3 = n2 - n1. The only issue is that you need to provide one more value, otherwise the last value is not shown nicely. Also you need to use the where="post" option of the step function.
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
n3=n2-n1
n3[n3<0] = 0
plt.step(np.arange(1,10,2),np.append(n3,[n3[-1]]), where='post', lw=3 )
plt.show()

Zipf Distribution: How do I measure Zipf Distribution using Python / Numpy

I have a file (lets say corpus.txt) of around 700 lines, each line containing numbers separated by -. For example:
86-55-267-99-121-72-336-89-211
59-127-245-343-75-245-245
First I need to read the data from the file, find the frequency of each number, measure the Zipf distribution of these numbers and then plot the distribution. I have done the first two parts of the task. I am stuck in drawing the Zipf distribution.
I know that numpy.random.zipf(a, size=None) should be used for this. But I am finding it extremely hard to use it. Any pointers or code snippet would be extremely helpful.
Code:
# Counts frequency as per given n
def calculateFrequency(fileDir):
frequency = {}
for line in fileDir:
line = line.strip().split('-')
for i in line:
frequency.setdefault(i, 0)
frequency[i] += 1
return frequency
fileDir = open("corpus.txt")
frequency = calculateFrequency(fileDir)
fileDir.close()
print(frequency)
## TODO: Measure and draw zipf distribution
As stated numpy.random.zipf(a, size=None) will produce plot of Samples that are drawn from a zipf distribution with specified parameter of a > 1.
However, since your question was difficulty in using numpy.random.zipf method, here is an naive attempt as discussed on scipy zipf documentation site.
Below is a simulated corpus.txt that has 10 lines of random data per line. However, each line may have duplicates as compared to other lines to simulate recurrance.
16-45-3-21-16-34-30-45-5-28
11-40-22-10-40-48-22-23-22-6
40-5-33-31-46-42-47-5-27-14
5-38-12-22-19-1-11-35-40-24
20-11-24-10-9-24-20-50-21-4
1-25-22-13-32-14-1-21-19-2
25-36-18-4-28-13-29-14-13-13
37-6-36-50-21-17-3-32-47-28
31-20-8-1-13-24-24-16-33-47
26-17-39-16-2-6-15-6-40-46
Working Code
import csv
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Read '-' seperated corpus data and get its frequency in a dict
frequency = {}
with open('corpus.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='-', quotechar='|')
for line in reader:
for word in line:
count = frequency.get(word,0)
frequency[word] = count + 1
#define zipf distribution parameter
a = 2.
#get list of values from frequency and convert to numpy array
s = frequency.values()
s = np.array(s)
# Display the histogram of the samples, along with the probability density function:
count, bins, ignored = plt.hist(s, 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot of histogram of the samples, along with the probability density function

Is there any solution for better fit beta prime distribution to data than using Scipy?

I was trying to fit beta prime distribution to my data using python. As there's scipy.stats.betaprime.fit, I tried this:
import numpy as np
import math
import scipy.stats as sts
import matplotlib.pyplot as plt
N = 5000
nb_bin = 100
a = 12; b = 106; scale = 36; loc = -a/(b-1)*scale
y = sts.betaprime.rvs(a,b,loc,scale,N)
a_hat,b_hat,loc_hat,scale_hat = sts.betaprime.fit(y)
print('Estimated parameters: \n a=%.2f, b=%.2f, loc=%.2f, scale=%.2f'%(a_hat,b_hat,loc_hat,scale_hat))
plt.figure()
count, bins, ignored = plt.hist(y, nb_bin, normed=True)
pdf_ini = sts.betaprime.pdf(bins,a,b,loc,scale)
pdf_est = sts.betaprime.pdf(bins,a_hat,b_hat,loc_hat,scale_hat)
plt.plot(bins,pdf_ini,'g',linewidth=2.0,label='ini');plt.grid()
plt.plot(bins,pdf_est,'y',linewidth=2.0,label='est');plt.legend();plt.show()
It shows me the result that:
Estimated parameters:
a=9935.34, b=10846.64, loc=-90.63, scale=98.93
which is quite different from the original one and the figure from the PDF:
If I give the real value of loc and scale as the input of fit function, the estimation result would be better. Has anyone worked on this part already or got a better solution?

Categories