Fit a distribution to a Counter in scipy

Fit a distribution to a Counter in scipy - python

I have a collections.Counter object with a count of the occurrences of different values like this:
1:193260
2:51794
3:19112
4:9250
5:6486
How can I fit a probability distribution to this data in scipy? scipy.stats.expon.fit() seems to want a list of numbers. It seems wasteful to create a list with 193260 [1]s, 51794 [2]s, etc. Is there a more elegant or efficient way?

It looks like scipy.stats.expon.fit is basically a small wrapper over scipy.optimize.minimize, where it first creates a function to compute neg-log-likelihood, and then uses scipy.optimize.minimize to fit the pdf parameters.
So, I think what you need to do here is write your own function that computes the neg-log-likelihood of the counter object, and then call scipy.optimize.minimize yourself.
More specifically, scipy defines the expon 'scale' parameter here
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html
So, the pdf is:
pdf(x) = 1 / scale * exp ( - x / scale)
So, taking the logarithm of both sides we get:
log_pdf(x) = - log(scale) - x / scale
Therefore, the negative-log-likelikehood of everything in your counter object would be:
def neg_log_likelihood(scale):
total = 0.0
for x, count in counter.iteritems():
total += (math.log(scale) + x / scale) * count
return total
Here is a program to try this out.
import scipy.stats
import scipy.optimize
import math
import collections
def fit1(counter):
def neg_log_likelihood(scale):
total = 0.0
for x, count in counter.iteritems():
total += (math.log(scale) + x / scale) * count
return total
optimize_result = scipy.optimize.minimize(neg_log_likelihood, [1.0])
if not optimize_result.success:
raise Exception(optimize_result.message)
return optimize_result.x[0]
def fit2(counter):
data = []
# Create an array where each key is repeated as many times
# as the value of the counter.
for x, count in counter.iteritems():
data += [x] * count
fit_result = scipy.stats.expon.fit(data, floc = 0)
return fit_result[-1]
def test():
c = collections.Counter()
c[1] = 193260
c[2] = 51794
c[3] = 19112
c[4] = 9250
c[5] = 6486
print "fit1 'scale' is %f " % fit1(c)
print "fit2 'scale' is %f " % fit2(c)
test()
Here is the output:
fit1 'scale' is 1.513437
fit2 'scale' is 1.513438

Related

Python_getting a count in a sequence of numbers then deviding it by length

Okay, so basically I have a list (seq) of 19 letters of DNA.
"CGGTACAATCGATTTAGAG"
I am looking to get the right code to count 'A','T','G','C'.
I have tried.
dna_count = seq.count <i>(i have done this for each letter)</i>
then i used: dna_fraction = dna_count/len(seq)
print(dna_fraction * 100)
this results in an error of, dna_count is not defined.
I also need to incorporate round() to 2d.p of the percentage outcome. and return this.
Code (from comment):
def percentBases(dnaStrand):
seq = "CGGTACAATCGATTTAGAG"
dna_count = seq.count("A") + seq.count("T") + seq.count("G") + seq.count("C")
dna_fraction = dna_count / len(seq)
print(dna_fraction * 100)
rawPerCent = 100/seq.count
percentC = round(rawPercent, 2)
return (percentC, percentG, percentA, percentT)
Error message (from comment):
dna_fraction = dna_count / len(seq)
NameError: name 'dna_count' is not defined

You can use Counter():
from collections import Counter
seq = "CGGTACAATCGATTTAGAG"
for element, count in Counter(seq).items(): # use Counter(seq).most_common() for sorted
print(f"{element}: {count / len(seq):.2%}")
Or you can do it in one line:
print(*(f"{e}: {c / len(seq):.2%}" for e, c in Counter(seq).items()), sep="\n")
# print("\n".join(f"{e}: {c / len(seq):.2%}" for e, c in Counter(seq).items()))

seq = "CGGTACAATCGATTTAGAG"
# use set to find every unique letter
for dna in set(seq):
# count total number for each DNA in sequence
dna_count = dna_count.count(dna)
# divide by total
dna_fraction = dna_count/len(seq)
# round by 2
dna_fraction = round(dna_fraction, 2)
# print percentage of dna in sequence
print(dna, dna_fraction * 100)

Solving a simple function with step and which outputs max value & argument of a function

I am writing a program which solves a function in an interval 0:9 where step size is 0.005. This program requires 1800 calculations and a way to find the max value of a function and x argument which was used.
What would be the recommended way and loops to use in order calculate function 1800 times (9/0.005), find the max value of it and output related argument value which was used in calculation for the max value?
My idea was that there should be 2 lists generated, one for the range/interval (1800 items) and other for calculated values (also 1800). Which would then find max in 'calculated array' and related x argument in the other array, using list index or some other method..
from operator import itemgetter
import math
myfile = open("result.txt", "w")
data = []
step=0.005
rng=9
lim=rng/step
print(lim)
xs=[x * step for x in range(rng)]
lim_int=int(lim)
print(xs)
for i in range(lim_int):
num=itemgetter(i)(xs)
x=math.sin(num)* math.exp(-num/100)
print(i, x)
data.append(x)
for i in range(rng):
text = str(i)
text2 = str(data[i])
print(text, text2)
myfile.write(text + ' ' + text2 + '\n')
i=1
while i < rng:
i=i+1
num2=itemgetter(i)(xs)
v=math.sin(num2)* math.exp(-num2/100)
if v==max(data):
arg=num2
break
print('largest function value', max(data))
print('function argument value used', arg)
myfile.close()

Numpy is the widely used performant package for this:
import numpy as np
x = np.arange(0, 9, 0.005)
f = np.sin(x)*np.exp(-x/100)
print("max is: ", np.max(f))
print("index of max is: ", np.argmax(f))
output:
max is: 0.98446367206362
index of max is: 312
If for some reason you want a native python solution (without using list methods max and index), you can do something like this:
step = 0.005
rng = 9
lim = int(rng/step)
x = [x_i*step for x_i in range(lim + 1)]
f = [math.exp(-x_i/100)*math.sin(x_i) for x_i in x]
max_ind = 0
f_max = f[max_ind]
for j, f_x in enumerate(f):
if f_x > f_max:
f_max = f_x
max_ind = j

Finding correlation coefficient from 2 lists

I am working on a project which has many functions when given a couple lists of data. I've already seperated the lists and I have defined some functions which I know for certain work correctly, that being a mean function and standard deviation function. My issue is when testing my lists I get a correct mean, correct standard deviation, but incorrect correlation coefficient. Could my math be off here? I need to find the correlation coefficient with only Python's standard library.
MY CODE:
def correlCo(someList1, someList2):
# First establish the means and standard deviations for both lists.
xMean = mean(someList1)
yMean = mean(someList2)
xStandDev = standDev(someList1)
yStandDev = standDev(someList2)
zList1 = []
zList2 = []
# Create 2 new lists taking (a[i]-a's Mean)/standard deviation of a
for x in someList1:
z1 = ((float(x)-xMean)/xStandDev)
zList1.append(z1)
for y in someList2:
z2 = ((float(y)-yMean)/yStandDev)
zList2.append(z2)
# Mapping out the lists to be float values instead of string
zList1 = list(map(float,zList1))
zList2 = list(map(float,zList2))
# Multiplying each value from the lists
zFinal = [a*b for a,b in zip(zList1,zList2)]
totalZ = 0
# Taking the sum of all the products
for a in zFinal:
totalZ += a
# Finally calculating correlation coefficient
r = (1/(len(someList1) - 1)) * totalZ
return r
SAMPLE RUN:
I have a list of [1,2,3,4,4,8] and [3,3,4,5,8,9]
I expect the correct answer of r = 0.8848, but get r = .203727
EDIT: To include the mean and standard deviation functions I have made.
def mean(someList):
total = 0
for a in someList:
total += float(a)
mean = total/len(someList)
return mean
def standDev(someList):
newList = []
sdTotal = 0
listMean = mean(someList)
for a in someList:
newNum = (float(a) - listMean)**2
newList.append(newNum)
for z in newList:
sdTotal += float(z)
standardDeviation = sdTotal/(len(newList))
return standardDeviation

The Pearson correlation can be calculated with numpy's corrcoef.
import numpy
numpy.corrcoef(list1, list2)[0, 1]

Pearson Correlation Coefficient
Code (modified)
def mean(someList):
total = 0
for a in someList:
total += float(a)
mean = total/len(someList)
return mean
def standDev(someList):
listMean = mean(someList)
dev = 0.0
for i in range(len(someList)):
dev += (someList[i]-listMean)**2
dev = dev**(1/2.0)
return dev
def correlCo(someList1, someList2):
# First establish the means and standard deviations for both lists.
xMean = mean(someList1)
yMean = mean(someList2)
xStandDev = standDev(someList1)
yStandDev = standDev(someList2)
# r numerator
rNum = 0.0
for i in range(len(someList1)):
rNum += (someList1[i]-xMean)*(someList2[i]-yMean)
# r denominator
rDen = xStandDev * yStandDev
r = rNum/rDen
return r
print(correlCo([1,2,3,4,4,8], [3,3,4,5,8,9]))
Output
0.884782972876

Normally according to the standard deviation formula, you should have divided the dev to the sample number (length of the list) before sqrrt it.Right?
I mean:
dev += ((someList[i]-listMean)**2)/len(someList)
enter image description here

Your standard deviation is wrong. You forgot to take the squareroot.
You are actually returning variance and not standard deviation from that function.
#DeathPox

Python: Fourth Order Runge-Kutta Method

from math import sin
from numpy import arange
from pylab import plot,xlabel,ylabel,show
def answer():
print('Part a:')
print(low(x,t))
print('First Graph')
print('')
def low(x,t):
return 1/RC * (V_in - V_out)
a = 0.0
b = 10.0
N = 1000
h = (b-a)/N
RC = 0.01
V_out = 0.0
tpoints = arange(a,b,h)
xpoints = []
x = 0.0
for t in tpoints:
xpoints.append(x)
k1 = h*f(x,t)
k2 = h*f(x+0.5*k1,t+0.5*h)
k3 = h*f(x+0.5*k2,t+0.5*h)
k4 = h*f(x+k3,t+h)
x += (k1+2*k2+2*k3+k4)/6
plot(tpoints,xpoints)
xlabel("t")
ylabel("x(t)")
show()
So I have the fourth order runge kutta method coded but the part I'm trying to fit in is where the problem say V_in(t) = 1 if [2t] is even or -1 if [2t] is odd.
Also the I'm not sure if I'm suppose to return this equation:
return 1/RC * (V_in - V_out)
Here is the problem:
Problem 8.1
It would be greatly appreciated if you help me out!

So I have the fourth order runge kutta method coded but the part I'm trying to fit in is where the problem say V_in(t) = 1 if [2t] is even or -1 if [2t] is odd.
You are treating V_in as a constant. The problem says that it's a function. So one solution is to make it a function! It's a very simple function to write:
def dV_out_dt(V_out, t) :
return (V_in(t) - V_out)/RC
def V_in(t) :
if math.floor(2.0*t) % 2 == 0 :
return 1
else :
return -1
You don't need or want that if statement in the definition of V_in(t). A branch inside of a loop is expensive, and this function will be called many times from inside a loop. There's a simple way to avoid that if statement.
def V_in(t) :
return 1 - 2*(math.floor(2.0*t) % 2)
This function is so small that you can fold it into the derivative function:
def dV_out_dt(V_out, t) :
return ((1 - 2*(math.floor(2.0*t) % 2)) - V_out)/RC

The function should look something like this:
def f(x,t):
V_out = x
n = floor(2*t)
V_in = (1==n%2)? -1 : 1
return 1/RC * (V_in - V_out)

calculate equivalent width using python code

I have this Fortran program for compute equivalent width of spectral lines
i hope to find help for write python code to do same algorithm (input file contain tow column wavelength and flux)
PARAMETER (N=195) ! N is the number of data
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
DIMENSION X(N),Y(N)
OPEN(1,FILE='halpha.dat')
DO 10 I=1,N
READ(1,*)X(I),Y(I)
WRITE(*,*)X(I),Y(I)
10 CONTINUE
CALL WIDTH(X,Y,N,SUM)
WRITE(*,*)SUM
END
c-----------------------------------------
SUBROUTINE WIDTH(X,Y,N,SUM)
PARAMETER (NBOD=20000)
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
DIMENSION X(NBOD),Y(NBOD)
SUM=0.D0
DO I=2,N
SUM=SUM+(X(I-1)-X(I))*((1.-Y(I-1))+(1.-Y(I)))
C WRITE(*,*)SUM
END DO
SUM=0.5*dabs(SUM)
RETURN
END

Here's a fairly literal translation:
def main():
N = 195 # number of data pairs
x, y = [0 for i in xrange(N)], [0 for i in xrange(N)]
with open('halpha.dat') as f:
for i in xrange(N):
x[i], y[i] = map(float, f.readline().split())
print x[i], y[i]
sum = width(x, y, N)
print sum
def width(x, y, N):
sum = 0.0
for i in xrange(1, N):
sum = sum + (x[i-1] - x[i]) * ((1. - y[i-1]) + (1. - y[i]))
sum = 0.5*abs(sum)
return sum
main()
However this would be a more idiomatic translation:
from math import fsum # more accurate floating point sum of a series of terms
def main():
with open('halpha.dat') as f: # Read file into a list of tuples.
pairs = [tuple(float(word) for word in line.split()) for line in f]
for pair in pairs:
print('{}, {}'.format(*pair))
print('{}'.format(width(pairs)))
def width(pairs):
def term(prev, curr):
return (prev[0] - curr[0]) * ((1. - prev[1]) + (1. - curr[1]))
return 0.5 * abs(fsum(term(*pairs[i-1:i+1]) for i in range(1, len(pairs))))
main()

I would suggest that a more natural way to do this in Python is to focus on the properties of the spectrum itself, and use your parameters in astropy's specutils.
In particular equivalent_width details are here. For more general info on
specutils, specutils.analysis and its packages follow these links:
specutils top level
and
specutils.analysis
To use this package you need to create a Spectrum1D object, the first component of which will be your wavelength axis and the second will be the flux. You can find details of how to create a Spectrum1D object by following the link in the analysis page (at the end of the third line of first paragraph).
It's a very powerful approach and has been developed by astronomers for astronomers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fit a distribution to a Counter in scipy - python

Related

Python_getting a count in a sequence of numbers then deviding it by length

Solving a simple function with step and which outputs max value & argument of a function

Finding correlation coefficient from 2 lists

Python: Fourth Order Runge-Kutta Method

calculate equivalent width using python code

Categories

Resources