Using SciPy curve_fit to predict post final score - python

I have a post, and I need to predict the final score as close as I can.
Apparently using curve_fit should do the trick, although I am not really understanding how I should use it.
I have two known values, that I collect 2 minutes after the post is posted.
Those are the comment count, referred as n_comments, and the vote count, referred as n_votes.
After an hour, I check the post again, and get the final_score (sum of all votes) value, which is what I want to predict.
I've looked at different examples online, but they all use multiple data points (I have just 2), also, my initial data point contains more information (n_votes and n_comments) as I've found that without the other you cannot accurately predict the score.
To use curve_fit you need a function. Mine looks like this:
def func(datapoint,k,t,s):
return ((datapoint[0]*k+datapoint[1]*t)*60*datapoint[2])*s
And a sample datapoint looks like this:
[n_votes, n_comments, hour]
This is the broken mess of my attempt, and the result doesn't look right at all.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [3, 1, 2, 1, 0]
initial_comment_list = [0, 3, 0, 1, 64]
final_score_list = [26,12,13,14,229]
# Those lists contain data about multiple posts; I want to predict one at a time, passing the parameters to the next.
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
x = np.array([3, 0, 1])
y = np.array([26 ,0 ,2])
#X = [[a,b,c] for a,b,c in zip(i_votes_list,i_comment_list,[i for i in range(len(i_votes_list))])]
popt, pcov = curve_fit(func, x, y)
plt.plot(x, [ 1 , func(x, *popt), 2], 'g--',
label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('Time')
plt.ylabel('Score')
plt.legend()
plt.show()
The plot should display the initial/final score and the current prediction.
I have some doubts regarding the function too.. Initially this is what it looked like :
(votes_per_minute + n_comments) * 60 * hour
But I replaced votes_per_minute with just votes. Considering that I collect this data after 2 minutes, and that I have a parameter there, I'd say that it's not too bad but I don't know really.
Again, who guarantees that this is the best function possible? It would be nice to have the function discovered automatically, but I think this is ML territory...
EDIT:
Regarding the measurements: I can get as many as I want (every 15-30-60s), although they have to be collected while the post has =< 3 minutes of age.

Disclaimer: This is just a suggestion on how you may approach this problem. There might be better alternatives.
I think, it might be helpful to take into consideration the relationship between elapsed-time-since-posting and the final-score. The following curve from [OC] Upvotes over time for a reddit post models the behavior of the final-score or total-upvotes-count in time:
The curve obviously relies on the fact that once a post is online, you expect somewhat linear ascending upvotes behavior that slowly converges/ stabilizes around a maximum (and from there you have a gentle/flat slope).
Moreover, we know that usually the number of votes/comments is ascending in function of time. the relationship between these elements can be considered to be a series, I chose to model it as a geometric progression (you can consider arithmetic one if you see it is better). Also, you have to keep in mind that you are counting some elements twice; Some users commented and upvoted so you counted them twice, also some can comment multiple times but can upvote only one time. I chose to consider that only 70% (in code p = 0.7) of the users are unique commenters and that users who commented and upvoted represent 60% (in code e = 1-0.6 = 0.4)of the the total number of users (commenters and upvoters), the result of these assumptions:
So we have two equation to model the score so you can combine them and take their average. In code this would look like this:
import warnings
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from mpl_toolkits.mplot3d import axes3d
# filter warnings
warnings.filterwarnings("ignore")
class Cfit:
def __init__(self, votes, comments, scores, fit_size):
self.votes = votes
self.comments = comments
self.scores = scores
self.time = 60 # prediction time
self.fit_size = fit_size
self.popt = []
def func(self, x, a, d, q):
e = 0.4
b = 1
p = 0.7
return (a * np.exp( 1-(b / self.time**d )) + q**self.time * e * (x + p*self.comments[:len(x)]) ) /2
def fit_then_predict(self):
popt, pcov = curve_fit(self.func, self.votes[:self.fit_size], self.scores[:self.fit_size])
return popt, pcov
# init
init_votes = np.array([3, 1, 2, 1, 0])
init_comments = np.array([0, 3, 0, 1, 64])
final_scores = np.array([26, 12, 13, 14, 229])
# fit and predict
cfit = Cfit(init_votes, init_comments, final_scores, 15)
popt, pcov = cfit.fit_then_predict()
# plot expectations
fig = plt.figure(figsize = (15,15))
ax1 = fig.add_subplot(2,3,(1,3), projection='3d')
ax1.scatter(init_votes, init_comments, final_scores, 'go', label='expected')
ax1.scatter(init_votes, init_comments, cfit.func(init_votes, *popt), 'ro', label = 'predicted')
# axis
ax1.set_xlabel('init votes count')
ax1.set_ylabel('init comments count')
ax1.set_zlabel('final score')
ax1.set_title('fincal score = f(init votes count, init comments count)')
plt.legend()
# evaluation: diff = expected - prediction
diff = abs(final_scores - cfit.func(init_votes, *popt))
ax2 = fig.add_subplot(2,3,4)
ax2.plot(init_votes, diff, 'ro', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax2.grid('on')
ax2.set_xlabel('init votes count')
ax2.set_ylabel('|expected-predicted|')
ax2.set_title('|expected-predicted| = f(init votes count)')
# plot expected and predictions as f(init-votes)
ax3 = fig.add_subplot(2,3,5)
ax3.plot(init_votes, final_scores, 'gx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax3.plot(init_votes, cfit.func(init_votes, *popt), 'rx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax3.set_xlabel('init votes count')
ax3.set_ylabel('final score')
ax3.set_title('fincal score = f(init votes count)')
ax3.grid('on')
# plot expected and predictions as f(init-comments)
ax4 = fig.add_subplot(2,3,6)
ax4.plot(init_votes, final_scores, 'gx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax4.plot(init_votes, cfit.func(init_votes, *popt), 'rx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax4.set_xlabel('init comments count')
ax4.set_ylabel('final score')
ax4.set_title('fincal score = f(init comments count)')
ax4.grid('on')
plt.show()
The output of the previous code is the following:
Well obviously the provided data-set is too small to evaluate any approach so it is up to you to test this more.
The main idea here is that you assume your data to follow a certain function/behavior (described in func) but you give it certain degrees of freedom (your parameters: a, d, q), and using curve_fit you try to approximate the best combination of these variables that will fit your input data to your output data. Once you have the returned parameters from curve_fit (in code popt) you just run your function using those parameters, like this for example (add this section at the end of the previous code):
# a function similar to func to predict scores for a certain values
def score(votes_count, comments_count, popt):
e, b, p = 0.4, 1, 0.7
a, d, q = popt[0], popt[1], popt[2]
t = 60
return (a * np.exp( 1-(b / t**d )) + q**t * e * (votes_count + p*comments_count )) /2
print("score for init-votes = 2 & init-comments = 0 is ", score(2, 0, popt))
Output:
score for init-votes = 2 & init-comments = 0 is 14.000150386210994
You can see that this output is close to the correct value 13 and hopefully with more data you can have better/ more accurate approximations of your parameters and consequently better "predictions".

Related

How to find the optimized or correct peaks

I have following graph
I am using python's scipy.signal.find_peaks to find the peaks. But I am not sure how do I do that. I did following :
per = np.percentile(x,[70])
peaks_control = findPeaks(x, per[0])
where x is the signal
array([1.07541259e+09, 1.13851049e+09, 1.19241492e+09, 1.23706527e+09,
1.27240093e+09, 1.29836131e+09, 1.31217483e+09, 1.32037296e+09,
1.31908858e+09, 1.30896503e+09, 1.29216550e+09, 1.26958042e+09,
1.24561632e+09, 1.21202121e+09, 1.16869371e+09, 1.11054499e+09,
1.04006154e+09, 9.65663403e+08, 8.87706760e+08, 8.09340093e+08,
7.37568765e+08, 6.79736364e+08, 6.38576457e+08, 6.06062937e+08,
5.80650350e+08, 5.55089744e+08, 5.36334499e+08, 5.20236597e+08,
5.06529837e+08, 4.91825175e+08, 4.77937063e+08, 4.65475058e+08,
4.56520513e+08, 4.48393240e+08, 4.41944988e+08, 4.34822844e+08,
4.33688578e+08, 4.33451049e+08, 4.36256177e+08, 4.33553613e+08,
4.29191142e+08, 4.28492541e+08, 4.24465967e+08, 4.20074825e+08,
4.19935897e+08, 4.16652681e+08, 4.12419580e+08, 4.11747552e+08,
4.08801166e+08, 4.02351981e+08, 3.99620513e+08, 3.98716550e+08,
3.46023077e+08, 3.53969464e+08, 4.17131235e+08, 5.19363869e+08,
6.50956410e+08, 8.01530303e+08, 9.50162937e+08, 1.08249790e+09,
1.18242378e+09, 1.22732168e+09, 1.20123077e+09, 1.21067599e+09,
1.21556410e+09, 1.21272261e+09, 1.20310023e+09, 1.18692774e+09,
1.16694033e+09, 1.14330117e+09, 1.11635338e+09, 1.07947529e+09,
1.03222145e+09, 9.73427972e+08, 9.08558974e+08, 8.39966200e+08,
7.70457343e+08, 7.04976224e+08, 6.49436131e+08, 6.02085548e+08,
5.68915385e+08, 5.41638928e+08, 5.18758741e+08, 5.01973660e+08,
4.88766667e+08, 4.77643823e+08, 4.65681818e+08, 4.56193240e+08,
4.46851515e+08, 4.36135198e+08, 4.32282984e+08, 4.27913520e+08,
4.23408625e+08, 4.24119580e+08, 4.22399068e+08, 4.22415385e+08,
4.20193939e+08, 4.17638462e+08, 4.14822378e+08, 4.10636364e+08,
4.08388345e+08, 4.04844522e+08, 4.00571562e+08, 4.00841026e+08,
4.00764802e+08, 4.00432867e+08, 4.00336364e+08, 4.00724709e+08,
4.03048019e+08, 3.57437995e+08, 3.62371096e+08, 4.16658741e+08,
5.10148019e+08, 6.31750117e+08, 7.65175991e+08, 8.96832168e+08,
1.01666597e+09, 1.10373263e+09, 1.14380816e+09, 1.11629790e+09,
1.12228904e+09, 1.12378788e+09, 1.11974825e+09, 1.10812774e+09,
1.09125035e+09, 1.07033566e+09, 1.04667389e+09, 1.02016830e+09,
9.86036830e+08, 9.42176457e+08, 8.88900233e+08, 8.27962005e+08,
7.64362238e+08, 7.00755245e+08, 6.42390909e+08, 5.92395338e+08,
5.52426107e+08, 5.26319114e+08, 5.03317249e+08, 4.85524942e+08,
4.70421911e+08, 4.59389510e+08, 4.51644988e+08, 4.46288578e+08,
4.41076923e+08, 4.37533566e+08, 4.31993007e+08, 4.28625641e+08,
4.25406294e+08, 4.21161538e+08, 4.19049650e+08, 4.16719347e+08,
4.13124242e+08, 4.08404429e+08, 4.06154545e+08, 4.03386014e+08,
4.00980420e+08, 3.99442657e+08, 3.97792075e+08, 3.95606527e+08,
3.97922378e+08, 3.98345221e+08, 3.96253613e+08, 3.95703030e+08,
3.96108392e+08, 3.67136830e+08, 3.58382051e+08, 3.95844289e+08,
4.70853846e+08, 5.76629837e+08, 6.97682284e+08, 8.21169930e+08,
9.32588112e+08, 1.01885804e+09, 1.06315152e+09, 1.05128159e+09,
1.03944545e+09, 1.03769580e+09, 1.03132145e+09, 1.02008601e+09,
1.00327389e+09, 9.85387646e+08, 9.66403030e+08, 9.44620746e+08,
9.18596737e+08, 8.82269697e+08, 8.37750816e+08, 7.84877156e+08,
7.27590443e+08, 6.70183217e+08, 6.14567832e+08, 5.67404895e+08,
5.30862471e+08, 5.03108625e+08, 4.84348718e+08, 4.68116550e+08,
4.55809907e+08, 4.46616783e+08, 4.39725175e+08, 4.34323077e+08])
The peaks I get are adjacent to each other, as I can see that there is little bump in second, third and forth peak sites.
How should I calculate it and ignore such adjacent ones. To calculate the width, prominecne, etc I need to calculate the peaks. If I know it already I might be able to put some threshold.
As you asked in the comments, I'll provide you an example. Please note, this is only an example, an Exploratory Data Analysis is always needed to choose the best way to reach your goal.
So, let's create some noisy data
import numpy as np
from scipy.signal import find_peaks, periodogram
import matplotlib.pyplot as plt
size = 100
a = np.linspace(1, .5, size)
x = np.linspace(0, 50, size)
np.random.seed(0)
y = a * np.sin(x) + np.random.normal(0, .1, size) + 5
now, let's try to find the peaks with find_peaks from scipy.signal
peaks = find_peaks(y)[0]
plt.plot(x, y)
plt.plot(x[peaks], y[peaks], marker='o', ls='none')
plt.show()
as you can see, there are some "wrong" peaks. We need to set distance argument in find_peaks (see documentation).
Let us suppose that we don't know the distance between the peaks. In this case, we can see that the data are periodic. So we can find the period with a periodogram and use the period as distance in find_peaks
_f, _p = periodogram(y, nfft=2**6)
# calculate the sample rate of x
sample_rate = 1 / np.median(np.diff(x))
periods = 1 / _f[1:] / sample_rate
density = _p[1:] / _p[1:].max()
max_density_idx = density.argmax()
period = periods[max_density_idx]
plt.semilogx(periods, density)
plt.scatter(period, density[max_density_idx], color='r')
plt.title(f"period {period:.2f}")
plt.show()
Now we can use the period as distance argument in find_peaks
peaks = find_peaks(y, distance=period)[0]
plt.plot(x, y)
plt.plot(x[peaks], y[peaks], marker='o', ls='none')
plt.show()
update
In your specific case, it's a little bit different.
Define the signal (I'll call variables X and Y)
Y = np.array([1.07541259e+09, 1.13851049e+09, 1.19241492e+09, 1.23706527e+09, 1.27240093e+09, 1.29836131e+09, 1.31217483e+09, 1.32037296e+09, 1.31908858e+09, 1.30896503e+09, 1.29216550e+09, 1.26958042e+09, 1.24561632e+09, 1.21202121e+09, 1.16869371e+09, 1.11054499e+09, 1.04006154e+09, 9.65663403e+08, 8.87706760e+08, 8.09340093e+08, 7.37568765e+08, 6.79736364e+08, 6.38576457e+08, 6.06062937e+08, 5.80650350e+08, 5.55089744e+08, 5.36334499e+08, 5.20236597e+08, 5.06529837e+08, 4.91825175e+08, 4.77937063e+08, 4.65475058e+08, 4.56520513e+08, 4.48393240e+08, 4.41944988e+08, 4.34822844e+08, 4.33688578e+08, 4.33451049e+08, 4.36256177e+08, 4.33553613e+08, 4.29191142e+08, 4.28492541e+08, 4.24465967e+08, 4.20074825e+08, 4.19935897e+08, 4.16652681e+08, 4.12419580e+08, 4.11747552e+08, 4.08801166e+08, 4.02351981e+08, 3.99620513e+08, 3.98716550e+08, 3.46023077e+08, 3.53969464e+08, 4.17131235e+08, 5.19363869e+08, 6.50956410e+08, 8.01530303e+08, 9.50162937e+08, 1.08249790e+09, 1.18242378e+09, 1.22732168e+09, 1.20123077e+09, 1.21067599e+09, 1.21556410e+09, 1.21272261e+09, 1.20310023e+09, 1.18692774e+09, 1.16694033e+09, 1.14330117e+09, 1.11635338e+09, 1.07947529e+09, 1.03222145e+09, 9.73427972e+08, 9.08558974e+08, 8.39966200e+08, 7.70457343e+08, 7.04976224e+08, 6.49436131e+08, 6.02085548e+08, 5.68915385e+08, 5.41638928e+08, 5.18758741e+08, 5.01973660e+08, 4.88766667e+08, 4.77643823e+08, 4.65681818e+08, 4.56193240e+08, 4.46851515e+08, 4.36135198e+08, 4.32282984e+08, 4.27913520e+08, 4.23408625e+08, 4.24119580e+08, 4.22399068e+08, 4.22415385e+08, 4.20193939e+08, 4.17638462e+08, 4.14822378e+08, 4.10636364e+08, 4.08388345e+08, 4.04844522e+08, 4.00571562e+08, 4.00841026e+08, 4.00764802e+08, 4.00432867e+08, 4.00336364e+08, 4.00724709e+08, 4.03048019e+08, 3.57437995e+08, 3.62371096e+08, 4.16658741e+08, 5.10148019e+08, 6.31750117e+08, 7.65175991e+08, 8.96832168e+08, 1.01666597e+09, 1.10373263e+09, 1.14380816e+09, 1.11629790e+09, 1.12228904e+09, 1.12378788e+09, 1.11974825e+09, 1.10812774e+09, 1.09125035e+09, 1.07033566e+09, 1.04667389e+09, 1.02016830e+09, 9.86036830e+08, 9.42176457e+08, 8.88900233e+08, 8.27962005e+08, 7.64362238e+08, 7.00755245e+08, 6.42390909e+08, 5.92395338e+08, 5.52426107e+08, 5.26319114e+08, 5.03317249e+08, 4.85524942e+08, 4.70421911e+08, 4.59389510e+08, 4.51644988e+08, 4.46288578e+08, 4.41076923e+08, 4.37533566e+08, 4.31993007e+08, 4.28625641e+08, 4.25406294e+08, 4.21161538e+08, 4.19049650e+08, 4.16719347e+08, 4.13124242e+08, 4.08404429e+08, 4.06154545e+08, 4.03386014e+08, 4.00980420e+08, 3.99442657e+08, 3.97792075e+08, 3.95606527e+08, 3.97922378e+08, 3.98345221e+08, 3.96253613e+08, 3.95703030e+08, 3.96108392e+08, 3.67136830e+08, 3.58382051e+08, 3.95844289e+08, 4.70853846e+08, 5.76629837e+08, 6.97682284e+08, 8.21169930e+08, 9.32588112e+08, 1.01885804e+09, 1.06315152e+09, 1.05128159e+09, 1.03944545e+09, 1.03769580e+09, 1.03132145e+09, 1.02008601e+09, 1.00327389e+09, 9.85387646e+08, 9.66403030e+08, 9.44620746e+08, 9.18596737e+08, 8.82269697e+08, 8.37750816e+08, 7.84877156e+08, 7.27590443e+08, 6.70183217e+08, 6.14567832e+08, 5.67404895e+08, 5.30862471e+08, 5.03108625e+08, 4.84348718e+08, 4.68116550e+08, 4.55809907e+08, 4.46616783e+08, 4.39725175e+08, 4.34323077e+08])
X = np.arange(Y.size)
Since Y.size is 200 and in your plot there are 200 secs, I assume that the sample rate is 1 sec.
If we search for the peaks with default distance we find a lot of unwanted peaks
peaks = find_peaks(Y)[0]
plt.plot(X, Y)
plt.plot(X[peaks], Y[peaks], marker='o', ls='none')
plt.show()
Let's do a periodogram
_f, _p = periodogram(Y, nfft=2**12)
# the sample rate of your signal
sample_rate = 1
periods = 1 / _f[1:] / sample_rate
density = _p[1:] / _p[1:].max()
max_density_idx = density.argmax()
period = periods[max_density_idx]
p_peaks_idx = find_peaks(density)[0]
plt.semilogx(periods, density)
plt.scatter(period, density[max_density_idx], color='r')
period_peaks = []
for p_peak in p_peaks_idx:
if density[p_peak] < .1:
continue
period_peaks.append(periods[p_peak])
plt.scatter(periods[p_peak], density[p_peak])
plt.text(periods[p_peak], density[p_peak], f"{periods[p_peak]:.1f} ", ha='right', va='center')
plt.title('periodogram')
plt.show()
We found two main periods
period_peaks
[56.888888888888886, 28.444444444444443]
If we use the higher density period (56.9, the fundamental, or 1st harmonic) we miss a peak
peaks = find_peaks(Y, distance=period_peaks[0])[0]
plt.plot(X, Y)
plt.plot(X[peaks], Y[peaks], marker='o', ls='none')
plt.show()
This could be because
you have too few observations
the periodicity is not constant
If we empirically subtract a quantity (say 10) from the period, we find all the peaks
peaks = find_peaks(Y, distance=period_peaks[0] - 10)[0]
plt.plot(X, Y)
plt.plot(X[peaks], Y[peaks], marker='o', ls='none')
plt.show()
So we got peaks at
X[peaks]
array([ 7, 61, 118, 174])
taking the difference, we see they're not regular (at this sample rate and with these few observations)
np.diff(X[peaks])
array([54, 57, 56])

Calculating errors in polynomial coefficients

I currently have written some code that generates a polynomial to interpolate several data sets that I have. I am now wanting to calculate the error in the polynomial coefficients but am unsure about how to go about this.
My current code below:
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
import numpy as np
f = [0,5,16,18.5,30,50]
a = [1.41,1.43,0.72,0.78,0.8,0.86]
b = [3.80e-5,5.40e-5,5.14e-5,5.16e-5,8.5e-5,1.58e-4]
c = [1.6e-2,1.54e-2,10.523e-2,14.589e-2,11.1e-2,5.66e-2]
f_new = np.linspace(f[0], f[5], num=len(f)*10)
coefs_a = poly.polyfit(f, a, 3)
coefs_b = poly.polyfit(f, b, 2)
coefs_c = poly.polyfit(f, c, 2)
ffit_a = poly.polyval(f_new, coefs_a)
ffit_b = poly.polyval(f_new, coefs_b)
ffit_c = poly.polyval(f_new, coefs_c)
print(coefs_a)
print(coefs_b)
print(coefs_c)
plt.plot(f,a)
plt.show()
plt.plot(f_new, ffit_a)
plt.title('a')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Coefficient value')
plt.show()
plt.plot(f,b)
plt.show()
plt.plot(f_new, ffit_b)
plt.title('b')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Coefficient value)')
plt.show()
plt.plot(f,c)
plt.show()
plt.plot(f_new, ffit_c)
plt.title('c')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Coefficient value')
plt.show()
So currently, I generate coefficient values and therefore polynomials for the quantities named ''a'', ''b'' and ''c'' and now want to get the errors in these coefficients so I can calculate an overall error for each of the ''a'', ''b'' and ''c'' quantities.
You can get the sum of squared residuals (a measure of error) for each fit if you set full=True
eg in your code to get the residuals for coefs_a,
coefs_a, res_a = poly.polyfit(f, a, 3, full=True)
print(res_a[0]) # res_a will be a list with the 1st entry being the sum of residuals
Docs for numpy.polynomial.polynomial.polyfit

Change the scale of the graph image

I try to generate a graph and save an image of the graph in python. Although the "plotting" of the values seems ok and I can get my picture, the scale of the graph is badly shifted.
If you compare the correct graph from tutorial example with my bad graph generated from different dataset, the curves are cut at the bottom to early: Y-axis should start just above the highest values and I should also see the curves for the highest X-values (in my case around 10^3).
But honestly, I think that problem is the scale of the y-axis, but actually do not know what parameteres should I change to fix it. I tried to play with some numbers (see below script), but without any good results.
This is the code for calculation and generation of the graph image:
import numpy as np
hic_data = load_hic_data_from_reads('/home/besy/Hi-C/MOREX/TCC35_parsedV2/TCC35_V2_interaction_filtered.tsv', resolution=100000)
min_diff = 1
max_diff = 500
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12, 12))
for cnum, c in enumerate(hic_data.chromosomes):
if c in ['ChrUn']:
continue
dist_intr = []
for diff in xrange(min_diff, min((max_diff, 1 + hic_data.chromosomes[c]))):
beg, end = hic_data.section_pos[c]
dist_intr.append([])
for i in xrange(beg, end - diff):
dist_intr[-1].append(hic_data[i, i + diff])
mean_intrp = []
for d in dist_intr:
if len(d):
mean_intrp.append(float(np.nansum(d)) / len(d))
else:
mean_intrp.append(0.0)
xp, yp = range(min_diff, max_diff), mean_intrp
x = []
y = []
for k in xrange(len(xp)):
if yp[k]:
x.append(xp[k])
y.append(yp[k])
l = plt.plot(x, y, '-', label=c, alpha=0.8)
plt.hlines(mean_intrp[2], 3, 5.25 + np.exp(cnum / 4.3), color=l[0].get_color(),
linestyle='--', alpha=0.5)
plt.text(5.25 + np.exp(cnum / 4.3), mean_intrp[2], c, color=l[0].get_color())
plt.plot(3, mean_intrp[2], '+', color=l[0].get_color())
plt.xscale('log')
plt.yscale('log')
plt.ylabel('number of interactions')
plt.xlabel('Distance between bins (in 100 kb bins)')
plt.grid()
plt.ylim(2, 250)
_ = plt.xlim(1, 110)
fig.savefig('/home/besy/Hi-C/MOREX/TCC35_V2_results/filtered/TCC35_V2_decay.png', dpi=fig.dpi)
I think that problem is in scale I need y-axis to start from 10^-1 (0.1), in order to change this I tried this:
min_diff = 0.1
.
.
.
dist_intr = []
for diff in xrange(min_diff, min((max_diff, 0.1 + hic_data.chromosomes[c]))):
.
.
.
plt.ylim((0.1, 20))
But this values return: "integer argument expected, got float"
I also tried to play with:
max_diff, plt.ylim and plt.xlim parameters little bit, but nothing changed to much.
I would like to ask you what parameter/s and how I need change to generate image of the correctly focused graph. Thank you in advance.

Attaching data from matrix to another

I have a problem with my data.
Some info for you:
I have two cannals of data Bx - red one (Canal 1) and By - blue one, (Canal 3), both of them contain 266336 records. Both measurements are taken in 300 seconds. As a result of my plot I got yaxis which gives me correct unit, which is picoTesla, but xaxis gives me number of samples instead of time. Look:
plt.plot(Bx, label='Canal 1', color='r', linewidth=0.1, linestyle="-")
plt.plot(By, label='Canal 3', color='b', linewidth=0.1, linestyle="-")
About my code, I have managed to create matrix, which define time:
dt = float(300)/266336
Fs = 1/dt
t = [0,300,dt*1e3]
My data matrix looks like this:
a = np.amin(data.data)
Bx = data.data[0,]
By = data.data[1,]
I know that from those 266336 records 887,78409 take place in every second. But how to do this? How to write to python, to let him know, that every second is occupied by 887,78409 samples.
UPDATE!
Using this code:
N = len(Bx)
time = np.linspace(0, 300, N)
plt.plot(time, Bx, ...)
I get this:
Looks like all you need to define your time is: np.linspace(0,300,266336).
This divides the [0, 300] interval into 266336 equal 'steps'.
N = len(Bx)
time = np.linspace(0, 300, N)
plt.plot(time, Bx, ...)
[mcve]:
import numpy as np
import matplotlib.pyplot as plt
Bx = np.random.rand(266336)
N = len(Bx)
time = np.linspace(0, 300, N) # or 300000
plt.figure(1).clf()
plt.plot(time, Bx)
If this (alone) doesn't work, then I'm clueless why, because it works for me. If it does work but your script doesn't then find what else you are doing in your script that screws up your figure display...

Mixture model fitting (Bimodal?) in SciPy using truncated normals. Python 3

I'm trying to follow this as an example but can't seem to adapt it to work with my dataset since I need truncated normals:
https://stackoverflow.com/questions/35990467/fit-two-gaussians-to-a-histogram-from-one-set-of-data-python#=
I have a dataset that is definitely a mixture of 2 truncated normals. The minimum value in the domain is 0 and the maximum is 1. I want to create an object that I can fit to optimize the parameters and get the likelihood of a sequence of numbers being drawn from that distribution. One option may be to just use the KDE model and using the pdf to get the likelihood. However, I want the exact mean and standard deviations of the 2 distributions. I guess I could, split the data in half and then model the 2 normals separately but I also want to learn how to use optimize in SciPy. I'm just starting to experiment with this type of statistical analysis so my apologies if this seems naive.
I'm not sure how to get a pdf this way that can integrate to 1 and have a domain constrained between 0 and 1.
import requests
from ast import literal_eval
from scipy import optimize, stats
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Actual Data
u = np.asarray(literal_eval(requests.get("https://pastebin.com/raw/hP5VJ9vr").text))
# u.size ==> 6000
u.min(), u.max()
# (1.3628525454666037e-08, 0.99973136607553781)
# Distribution
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
sns.kdeplot(u, color="black", ax=ax)
ax.axvline(0, linestyle=":", color="red")
ax.axvline(1, linestyle=":", color="red")
kde = stats.gaussian_kde(u)
# KDE Model
def truncated_gaussian_lower(x,mu,sigma,A):
return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=0, a_max=None)
def truncated_gaussian_upper(x,mu,sigma,A):
return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=None, a_max=1)
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
kde = stats.gaussian_kde(u)
# Estimates: mu sigma A
estimates= [0.1, 1, 3,
0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )
# ---------------------------------------------------------------------------
# RuntimeError Traceback (most recent call last)
# <ipython-input-265-b2efb2ca0e0a> in <module>()
# 32 estimates= [0.1, 1, 3,
# 33 0.9, 1, 1]
# ---> 34 params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )
# /Users/mu/anaconda/lib/python3.6/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
# 738 cost = np.sum(infodict['fvec'] ** 2)
# 739 if ier not in [1, 2, 3, 4]:
# --> 740 raise RuntimeError("Optimal parameters not found: " + errmsg)
# 741 else:
# 742 # Rename maxfev (leastsq) to max_nfev (least_squares), if specified.
# RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 1400.
In response to #Uvar 's very helpful explanation below. I am trying to test the integral from 0 - 1 to see if it equals 1 but I'm getting 0.3. I think I'm missing a crucial step in logic:
# KDE Model
def truncated_gaussian(x,mu,sigma,A):
return A*np.exp(-(x-mu)**2/2/sigma**2)
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
if type(x) == np.ndarray:
norm_probas = truncated_gaussian(x,mu1,sigma1,A1) + truncated_gaussian(x,mu2,sigma2,A2)
mask_lower = x < 0
mask_upper = x > 1
mask_floor = (mask_lower.astype(int) + mask_upper.astype(int)) > 1
norm_probas[mask_floor] = 0
return norm_probas
else:
if (x < 0) or (x > 1):
return 0
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
kde = stats.gaussian_kde(u, bw_method=2e-2)
# # Estimates: mu sigma A
estimates= [0.1, 1, 3,
0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u)/integrate.quad(kde, 0 , 1)[0],estimates ,maxfev=5000)
# params
# array([ 9.89751700e-01, 1.92831695e-02, 7.84324114e+00,
# 3.73623345e-03, 1.07754038e-02, 3.79238972e+01])
# Test the integral from 0 - 1
x = np.linspace(0,1,1000)
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
ax.plot(x, kde(x), color="black", label="Data")
ax.plot(x, mixture_model(x, *params), color="red", label="Model")
ax.legend()
# Integrating from 0 to 1
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
# 0.3026863969781809
It seems you are misspecifying the fitting procedure.
You are trying to fit the kde.pdf(u) while constraining half-bounds.
foo = kde.pdf(u)
min(foo)
Out[329]: 0.22903365654960098
max(foo)
Out[330]: 4.0119283429320332
As you can see, the probability density function of u is not constrained to [0,1].
As such, just deleting the clipping action, will result in an exact fit.
def truncated_gaussian_lower(x,mu,sigma,A):
return A*np.exp((-(x-mu)**2)/(2*sigma**2))
def truncated_gaussian_upper(x,mu,sigma,A):
return A * np.exp((-(x-mu)**2)/(2*sigma**2))
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
estimates= [0.15, 1, 3,
0.95, 1, 1]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=kde.pdf(u), p0=estimates)
params
Out[327]:
array([ 0.00672248, 0.07462657, 4.01188383, 0.98006841, 0.07654998,
1.30569665])
y3 = mixture_model(u, params[0], params[1], params[2], params[3], params[4], params[5])
plt.plot(kde.pdf(u)+0.1) #add offset for visual inspection purpose
plt.plot(y3)
So, let's now say I change what I am plotting to:
plt.figure(); plt.plot(u,y3,'.')
Because, indeed:
np.allclose(y3, kde(u), atol=1e-2)
>>True
You can edit the mixture model a bit to be 0 outside of the domain [0, 1]:
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
if (x < 0) or (x > 1):
return 0
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
Doing so, however, will lose the option of instantly evaluating the function over an array of x.. So for the sake of argument, I will leave it out for now.
Anyway, we want our integral to sum up to 1 in the domain [0, 1], and one way to do this (feel free to play around with the bandwidth estimator in stats.gaussian_kde as well..) is to divide the probability density estimate by its integral over the domain. Take care that optimize.curve_fit only takes 1400 iterations in this implementation, so the initial parameter estimates matter.
from scipy import integrate
sum_prob = integrate.quad(kde, 0 , 1)[0]
y = kde(u)/sum_prob
# Estimates: mu sigma A
estimates= [0.15, 1, 5,
0.95, 0.5, 3]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=y, p0=estimates)
>>array([ 6.72247814e-03, 7.46265651e-02, 7.23699661e+00,
9.80068414e-01, 7.65499825e-02, 2.35533297e+00])
y3 = mixture_model(np.arange(0,1,0.001), params[0], params[1], params[2],
params[3], params[4], params[5])
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
sns.kdeplot(u, color="black", ax=ax)
ax.axvline(0, linestyle=":", color="red")
ax.axvline(1, linestyle=":", color="red")
plt.plot(np.arange(0,1,0.001), y3) #The red line is now your custom pdf with area-under-curve = 0.998 in the domain..
To check for the area under the curve, I used this hacky solution of redefining mixture_model..:
def mixture_model(x):
mu1=params[0]; sigma1=params[1]; A1=params[2]; mu2=params[3]; sigma2=params[4]; A2=params[5]
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
from scipy import integrate
integrated_value, error = integrate.quad(mixture_model, 0, 1) #0 lower bound, 1 upper bound
>>(0.9978588016186962, 5.222293368393178e-14)
Or doing the integral a second way:
import sympy
x = sympy.symbols('x', real=True, nonnegative=True)
foo = sympy.integrate(params[2]*sympy.exp((-(x-params[0])**2)/(2*params[1]**2))+params[5]*sympy.exp((-(x-params[3])**2)/(2*params[4]**2)),(x,0,1), manual=True)
foo.doit()
>>0.562981541724715*sqrt(pi) #this evaluates to 0.9978588016186956
And actually doing it your way as described in your edited question:
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
>>0.9978588016186962
If I set my bandwidth to your level (2e-2), indeed the evaluation comes down to 0.92, which is a worse result than the 0.998 we had earlier, but that is still significantly different from the 0.3 you report which is something I cannot recreate, even while copying your code snippets. Do you perhaps accidentally redefine functions/variables somewhere?

Categories