Wrong scipy fit even with good initial guess - python

The model to fit is the equation
def func(x, b):
return b*np.exp(-b*x)*(1.0 + b*x)/4.0
I know that b=0.1 is a good guess to my data
0 0.1932332495855138
1 0.0283534527253836
2 0.0188036856033853
3 0.0567007258167565
4 0.0704161703188139
5 0.0276463443409273
6 0.0144509808494943
7 0.0188027609145469
8 0.0049573500626925
9 0.0064589075683206
10 0.0118522499082115
11 0.0087201376939245
12 0.0055855004231049
13 0.0110355379801288
14 0.0024829496736532
15 0.0050982312687186
16 0.0041032075307342
17 0.0063991465281368
18 0.0047195530453669
19 0.0028479431829209
20 0.0177577032522473
21 0.0082863863356967
22 0.0057720347102372
23 0.0053694769677398
24 0.017408417311084
25 0.0023307847797263
26 0.0014090741613788
27 0.0019007144648791
28 0.0043599058193019
29 0.004435997067249
30 0.0015569027316533
31 0.0016127575928092
32 0.00120222948697
33 0.0006851723909766
34 0.0014497504163
35 0.0014245210449107
36 0.0011375555693977
37 0.0007939973846594
38 0.0005707034948325
39 0.0007890519641431
40 0.0006274139241806
41 0.0005899624312505
42 0.0003989619799181
43 0.0002212632688891
44 0.0001465605806698
45 0.000188075040325
46 0.0002779076010181
47 0.0002941294723591
48 0.0001690581072228
49 0.0001448055157076
50 0.0002734759385405
51 0.0003228484365634
52 0.0002120441778252
53 0.0002383276583408
54 0.0002156310534404
55 0.0004499244488764
56 0.0001408465706883
57 0.000135998586104
58 0.00028706917157
59 0.0001788548683777
But it doesn't matter if I set p0=0.1, or p0=1.0, the fitting parameter in both cases python says to be popt= [0.42992594] and popt=[0.42994105], which is almost the same value. Why the curve_fit function doesn't work in this case?
popt, pcov = curve_fit(func, xdata, ydata, p0=[0.1])

There's nothing too mysterious going on here. 0.4299... is just a better fit to the data, in the least-squares sense.
With b = 0.1, the first few points are not well fit at all. Least-squares heavily weights outliers, so the optimizer tries very hard to fit those better, even if it means doing slightly worse at other points. In other words, "most" points are fit "pretty well", and there is a very high penalty for fitting any point very badly (that's the "square" in least-squares).
Below is a plot of the data (blue) and your model function with b = 0.1 and b = 0.4299 in orange and green respectively. The value returned by curve_fit is better subjectively and objectively. Computing the MSE to the data in both cases gives about 0.18 using b = 0.1, and 0.13 using b = 0.4299.


Calculate a prediction interval for a dataset Python

I have the following table:
0 59.98797
1 61.89383
2 61.08403
3 61.00661
4 62.64753
5 62.18118
6 60.74520
7 57.83964
8 62.09705
9 57.07985
10 58.62777
11 60.02589
12 58.74948
13 59.14136
14 58.37719
15 58.27401
16 59.67806
17 58.62855
18 58.45272
19 57.62186
20 58.64749
21 58.88152
22 54.80138
23 59.57697
24 60.26713
25 60.96022
26 55.59813
27 60.32104
28 57.95403
29 58.90658
30 53.72838
31 57.03986
32 58.14056
33 53.62257
34 57.08174
35 57.26881
36 48.80800
37 56.90632
38 59.08444
39 57.36432
consisting of various percentages.
I'm interested in creating a probability distribution based on these percentages for the sake of coming up with a prediction interval (say 95%) of what we would expect a new observation of this percentage to be within.
I initially was doing the following, but upon testing with my sample data I remembered that CIs capture the mean, not a new observation.
import scipy.stats as st
import numpy as np
# Get data in a list
lst = list(percDone['perc'])
# create 95% confidence interval
st.t.interval(alpha=0.95, df=len(lst)-1,

Plotting of dot points based on np.where condition

I have a lot of data points (in .CSV form) that I am trying to visualize, what I would like to do is to read the csv and read the "result" column, if the value in the corresponding column is positive(I was trying to use np.where condition), I would like to plot the A B C D E F G parameters corresponding to it in such a way that the y-axis is the value of the parameters and x-axis is the name of the parameter.(Something like a dot/scatter plot) I would like to plot all the values in the same graph, Furthermore, if the number of points are more than 20 I would like to use the first 20 points for the plotting.
An example of the type of dataset is below. (Mine contains around 12000 rows)
A B C D E F G result
23 -54 36 27 98 39 80 -0.86
14 44 -16 47 28 29 26 1.65
67 84 26 67 -88 29 10 0.5
-45 14 76 37 68 59 90 0
24 34 56 27 38 79 48 -1.65
Any help in guiding for this would be appreciated !
From your question I assume that your data is a pandas dataframe. In this case you can do the selection with pandas and use its built-in plotting function:
df.loc[df.result>0, df.columns[:-1]].T.plot(ls='', marker='o')
If you want to plot the first 20 rows only, just add [:20] (or better .iloc[:20]) to df.loc.

get rid of tiny fraction in bar plot scale (Pandas/Python)

When I'm trying to plot a bar plot (of histograms), using pd.cut, I get a funny (and very annoying!) 0.001 added to the axis (from the left), making it starting from -1.001 instead of -1. The question is how to get rid of this? (please see the figure).
My code is:
out_i = pd.cut(df, bins=np.arange(-1,1.2,0.2), include_lowest=True)
out_i.value_counts(sort=False).plot.bar(rot=45, figsize=(6,6))
with df:
0 -0.402203
1 -0.019031
2 -0.979292
3 -0.701221
4 -0.267261
5 -0.563602
7 -0.454961
8 0.632456
9 -0.843081
10 -0.629253
11 -0.946188
12 -0.628178
13 -0.776933
14 -0.717091
15 -0.392144
16 -0.799408
17 -0.897951
18 0.255321
19 -0.641854
20 -0.356393
21 -0.507321
22 -0.698238
23 -0.985097
25 -0.661444
26 -0.751593
27 -0.437505
28 -0.413451
29 -0.798745
30 -0.736440
31 -0.672727
32 -0.807688
33 -0.087085
34 -0.393203
35 -0.979730
36 -0.902951
37 -0.454231
38 -0.561951
39 -0.388580
40 -0.706501
41 -0.408248
42 -0.377235
43 -0.283110
44 -0.517428
45 -0.949603
46 -0.268667
47 -0.376199
48 -0.472293
49 -0.211781
50 -0.921520
51 -0.345870
53 -0.542487
55 -0.597996
In case it is acceptable to chop off the decimal points of the intervals, generate a custom list of interval labels and set this as the xticklabels of the plot:
out_i = pd.cut(df['a'], bins=np.arange(-1,1.2,0.2), include_lowest=True)
intervals = out_i.cat.categories
labels = ['(%.1f, %.1f]' % (int(interval.left*100)/100, interval.right) for interval in intervals]
ax = out_i.value_counts(sort=False).plot.bar(rot=45, figsize=(6,6))
Which results in the following plot:
Note: this will always output a half-closed interval (a,b]. It can be improved by making the brackets dynamic as per the parameters of pd.cut.

Python/Pandas Select Columns based on Best Value Distribution

I have a dataframe (df) in pandas/python with ['Product','OrderDate','Sales'].
I noticed that some rows, values have better Distribution (like in a Histogram) than others. By "Best" meaning, the shape is more spread, or the spread of values make the shape looks wider than for other rows.
If I want to pick from say +700 Product's, those with more spread values, is there a way to do that easily in pandas/python?
txs in advance.
Caveat here is that I'm not a stats expert but basically scipy has a number of tests you can conduct on your data to test whether it could be considered to be a normalised Gaussian distribution.
Here I create 2 series one is simple a linear range and the other is a random normalised sampling with mean set to 50 and variance set to 25.
In [48]:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'linear':arange(100), 'normal':np.random.normal(50, 25, 100)})
linear normal
0 0 66.565374
1 1 63.453899
2 2 65.736406
3 3 65.848908
4 4 56.916032
5 5 93.870682
6 6 89.513998
7 7 9.949555
8 8 9.727099
9 9 47.072785
10 10 62.849321
11 11 33.263309
12 12 42.168484
13 13 38.488933
14 14 51.833459
15 15 54.911915
16 16 62.372709
17 17 96.928452
18 18 65.333546
19 19 26.341462
20 20 41.692790
21 21 22.852561
22 22 15.799415
23 23 50.600141
24 24 14.234088
25 25 72.428607
26 26 45.872601
27 27 80.783253
28 28 29.561586
29 29 51.261099
.. ... ...
70 70 32.826052
71 71 35.413106
72 72 49.415386
73 73 28.998378
74 74 32.237667
75 75 86.622402
76 76 105.098296
77 77 53.176413
78 78 -7.954881
79 79 60.313761
80 80 42.739641
81 81 56.667834
82 82 68.046688
83 83 72.189683
84 84 67.125708
85 85 24.798553
86 86 58.845761
87 87 54.559792
88 88 93.116777
89 89 30.209895
90 90 80.952444
91 91 57.895433
92 92 47.392336
93 93 13.136111
94 94 26.624532
95 95 53.461421
96 96 28.782809
97 97 16.342756
98 98 64.768579
99 99 68.410021
[100 rows x 2 columns]
From this page there are a number of tests we can use which are combined to for the normaltest, namely the skewtest and kurtosistest, I cannot explain these but you can see that the p-value is poor for the linear series and is relatively closer to 1 for the normalised data:
In [49]:
print('linear skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['linear']))
print('normal skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['normal']))
print('linear kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['linear']))
print('normal kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['normal']))
print('linear normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['linear']))
print('normal normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['normal']))
linear skewtest teststat = 1.022 pvalue = 0.3070
normal skewtest teststat = -0.170 pvalue = 0.8652
linear kurtoisis teststat = -5.799 pvalue = 0.0000
normal kurtoisis teststat = -1.113 pvalue = 0.2656
linear normaltest teststat = 34.674 pvalue = 0.0000
normal normaltest teststat = 1.268 pvalue = 0.5304
From the scipy site:
When testing for normality of a small sample of t-distributed
observations and a large sample of normal distributed observation,
then in neither case can we reject the null hypothesis that the sample
comes from a normal distribution. In the first case this is because
the test is not powerful enough to distinguish a t and a normally
distributed random variable in a small sample.
So you'll have to try the above and see if it fits with what you want, hope this helps.
Sure. What you'd like to do here is find the 700 entries with the largest standard deviation.
pandas.DataFrame.std() will return the standard deviation for an axis, and then you just need to keep track of the entries with the highest corresponding values.
Large Standard Deviation vs. Small Standard Deviation

Python computing error

I’m using the API mpmath to compute the following sum
Let us consider the serie u0, u1, u2 defined by:
u0 = 3/2 = 1,5
u1 = 5/3 = 1,6666666…
un+1 = 2003 - 6002/un + 4000/un un-1
The serie converges on 2, but with rounding problem it seems to converge on 2000.
n Calculated value Rounded off exact value
2 1,800001 1,800000000
3 1,890000 1,888888889
4 3,116924 1,941176471
5 756,3870306 1,969696970
6 1996,761549 1,984615385
7 1999,996781 1,992248062
8 1999,999997 1,996108949
9 2000,000000 1,998050682
10 2000,000000 1,999024390
My code :
from mpmath import *
mp.dps = 50
for i in range (2,11):
print u
my bad results :
I tried to perform with some others functions (fdiv…) or to change the precision: same bad result
What’s wrong with this code ?
How to change my code to find the value 2.0 ??? with the formula :
un+1 = 2003 - 6002/un + 4000/un un-1
Using the decimal module, you can see the series also has a solution converging at 2000:
from decimal import Decimal, getcontext
getcontext().prec = 100
u0=Decimal(3) / Decimal(2)
u1=Decimal(5) / Decimal(3)
u=[u0, u1]
for i in range(100):
un1 = 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
print un1
The recurrence relation has multiple fixed points (one at 2 and the other at 2000):
>>> u = [Decimal(2), Decimal(2)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
>>> u = [Decimal(2000), Decimal(2000)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
The solution at 2 is an unstable fixed-point. The attractive fixed-point is at 2000.
The convergence gets very close to two and when the round-off causes the value to slightly exceed two, that difference gets amplified again and again until hitting 2000.
Your (non-linear) recurrence sequence has three fixed points: 1, 2 and 2000. The values 1 and 2 are close to each other compared to 2000, which is usually an indication of unstable fixed points because they are "almost" double roots.
You need to do some maths in order to diverge less early. Let v(n) be a side sequence:
v(n) = (1+2^n)u(n)
The following holds true:
v(n+1) = (1+2^(n+1)) * (2003v(n)v(n-1) - 6002(1+2^n)v(n-1) + 4000(1+2^n)(1+2^n-1)) / (v(n)v(n-1))
You can then simply compute v(n) and deduce u(n) from u(n) = v(n)/(1+2^n):
#!/usr/bin/env python
from mpmath import *
mp.dps = 50
v0 = mpf(3)
v1 = mpf(5)
for i in range (2,25):
vn1 = (1+2**i) * (2003*v[i-1]*v[i-2] \
- 6002*(1+2**(i-1))*v[i-2] \
+ 4000*(1+2**(i-1))*(1+2**(i-2))) \
/ (v[i-1]*v[i-2])
print u
And the result:
Note that this will still diverge eventually. In order to really converge, you need to compute v(n) with arbitrary precision. But this is now a lot easier since all the values are integers.
You calculate your initial values to 53-bits of precision and then assign that rounded value to the high-precision mpf variable. You should use u0=mpf(3)/mpf(2) and u1=mpf(5)/mpf(3). You'll stay close to 2 for a few more interations, but you'll still end up converging at 2000. This is due to rounding error. One alternative is to compute with fractions. I used gmpy and the following code converges to 2.
from __future__ import print_function
import gmpy
u = [gmpy.mpq(3,2), gmpy.mpq(5,3)]
for i in range(2,300):
temp = (2003 - 6002/u[-1] + 4000/(u[-1]*u[-2]))
for i in u: print(gmpy.mpf(i,300))
If you compute with infinite precision then you get 2 otherwise you get 2000:
import itertools
from fractions import Fraction
def series(u0=Fraction(3, 2), u1=Fraction(5, 3)):
yield u0
yield u1
while u0 != u1:
un = 2003 - 6002/u1 + 4000/(u1*u0)
yield un
u1, u0 = un, u1
for i, u in enumerate(itertools.islice(series(), 100)):
err = (2-u)/2 # relative error
print("%d\t%.2g" % (i, err))
0 0.25
1 0.17
2 0.1
3 0.056
4 0.029
5 0.015
6 0.0077
7 0.0039
8 0.0019
9 0.00097
10 0.00049
11 0.00024
12 0.00012
13 6.1e-05
14 3.1e-05
15 1.5e-05
16 7.6e-06
17 3.8e-06
18 1.9e-06
19 9.5e-07
20 4.8e-07
21 2.4e-07
22 1.2e-07
23 6e-08
24 3e-08
25 1.5e-08
26 7.5e-09
27 3.7e-09
28 1.9e-09
29 9.3e-10
30 4.7e-10
31 2.3e-10
32 1.2e-10
33 5.8e-11
34 2.9e-11
35 1.5e-11
36 7.3e-12
37 3.6e-12
38 1.8e-12
39 9.1e-13
40 4.5e-13
41 2.3e-13
42 1.1e-13
43 5.7e-14
44 2.8e-14
45 1.4e-14
46 7.1e-15
47 3.6e-15
48 1.8e-15
49 8.9e-16
50 4.4e-16
51 2.2e-16
52 1.1e-16
53 5.6e-17
54 2.8e-17
55 1.4e-17
56 6.9e-18
57 3.5e-18
58 1.7e-18
59 8.7e-19
60 4.3e-19
61 2.2e-19
62 1.1e-19
63 5.4e-20
64 2.7e-20
65 1.4e-20
66 6.8e-21
67 3.4e-21
68 1.7e-21
69 8.5e-22
70 4.2e-22
71 2.1e-22
72 1.1e-22
73 5.3e-23
74 2.6e-23
75 1.3e-23
76 6.6e-24
77 3.3e-24
78 1.7e-24
79 8.3e-25
80 4.1e-25
81 2.1e-25
82 1e-25
83 5.2e-26
84 2.6e-26
85 1.3e-26
86 6.5e-27
87 3.2e-27
88 1.6e-27
89 8.1e-28
90 4e-28
91 2e-28
92 1e-28
93 5e-29
94 2.5e-29
95 1.3e-29
96 6.3e-30
97 3.2e-30
98 1.6e-30
99 7.9e-31
Well, as casevh said, I just added the mpf function in first initials terms in my code :
and the value converge for 16 steps to the correct value 2.0 before diverged again (see below).
So, even with a good python library for arbitrary-precision floating-point arithmetic and some basics operations the result can become totally false and it is not algorithmic, mathematical or recurrence problem as I read sometimes.
So it is necessary to remain watchful and critic !!! ( I’m very afraid about the mpmath.lerchphi(z, s, a) function ;-)
2 1.8000000000000000000000000000000000000000000000022 3
1.8888888888888888888888888888888888888888888913205 4 1.9411764705882352941176470588235294117647084569125 5 1.9696969696969696969696969696969696969723495083846 6 1.9846153846153846153846153846153846180779422496889 7 1.992248062015503875968992248062018218070968279944 8 1.9961089494163424124513618677070049064461141667961 9 1.998050682261208576998050684991268132991329645551 10 1.9990243902439024390243929766241359876402781522945 11 1.9995119570522205954151303455889283862002420414092 12 1.9997559189650964147435086295745928366095548127257 13 1.9998779445868451615169464386495752584786229236677 14 1.9999389685715481608370784691478769380770569091713 15 1.9999694860884747554701272066241108169217231319376 16 1.9999874767910784720428384947047783821702386000249 17 2.0027277350948824117795762659330557916802871427763 18 4.7316350177463946015607576536159982430500337286276 19 1156.6278675611076227796014310764287933259776352198 20 1998.5416721291457644804673979070312813731252347786 21 1999.998540608689366669273522363692463645090555294 22 1999.9999985406079725746311606572627439743947878652
The exact solution to your recurrence relation (with initial values u_0 = 3/2, u_1 = 5/3) is easily verified to be
u_n = (2^(n+1) + 1) / (2^n + 1). (*)
The problem you're seeing is that although the solution is such that
lim_{n -> oo} u_n = 2,
this limit is a repelling fixed point of your recurrence relation. That is, any departure from the correct values of u_{n-1}, u{n-2}, for some n, will result in further values diverging from the correct limit. Consequently, unless your implementation of the recurrence relation correctly represents every u_n value exactly, it can be expected to exhibit eventual divergence from the correct limit, converging to the incorrect value of 2000 that just happens to be the only attracting fixed point of your recurrence relation.
(*) In fact, u_n = (2^(n+1) + 1) / (2^n + 1) is the solution to any recurrence relation of the form
u_n = C + (7 - 3C)/u_{n-1} + (2C - 6)/(u_{n-1} u_{n-2})
with the same initial values as given above, where C is an arbitrary constant. If I haven't made a mistake finding the roots of the characteristic polynomial, this will have the set of fixed points {1, 2, C - 3}\{0}. The limit 2 can be either a repelling fixed point or an attracting fixed point, depending on the value of C. E.g., for C = 2003 the set of fixed points is {1, 2, 2000} with 2 being a repellor, whereas for C = 3 the fixed points are {1, 2} with 2 being an attractor.
