logarithmic scale in Python makes the yscale disappear - python

I am trying to plot these values with logarithmic scale.
I set the xscale to be logarithmic and it works but once I change the yscale to logarithmic the plot is empty
1 1.3983122938704392e-24
2 1.2825051378519808e-24
3 5.5230816485933455e-25
4 1.6920941401196186e-25
5 4.2785540109054585e-26
6 9.573610853721047e-27
7 1.9686256356892187e-27
8 3.805287950338823e-28
9 7.016068861317399e-29
10 1.2462784636632535e-29
11 2.1480378257342412e-30
12 3.6112402884779174e-31
13 5.945542463839695e-32
14 9.615996905704895e-33
15 1.5315555893729047e-33
16 2.4069535816407215e-34
17 3.7385510098146993e-35
18 5.7467571042464316e-36
19 8.75215314794321e-37
20 1.3218825579234095e-37
21 1.981570594514771e-38
22 2.950315675351239e-39
23 4.365487327673689e-40
24 6.422908401651751e-41
25 9.400843198676917e-42
26 1.3693535952403045e-42
27 1.985800585870847e-43
28 2.8679212082869596e-44
29 4.126063379717258e-45
30 5.9149958576112206e-46
31 8.451319914553198e-47
32 1.2037551958686832e-47
33 1.7095383153201586e-48
34 2.4211547577096363e-49
35 3.4200992669649083e-50
36 4.819376551228057e-51
37 6.775436972956835e-52
38 9.504553221721265e-53
39 1.3305264659826585e-53
with using this code
plt.scatter(A, B)
plt.xscale('log')
plt.ylim('log')
plt.show()

I think it is because you use 'ylim' instead of 'yscale' at the third line of your code.
You can try this :
plt.scatter(A, B)
plt.xscale('log')
plt.yscale('log')
plt.show()

Related

Calculate a prediction interval for a dataset Python

I have the following table:
perc
0 59.98797
1 61.89383
2 61.08403
3 61.00661
4 62.64753
5 62.18118
6 60.74520
7 57.83964
8 62.09705
9 57.07985
10 58.62777
11 60.02589
12 58.74948
13 59.14136
14 58.37719
15 58.27401
16 59.67806
17 58.62855
18 58.45272
19 57.62186
20 58.64749
21 58.88152
22 54.80138
23 59.57697
24 60.26713
25 60.96022
26 55.59813
27 60.32104
28 57.95403
29 58.90658
30 53.72838
31 57.03986
32 58.14056
33 53.62257
34 57.08174
35 57.26881
36 48.80800
37 56.90632
38 59.08444
39 57.36432
consisting of various percentages.
I'm interested in creating a probability distribution based on these percentages for the sake of coming up with a prediction interval (say 95%) of what we would expect a new observation of this percentage to be within.
I initially was doing the following, but upon testing with my sample data I remembered that CIs capture the mean, not a new observation.
import scipy.stats as st
import numpy as np
# Get data in a list
lst = list(percDone['perc'])
# create 95% confidence interval
st.t.interval(alpha=0.95, df=len(lst)-1,
loc=np.mean(lst),
scale=st.sem(lst))
Thanks!

how to sequentially assign two numbers in an array?

I try to assign two numbers diagonally to each other in the matrix according to certain procedures.
At first the first 1st number in the penultimate line of the line with the 2nd number in the last line, then the first number in the line up with the 2nd number in the penultimate line, etc..This sequence is shown in the example below. The matrix does not always have to be the same size.
Example
a=np.array([[11,12,13],
[21,22,23],
[31,32,33]])
required output:
21 32
11 22
11 33
22 33
12 23
or
a=np.array([[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44]])
required output:
31 42
21 32
21 43
32 43
11 22
11 33
11 44
22 33
22 44
12 23
12 34
23 34
13 24
It is possible?
Here's an iterative solution, assuming a square matrix. Modifying this for non-square matrices shouldn't be hard.
import numpy as np
a=np.array([[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44]])
w,h = a.shape
for y0 in range(1,h):
y = h-y0-1
for x in range(h-y-1):
print( a[y+x,x], a[y+x+1,x+1] )
for x in range(1,w-1):
for y in range(w-x-1):
print( a[y,x+y], a[y+1,x+y+1] )

Wrong scipy fit even with good initial guess

The model to fit is the equation
def func(x, b):
return b*np.exp(-b*x)*(1.0 + b*x)/4.0
I know that b=0.1 is a good guess to my data
0 0.1932332495855138
1 0.0283534527253836
2 0.0188036856033853
3 0.0567007258167565
4 0.0704161703188139
5 0.0276463443409273
6 0.0144509808494943
7 0.0188027609145469
8 0.0049573500626925
9 0.0064589075683206
10 0.0118522499082115
11 0.0087201376939245
12 0.0055855004231049
13 0.0110355379801288
14 0.0024829496736532
15 0.0050982312687186
16 0.0041032075307342
17 0.0063991465281368
18 0.0047195530453669
19 0.0028479431829209
20 0.0177577032522473
21 0.0082863863356967
22 0.0057720347102372
23 0.0053694769677398
24 0.017408417311084
25 0.0023307847797263
26 0.0014090741613788
27 0.0019007144648791
28 0.0043599058193019
29 0.004435997067249
30 0.0015569027316533
31 0.0016127575928092
32 0.00120222948697
33 0.0006851723909766
34 0.0014497504163
35 0.0014245210449107
36 0.0011375555693977
37 0.0007939973846594
38 0.0005707034948325
39 0.0007890519641431
40 0.0006274139241806
41 0.0005899624312505
42 0.0003989619799181
43 0.0002212632688891
44 0.0001465605806698
45 0.000188075040325
46 0.0002779076010181
47 0.0002941294723591
48 0.0001690581072228
49 0.0001448055157076
50 0.0002734759385405
51 0.0003228484365634
52 0.0002120441778252
53 0.0002383276583408
54 0.0002156310534404
55 0.0004499244488764
56 0.0001408465706883
57 0.000135998586104
58 0.00028706917157
59 0.0001788548683777
But it doesn't matter if I set p0=0.1, or p0=1.0, the fitting parameter in both cases python says to be popt= [0.42992594] and popt=[0.42994105], which is almost the same value. Why the curve_fit function doesn't work in this case?
popt, pcov = curve_fit(func, xdata, ydata, p0=[0.1])
There's nothing too mysterious going on here. 0.4299... is just a better fit to the data, in the least-squares sense.
With b = 0.1, the first few points are not well fit at all. Least-squares heavily weights outliers, so the optimizer tries very hard to fit those better, even if it means doing slightly worse at other points. In other words, "most" points are fit "pretty well", and there is a very high penalty for fitting any point very badly (that's the "square" in least-squares).
Below is a plot of the data (blue) and your model function with b = 0.1 and b = 0.4299 in orange and green respectively. The value returned by curve_fit is better subjectively and objectively. Computing the MSE to the data in both cases gives about 0.18 using b = 0.1, and 0.13 using b = 0.4299.

Given a discrete distribution, how do I round a number to the closest value in that distribution?

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

get rid of tiny fraction in bar plot scale (Pandas/Python)

When I'm trying to plot a bar plot (of histograms), using pd.cut, I get a funny (and very annoying!) 0.001 added to the axis (from the left), making it starting from -1.001 instead of -1. The question is how to get rid of this? (please see the figure).
My code is:
out_i = pd.cut(df, bins=np.arange(-1,1.2,0.2), include_lowest=True)
out_i.value_counts(sort=False).plot.bar(rot=45, figsize=(6,6))
plt.tight_layout()
with df:
a
0 -0.402203
1 -0.019031
2 -0.979292
3 -0.701221
4 -0.267261
5 -0.563602
7 -0.454961
8 0.632456
9 -0.843081
10 -0.629253
11 -0.946188
12 -0.628178
13 -0.776933
14 -0.717091
15 -0.392144
16 -0.799408
17 -0.897951
18 0.255321
19 -0.641854
20 -0.356393
21 -0.507321
22 -0.698238
23 -0.985097
25 -0.661444
26 -0.751593
27 -0.437505
28 -0.413451
29 -0.798745
30 -0.736440
31 -0.672727
32 -0.807688
33 -0.087085
34 -0.393203
35 -0.979730
36 -0.902951
37 -0.454231
38 -0.561951
39 -0.388580
40 -0.706501
41 -0.408248
42 -0.377235
43 -0.283110
44 -0.517428
45 -0.949603
46 -0.268667
47 -0.376199
48 -0.472293
49 -0.211781
50 -0.921520
51 -0.345870
53 -0.542487
55 -0.597996
In case it is acceptable to chop off the decimal points of the intervals, generate a custom list of interval labels and set this as the xticklabels of the plot:
out_i = pd.cut(df['a'], bins=np.arange(-1,1.2,0.2), include_lowest=True)
intervals = out_i.cat.categories
labels = ['(%.1f, %.1f]' % (int(interval.left*100)/100, interval.right) for interval in intervals]
ax = out_i.value_counts(sort=False).plot.bar(rot=45, figsize=(6,6))
ax.set_xticklabels(labels)
plt.tight_layout()
Which results in the following plot:
Note: this will always output a half-closed interval (a,b]. It can be improved by making the brackets dynamic as per the parameters of pd.cut.

Categories