Python computing error - python

I’m using the API mpmath to compute the following sum
Let us consider the serie u0, u1, u2 defined by:
u0 = 3/2 = 1,5
u1 = 5/3 = 1,6666666…
un+1 = 2003 - 6002/un + 4000/un un-1
The serie converges on 2, but with rounding problem it seems to converge on 2000.
n Calculated value Rounded off exact value
2 1,800001 1,800000000
3 1,890000 1,888888889
4 3,116924 1,941176471
5 756,3870306 1,969696970
6 1996,761549 1,984615385
7 1999,996781 1,992248062
8 1999,999997 1,996108949
9 2000,000000 1,998050682
10 2000,000000 1,999024390
My code :
from mpmath import *
mp.dps = 50
u0=mpf(3/2.0)
u1=mpf(5/3.0)
u=[]
u.append(u0)
u.append(u1)
for i in range (2,11):
un1=(2003-6002/u[i-1]+(mpf(4000)/mpf((u[i-1]*u[i-2]))))
u.append(un1)
print u
my bad results :
[mpf('1.5'),
mpf('1.6666666666666667406815349750104360282421112060546875'),
mpf('1.8000000000000888711326751945268011597589466120961647'),
mpf('1.8888888889876302386905492787148253684796100079942617'),
mpf('1.9411765751351638992775070422559330255517747908588059'),
mpf('1.9698046831709839591526211645628191427874374792786951'),
mpf('2.093979191783975876606205176530675127058752077926479'),
mpf('106.44733511712489354422046139349654833300787666477228'),
mpf('1964.5606972399290690749220686397494349501387742896911'),
mpf('1999.9639916238009625032390578545797067344576357100626'),
mpf('1999.9999640260895343960004614025893194430187653900418')]
I tried to perform with some others functions (fdiv…) or to change the precision: same bad result
What’s wrong with this code ?
Question:
How to change my code to find the value 2.0 ??? with the formula :
un+1 = 2003 - 6002/un + 4000/un un-1
thanks

Using the decimal module, you can see the series also has a solution converging at 2000:
from decimal import Decimal, getcontext
getcontext().prec = 100
u0=Decimal(3) / Decimal(2)
u1=Decimal(5) / Decimal(3)
u=[u0, u1]
for i in range(100):
un1 = 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
u.append(un1)
print un1
The recurrence relation has multiple fixed points (one at 2 and the other at 2000):
>>> u = [Decimal(2), Decimal(2)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
Decimal('2')
>>> u = [Decimal(2000), Decimal(2000)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
Decimal('2000.000')
The solution at 2 is an unstable fixed-point. The attractive fixed-point is at 2000.
The convergence gets very close to two and when the round-off causes the value to slightly exceed two, that difference gets amplified again and again until hitting 2000.

Your (non-linear) recurrence sequence has three fixed points: 1, 2 and 2000. The values 1 and 2 are close to each other compared to 2000, which is usually an indication of unstable fixed points because they are "almost" double roots.
You need to do some maths in order to diverge less early. Let v(n) be a side sequence:
v(n) = (1+2^n)u(n)
The following holds true:
v(n+1) = (1+2^(n+1)) * (2003v(n)v(n-1) - 6002(1+2^n)v(n-1) + 4000(1+2^n)(1+2^n-1)) / (v(n)v(n-1))
You can then simply compute v(n) and deduce u(n) from u(n) = v(n)/(1+2^n):
#!/usr/bin/env python
from mpmath import *
mp.dps = 50
v0 = mpf(3)
v1 = mpf(5)
v=[]
v.append(v0)
v.append(v1)
u=[]
u.append(v[0]/2)
u.append(v[1]/3)
for i in range (2,25):
vn1 = (1+2**i) * (2003*v[i-1]*v[i-2] \
- 6002*(1+2**(i-1))*v[i-2] \
+ 4000*(1+2**(i-1))*(1+2**(i-2))) \
/ (v[i-1]*v[i-2])
v.append(vn1)
u.append(vn1/(1+2**i))
print u
And the result:
[mpf('1.5'),
mpf('1.6666666666666666666666666666666666666666666666666676'),
mpf('1.8000000000000000000000000000000000000000000000000005'),
mpf('1.8888888888888888888888888888888888888888888888888892'),
mpf('1.9411764705882352941176470588235294117647058823529413'),
mpf('1.969696969696969696969696969696969696969696969696969'),
mpf('1.9846153846153846153846153846153846153846153846153847'),
mpf('1.992248062015503875968992248062015503875968992248062'),
mpf('1.9961089494163424124513618677042801556420233463035019'),
mpf('1.9980506822612085769980506822612085769980506822612089'),
mpf('1.9990243902439024390243902439024390243902439024390251'),
mpf('1.9995119570522205954123962908735968765251342118106393'),
mpf('1.99975591896509641200878691725652916768367097876495'),
mpf('1.9998779445868424264616135725619431221774685707311133'),
mpf('1.9999389685688129386634116570033567287152883735123589'),
mpf('1.9999694833531691537733833806341359211449845890933504'),
mpf('1.9999847414437645909944001098616048949448403192089965'),
mpf('1.9999923706636759668276456631037666033431751771913355'),
...
Note that this will still diverge eventually. In order to really converge, you need to compute v(n) with arbitrary precision. But this is now a lot easier since all the values are integers.

You calculate your initial values to 53-bits of precision and then assign that rounded value to the high-precision mpf variable. You should use u0=mpf(3)/mpf(2) and u1=mpf(5)/mpf(3). You'll stay close to 2 for a few more interations, but you'll still end up converging at 2000. This is due to rounding error. One alternative is to compute with fractions. I used gmpy and the following code converges to 2.
from __future__ import print_function
import gmpy
u = [gmpy.mpq(3,2), gmpy.mpq(5,3)]
for i in range(2,300):
temp = (2003 - 6002/u[-1] + 4000/(u[-1]*u[-2]))
u.append(temp)
for i in u: print(gmpy.mpf(i,300))

If you compute with infinite precision then you get 2 otherwise you get 2000:
import itertools
from fractions import Fraction
def series(u0=Fraction(3, 2), u1=Fraction(5, 3)):
yield u0
yield u1
while u0 != u1:
un = 2003 - 6002/u1 + 4000/(u1*u0)
yield un
u1, u0 = un, u1
for i, u in enumerate(itertools.islice(series(), 100)):
err = (2-u)/2 # relative error
print("%d\t%.2g" % (i, err))
Output
0 0.25
1 0.17
2 0.1
3 0.056
4 0.029
5 0.015
6 0.0077
7 0.0039
8 0.0019
9 0.00097
10 0.00049
11 0.00024
12 0.00012
13 6.1e-05
14 3.1e-05
15 1.5e-05
16 7.6e-06
17 3.8e-06
18 1.9e-06
19 9.5e-07
20 4.8e-07
21 2.4e-07
22 1.2e-07
23 6e-08
24 3e-08
25 1.5e-08
26 7.5e-09
27 3.7e-09
28 1.9e-09
29 9.3e-10
30 4.7e-10
31 2.3e-10
32 1.2e-10
33 5.8e-11
34 2.9e-11
35 1.5e-11
36 7.3e-12
37 3.6e-12
38 1.8e-12
39 9.1e-13
40 4.5e-13
41 2.3e-13
42 1.1e-13
43 5.7e-14
44 2.8e-14
45 1.4e-14
46 7.1e-15
47 3.6e-15
48 1.8e-15
49 8.9e-16
50 4.4e-16
51 2.2e-16
52 1.1e-16
53 5.6e-17
54 2.8e-17
55 1.4e-17
56 6.9e-18
57 3.5e-18
58 1.7e-18
59 8.7e-19
60 4.3e-19
61 2.2e-19
62 1.1e-19
63 5.4e-20
64 2.7e-20
65 1.4e-20
66 6.8e-21
67 3.4e-21
68 1.7e-21
69 8.5e-22
70 4.2e-22
71 2.1e-22
72 1.1e-22
73 5.3e-23
74 2.6e-23
75 1.3e-23
76 6.6e-24
77 3.3e-24
78 1.7e-24
79 8.3e-25
80 4.1e-25
81 2.1e-25
82 1e-25
83 5.2e-26
84 2.6e-26
85 1.3e-26
86 6.5e-27
87 3.2e-27
88 1.6e-27
89 8.1e-28
90 4e-28
91 2e-28
92 1e-28
93 5e-29
94 2.5e-29
95 1.3e-29
96 6.3e-30
97 3.2e-30
98 1.6e-30
99 7.9e-31

Well, as casevh said, I just added the mpf function in first initials terms in my code :
u0=mpf(3)/mpf(2)
u1=mpf(5)/mpf(3)
and the value converge for 16 steps to the correct value 2.0 before diverged again (see below).
So, even with a good python library for arbitrary-precision floating-point arithmetic and some basics operations the result can become totally false and it is not algorithmic, mathematical or recurrence problem as I read sometimes.
So it is necessary to remain watchful and critic !!! ( I’m very afraid about the mpmath.lerchphi(z, s, a) function ;-)
2 1.8000000000000000000000000000000000000000000000022 3
1.8888888888888888888888888888888888888888888913205 4 1.9411764705882352941176470588235294117647084569125 5 1.9696969696969696969696969696969696969723495083846 6 1.9846153846153846153846153846153846180779422496889 7 1.992248062015503875968992248062018218070968279944 8 1.9961089494163424124513618677070049064461141667961 9 1.998050682261208576998050684991268132991329645551 10 1.9990243902439024390243929766241359876402781522945 11 1.9995119570522205954151303455889283862002420414092 12 1.9997559189650964147435086295745928366095548127257 13 1.9998779445868451615169464386495752584786229236677 14 1.9999389685715481608370784691478769380770569091713 15 1.9999694860884747554701272066241108169217231319376 16 1.9999874767910784720428384947047783821702386000249 17 2.0027277350948824117795762659330557916802871427763 18 4.7316350177463946015607576536159982430500337286276 19 1156.6278675611076227796014310764287933259776352198 20 1998.5416721291457644804673979070312813731252347786 21 1999.998540608689366669273522363692463645090555294 22 1999.9999985406079725746311606572627439743947878652

The exact solution to your recurrence relation (with initial values u_0 = 3/2, u_1 = 5/3) is easily verified to be
u_n = (2^(n+1) + 1) / (2^n + 1). (*)
The problem you're seeing is that although the solution is such that
lim_{n -> oo} u_n = 2,
this limit is a repelling fixed point of your recurrence relation. That is, any departure from the correct values of u_{n-1}, u{n-2}, for some n, will result in further values diverging from the correct limit. Consequently, unless your implementation of the recurrence relation correctly represents every u_n value exactly, it can be expected to exhibit eventual divergence from the correct limit, converging to the incorrect value of 2000 that just happens to be the only attracting fixed point of your recurrence relation.
(*) In fact, u_n = (2^(n+1) + 1) / (2^n + 1) is the solution to any recurrence relation of the form
u_n = C + (7 - 3C)/u_{n-1} + (2C - 6)/(u_{n-1} u_{n-2})
with the same initial values as given above, where C is an arbitrary constant. If I haven't made a mistake finding the roots of the characteristic polynomial, this will have the set of fixed points {1, 2, C - 3}\{0}. The limit 2 can be either a repelling fixed point or an attracting fixed point, depending on the value of C. E.g., for C = 2003 the set of fixed points is {1, 2, 2000} with 2 being a repellor, whereas for C = 3 the fixed points are {1, 2} with 2 being an attractor.

Related

matlab traduction to python for simple calculation

one more time i need your help,
To introduce the problem, i got this :
x=[0 1 3 4 5 6 7 8]
y=[9 10 11 12 13 14 15 16]
x=x(:)
y=y(:)
X=[x.^2, x.*y,y.^2,x,y]
a=sum(X)/(X'*X)
X=
0 0 81 0 9
1 10 100 1 10
9 33 121 3 11
16 48 144 4 12
25 65 169 5 13
36 84 196 6 14
49 105 225 7 15
64 128 256 8 16
a =
-0.0139 0.0278 -0.0139 -0.2361 0.2361
Considere that the matlab code is absolutely true
and i translate this to :
x=[0,1,3,4,5,6,7,8]
y=[9,10,11,12,13,14,15,16]
X=np.array([x*x,x*y,y*y,x,y]).T
a=np.sum(X)/np.dot(X.T,X)#line with the probleme
X is the same
But i get (5,5) matrix on a
Probleme come from the mult beetwen X.T and X i think, i'll try np.matmul, np.dot, transpose and T and i don't know why i can't get a (1,5) or (5,1) vector... what is wrong is the translation beetwen those 2 langage on the a calculation
Any Suggestions ?
The division of such two matrices in MATLAB:
s = sum(X)
XX = (X'*X)
a = s / XX
is solving for t the linear system: XX * t = s.
To achieve the same in Python/NumPy, just use np.linalg.solve() (making sure to use np.sum() with the correct axis parameter to mimic the same behavior as MATLAB's sum(), as indicated in the comments and #AnderBiguri's answer):
x=np.array([0,1,3,4,5,6,7,8])
y=np.array([9,10,11,12,13,14,15,16])
X=np.array([x*x,x*y,y*y,x,y]).T
s = np.sum(X, 0)
XX = np.dot(X.T, X)
a = np.linalg.solve(XX, s)
print(a)
# [-0.01388889 0.02777778 -0.01388889 -0.23611111 0.23611111]
The issue is sum.
In MATLAB, default sum sums over the first axis. In numpy sum sums all the values.
a=np.sum(X, axis=0)/np.dot(X.T,X)

Wrong scipy fit even with good initial guess

The model to fit is the equation
def func(x, b):
return b*np.exp(-b*x)*(1.0 + b*x)/4.0
I know that b=0.1 is a good guess to my data
0 0.1932332495855138
1 0.0283534527253836
2 0.0188036856033853
3 0.0567007258167565
4 0.0704161703188139
5 0.0276463443409273
6 0.0144509808494943
7 0.0188027609145469
8 0.0049573500626925
9 0.0064589075683206
10 0.0118522499082115
11 0.0087201376939245
12 0.0055855004231049
13 0.0110355379801288
14 0.0024829496736532
15 0.0050982312687186
16 0.0041032075307342
17 0.0063991465281368
18 0.0047195530453669
19 0.0028479431829209
20 0.0177577032522473
21 0.0082863863356967
22 0.0057720347102372
23 0.0053694769677398
24 0.017408417311084
25 0.0023307847797263
26 0.0014090741613788
27 0.0019007144648791
28 0.0043599058193019
29 0.004435997067249
30 0.0015569027316533
31 0.0016127575928092
32 0.00120222948697
33 0.0006851723909766
34 0.0014497504163
35 0.0014245210449107
36 0.0011375555693977
37 0.0007939973846594
38 0.0005707034948325
39 0.0007890519641431
40 0.0006274139241806
41 0.0005899624312505
42 0.0003989619799181
43 0.0002212632688891
44 0.0001465605806698
45 0.000188075040325
46 0.0002779076010181
47 0.0002941294723591
48 0.0001690581072228
49 0.0001448055157076
50 0.0002734759385405
51 0.0003228484365634
52 0.0002120441778252
53 0.0002383276583408
54 0.0002156310534404
55 0.0004499244488764
56 0.0001408465706883
57 0.000135998586104
58 0.00028706917157
59 0.0001788548683777
But it doesn't matter if I set p0=0.1, or p0=1.0, the fitting parameter in both cases python says to be popt= [0.42992594] and popt=[0.42994105], which is almost the same value. Why the curve_fit function doesn't work in this case?
popt, pcov = curve_fit(func, xdata, ydata, p0=[0.1])
There's nothing too mysterious going on here. 0.4299... is just a better fit to the data, in the least-squares sense.
With b = 0.1, the first few points are not well fit at all. Least-squares heavily weights outliers, so the optimizer tries very hard to fit those better, even if it means doing slightly worse at other points. In other words, "most" points are fit "pretty well", and there is a very high penalty for fitting any point very badly (that's the "square" in least-squares).
Below is a plot of the data (blue) and your model function with b = 0.1 and b = 0.4299 in orange and green respectively. The value returned by curve_fit is better subjectively and objectively. Computing the MSE to the data in both cases gives about 0.18 using b = 0.1, and 0.13 using b = 0.4299.

Given a discrete distribution, how do I round a number to the closest value in that distribution?

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

Portfolio Selection in Python with constraints from a fixed set

I am working on a project where I am trying to select the optimal subset of players from a set of 125 players (example below)
The constraints are:
a) Number of players = 3
b) Sum of prices <= 30
The optimization function is Max(Sum of Votes)
Player Vote Price
William Smith 0.67 8.6
Robert Thompson 0.31 6.7
Joseph Robinson 0.61 6.2
Richard Johnson 0.88 4.3
Richard Hall 0.28 9.7
I looked at the scipy optimize package but I can't find anywhere a way to constraint the universe to this subset. Can anyone point me if there is a library that would do that?
Thanks
The problem is well suited to be formulated as mathematical program and can be solved with different Optimization libraries.
It is known as the exact k-item knapsack problem.
You can use the Package PuLP for example. It has interfaces to different optimization software packages, but comes bundled with a free solver.
easy_install pulp
Free solvers are often way slower than commercial ones, but I think PuLP should be able to solve reasonably large versions of your problem with its standard solver.
Your problem can be solved with PuLP as follows:
from pulp import *
# Data input
players = ["William Smith", "Robert Thompson", "Joseph Robinson", "Richard Johnson", "Richard Hall"]
vote = [0.67, 0.31, 0.61, 0.88, 0.28]
price = [8.6, 6.7, 6.2, 4.3, 9.7]
P = range(len(players))
# Declare problem instance, maximization problem
prob = LpProblem("Portfolio", LpMaximize)
# Declare decision variable x, which is 1 if a
# player is part of the portfolio and 0 else
x = LpVariable.matrix("x", list(P), 0, 1, LpInteger)
# Objective function -> Maximize votes
prob += sum(vote[p] * x[p] for p in P)
# Constraint definition
prob += sum(x[p] for p in P) == 3
prob += sum(price[p] * x[p] for p in P) <= 30
# Start solving the problem instance
prob.solve()
# Extract solution
portfolio = [players[p] for p in P if x[p].varValue]
print(portfolio)
The runtime to draw 3 players from 125 with the same random data as used by Brad Solomon is 0.5 seconds on my machine.
Your problem is discrete optimization task because of a) constraint. You should introduce discrete variables to represent taken/not taken players. Consider the following Minizinc pseudocode:
array[players_num] of var bool: taken_players;
array[players_num] of float: votes;
array[players_num] of float: prices;
constraint sum (taken_players * prices) <= 30;
constraint sum (taken_players) = 3;
solve maximize sum (taken_players * votes);
As far as I know, you can't use scipy to solve such problems (e.g. this).
You can solve your problem in these ways:
You can generate Minizinc problem in Python and solve it by calling external solver. It seems to be more scalable and robust.
You can use simulated annealing
Mixed integer approach
The second option seems to be simpler for you. But, personally, I prefer the first one: it allows you introducing a wide range of various constraints, problem formulation feels more natural and clear.
#CaptainTrunky is correct, scipy.minimize will not work here.
Here is an awfully crappy workaround using itertools, please ignore if one of the other methods has worked. Consider that to draw 3 players from 125 creates 317,750 combinations, n!/((n - k)! * k!). Runtime on the main loop ~ 6m.
from itertools import combinations
df = DataFrame({'Player' : np.arange(0, 125),
'Vote' : 10 * np.random.random(125),
'Price' : np.random.randint(1, 10, 125)
})
df
Out[109]:
Player Price Vote
0 0 4 7.52425
1 1 6 3.62207
2 2 9 4.69236
3 3 4 5.24461
4 4 4 5.41303
.. ... ... ...
120 120 9 8.48551
121 121 8 9.95126
122 122 8 6.29137
123 123 8 1.07988
124 124 4 2.02374
players = df.Player.values
idx = pd.MultiIndex.from_tuples([i for i in combinations(players, 3)])
votes = []
prices = []
for i in combinations(players, 3):
vote = df[df.Player.isin(i)].sum()['Vote']
price = df[df.Player.isin(i)].sum()['Price']
votes.append(vote); prices.append(price)
result = DataFrame({'Price' : prices, 'Vote' : votes}, index=idx)
# The index below is (first player, second player, third player)
result[result.Price <= 30].sort_values('Vote', ascending=False)
Out[128]:
Price Vote
63 87 121 25.0 29.75051
64 121 20.0 29.62626
64 87 121 19.0 29.61032
63 64 87 20.0 29.56665
65 121 24.0 29.54248
... ...
18 22 78 12.0 1.06352
23 103 20.0 1.02450
22 23 103 20.0 1.00835
18 22 103 15.0 0.98461
23 14.0 0.98372

Python/Pandas Select Columns based on Best Value Distribution

I have a dataframe (df) in pandas/python with ['Product','OrderDate','Sales'].
I noticed that some rows, values have better Distribution (like in a Histogram) than others. By "Best" meaning, the shape is more spread, or the spread of values make the shape looks wider than for other rows.
If I want to pick from say +700 Product's, those with more spread values, is there a way to do that easily in pandas/python?
txs in advance.
Caveat here is that I'm not a stats expert but basically scipy has a number of tests you can conduct on your data to test whether it could be considered to be a normalised Gaussian distribution.
Here I create 2 series one is simple a linear range and the other is a random normalised sampling with mean set to 50 and variance set to 25.
In [48]:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'linear':arange(100), 'normal':np.random.normal(50, 25, 100)})
df
Out[48]:
linear normal
0 0 66.565374
1 1 63.453899
2 2 65.736406
3 3 65.848908
4 4 56.916032
5 5 93.870682
6 6 89.513998
7 7 9.949555
8 8 9.727099
9 9 47.072785
10 10 62.849321
11 11 33.263309
12 12 42.168484
13 13 38.488933
14 14 51.833459
15 15 54.911915
16 16 62.372709
17 17 96.928452
18 18 65.333546
19 19 26.341462
20 20 41.692790
21 21 22.852561
22 22 15.799415
23 23 50.600141
24 24 14.234088
25 25 72.428607
26 26 45.872601
27 27 80.783253
28 28 29.561586
29 29 51.261099
.. ... ...
70 70 32.826052
71 71 35.413106
72 72 49.415386
73 73 28.998378
74 74 32.237667
75 75 86.622402
76 76 105.098296
77 77 53.176413
78 78 -7.954881
79 79 60.313761
80 80 42.739641
81 81 56.667834
82 82 68.046688
83 83 72.189683
84 84 67.125708
85 85 24.798553
86 86 58.845761
87 87 54.559792
88 88 93.116777
89 89 30.209895
90 90 80.952444
91 91 57.895433
92 92 47.392336
93 93 13.136111
94 94 26.624532
95 95 53.461421
96 96 28.782809
97 97 16.342756
98 98 64.768579
99 99 68.410021
[100 rows x 2 columns]
From this page there are a number of tests we can use which are combined to for the normaltest, namely the skewtest and kurtosistest, I cannot explain these but you can see that the p-value is poor for the linear series and is relatively closer to 1 for the normalised data:
In [49]:
print('linear skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['linear']))
print('normal skewtest teststat = %6.3f pvalue = %6.4f' % sc.stats.skewtest(df['normal']))
print('linear kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['linear']))
print('normal kurtoisis teststat = %6.3f pvalue = %6.4f' % sc.stats.kurtosistest(df['normal']))
print('linear normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['linear']))
print('normal normaltest teststat = %6.3f pvalue = %6.4f' % sc.stats.normaltest(df['normal']))
linear skewtest teststat = 1.022 pvalue = 0.3070
normal skewtest teststat = -0.170 pvalue = 0.8652
linear kurtoisis teststat = -5.799 pvalue = 0.0000
normal kurtoisis teststat = -1.113 pvalue = 0.2656
linear normaltest teststat = 34.674 pvalue = 0.0000
normal normaltest teststat = 1.268 pvalue = 0.5304
From the scipy site:
When testing for normality of a small sample of t-distributed
observations and a large sample of normal distributed observation,
then in neither case can we reject the null hypothesis that the sample
comes from a normal distribution. In the first case this is because
the test is not powerful enough to distinguish a t and a normally
distributed random variable in a small sample.
So you'll have to try the above and see if it fits with what you want, hope this helps.
Sure. What you'd like to do here is find the 700 entries with the largest standard deviation.
pandas.DataFrame.std() will return the standard deviation for an axis, and then you just need to keep track of the entries with the highest corresponding values.
Large Standard Deviation vs. Small Standard Deviation

Categories