Fitting dictionary into normal distribution curve - python

Here is the dictionary:
l= {31.2: 1,35.1: 4,39.0: 13,42.9: 33,46.8: 115,50.7: 271,54.6: 363,58.5:381,62.4:379,66.3:370,70.2:256,74.1: 47,78.0: 2}
So this means that 31.2 has occurred 1 time, 35.1 has occurred 4 times and so on.
I tried:
fig, ax = plt.subplots(1, 1)
ax.scatter(l.keys(), l.values)
ax.set_xlabel('Key')
ax.set_ylabel('Length of value')
Also I found mean and std by
np.mean([k for k in l.keys()])
np.std([k for k in l.keys()])
Is this the way to find mean and std for that data. I doubt that because it does not take into account of number of occurences of each data. I want to see the normal curve on this data. Also is there a way to know how often a value occurs. For example if I extend the curve to touch 0 on x axis , and if I want to know how many data points are involved for an occurrence of 0(can also be probability).

Here is a way to draw a normal gauss curve to fit the data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
l = {31.2: 1, 35.1: 4, 39.0: 13, 42.9: 33, 46.8: 115, 50.7: 271, 54.6: 363, 58.5: 381, 62.4: 379, 66.3: 370, 70.2: 256, 74.1: 47, 78.0: 2}
# convert the dictionary to a list
l_list = [k for k, v in l.items() for _ in range(v)]
fig, ax = plt.subplots(1, 1)
ax.scatter(l.keys(), l.values())
ax.set_xlabel('Key')
ax.set_ylabel('Length of value')
mu = np.mean(l_list)
sigma = np.std(l_list)
u = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 100)
ax2 = ax.twinx()
ax2.plot(u, stats.norm.pdf(u, mu, sigma), color='crimson')
ax2.set_ylabel('normal curve')
plt.show()

Here's a way to get the mean and std:
l= {31.2: 1,35.1: 4,39.0: 13,42.9: 33,46.8: 115,50.7: 271,54.6: 363,58.5:381,62.4:379,66.3:370,70.2:256,74.1: 47,78.0: 2}
ll=[[i]*j for i,j in zip(l.keys(),l.values())]
flat_list = [item for sublist in ll for item in sublist]
np.mean(flat_list), np.std(flat_list)
which prints (59.559194630872476, 7.528353520785996).
You could do a histogram with np.histogram(flat_list) to evaluate the frequency of each occurrence.

Related

Ploting points from sympy Segment3D definition. Extract the half of value of each item from dictionary in format "(1,2,3)"

I´ve defined points and lines via sympy bib and took them into dictionary format. Now I´m trying to plot the points and lines from dictionary via FOR Loop.But I become an error.
How can I take only the half of the value of each dictionary item and only in format as example "(0,1,2)"? Without "Segment3D" and "Point3D"?
Error:
Traceback (most recent call last):
File "C:/Users/.../2D.py", line 81, in
l_length=math.hypot(pA[0]-pE[0], pA[1]-pE[1], pA[2]-pE[2])
TypeError: 'Segment3D' object is not subscriptable
from sympy import Point3D, Line3D, Segment3D, Eq, solve_linear_system, Matrix, Basic
from math import *
from sympy.interactive import printing
printing.init_printing(use_latex=True)
from numpy import linalg
from sympy.solvers import solve
import numpy as np
import sympy as sp
from matplotlib import pyplot as plt
import math
from itertools import islice
# Punkte der Geometrie festlegen
points={'p0':Point3D(0,0,0),
'p1':Point3D(0,1,1),
'p2':Point3D(1.707,1,1),
'p3':Point3D(3.414,1,1),
'p4':Point3D(1.707,2.707,4)}
locals().update(points)
lines = {"l1": Segment3D(p1, p4),
"l2": Segment3D(p2, p4),
"l3": Segment3D(p3, p4)}
locals().update(lines)
plt.rcParams["figure.figsize"] = [10, 7]
plt.rcParams["figure.autolayout"] = True
fig = plt.figure('1')
ax = plt.axes(projection="3d")
ax.text(p1[0],p1[1]+0.3,p1[2]+.2,'P1')
ax.text(p2[0],p2[1]+0.3,p2[2]+.2,'P2')
ax.text(p3[0],p3[1]+0.3,p3[2]+.2,'P3')
ax.text(p4[0],p4[1]+0.3,p4[2]+.2,'P4')
#koordinatenaxen Plot
P0=Point3D(0,0,0)
X=Point3D(5,0.0)
Y=Point3D(0,5,0)
Z=Point3D(0,0,5)
lX,lY,lZ=Line3D(P0,X),Line3D(P0,Y),Line3D(P0,Z)
x,y,z=[P0[0],X[0]],[P0[1],X[1]],[P0[2],X[2]]
ax.plot(x,y,z,color='blue')
x,y,z=[P0[0],Y[0]],[P0[1],Y[1]],[P0[2],Y[2]]
ax.plot(x,y,z,color='blue')
x,y,z=[P0[0],Z[0]],[P0[1],Z[1]],[P0[2],Z[2]]
ax.plot(x,y,z,color='blue')
ax.set_xlabel('x-Achse')
ax.set_ylabel('y-Achse')
ax.set_zlabel('z-Achse')
print(lines.keys()
e1x=cos(l1.angle_between(lX))
e1y=cos(l1.angle_between(lY))
e1z=cos(l1.angle_between(lZ))
e1=np.array([e1x, e1y, e1z])
e2x=cos(l2.angle_between(lX))
e2y=cos(l2.angle_between(lY))
e2z=cos(l2.angle_between(lZ))
e2=np.array([e2x, e2y, e2z])
e3x=cos(l3.angle_between(lX))
e3y=cos(l3.angle_between(lY))
e3z=cos(l3.angle_between(lZ))
e3=np.array([e3x, e3y, e3z])
for key,value in lines.items():
print("two points of the line:", list(lines.values())[0])
pE, pA=(list(lines.values())[0]),(list(lines.values())[0])
l_length=math.hypot(pA[0]-pE[0], pA[1]-pE[1], pA[2]-pE[2])
print('Line ', key,'length', l_length)
x,y,z=[pA[0],pE[0]],[pA[1],pE[1]],[pA[2],pE[2]]
ax.scatter(x,y,z,c='red',s=100)
ax.text((pE[0]/2+pA[0]/2)*1.1,(pE[1]/2+pA[1]/2)*1.1,pE[2]/2+pA[2]/2+0.1,key)
# ax.text(pE[0] * 1.2, pE[1], pE[2] * 1.1, 'p')
ax.plot(x,y,z,color='black')
plt.pause(0.3)
plt.tight_layout()
print('*************************')
plt.show() ```
Then this is the error I'm getting.
```python
C:\Users\gsger\anaconda3\envs\pythonProject\python.exe C:/Users/gsger/AppData/Local/Packages/PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0/LocalCache/local-packages/Python38/site-packages/scipy/optimize/2D.py
dict_keys(['l1', 'l2', 'l3'])
Traceback (most recent call last):
File "C:/Users/.../2D.py", line 78, in <module>
l_length=math.hypot(pA[0]-pE[0], pA[1]-pE[1], pA[2]-pE[2])
TypeError: 'Segment3D' object is not subscriptable
two points of the line: Segment3D(Point3D(0, 1, 1), Point3D(1707/1000, 2707/1000, 4))
Process finished with exit code 1
I think Point's allowing you to use indices led you astray. Instead of using indexing, try use the attribute for what you are interested in:
>>> Point(1, 2).x
1
>>> Point(1, 2).args # all SymPy objects know what their args are
(1, 2)
>>> s = Segment((1, 2), (3, 4))
>>> s.p1.args
(1, 2)
>>> s.p2.args
(3, 4)
Note, too, you define pA and pE to be the same thing and I am guessing you mean them to be the points defining the segment. So I would suggest iterating over key, seg to remind yourself that your value is a segment and then use attributes within your loop:
for key, seg in lines.items():
print("two points of the line:", seg)
pE, pA = seg.p1, seg.p2
l_length = seg.length.n()
print('Line ', key, 'length', l_length)
x, y, z = [list(i) for i in zip(pA, pE)]
ax.scatter(x, y, z, c='red', s=100)
m = seg.midpoint
ax.text(m.x*1.1, m.y*1.1, m.z + 0.1, key)
# ax.text(pE.x*1.2, pE.y, pE.z*1.1, 'p')
ax.plot(x, y, z, color='black')
plt.pause(0.3)
plt.tight_layout()
print('*************************')

Number format python

I want to have the legend of the plot shown with the value in a list. But what I get is the element index but not the value itself. I dont know how to fix it. I'm referring to the plt.plot line. Thanks for the help.
import matplotlib.pyplot as plt
import numpy as np
x = np.random.random(1000)
y = np.random.random(1000)
n = len(x)
d_ij = []
for i in range(n):
for j in range(i+1,n):
a = np.sqrt((x[i]-x[j])**2+(y[i]-y[j])**2)
d_ij.append(a)
epsilon = np.linspace(0.01,1,num=10)
sigma = np.linspace(0.01,1,num=10)
def lj_pot(epsi,sig,d):
result = []
for i in range(len(d)):
a = 4*epsi*((sig/d[i])**12-(sig/d[i])**6)
result.append(a)
return result
for i in range(len(epsilon)):
for j in range(len(sigma)):
a = epsilon[i]
b = sigma[j]
plt.cla()
plt.ylim([-1.5, 1.5])
plt.xlim([0, 2])
plt.plot(sorted(d_ij),lj_pot(epsilon[i],sigma[j],sorted(d_ij)),label = 'epsilon = %d, sigma =%d' %(a,b))
plt.legend()
plt.savefig("epsilon_%d_sigma_%d.png" % (i,j))
plt.show()
Your code is a bit unpythonic, so I tried to clean it up to the best of my knowledge. numpy.random.random and numpy.random.uniform(0, 1) are basically the same, however, the latter also allows you to pass the shape of the return array that you would like to have, in this case an array with 1000 rows and two columns (1000, 2). I then use some magic to assign the two colums of the return array to x and y in the same line, respectively.
numpy.hypot does as the name suggests and calculates the hypothenuse of x and y. It can also do that for each entry of arrays with the same size, saving you the for loops, which you should try to aviod in Python since they are pretty slow.
You used plt for all your plotting, which is fine as long as you only have one figure, but I would recommend to be as explicit as possible, according to one of Python's key notions:
explicit is better than implicit.
I recommend you read through this guide, in particular the section called 'Stateful Versus Stateless Approaches'. I changed your commands accordingly.
It is also very unpythonic to loop over items of a list using the index of the item in the list like you did (for i in range(len(list)): item = list[i]). You can just reference the item directly (for item in list:).
Lastly I changed your formatted strings to the more convenient f-strings. Have a read here.
import matplotlib.pyplot as plt
import numpy as np
def pot(epsi, sig, d):
result = 4*epsi*((sig/d)**12 - (sig/d)**6)
return result
# I am not sure why you would create the independent variable this way,
# maybe you are simulating something. In that case, the code below is
# simpler than your version and should achieve the same.
# x, y = zip(*np.random.uniform(0, 1, (1000, 2)))
# d = np.array(sorted(np.hypot(x, y)))
# If you only want to plot your pot function then creating the value range
# like this is just fine.
d = np.linspace(0.001, 1, 1000)
epsilons = sigmas = np.linspace(0.01, 1, num=10)
fig, ax = plt.subplots()
ax.set_xlim([0, 2])
ax.set_ylim([-1.5, 1.5])
line = None
for epsilon in epsilons:
for sigma in sigmas:
if line is None:
line = ax.plot(
d, pot(epsilon, sigma, d),
label=f'epsilon = {epsilon}, sigma = {sigma}'
)[0]
fig.legend()
else:
line.set_data(d, pot(epsilon, sigma, d))
# plt.savefig(f"epsilon_{epsilon}_sigma_{sigma}.png")
fig.show()

How to uniformly resample a non-uniform signal using SciPy?

I have an (x, y) signal with non-uniform sample rate in x. (The sample rate is roughly proportional to 1/x). I attempted to uniformly re-sample it using scipy.signal's resample function. From what I understand from the documentation, I could pass it the following arguments:
scipy.resample(array_of_y_values, number_of_sample_points, array_of_x_values)
and it would return the array of
[[resampled_y_values],[new_sample_points]]
I'd expect it to return an uniformly sampled data with a roughly identical form of the original, with the same minimal and maximalx value. But it doesn't:
# nu_data = [[x1, x2, ..., xn], [y1, y2, ..., yn]]
# with x values in ascending order
length = len(nu_data[0])
resampled = sg.resample(nu_data[1], length, nu_data[0])
uniform_data = np.array([resampled[1], resampled[0]])
plt.plot(nu_data[0], nu_data[1], uniform_data[0], uniform_data[1])
plt.show()
blue: nu_data, orange: uniform_data
It doesn't look unaltered, and the x scale have been resized too. If I try to fix the range: construct the desired uniform x values myself and use them instead, the distortion remains:
length = len(nu_data[0])
resampled = sg.resample(nu_data[1], length, nu_data[0])
delta = (nu_data[0,-1] - nu_data[0,0]) / length
new_samplepoints = np.arange(nu_data[0,0], nu_data[0,-1], delta)
uniform_data = np.array([new_samplepoints, resampled[0]])
plt.plot(nu_data[0], nu_data[1], uniform_data[0], uniform_data[1])
plt.show()
What is the proper way to re-sample my data uniformly, if not this?
Please look at this rough solution:
import matplotlib.pyplot as plt
from scipy import interpolate
import numpy as np
x = np.array([0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20])
y = np.exp(-x/3.0)
flinear = interpolate.interp1d(x, y)
fcubic = interpolate.interp1d(x, y, kind='cubic')
xnew = np.arange(0.001, 20, 1)
ylinear = flinear(xnew)
ycubic = fcubic(xnew)
plt.plot(x, y, 'X', xnew, ylinear, 'x', xnew, ycubic, 'o')
plt.show()
That is a bit updated example from scipy page. If you execute it, you should see something like this:
Blue crosses are initial function, your signal with non uniform sampling distribution. And there are two results - orange x - representing linear interpolation, and green dots - cubic interpolation. Question is which option you prefer? Personally I don't like both of them, that is why I usually took 4 points and interpolate between them, then another points... to have cubic interpolation without that strange ups. That is much more work, and also I can't see doing it with scipy, so it will be slow. That is why I've asked about size of the data.

Python boxplot showing means and confidence intervals

How can I create a boxplot like the one below, in Python? I want to depict means and confidence bounds only (rather than proportions of IQRs, as in matplotlib boxplot).
I don't have any version constraints, and if your answer has some package dependency that's OK too. Thanks!
Use errorbar instead. Here is a minimal example:
import matplotlib.pyplot as plt
x = [2, 4, 3]
y = [1, 3, 5]
errors = [0.5, 0.25, 0.75]
plt.figure()
plt.errorbar(x, y, xerr=errors, fmt = 'o', color = 'k')
plt.yticks((0, 1, 3, 5, 6), ('', 'x3', 'x2', 'x1',''))
Note that boxplot is not the right approach; the conf_intervals parameter only controls the placement of the notches on the boxes (and we don't want boxes anyway, let alone notched boxes). There is no way to customize the whiskers except as a function of IQR.
Thanks to America, I propose a way to automatize this kind of graph a little bit.
Below an example of code generating 20 arrays from a normal distribution with mean=0.25 and std=0.1.
I used the formula W = t * s / sqrt(n), to calculate the margin of error of the confidence interval, with t the constant from the t distribution (see scipy.stats.t), s the standard deviation and n the number of values in an array.
list_samples=list() # making a list of arrays
for i in range(20):
list.append(np.random.normal(loc=0.25, scale=0.1, size=20))
def W_array(array, conf=0.95): # function that returns W based on the array provided
t = stats.t(df = len(array) - 1).ppf((1 + conf) /2)
W = t * np.std(array, ddof=1) / np.sqrt(len(array))
return W # the error
W_list = list()
mean_list = list()
for i in range(len(list_samples)):
W_list.append(W_array(list_samples[i])) # makes a list of W for each array
mean_list.append(np.mean(list_samples[i])) # same for the means to plot
plt.errorbar(x=mean_list, y=range(len(list_samples)), xerr=W_list, fmt='o', color='k')
plt.axvline(.25, ls='--') # this is only to demonstrate that 95%
# of the 95% CI contain the actual mean
plt.yticks([])
plt.show();

numpy/scipy equivalent of R ecdf(x)(x) function?

What is the equivalent of R's ecdf(x)(x) function in Python, in either numpy or scipy? Is ecdf(x)(x) basically the same as:
import numpy as np
def ecdf(x):
# normalize X to sum to 1
x = x / np.sum(x)
return np.cumsum(x)
or is something else required?
EDIT how can one control the number of bins used by ecdf?
The OP implementation for ecdf is wrong, you are not supposed to cumsum() the values. So not ys = np.cumsum(x)/np.sum(x) but ys = np.cumsum(1 for _ in x)/float(len(x)) or better ys = np.arange(1, len(x)+1)/float(len(x))
You either go with statmodels's ECDF if you are OK with that extra dependency or provide your own implementation. See below:
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
%matplotlib inline
grades = (93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,
89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61)
def ecdf_wrong(x):
xs = np.sort(x) # need to be sorted
ys = np.cumsum(xs)/np.sum(xs) # normalize so sum == 1
return (xs,ys)
def ecdf(x):
xs = np.sort(x)
ys = np.arange(1, len(xs)+1)/float(len(xs))
return xs, ys
xs, ys = ecdf_wrong(grades)
plt.plot(xs, ys, label="wrong cumsum")
xs, ys = ecdf(grades)
plt.plot(xs, ys, label="handwritten", marker=">", markerfacecolor='none')
cdf = ECDF(grades)
plt.plot(cdf.x, cdf.y, label="statmodels", marker="<", markerfacecolor='none')
plt.legend()
plt.show()
Try these links:
statsmodels.ECDF
ECDF in python without step function?
Example code
import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt
data = np.random.normal(0,5, size=2000)
ecdf = ECDF(data)
plt.plot(ecdf.x,ecdf.y)
The ecdf function in R returns the empirical cumulative distribution function, so the have exact equivalent would be rather:
def ecdf(x):
x = np.sort(x)
n = len(x)
def _ecdf(v):
# side='right' because we want Pr(x <= v)
return (np.searchsorted(x, v, side='right') + 1) / n
return _ecdf
np.random.seed(42)
X = np.random.normal(size=10_000)
Fn = ecdf(X)
Fn([3, 2, 1]) - Fn([-3, -2, -1])
## array([0.9972, 0.9533, 0.682 ])
As shown, it gives the correct 68–95–99.7% probabilities for normal distribution.
This author has a very nice example of a user-written ECDF function: John Stachurski's Python lectures. His lecture series is geared towards graduate students in computational economics; however they are my go-to resource for anyone learning general scientific computing in Python.
Edit: This is a year old now, but I thought I'd still answer the "Edit" part of your question, in case you (or others) still fin it useful.
There really aren't any "bins" with ECDFs as there are with histograms. If G is your empirical distribution function formed using data vector Z, G(x) is literally the number of occurrences of Z <= x, divided by len(Z). This requires no "binning" to determine. Thus there is a sense in which the ECDF retains all possible information about a dataset (since it must retain the entire dataset for calculations), whereas a histogram actually loses some information about the dataset by binning. I much prefer to work with ecdfs vs histograms when possible, for this reason.
Fun bonus: if you need to create a small-footprint ECDF-like object from very large streaming data, you should look into this "Data Skeletons" paper by McDermott et al.
data <- c(10, 20, 50, 40, 40, 30, 60, 70, 80, 90)
# Define a function to compute the ECDF
ecdf_func <- function(data) {
Length <- length(data)
sorted <- sort(data)
ecdf <- rep(0, Length)
for (i in 1:Length) {
ecdf[i] <- sum(sorted <= data[i]) / Length
}
return(ecdf)
}
ecdf <- ecdf_func(data)
print(ecdf)
Output:
[1] 0.1 0.2 0.6 0.5 0.5 0.3 0.7 0.8 0.9 1.0
# With stats library
library(stats)
ecdf_fun <- ecdf(data)
ecdf_ <- ecdf_fun(data)
print(ecdf_)
Output:
[1] 0.1 0.2 0.6 0.5 0.5 0.3 0.7 0.8 0.9 1.0

Categories