I am trying to create a function that will take a numpy dstr name as an argument and plot a histogram of random data points from that distribution.
if it only works on npy distributions that require 1 argument that is okay. Just really stuck trying to create the np.random.distribution()...
\
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#Define a function (Fnc) that produces random numpy distributions (dstr)
#Fnc args: [npy dstr name as lst of str], [num of data pts]
def get_rand_dstr(dstr_name):
npdstr = dstr_name
dstr = np.random.npdstr(user.input("How many datapoints?"))
#here pass each dstr from dstr_name through for loop
#for loop will prompt user for required args of dstr (nbr of desired datapoints)
return plt.hist(df)
get_rand_dstr('chisquare')
Use this code, it might be helped you
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
def get_rand_dstr(dstr_name):
# npdstr = dstr_name
dstr = 'np.random.{}({})'.format(dstr_name, (input("How many datapoints?"))) # for using any distribution need to manipulate here
# cause number of args are diffrent for diffrent distibutions
print(dstr)
df = eval(dstr)
print(df)
# dstr1 = np.random.chisquare(int(input("How many datapoints?")))
# print(dstr1)
return plt.hist(df)
# get_rand_dstr('geometric')
get_rand_dstr('chisquare')
The accepted answer is incorrect and does not work. The problem is that NumPy's random distributions take different required arguments, so it's a little fiddly to pass size to all of them because it's a kwarg. (That's why the example in the accepted solution returns the wrong number of samples — only 1, not the 5 that were requested. That's because the first argument for chisquare is df, not size.)
It's common to want to invoke functions by name. As well as not working, the accepted answer uses eval() which is a common suggested solution to the issue. But it's generally accepted to be a bad idea, for various reasons.
A better way to achieve what you want is to define a dictionary that maps strings representing the names of functions to the functions themselves. For example:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
DISTRIBUTIONS = {
'standard_cauchy': np.random.standard_cauchy,
'standard_exponential': np.random.standard_exponential,
'standard_normal': np.random.standard_normal,
'chisquare': lambda size: np.random.chisquare(df=1, size=size),
}
def get_rand_dstr(dstr_name):
npdstr = DISTRIBUTIONS[dstr_name]
size = int(input("How many datapoints?"))
dstr = npdstr(size=size)
return plt.hist(dstr)
get_rand_dstr('chisquare')
This works fine — for the functions I made keys for. You could make more — there are 35 I think — but the problem is that they don't all have the same API. In other words, you can't call them all just with size as an argument. For example, np.random.chisquare() requires the parameter df or 'degrees of freedom'. Other functions require other things. You could make assumptions about those things and wrap all of the function calls (like I did, above, for chisquare)... if that's what you want to do?
Related
I try to figure it out how can I implement a function in python to generate arrays like this (please ignore the fact that is the same number):
[[0101111010, 0101111010,0101111010, 0101111010,0101111010,0101111010,0101111010],
[0101111010,0101111010,0101111010,0101111010,0101111010,0101111010,0101111010]]
I write this code but I don't know if it's the best idea:
import numpy as np
import random
sol_per_pop = 20
num_genes = 10
pop_size = (sol_per_pop,num_genes)
population = np.random.randint(2, size=pop_size)
print(population)
I don't want to use string. I want to find the best solution. Thank you!
I don't really see what this would be useful for. But it might still be fun.
import random
int("{:b}".format(random.randint(0,1<<64)))
Or:
import random
r = random.randint(0,1<<64)
sum(10**i*((r>>i)&1) for i in range(64))
Or, if we must use numpy:
import numpy as np
significance = 10**np.cumsum(np.ones(64))
np.sum(significance*np.random.default_rng().integers(0, 2, 64))
Yet another idea:
import numpy as np
rng = np.random.default_rng()
significance = 10**np.cumsum(np.ones(64))
result = np.zeros(64)
result[nrng.choice(range(64), rng.integers(0, 64))] = 1
np.sum(significance*result)
As you can see, there are many approaches to solve the same problem.
I have two example to help you understand what I mean
Example1 works:
import pandas as pd
import numpy as np
x_grid = np.linspace(-3, 3, 1000)
df = pd.read_excel('somefile.xlsx').dropna()
I called the method dropna() on the instance of a DataFrame object when creating it.
Example2 does not work:
from statsmodels.nonparametric.kde import KDEUnivariate
kde = KDEUnivariate(df).fit().evaluate(x_grid)
To make it work I need to create the instance of the Class first like this:
kde = KDEUnivariate(df)
And then call the methods one at a time
kde.fit()
grid = kde.evaluate(x_grid)
What is the logic behind this?
Thank you for any help!
When you try to do:
import pandas as pd
import numpy as np
x_grid = np.linspace(-3, 3, 1000)
df = pd.read_excel('somefile.xlsx').dropna()
from statsmodels.nonparametric.kde import KDEUnivariate
kde = KDEUnivariate(df).fit().evaluate(x_grid)
then you actually pass the return value of fit() method, which is 'NoneType'.
The same if you did this:
kde = KDEUnivariate(df)
kde = kde.fit()
grid = kde.evaluate(x_grid)
But you don't want this.
You want a instantiated, then fittedKDEUnivariate() object.
Then evaluate it.
That is why the appropriate way of calling mechanism is as follows:
kde = KDEUnivariate(df)
kde.fit()
grid = kde.evaluate(x_grid)
In this situation the KDEUnivariate() instance's evaluate() method works with the KDEUnivariate() instance itself, and with it's fitted parameters, not the return value of KDEUnivariate() instance's fit() method.
I am using the Python version of the Shogun Toolbox.
I want to use the LinearTimeMMD, which accepts data under the streaming interface CStreamingFeatures. I have the data in the form of two RealFeatures objects: feat_p and feat_q. These work just fine with the QuadraticTimeMMD.
In order to use it with the LinearTimeMMD, I need to create StreamingFeatures objects from these - In this case, these would be StreamingRealFeatures, as far as I know.
My first approach was using this:
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space() returns -1. Also, if I try calling gen_p.get_streamed_features(100) this results in a Memory Access Error.
I tried another approach using StreamingFileFromFeatures:
streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)
gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)
But this results in the same situation with the same described problems.
It seems that in both cases, the contents of the RealFeatures object handed to the StreamingRealFeatures object cannot be accessed.
What am I doing wrong?
EDIT: I was asked for a small working example to show the error:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import laplace, norm
def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):
# sample from both distributions
X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
Y=laplace.rvs(size=n, loc=mu, scale=b)
return X,Y
# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())
test_features = gen_p.get_streamed_features(1)
print("success")
EDIT 2: The Output of the working example:
Dimensions: -1
Number of features: -1
Number of vectors: 1
Speicherzugriffsfehler (Speicherabzug geschrieben)
EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05
# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)
EDIT 4: Additional code sample showing the growing mmd problem:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
def mmd(n):
X = [(1.0,i) for i in range(n)]
Y = [(2.0,i) for i in range(n)]
X = np.array(X)
Y = np.array(Y)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(2, len(X)))
feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(100)
mmd.set_num_samples_q(100)
alpha = 0.05
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("N =", n)
print("MMD_l[X,Y]^2=%.2f" % statistic)
print()
for n in [1000, 10000, 15000, 20000, 25000, 30000]:
mmd(n)
Output:
N = 1000
MMD_l[X,Y]^2=-12.69
N = 10000
MMD_l[X,Y]^2=-40.14
N = 15000
MMD_l[X,Y]^2=-49.16
N = 20000
MMD_l[X,Y]^2=-56.77
N = 25000
MMD_l[X,Y]^2=-63.47
N = 30000
MMD_l[X,Y]^2=-69.52
For some reason, the pythonenv in my machine is broken. So, I couldn't give a snippet in Python. But let me point to a working example in C++ which attempts to address the issues (https://gist.github.com/lambday/983830beb0afeb38b9447fd91a143e67).
I think the easiest way is to create a StreamingRealFeatures instance directly from RealFeatures instance (like you tried the first time). Check test1() and test2() methods in the gist which shows the equivalence of using RealFeatures and StreamingRealFeatures in the use-case in question. The reason you were getting weird results when streaming directly is that in order to start the streaming process we need to call the start_parser method in the StreamingRealFeatures class. We handle these technicalities internally inside MMD classes. But when trying to use it directly, we need to invoke that separately (See test3() method in my attached example).
Please note that the compute_statistic() method doesn't return MMD directly, but rather returns \frac{n_x\times n_y}{n_x+n_y}\times MMD^2 (as mentioned in the doc http://shogun.ml/api/latest/classshogun_1_1CMMD.html). With that in mind, maybe the results you are getting for varying number of samples make sense.
Hope it helps.
I intend for part of a program I'm writing to automatically generate Gaussian distributions of various statistics over multiple raw text sources, however I'm having some issues generating the graphs as per the guide at:
python pylab plot normal distribution
The general gist of the plot code is as follows.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as pyplot
meanAverage = 222.89219487179491 # typical value calculated beforehand
standardDeviation = 3.8857889432054091 # typical value calculated beforehand
x = np.linspace(-3,3,100)
pyplot.plot(x,mlab.normpdf(x,meanAverage,standardDeviation))
pyplot.show()
All it does is produce a rather flat looking and useless y = 0 line!
Can anyone see what the problem is here?
Cheers.
If you read documentation of matplotlib.mlab.normpdf, this function is deprycated and you should use scipy.stats.norm.pdf instead.
Deprecated since version 2.2: scipy.stats.norm.pdf
And because your distribution mean is about 222, you should use np.linspace(200, 220, 100).
So your code will look like:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as pyplot
meanAverage = 222.89219487179491 # typical value calculated beforehand
standardDeviation = 3.8857889432054091 # typical value calculated beforehand
x = np.linspace(200, 220, 100)
pyplot.plot(x, norm.pdf(x, meanAverage, standardDeviation))
pyplot.show()
It looks like you made a few small but significant errors. You either are choosing your x vector wrong or you swapped your stddev and mean. Since your mean is at 222, you probably want your x vector in this area, maybe something like 150 to 300. This way you get all the good stuff, right now you are looking at -3 to 3 which is at the tail of the distribution. Hope that helps.
I see that, for the *args which are sending meanAverage, standardDeviation, the correct thing to be sent is:
mu : a numdims array of means of a
sigma : a numdims array of atandard deviation of a
Does this help?
I have been unable to find this function in any of the standard packages, so I wrote the one below. Before throwing it toward the Cheeseshop, however, does anyone know of an already published version? Alternatively, please suggest any improvements. Thanks.
def fivenum(v):
"""Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input vector, a list or array of numbers based on 1.5 times the interquartile distance"""
import numpy as np
from scipy.stats import scoreatpercentile
try:
np.sum(v)
except TypeError:
print('Error: you must provide a list or array of only numbers')
q1 = scoreatpercentile(v,25)
q3 = scoreatpercentile(v,75)
iqd = q3-q1
md = np.median(v)
whisker = 1.5*iqd
return np.min(v), md-whisker, md, md+whisker, np.max(v),
pandas Series and DataFrame have a describe method, which is similar to R's summary:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: s = pd.Series(np.random.rand(100))
In [6]: s.describe()
Out[6]:
count 100.000000
mean 0.540376
std 0.296250
min 0.002514
25% 0.268722
50% 0.593436
75% 0.831067
max 0.991971
NAN's are handled correctly.
I would get rid of these two things:
import numpy as np
from scipy.stats import scoreatpercentile
You should be importing at the module level. This means that users will be aware of missing dependencies as soon as they import your module, rather than when they call the function.
try:
sum(v)
except TypeError:
print('Error: you must provide a list or array of only numbers')
Several problems with this:
Don't type check in Python. Document what the function takes.
How do you know callers will see this? They might not be running at a console, and even if they are, they might not want your error message interfering with their output.
Don't type check in Python.
If you do want to raise some sort of exception for invalid data (not type checking), either let an existing exception propagate, or wrap it in your own exception type.
In case anybody ever needs a version that works with NaN in the data, here is my modification. I didn't want to change the original poster answer to avoid confusion.
import numpy as np
from scipy.stats import scoreatpercentile
from scipy.stats import nanmedian
def fivenum(v):
"""Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input vector, a list or array of numbers based on 1.5 times the interquartile distance"""
try:
np.sum(v)
except TypeError:
print('Error: you must provide a list or array of only numbers')
q1 = scoreatpercentile(v[~np.isnan(v)],25)
q3 = scoreatpercentile(v[~np.isnan(v)],75)
iqd = q3-q1
md = nanmedian(v)
whisker = 1.5*iqd
return np.nanmin(v), md-whisker, md, md+whisker, np.nanmax(v),
Try this:
import numpy as np
import numpy.random
from statstools import run
from scipy.stats import scoreatpercentile
data=np.random.randn(5)
return (min(data), md-whisker, md, md+whisker, max(data))
I am new to Python, but the return is calculated incorrectly: it should be max(min(v), q1-whisker) for the lower bound and min (max(v), q3+whisker) for the upper bound. It is how it's done in R (the summary() function), and that's what shows up on the boxplots in matplotlib.pyplot and in R.
Minimal, but it gets the job done. :)
import numpy as np
[round(np.percentile(results[:,4], i), 1) for i in [1, 2, 5, 10, 25, 50]]
import numpy as np
# np_array = np.array(np.random.random(100))
np.percentile(np_array, [0, 25, 50, 75, 100])
percentiles selection can be configured with the interpolation argument which is linear by default
import pandas as pd
def fivenum(x):
series=pd.Series(x)
mi = series.min()
q1 = series.quantile(q=0.25, interpolation='nearest')
me = series.median()
q3 = series.quantile(q=0.75, interpolation='nearest')
ma = series.max()
return pd.Series([mi, q1, me, q3, ma], index=['min', 'q1', 'median', 'q3', 'max'])