I have some data and I can fit a gamma distribution using for example this code taken from Fitting a gamma distribution with (python) Scipy .
import scipy.stats as ss
import scipy as sp
Generate some gamma data:
alpha=5
loc=100.5
beta=22
data=ss.gamma.rvs(alpha,loc=loc,scale=beta,size=10000)
print(data)
# [ 202.36035683 297.23906376 249.53831795 ..., 271.85204096 180.75026301
# 364.60240242]
Here we fit the data to the gamma distribution:
fit_alpha,fit_loc,fit_beta=ss.gamma.fit(data)
print(fit_alpha,fit_loc,fit_beta)
# (5.0833692504230008, 100.08697963283467, 21.739518937816108)
print(alpha,loc,beta)
# (5, 100.5, 22)
I can also fit an exponential distribution to the same data. I would however like to do a likelihood ratio test. To do this I don't just need to fit the distributions but I also need to return the likelihood. How can you do that in python?
You can compute the log-likelihood of data by calling the logpdf method of stats.gamma and then summing the array.
The first bit of code is from your example:
In [63]: import scipy.stats as ss
In [64]: np.random.seed(123)
In [65]: alpha = 5
In [66]: loc = 100.5
In [67]: beta = 22
In [68]: data = ss.gamma.rvs(alpha, loc=loc, scale=beta, size=10000)
In [70]: data
Out[70]:
array([ 159.73200869, 258.23458137, 178.0504184 , ..., 281.91672824,
164.77152977, 145.83445141])
In [71]: fit_alpha, fit_loc, fit_beta = ss.gamma.fit(data)
In [72]: fit_alpha, fit_loc, fit_beta
Out[72]: (4.9953385276512883, 101.24295938462399, 21.992307537192605)
Here's how to compute the log-likelihood:
In [73]: loglh = ss.gamma.logpdf(data, fit_alpha, fit_loc, fit_beta).sum()
In [74]: loglh
Out[74]: -52437.410641032831
Related
How to create a function that loops through numpy matrix to z-scale each and every data points returning the data standardized. Just like how sklearn.preprocessing.StandardScaler does it. I have got up to here with no success. May somebody help me with this?
def stand_scaler(data):
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
for i in range(len(data)):
data[i] = (data[i] - mean)/std
return data
stand_scaler(data)
You shouldn't need a for-loop for this; numpy's array operations are intended for exactly this case. For a one dimensional array it's straightforward:
In [1]: import numpy as np
In [2]: x = np.random.normal(size=10)
In [3]: nx = (x - x.mean()) / x.std()
In [4]: x
Out[4]:
array([ 0.52700345, -0.57358563, -0.16925383, 2.14401554, 1.05223331,
0.72659482, 1.06816826, 0.31194848, 0.04004589, 1.09046925])
In [5]: nx
Out[5]:
array([-0.12859083, -1.62209992, -1.0734181 , 2.06570881, 0.58415071,
0.14225641, 0.60577458, -0.42042233, -0.78939654, 0.63603721])
In [6]: nx.mean()
Out[6]: 5.551115123125783e-17
In [7]: nx.std()
Out[7]: 1.0000000000000002
For higher dimensions, you can choose an axis to work over, and scale by using numpy's broadcasting; e.g., in this case, imagine each column is a different variable:
In [8]: y = np.array([10,1]) * np.random.normal(size=(5,2)) - np.array([5,-10])
In [9]: ny = (y - y.mean(axis=0)) / y.std(axis=0)
In [10]: ny
Out[10]:
array([[ 0.78076062, -0.26971997],
[-1.59591909, -1.2409338 ],
[-0.55740483, -0.81901609],
[ 1.22978416, 1.12697814],
[ 0.14277914, 1.20269171]])
In [11]: ny.mean(axis=0), ny.std(axis=0)
Out[11]: (array([-3.33066907e-17, 8.43769499e-16]), array([1., 1.]))
There was a problem in the simplest example of linear regression. At the output, the coefficients are zero, what do I do wrong? Thanks for the help.
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = [25,50,75,100]
y = [10.5,17,23.25,29]
pred = [27,41,22,33]
df = pd.DataFrame({'x':x, 'y':y, 'pred':pred})
x = df['x'].values.reshape(1,-1)
y = df['y'].values.reshape(1,-1)
pred = df['pred'].values.reshape(1,-1)
plt.scatter(x,y,color='black')
clf = lm.LinearRegression(fit_intercept =True)
clf.fit(x,y)
m=clf.coef_[0]
b=clf.intercept_
print("slope=",m, "intercept=",b)
Output:
slope= [ 0. 0. 0. 0.] intercept= [ 10.5 17. 23.25 29. ]
Think it through for a second. Given that you have multiple coefficients returned suggests you have multiple factors. Since it's a single regression, the problem lies in the shape of your input data. Your original reshaping made the class think you had 4 variables and only one observation per variable.
Try something like this:
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.array([25,99,75,100, 3, 4, 6, 80])[..., np.newaxis]
y = np.array([10.5,17,23.25,29, 1, 2, 33, 4])[..., np.newaxis]
clf = lm.LinearRegression()
clf.fit(x,y)
clf.coef_
Output:
array([[ 0.09399429]])
As #jrjames83 has already explained in his answer after reshaping (.reshape(1,-1)) you were feeding a data set containing one sample (row) and four features (columns):
In [103]: x.shape
Out[103]: (1, 4)
most probably you wanted to reshape it this way:
In [104]: x = df['x'].values.reshape(-1, 1)
In [105]: x.shape
Out[105]: (4, 1)
so that you would have four samples and one feature...
alternatively you could pass DataFrame columns to your model as follows (no need to pollute your memory with additional variables):
In [98]: clf = lm.LinearRegression(fit_intercept =True)
In [99]: clf.fit(df[['x']],df['y'])
Out[99]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [100]: clf.coef_
Out[100]: array([0.247])
In [101]: clf.intercept_
Out[101]: 4.5
Is there a method that I can call to create a random orthonormal matrix in python? Possibly using numpy? Or is there a way to create a orthonormal matrix using multiple numpy methods? Thanks.
Version 0.18 of scipy has scipy.stats.ortho_group and scipy.stats.special_ortho_group. The pull request where it was added is https://github.com/scipy/scipy/pull/5622
For example,
In [24]: from scipy.stats import ortho_group # Requires version 0.18 of scipy
In [25]: m = ortho_group.rvs(dim=3)
In [26]: m
Out[26]:
array([[-0.23939017, 0.58743526, -0.77305379],
[ 0.81921268, -0.30515101, -0.48556508],
[-0.52113619, -0.74953498, -0.40818426]])
In [27]: np.set_printoptions(suppress=True)
In [28]: m.dot(m.T)
Out[28]:
array([[ 1., 0., -0.],
[ 0., 1., 0.],
[-0., 0., 1.]])
You can obtain a random n x n orthogonal matrix Q, (uniformly distributed over the manifold of n x n orthogonal matrices) by performing a QR factorization of an n x n matrix with elements i.i.d. Gaussian random variables of mean 0 and variance 1. Here is an example:
import numpy as np
from scipy.linalg import qr
n = 3
H = np.random.randn(n, n)
Q, R = qr(H)
print (Q.dot(Q.T))
[[ 1.00000000e+00 -2.77555756e-17 2.49800181e-16]
[ -2.77555756e-17 1.00000000e+00 -1.38777878e-17]
[ 2.49800181e-16 -1.38777878e-17 1.00000000e+00]]
EDIT: (Revisiting this answer after the comment by #g g.) The claim above on the QR decomposition of a Gaussian matrix providing a uniformly distributed (over the, so called, Stiefel manifold) orthogonal matrix is suggested by Theorems 2.3.18-19 of this reference. Note that the statement of the result suggests a "QR-like" decomposition, however, with the triangular matrix R having positive elements.
Apparently, the qr function of scipy (numpy) function does not guarantee positive diagonal elements for R and the corresponding Q is actually not uniformly distributed. This has been observed in this monograph, Sec. 4.6 (the discussion refers to MATLAB, but I guess both MATLAB and scipy use the same LAPACK routines). It is suggested there that the matrix Q provided by qr is modified by post multiplying it with a random unitary diagonal matrix.
Below I reproduce the experiment in the above reference, plotting the empirical distribution (histogram) of phases of eigenvalues of the "direct" Q matrix provided by qr, as well as the "modified" version, where it is seen that the modified version does indeed have a uniform eigenvalue phase, as would be expected from a uniformly distributed orthogonal matrix.
from scipy.linalg import qr, eigvals
from seaborn import distplot
n = 50
repeats = 10000
angles = []
angles_modified = []
for rp in range(repeats):
H = np.random.randn(n, n)
Q, R = qr(H)
angles.append(np.angle(eigvals(Q)))
Q_modified = Q # np.diag(np.exp(1j * np.pi * 2 * np.random.rand(n)))
angles_modified.append(np.angle(eigvals(Q_modified)))
fig, ax = plt.subplots(1,2, figsize = (10,3))
distplot(np.asarray(angles).flatten(),kde = False, hist_kws=dict(edgecolor="k", linewidth=2), ax= ax[0])
ax[0].set(xlabel='phase', title='direct')
distplot(np.asarray(angles_modified).flatten(),kde = False, hist_kws=dict(edgecolor="k", linewidth=2), ax= ax[1])
ax[1].set(xlabel='phase', title='modified');
This is the rvs method pulled from the https://github.com/scipy/scipy/pull/5622/files, with minimal change - just enough to run as a stand alone numpy function.
import numpy as np
def rvs(dim=3):
random_state = np.random
H = np.eye(dim)
D = np.ones((dim,))
for n in range(1, dim):
x = random_state.normal(size=(dim-n+1,))
D[n-1] = np.sign(x[0])
x[0] -= D[n-1]*np.sqrt((x*x).sum())
# Householder transformation
Hx = (np.eye(dim-n+1) - 2.*np.outer(x, x)/(x*x).sum())
mat = np.eye(dim)
mat[n-1:, n-1:] = Hx
H = np.dot(H, mat)
# Fix the last sign such that the determinant is 1
D[-1] = (-1)**(1-(dim % 2))*D.prod()
# Equivalent to np.dot(np.diag(D), H) but faster, apparently
H = (D*H.T).T
return H
It matches Warren's test, https://stackoverflow.com/a/38426572/901925
An easy way to create any shape (n x m) orthogonal matrix:
import numpy as np
n, m = 3, 5
H = np.random.rand(n, m)
u, s, vh = np.linalg.svd(H, full_matrices=False)
mat = u # vh
print(mat # mat.T) # -> eye(n)
Note that if n > m, it would obtain mat.T # mat = eye(m).
from scipy.stats import special_ortho_group
num_dim=3
x = special_ortho_group.rvs(num_dim)
Documentation
if you want a none Square Matrix with orthonormal column vectors you could create a square one with any of the mentioned method and drop some columns.
Numpy also has qr factorization. https://numpy.org/doc/stable/reference/generated/numpy.linalg.qr.html
import numpy as np
a = np.random.rand(3, 3)
q, r = np.linalg.qr(a)
q # q.T
# array([[ 1.00000000e+00, 8.83206468e-17, 2.69154044e-16],
# [ 8.83206468e-17, 1.00000000e+00, -1.30466244e-16],
# [ 2.69154044e-16, -1.30466244e-16, 1.00000000e+00]])
I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?
scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.
Here's an example. First, create an array p of positive values whose rows sum to 1:
In [23]: np.random.seed(123)
In [24]: x = np.random.rand(3, 10)
In [25]: p = x/x.sum(axis=1, keepdims=True)
In [26]: p
Out[26]:
array([[ 0.12798052, 0.05257987, 0.04168536, 0.1013075 , 0.13220688,
0.07774843, 0.18022149, 0.1258417 , 0.08837421, 0.07205402],
[ 0.08313743, 0.17661773, 0.1062474 , 0.01445742, 0.09642919,
0.17878489, 0.04420998, 0.0425045 , 0.12877228, 0.1288392 ],
[ 0.11793032, 0.15790292, 0.13467074, 0.11358463, 0.13429674,
0.06003561, 0.06725376, 0.0424324 , 0.05459921, 0.11729367]])
In [27]: p.shape
Out[27]: (3, 10)
In [28]: p.sum(axis=1)
Out[28]: array([ 1., 1., 1.])
Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).
In [29]: from scipy.special import entr
In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731, 2.14586635, 2.22486581])
In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434, 3.09583074, 3.20980287])
If you don't want the dependency on scipy, you can use the explicit formula:
In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434, 3.09583074, 3.20980287])
As #Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.
Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:
import numpy as np
def entropy(x):
"""
x is assumed to be an (nsignals, nsamples) array containing integers between
0 and n_unique_vals
"""
x = np.atleast_2d(x)
nrows, ncols = x.shape
nbins = x.max() + 1
# count the number of occurrences for each unique integer between 0 and x.max()
# in each row of x
counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))
# divide by number of columns to get the probability of each unique value
p = counts / float(ncols)
# compute Shannon entropy in bits
return -np.sum(p * np.log2(p), axis=1)
Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.
Test correctness for a single row:
vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)
print("theoretical H(x): %.6f, empirical H(x): %.6f" %
(-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532
Test speed:
In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
....: entropy(x)
....:
10 loops, best of 3: 34.6 ms per loop
If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:
y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape
I am doing what I thought would be a simple regression on my data however something is wrong. I use csv2rec to read my data but then I print the regression parameters m and b I get nan nan.
In case you want to preview the csv file here is some of it:
"Oxide","ooh","oh",
"MoO",3.06,0.01,
"IrO",2.79,-0.23,
What I want is a regression on the two rows. x = a.oh and y = a.ooh
Here is the script I am using
import matplotlib
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from pylab import polyfit
a = mlab.csv2rec('rutilecsv.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('E_OH / eV', fontsize=12)
ax.set_ylabel('E_OOH / eV', fontsize=12)
(m, b) = polyfit(a.oh, a.ooh, 1)
print m, b
ax.plot(a.oh, a.ooh, 'go')
plt.axis([-2, 3, 1, 6])
plt.show()
Okay, just to put this to bed, this is exactly the symptom you'd get if there were missing data:
"Oxide","ooh","oh",
"MoO",3.06,0.01,
"IrO",2.79,-0.23,
"ZZ",2.79,,
results in
In [7]: a.ooh
Out[7]: array([ 3.06, 2.79, 2.79])
In [8]: a.oh
Out[8]: array([ 0.01, -0.23, nan])
In [9]: polyfit(a.oh, a.ooh, 1)
Out[9]: array([ nan, nan])
If you want to simply ignore the missing data, then you can simply pass polyfit only the points where both exist:
In [15]: good_data = ~(numpy.isnan(a.oh) | numpy.isnan(a.ooh))
In [16]: good_data
Out[16]: array([ True, True, False], dtype=bool)
In [17]: a.oh[good_data]
Out[17]: array([ 0.01, -0.23])
In [18]: a.ooh[good_data]
Out[18]: array([ 3.06, 2.79])
In [19]: polyfit(a.oh[good_data], a.ooh[good_data], 1)
Out[19]: array([ 1.125 , 3.04875])
Two things to check:
Are values converted propery
Try a['oh'] and a['ooh'] to access vectors
and maybe use option names to specify column names when reading file in.