There was a problem in the simplest example of linear regression. At the output, the coefficients are zero, what do I do wrong? Thanks for the help.
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = [25,50,75,100]
y = [10.5,17,23.25,29]
pred = [27,41,22,33]
df = pd.DataFrame({'x':x, 'y':y, 'pred':pred})
x = df['x'].values.reshape(1,-1)
y = df['y'].values.reshape(1,-1)
pred = df['pred'].values.reshape(1,-1)
plt.scatter(x,y,color='black')
clf = lm.LinearRegression(fit_intercept =True)
clf.fit(x,y)
m=clf.coef_[0]
b=clf.intercept_
print("slope=",m, "intercept=",b)
Output:
slope= [ 0. 0. 0. 0.] intercept= [ 10.5 17. 23.25 29. ]
Think it through for a second. Given that you have multiple coefficients returned suggests you have multiple factors. Since it's a single regression, the problem lies in the shape of your input data. Your original reshaping made the class think you had 4 variables and only one observation per variable.
Try something like this:
import sklearn.linear_model as lm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.array([25,99,75,100, 3, 4, 6, 80])[..., np.newaxis]
y = np.array([10.5,17,23.25,29, 1, 2, 33, 4])[..., np.newaxis]
clf = lm.LinearRegression()
clf.fit(x,y)
clf.coef_
Output:
array([[ 0.09399429]])
As #jrjames83 has already explained in his answer after reshaping (.reshape(1,-1)) you were feeding a data set containing one sample (row) and four features (columns):
In [103]: x.shape
Out[103]: (1, 4)
most probably you wanted to reshape it this way:
In [104]: x = df['x'].values.reshape(-1, 1)
In [105]: x.shape
Out[105]: (4, 1)
so that you would have four samples and one feature...
alternatively you could pass DataFrame columns to your model as follows (no need to pollute your memory with additional variables):
In [98]: clf = lm.LinearRegression(fit_intercept =True)
In [99]: clf.fit(df[['x']],df['y'])
Out[99]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [100]: clf.coef_
Out[100]: array([0.247])
In [101]: clf.intercept_
Out[101]: 4.5
Related
Given a 2D array, I would like to normalize it into range 0-1.
I know this can be achieve as below
import numpy as np
from sklearn.preprocessing import normalize,MinMaxScaler
np.random.seed(0)
t_feat=4
t_epoch=3
t_wind=2
result = [np.random.rand(t_epoch, t_feat) for _ in range(t_wind)]
wdw_epoch_feat=np.array(result)
matrix=wdw_epoch_feat[:,:,0]
xmax, xmin = matrix.max(), matrix.min()
x_norm = (matrix - xmin)/(xmax - xmin)
which produce
[[0.55153917 0.42094786 0.98439526], [0.57160496 0. 1. ]]
However, I cannot get the same result using the MinMaxScaler of sklearn
scaler = MinMaxScaler()
x_norm = scaler.fit_transform(matrix)
which produce
[[0. 1. 0.], [1. 0. 1.]]
Appreciate for any thought
You are standardizing the entire matrix. MinMaxScaler is designed for machine learning, thus performs standardization per row or column based on how you define it. To get the same results as you, you would need to turn the 2D array into a 1D array. I show this below and get your same results in the first column:
import numpy as np
from sklearn.preprocessing import normalize, MinMaxScaler
np.random.seed(0)
t_feat=4
t_epoch=3
t_wind=2
result = [np.random.rand(t_epoch, t_feat) for _ in range(t_wind)]
wdw_epoch_feat=np.array(result)
matrix=wdw_epoch_feat[:,:,0]
xmax, xmin = matrix.max(), matrix.min()
x_norm = (matrix - xmin)/(xmax - xmin)
matrix = np.array([matrix.flatten(), np.random.rand(len(matrix.flatten()))]).T
scaler = MinMaxScaler()
test = scaler.fit_transform(matrix)
print(test)
-------------------------------------------
[[0.55153917 0. ]
[0.42094786 0.63123194]
[0.98439526 0.03034732]
[0.57160496 1. ]
[0. 0.48835502]
[1. 0.35865137]]
When you use MinMaxScaler for Machine Learning, you generally want to standardize each column.
A clever way to do this would be to reshape your data to 1D, apply transform and reshape it back to original -
import numpy as np
X = np.array([[-1, 2], [-0.5, 6]])
scaler = MinMaxScaler()
X_one_column = X.reshape([-1,1])
result_one_column = scaler.fit_transform(X_one_column)
result = result_one_column.reshape(X.shape)
print(result)
[[ 0. 0.42857143]
[ 0.07142857 1. ]]
I am preprocessing my data to make this work:
model = LogisticRegression()
model.fit(X, Y)
I am struggling to reshape my numpy.ndarray.
At this point, for Y I have:
Y
array([array([[52593.4410802]]), array([[52593.4410802]])], dtype=object)
Y.shape
(2,)
type(Y)
<class 'numpy.ndarray'>
And for X, I have:
X
array([array([[34.07824204],
[33.36032467],
[24.61158084],
...,
[34.62648953],
[34.49591937],
[34.40951467]]),
array([[ 4.50136316],
[ 7.46307729],
[17.07135805],
...,
[57.98715047],
[54.5733181 ],
[50.13691107]])], dtype=object)
X.shape
(2,)
type(X)
<class 'numpy.ndarray'>
I would like to get my X and transform so each data becomes a column/feature (idea of transpose). So each value would became a feature something like this idea:
X[0][0]
array([34.07824204])
X[0][1]
array([33.36032467])
# Sudo code idea:
# X_new = [0][0],[0][1],...
# X_new = append(X_new,[1][0],[1][1]...)
What I have tried:
nsamples, nx, ny = X.shape
d2_train_dataset = X.reshape((nsamples,nx*ny))
Also, I tried to reshape and transpose but it will not give what I need:
X
array([array([[34.07824204],
[33.36032467],
[24.61158084],
...,
[34.62648953],
[34.49591937],
[34.40951467]]),
array([[ 4.50136316],
[ 7.46307729],
[17.07135805],
...,
[57.98715047],
[54.5733181 ],
[50.13691107]])], dtype=object)
X.T
array([array([[34.07824204],
[33.36032467],
[24.61158084],
...,
[34.62648953],
[34.49591937],
[34.40951467]]),
array([[ 4.50136316],
[ 7.46307729],
[17.07135805],
...,
[57.98715047],
[54.5733181 ],
[50.13691107]])], dtype=object)
As suggested in one of the comments, I tried, without sucess to:
(I get the output as input)
X.flatten()
array([array([[34.07824204],
[33.36032467],
[24.61158084],
...,
[34.62648953],
[34.49591937],
[34.40951467]]),
array([[ 4.50136316],
[ 7.46307729],
[17.07135805],
...,
[57.98715047],
[54.5733181 ],
[50.13691107]])], dtype=object)
As I can understand from Y, your labels are continuous, not discrete. Your data suggest that you need a regression model, but you are trying to fit a binary classifier, logistic regression. As a regression algorithm, you may use linear regression, Support Vector Regression or any other regression model.
Before reshaping, get rid of your arrays in arrays.
You can do this easily with numpy.stack. For instance
import numpy
from numpy import array
Y = array([array([[52593.4410802]]), array([[52593.4410802]])], dtype=object)
Y = numpy.stack(Y)
print(Y.shape)
print(Y)
gives:
(2,1,1)
[[[52593.4410802]]
[[52593.4410802]]]
From this, you can reshape to what you need.
I would like to calculate the geometric mean of some data (including NaN), how can I do it?
I know how to calculate the mean value with NaNs, we can use the following code:
import numpy as np
M = np.nanmean(data, axis=2).
So how to do it with geomean?
You could use the identity (I only found it in the german Wikipedia but there are probably other sources as well):
This identity can be constructed using the "logarithm rules" on the normal definition of the geometric mean:
The base a can be chosen arbitarly, so you could use np.log (and np.exp as inverse operation):
import numpy as np
def nangmean(arr, axis=None):
arr = np.asarray(arr)
inverse_valids = 1. / np.sum(~np.isnan(arr), axis=axis) # could be a problem for all-nan-axis
rhs = inverse_valids * np.nansum(np.log(arr), axis=axis)
return np.exp(rhs)
And it seems to work:
>>> l = [[1, 2, 3], [1, np.nan, 3], [np.nan, 2, np.nan]]
>>> nangmean(l)
1.8171205928321397
>>> nangmean(l, axis=1)
array([ 1.81712059, 1.73205081, 2. ])
>>> nangmean(l, axis=0)
array([ 1., 2., 3.])
In NumPy 1.10 also np.nanprod was added, so you could also use the normal definition:
import numpy as np
def nangmean(arr, axis=None):
arr = np.asarray(arr)
valids = np.sum(~np.isnan(arr), axis=axis)
prod = np.nanprod(arr, axis=axis)
return np.power(prod, 1. / valids)
I am working on polynomial train-test fit problem and want to convert a list object into a numpy array of the form (4, 100). (i.e., 4 rows, 100 columns)
I have the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from numpy import array
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
The code works fine and returns an array of (4, 100), but the output looks like something of 100 rows and 4 columns, and once I removed the ".reshape(4, 100)" part from the np.array function, the dimension of the output becomes (4, 100, 1). (I apologize for my ignorance, what does the 1 in (4, 100, 1) stand for?)
I guess there's something wrong with my list comprehension that I couldn't figure out at the moment. Could anyone help point me the error on my code or make recommendation on how to convert/reshape the output array into the desired (4, 100) format?
Thank you.
Lets run a simplified version of your code, leaving out the details of what the sklearn polyfit is doing:
In [248]: results = []
...: pred_data = np.linspace(0,10,100)
...: degree = [1,3,6,9]
...:
In [249]: for i in degree:
...: results.append(pred_data[:,np.newaxis])
...:
In [250]: len(results)
Out[250]: 4
In [251]: results[0].shape
Out[251]: (100, 1)
In [252]: arr = np.array(results)
In [253]: arr.shape
Out[253]: (4, 100, 1)
pred_data is (100,) (by linespace construction). newaxis makes it (100,1). Do something with it, and collect the result 4x, the result is a list of 4 (100,1) arrays. Join those into one array and we get a 3d (4,100,1) array.
The display of arr starts as:
array([[[ 0. ],
[ 0.1010101 ],
[ 0.2020202 ],
...
[ 9.7979798 ],
[ 9.8989899 ],
[ 10. ]]])
The inner elements are [...], consistent with that last size 1 dimension.
I can remove the last dimension in various ways
arr.reshape(4,100)
arr[:,:,0]
np.squeeze(arr)
I don't know enough of the sklearn code to know whether you really need pred_data[:,np.newaxis]. I have seen shapes like (#samples, #features) in other sklearn questions. So a shape like (100,1) might be correct if you have 100 samples and 1 feature.
I have some data and I can fit a gamma distribution using for example this code taken from Fitting a gamma distribution with (python) Scipy .
import scipy.stats as ss
import scipy as sp
Generate some gamma data:
alpha=5
loc=100.5
beta=22
data=ss.gamma.rvs(alpha,loc=loc,scale=beta,size=10000)
print(data)
# [ 202.36035683 297.23906376 249.53831795 ..., 271.85204096 180.75026301
# 364.60240242]
Here we fit the data to the gamma distribution:
fit_alpha,fit_loc,fit_beta=ss.gamma.fit(data)
print(fit_alpha,fit_loc,fit_beta)
# (5.0833692504230008, 100.08697963283467, 21.739518937816108)
print(alpha,loc,beta)
# (5, 100.5, 22)
I can also fit an exponential distribution to the same data. I would however like to do a likelihood ratio test. To do this I don't just need to fit the distributions but I also need to return the likelihood. How can you do that in python?
You can compute the log-likelihood of data by calling the logpdf method of stats.gamma and then summing the array.
The first bit of code is from your example:
In [63]: import scipy.stats as ss
In [64]: np.random.seed(123)
In [65]: alpha = 5
In [66]: loc = 100.5
In [67]: beta = 22
In [68]: data = ss.gamma.rvs(alpha, loc=loc, scale=beta, size=10000)
In [70]: data
Out[70]:
array([ 159.73200869, 258.23458137, 178.0504184 , ..., 281.91672824,
164.77152977, 145.83445141])
In [71]: fit_alpha, fit_loc, fit_beta = ss.gamma.fit(data)
In [72]: fit_alpha, fit_loc, fit_beta
Out[72]: (4.9953385276512883, 101.24295938462399, 21.992307537192605)
Here's how to compute the log-likelihood:
In [73]: loglh = ss.gamma.logpdf(data, fit_alpha, fit_loc, fit_beta).sum()
In [74]: loglh
Out[74]: -52437.410641032831