I am using scikit linear regression - single variable to predict y from x. The argument is in float datatype. How can i transform the float into numpy array to predict the output ?
import matplotlib.pyplot as plt
import pandas
import numpy as np
from sklearn import linear_model
import sys
colnames = ['charge_time', 'running_time']
data = pandas.read_csv('trainingdata.txt', names=colnames)
data = data[data.running_time < 8]
x = np.array(list(data.charge_time))
y = np.array(list(data.running_time))
clf = linear_model.LinearRegression() # Creating a Linear Regression Modal
clf.fit(x[:,np.newaxis], y) # Fitting x and y array as training set
data = float(sys.stdin.readline()) # Input is Float e.g. 4.8
print clf.predict(data[:,np.newaxis]) # As per my understanding parameter should be in 1-D array.
First of all, a suggestion not directly related to your question:
You don't need to do x = np.array(list(data.charge_time)), you can directly call x = np.array(data.charge_time) or, even better, x = data.charge_time.values which directly returns the underlying ndarray.
It is also not clear to me why you're adding a dimension to the input arrays using np.newaxis.
Regarding your question, predict expects an array-like parameters: that can be a list, a numpy array, or other.
So you should be able to just do data = np.array([float(sys.stdin.readline())]). Putting the float value in a list ([]) is needed because without it numpy would create a 0-d array (i.e. a single value, which is not sliceable) instead of a 1-d array.
Related
I want to have a function which takes two arguments - a vector y and a 2d matrix X (both are represented as numpy arrays and we can assume the length of y is the same as the number of rows in X), and then return regression summary of an OLS regression of y on X (treating each column as an independent categorical variable). And I want to use only numpy and statsmodels to achieve this.
For example, if I just want to regress y on X, it should be easily achieved as follows,
import statsmodels.api as sm
def f(y, X):
return sm.OLS(y, sm.add_constant(X), missing="drop").fit()
But my question is how can I do the same regression (y on X) with similar code as above with the only difference of treating each column in X as categorical instead of continuous?
Thanks for your time and help!
I have a task to create a 30x40 feature matrix with random integers between 1 & 100:
import numpy as np
matrix= np.random.randint(1,100,size=(30,40))
Next I need to rescale the elements in the matrix to be between the range 5-10:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaler.fit (5,10)
matrix1 = scaler.fit_transform(matrix)
Which gives me this error:
ValueError: Expected 2D array, got scalar array instead:
array=5.0.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
I've tried reshaping the data:
matrix.reshape(-1,1)
but I get the same error.
I think you need to define the feature range when you create an instance of MinMaxScaler like this:
scaler = preprocessing.MinMaxScaler(feature_range=(5, 10))
And then you could fit and transform the data like this:
matrix1 = scaler.fit_transform(matrix)
The last line is a short form for:
scaler.fit(matrix)
matrix1 = scaler.transform(matrix)
I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.
I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700
I'm trying to make some symbolic calculations using indexing of symbolic variable.
X = T.matrix('X')
y = T.matrix('y')
z = T.dot(T.dot(X,y[0]),y[1]).norm(L=2)
callable_function = theano.function([y,X], z)
callable_function(np.array([np.array([[3],[5]]),np.array([[4,1]])]),np.array([1,2]))
And I'm getting
AttributeError: ('Bad input argument to theano function with name "C:/Users/LIKAN/PycharmProjects/deepEEG/test.py:17" at index 0(0-based)', "'float' object has no attribute 'dtype'")
How to use symbolic variable indexing correctly?
You declare both y and X as matrices but your inputs to the compiled Theano function are an object array and a vector.
np.array([np.array([[3],[5]]),np.array([[4,1]])]) is an object array because it is constructed as an array of numpy arrays. Note that np.array([np.array([[3],[5]]),np.array([[4,1]])]).dtype == object.
To create a matrix, just use a multi-dimensional array in the numpy array construction. You don't even need to create numpy arrays, just pass vanilla Python lists. Since your second argument (for X) is a vector I've assumed the input value is correct and the symbolic variable declaration is incorrect. With these changes, the following code runs:
import numpy as np
import theano
import theano.tensor as T
X = T.vector('X')
y = T.matrix('y')
z = T.dot(T.dot(X,y[0]),y[1]).norm(L=2)
callable_function = theano.function([y,X], z)
print callable_function([[3,5],[4,1]], [1,2])