np.expm1 doesn't return desired value - python

I'm currently trying the following concept:
I applied ‍np.log1p() to the independent variables and dependent variable (price)
Assuming X = independent variables and Y = dependent variable, I train_test_split X & Y
Then I trained the LinearRegression(), Ridge(), Lasso(), and ElasticNet() models
Given that the labels I used to train the model were also log1p(Y), I'm assuming the model predictions are also log values?
If the predictions are log values, how come np.expm1 doesn't return a value that is on a similar scale?
Linear Regression Code for reference
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import skew
from scipy import stats
from scipy.stats import norm
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
df_num = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df_cat = pd.DataFrame(np.random.randint(0,2,size=(10000, 2)), columns=['cat1', 'cat2'])
price = pd.DataFrame(np.random.randint(0,100,size=(10000, 1)), columns=['price'])
y = price
skewness = df_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
df_num[skewed_features] = np.log1p(df_num[skewed_features])
y = np.log1p(y)
train = pd.concat([df_num, df_cat], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
def predict_price(A, B, C, D, cat1):
cat1_index = np.where(train.columns == cat1)[0][0]
x = np.zeros(len(train.columns))
x[0] = np.log1p(A)
x[1] = np.log1p(B)
x[2] = np.log1p(C)
x[3] = np.log1p(D)
if cat1_index >= 0:
x[cat1_index] = 1
return np.expm1(lr_clf.predict([x])[0])
predict_price(20, 30, 15, 55, 'cat2')
EDIT1: I tried to recreate an example from scratch, but I can't seem to replicate the issue I'm running into. The issue I run into in my real data is that:
predictions work totally fine if I DON'T log-normalize inputs when training and DON'T log normalize inputs when predicting.
HOWEVER when I do log-normalize when training and log normalize inputs and np.expm1 the prediction, the value is totally off.
Please let me know if there is anything I can explain more clearly.

Related

Linear Regression - how to predict the estimated relative performance?

Paul need a laptop that is fast enough. One of the main parameter of computers which he must focus on is CPU. In this project we need to forecast performance of CPU which is characterized in terms of cycle time and memory capacity and so on.
It is Linear Regression problem and you should predict the Estimated Relative Performance column.
I am new in Python. Could anybody help me with the code for this task?
CSV file (on Google Drive)
This is what I have done. But probably I did not understand the case.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("Computer_Hardware.csv")
data
data.describe()
y = data["Machine Cycle Time in nanoseconds"]
x1 = data["Estimated Relative Performance"]
plt.scatter(x1,y)
plt.xlabel("Estimated Relative Performance", fontsize = 20)
plt.ylabel("Machine Cycle Time in nanoseconds", fontsize = 20)
plt.show()
x = sm.add_constant(x1)
x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
results.summary()
In any fitted model from statsmodels you can extract predicted values with method predict() and then add them to your frame.
data['predicted'] = results.predict()
Maybe your model needs more work, for now, it only uses a variable and maybe you will get a better prediction with another model that uses more variables.
y = b0 + b1 * x1
According to the text "... CPU which is characterized in terms of cycle time and memory capacity and so on" is the problem.
A proposal will be to extend your models using statsmodels API to write a formula. In your case I like to remove all spaces in columns names before.
# Rename columns without spaces
old_columns = data.columns
new_columns = [col.replace(' ', '_') for col in old_columns]
data = data.rename(columns={old:new for old, new in zip(old_columns, new_columns)})
# Fit a model using more variables
import statsmodels.formula.api as sm2
formula = ('Estimated_Relative_Performance ~ ',
'Machine_Cycle_Time_in_nanoseconds + ',
'Maximum_Main_Memory_in_kilobytes + ',
'Cache_Memory_in_Kilobytes + ',
'Maximum_Channels_in_Units')
formula = ' '.join(formula)
print(formula)
results2 = sm2.ols(formula, data).fit()
results2.summary()
data['predicted2'] = results2.predict()
Keras is one of the easiest libraries to use if you don't have much experience in Python. This is a good tutorial: https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
I studied some blogs regarding the topic, and came with this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
raw_data = pd.read_csv("Computer_Hardware.csv")
x = raw_data[['Machine Cycle Time in nanoseconds',
'Minimum Main Memory in Kilobytes', 'Maximum Main Memory in kilobytes',
'Cache Memory in Kilobytes', 'Minimum Channels in Units',
'Maximum Channels in Units', 'Published Relative Performance']]
y = raw_data['Estimated Relative Performance']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
print(model.coef_)
print(model.intercept_)
pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])
predictions = model.predict(x_test)
plt.hist(y_test - predictions)
from sklearn import metrics
metrics.mean_absolute_error(y_test, predictions)
metrics.mean_squared_error(y_test, predictions)
np.sqrt(metrics.mean_squared_error(y_test, predictions))

Create a dummy variable in a cycle (loop)

I'm working on a dataset composed by 22 columns and 129 rows.
I'm using Support Vector Machine to predict my dependent variable.
To do this, I split the variable in a dummy that assume 0 and 1:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
Now, my answer is:
I want to generate in loop this dummy, for example:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 12 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 5 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 8 else 0)
and so on. I want to test my variable with different classification (<12, <5, <8) and permit to SVM to test all of this.
Full code:
import pandas as pd # pandas is used to load and manipulate data and for One-Hot Encoding
import numpy as np # data manipulation
import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.model_selection import train_test_split # split data into training and testing sets
from sklearn import preprocessing # scale and center data
from sklearn.svm import SVC # this will make a support vector machine for classificaiton
from sklearn.model_selection import GridSearchCV # this will do cross validation
from sklearn.metrics import confusion_matrix # this creates a confusion matrix
from sklearn.metrics import plot_confusion_matrix # draws a confusion matrix
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn import svm, datasets
datafile = (r'C:\Users\gpont\PycharmProjects\pythonProject2\data\Map\databaseCDP0.csv')
df = pd.read_csv(datafile, skiprows = 0, sep=';')
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
#Splitting data in two datasets
df_lowr = df[df['dummy_medianrat'] == 1]
df_higr = df[df['dummy_medianrat'] == 0]
df_downsample = pd.concat([df_lowr, df_higr])
len(df_downsample)
X = df_downsample.drop('dummy_medianrat', axis=1).copy()
X.head()
y = df_downsample['dummy_medianrat'].copy()
y.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
test_size=0.25)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.shape
X_test.shape
#Build A Preliminary Support Vector Machine
#We don't need to scale y_traing because is 0, 1 (binary classification)
clf_svm = SVC(random_state=42)
clf_svm.fit(X_train_scaled, y_train)
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(clf_svm, X_test_scaled, y_test,
display_labels=["Did not default", "Defaulted"],
cmap=plt.cm.Blues,
normalize=normalize)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
After created some dummies with differente values, I want to generate two confusion matrix (normalized and not), for each dummy created in a loop.

How do I predict values using .predict() function in statsmodels in Python?

I have typed in the following lines of code:
# import relevant statistical packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import sklearn.linear_model as skl
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
# import data
url = "/<...>/Smarket.csv" # relative url within my computer
Smarket = pd.read_csv(url, index_col = 'SlNo')
X3 = Smarket[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
Y3 = Smarket['Direction']
X_train, X_test, y_train, y_test = train_test_split(X3, Y3, test_size=0.2016)
data_1 = pd.concat([pd.DataFrame(y_train), X_train], axis = 1)
model_1 = sm.formula.glm(formula = 'y_train~X_train', data = data_1, family= sm.families.Binomial()).fit()
X_new = model_1.predict(X_test)
Now it is in the last code where I recieve the following error:
PatsyError: Number of rows mismatch between data argument and X_train (252 versus 998)
y_train~X_train
^^^^^^^
I am just unable to understand why I am getting this error. I get it might be because of mismatch in the number of data between X_test and X_train. How do I need to change my code to get the predicted values?

import csv to OrderedDict and predict using regression

i build a regression model to predict energy ( 1 columns ) from 5 variables ( 5 columns ) ... i used my exprimental data to train and fit the model and it works with good score ...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('new.csv')
X = data.drop(['E'],1)
y = data['E']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5 ,
random_state=2)
from sklearn import ensemble
clf1 = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth =5,
min_samples_split = 2, loss='ls',
learning_rate = 0.1)
clf1.fit(X_train, y_train)
clf1.score(X_test, y_test)
but now i want to add a new csv file contain new data for mentioned 5 variables to OrderedDict and use the model to predict energy ...
with code bellow i manually insert row by row and it predict energy correctly
from collections import OrderedDict
new_data = OrderedDict([('H',48.52512), ('A',169.8379), ('P',55.52512),
('R',3.058758), ('Q',2038.055)])
new_data = pd.Series(new_data)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but i cant do this with huge datasets and need to import csv file ... i do the bellow but cant figure it out ....
data_2 = pd.read_csv('new2.csv')
X_new = OrderedDict(data_2)
new_data = pd.Series(X_new)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but gives me : ValueError: setting an array element with a sequence.
can anyone help me ??

How to find the best degree of polynomials?

I'm new to Machine Learning and currently got stuck with this.
First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regression to reduce the bias.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
poly_predict = poly_reg.predict(X_poly)
poly_mse = mean_squared_error(X, poly_predict)
poly_rmse = np.sqrt(poly_mse)
poly_rmse
Then I got slightly better result than linear regression, then I continued to set degree = 3/4/5, the result kept getting better. But it might be somewhat overfitting as degree increased.
The best degree of polynomial should be the degree that generates the lowest RMSE in cross validation set. But I don't have any idea how to achieve that. Should I use GridSearchCV? or any other method?
Much appreciate if you could me with this.
You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.
There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).
You can also play with how you decide to hold out your train/test/validation data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape(100, 1)
y = X**4 + X**3 + X + 1
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rmses = []
degrees = np.arange(1, 10)
min_rmse, min_deg = 1e10, 0
for deg in degrees:
# Train features
poly_features = PolynomialFeatures(degree=deg, include_bias=False)
x_poly_train = poly_features.fit_transform(x_train)
# Linear regression
poly_reg = LinearRegression()
poly_reg.fit(x_poly_train, y_train)
# Compare with test data
x_poly_test = poly_features.fit_transform(x_test)
poly_predict = poly_reg.predict(x_poly_test)
poly_mse = mean_squared_error(y_test, poly_predict)
poly_rmse = np.sqrt(poly_mse)
rmses.append(poly_rmse)
# Cross-validation of degree
if min_rmse > poly_rmse:
min_rmse = poly_rmse
min_deg = deg
# Plot and present results
print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, rmses)
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')
This will print:
Best degree 4 with RMSE 1.27689038706e-08
Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.
In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.
Here is an example how to use this library:
Firstly let us define a method to sample random data:
def make_data(N, err=1.0, rseed=1):
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 1. / (X.ravel() + 0.3)
if err > 0:
y += err * rng.randn(N)
return X, y
Build a pipeline:
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
Create a data and a vector(X_test) for testing and visualisation purposes:
X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]
Define the GridSearchCV parameters:
param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)
Get the best parameters from our model:
model = grid.best_estimator_
model
Pipeline(memory=None,
steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
Fit the model with the X and y data and use the vector to predict the values:
y_test = model.fit(X, y).predict(X_test)
Visualize the result:
plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')
The best fit result
The full code snippet:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def make_data(N, err=1.0, rseed=1):
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 1. / (X.ravel() + 0.3)
if err > 0:
y += err * rng.randn(N)
return X, y
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]
param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)
model = grid.best_estimator_
y_test = model.fit(X, y).predict(X_test)
plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')
This is where Bayesian model selection comes in really. This gives you the most likely model given both model complexity and data fit. I'm super tired so the quick answer is to use the BIC (Bayesian information criterion):
k = number of variables in the model
n = number of observations
sse = sum(residuals**2)
BIC = n*ln(sse/n) + k*ln(n)
This BIC (or AIC etc) will give you the best model

Categories