I'm new to python and machine learning. So My question may be trivial.
I typed the below code in Jupyter Notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
X_poly[:5]
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
plt.scatter(X, y)
plt.plot(X, lin_reg.predict(poly_reg.fit_transform(X)))
plt.show()
Then I deleted below code:
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
But a graph and regression are normally generated.
So those codes are not essential?
Chatgpt said that "without the training and fitting of the linear regression model, the predicted line would not be accurate and would not reflect the relationship between the input and target data."
But to me, the resultant graph and regression seems accurate ... even
lin_reg.predict(poly_reg.fit_transform(X[[2]]))
working
lin_reg = LinearRegression() lin_reg.fit(X_poly, y)
Are they meaningless?
Or Is something get wrong with deleting those codes?
ps. And please note to me if my question method is not right.
Until you restart the runtime environment, your fitted model is still in the memory. You are addressing the model that was fit before you deleted the lines, so the will be no difference in the output. Once you restart the runtime environment, you will get a mistake "lin_reg not defined"
Related
I'm trying to write an integration test that uses the descriptive statistics (.describe().to_list()) of the results of a model prediction (model.predict(X)). However, even though I've set np.random.seed(###) the descriptive statistics are different after running the tests in the console vs. in the environment created by Pycharm:
Here's a MRE for local:
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
np.random.seed(42)
X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)
pred = regr.predict(X)
# Theory: This result should be the same from the result in a class
pd.Series(pred).describe().to_list()
And an example test-file:
from unittest import TestCase
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
np.random.seed(42)
class TestPD(TestCase):
def testExpectedPrediction(self):
np.random.seed(42)
X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)
pred = pd.Series(regr.predict(X))
for i in pred.describe().to_list():
print(i)
# here we would have a self.assertTrue/Equals f.e. element
What appears to happen is that when I run this test in the Python Console, I get one result. But then when I run it using PyCharm's unittests for the folder, I get another result. Now, importantly, in PyCharm, the project interpreter is used to create an environment for the console that ought to be the same as the test environment. This leaves me to believe that I'm missing something about the way random_state is passed along. My expectation is, given that I have set a seed, that the results would be reproducible. But that doesn't appear to be the case and I would like to understand:
Why they aren't equal?
What I can do to make them equal?
I haven't been able to find a lot of best practices with respect to testing against expected model results. So commentary in that regard would also be helpful.
I'm running linear regressions with statsmodels and because I tend to distrust my results I also ran the same regression with scipy. The underlying dataset has about 80,000 observations. Unofrtunately, I cannot provide the data for you to reproduce the errors.
I run two rounds of regressions: first simple OLS, second simple OLS with standardized variables
Surprisingly, the results differ a lot. While R² and p-value seem to be the same, coefficients, intercept and standard error are all over the place. Interestingly, after standardizing the results align more. Now, there is only a slight difference in the constant, which I am happy to attribute to rounding issues.
The exact numbers can be found in the appended screenshots.
Any idea, where these differences come from and why they disappear after standardizing? What did I do wrong? Do I have to be extra worried, since I run most of my regressions with sklearn (only swapped to statsmodels since I needed some p-values) and even more differences may occur?
Thanks for your help! If you need any additional information, feel free to ask. Code and Screenshots are povided below.
My code in short looks like this:
# package import
import numpy as np
from scipy.stats import linregress
from scipy.stats.mstats import zscore
import statsmodels.api as sma
import statsmodels.formula.api as smf
# adding constant
train_IV_cons = sma.add_constant(train_IV)
# run regression
(coefficients, intercept, rvalue, pvalue, stderr) = linregress(train_IV[:,0], train_DV)
print(coefficients, intercept, rvalue, pvalue, stderr)
est = smf.OLS(train_DV, train_IV_cons[:,[0,1]])
model_results = est.fit()
print(model_results.summary())
# normalize variables
train_IV_norm = train_IV
train_IV_norm[:,0]=np.array(ss.zscore(train_IV_norm[:,0]))
train_IV_norm_cons = sma.add_constant(train_IV_norm)
# run regressions
(coefficients, intercept, rvalue, pvalue, stderr) = linregress(train_IV_norm[:,0], train_DV_norm)
print(coefficients, intercept, rvalue, pvalue, stderr)
est = smf.OLS(train_DV_norm, train_IV_norm_cons[:,[0,1]])
model_results = est.fit()
print(model_results.summary())
First regression (not standardized data):
Second regression (standardized data):
in my project I am trying to separate different implemented classifiers in the separate files (in order to make code less messy).
Unfortunately, when I am importing file containing the RandomForestClassifier the program neither stops its execution nor shows an error, the code is provided here:
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
def train_rfc(X,y):
n_estimators = [100]
min_samples_split = [2]
min_samples_leaf = [1]
bootstrap = [True]
parameters = {'n_estimators': n_estimators, 'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split}
clf = GridSearchCV(RandomForestClassifier(verbose=1,n_jobs=-1), cv=4, param_grid=parameters)
print "=="
clf.fit(X, y)
print "++"
return clf
rfc_clf = train_rfc(corpus,data["target"])
print ("Accuracy of RF on CV sets :{}".format(rfc_clf.best_score_))
As I was able to figure out the problem is most likely to be somewhere in the RandomForest, and is in clf.fit(X,y) function to be exact, since the program does not do anything as soon as it reaches this point.
I had no problem running the other, similarly implemented SVM classifier that also used GridSearchCV, and was also inside of a function.
I would really appreciate any help on that.
I dont have much knowledge in Python but I have to crack this for an assessment completion,
Question:
Run the following code to load the required libraries and create the data set to fit the model.
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head())
I have to perform the following steps to complete this scenario.
For the boston dataset loaded in the above code snippet, perform linear regression.
Use the target variable as the dependent variable.
Use the RM variable as the independent variable.
Fit a single linear regression model using statsmodels package in python.
Import statsmodels packages appropriately in your code.
Upon fitting the model, Identify the coefficients.
Finally print the model summary in your code.
You can write your code using vim app.py .
Press i for insert mode.
Press esc and then :wq to save and quit the editor.
Please help me to understand how to get this completed. Your valuable comments are much appreciated
Thanks in Advance
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head())
import statsmodels.api as sm
import statsmodels.formula.api as smf
X = dataset["RM"]
y = dataset['target']
X = sm.add_constant(X)
model = smf.OLS(y,X).fit()
predictions = model.predict(X)
print(model.summary())
The following script runs fine on my machine with n_samples=1000, but dies (no error, just stops working) with n_samples=10000. This only happens using the Anaconda python distribution (numpy 1.8.1) but is fine with Enthought's (numpy 1.9.2). Any ideas what would be causing this?
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics.scorer import log_loss_scorer
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np
X, y = datasets.make_classification(n_samples=10000, n_features=50,
n_informative=35, n_redundant=10,
random_state=1984)
lr = LogisticRegression(random_state=1984)
param_grid = {'C': np.logspace(-1, 2, 4, base=2)}
kf = KFold(n=y.size, n_folds=5, shuffle=True, random_state=1984)
gs = GridSearchCV(estimator=lr, param_grid=param_grid, scoring=log_loss_scorer, cv=kf, verbose=100,
n_jobs=-1)
gs.fit(X, y)
Note: I'm using sklearn 0.16.1 in both distributions and am using OS X.
I've noticed that upgrading to numpy version 1.9.2 with Enthought distribution (by updating manually) breaks the grid search. I haven't had any luck downgrading Anaconda numpy version to 1.8.1 though.
Are you on windows? If so, you need to protect the code with
if __name__ == "__main__":
do_stuff()
Otherwise multiprocessing will not work.
Per Andreas's comment, the problem seems to be with multi threading in the linear algebra library. I solved it with the following command in the terminal:
export VECLIB_MAXIMUM_THREADS=1
My (weak) understanding is that this limits the linear algebra's library use of multiple threads and lets multiprocessing handle multithreading as it wants.