I am working through this multiple regression problem with this walk through however the code that starts at
section : #Treating categorical variables with One-hot-encoding at website: https://towardsdatascience.com/what-makes-a-movie-hit-a-jackpot-learning-from-data-with-multiple-linear-regression-339f6c1a7022
I ran code up to this point but it doesn't work for (X)
Actual code:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# LabelEncoder for a number of columns
class MultiColumnLabelEncoder:
def __init__(self, columns = None):
self.columns = columns # list of column to encode
def fit(self, X, y=None):
return self
def transform(self, X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname, col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
le = MultiColumnLabelEncoder()
X_train_le = le.fit_transform(X)
Here is the error that I get:
Traceback (most recent call last):
File "<ipython-input-63-581cea150670>", line 34, in <module>
X_train_le = le.fit_transform(X)
NameError: name 'X' is not defined
Your code shouldn't be able to work because you left out 40 lines of codes that she wrote before that snippet of codes. She has defined X earlier. The codes can be obtained from Github.
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
import statsmodels.api as sm
import pyreadr
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import explained_variance_score
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
result = pyreadr.read_r('Movies.RData')# also works for Rds
print(result.keys())
df = pd.DataFrame(result['movies'], columns=result['movies'].keys() )
df.shape
df.shape[0]
df.set_index("title", inplace=True) #setting the index name
df_1 = df.loc[:, ['imdb_rating','genre', 'runtime', 'best_pic_nom',
'top200_box', 'director', 'actor1']]
#Let's also check the column-wise distribution of null values
print(df_1.isnull().values.sum())
print(df_1.isnull().sum())
#Dropping missing values from my dataset
df_1.dropna(how='any', inplace=True)
print(df_1.isnull().values.sum()) #checking for missing values after the dropna()
#Splitting for 2 matrices: independent variables used for prediction and dependent variables (that is predicted)
X = df_1.drop(["imdb_rating", 'runtime'], axis = 1) #Feature Matrix
y = df_1["imdb_rating"] #Dependent Variables
Related
Here I have created the output x from first def(null_checking) function and want to use the same output (x) as input for second def(variance) within class. I tried a lot but couldnot.
#import dependencies
#import dependencies
from optparse import Values
from re import X
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import pandas as pd
from .prac import checking_null
#import datasets
datasets = pd.read_csv('Car_sales.csv')
datasets
features = ['Fuel_efficiency', 'Power_perf_factor', 'Engine_size', 'Horsepower', 'Fuel_capacity', 'Curb_weight']
features
class extraction():
def __init__(self, datasets, features):
self.features = features
self.datasets = datasets
#checking and removing null rows present in the dataframe
def null_checking(self):
datasets1 = self.datasets[self.features]
print(datasets1)
for items, x in enumerate(datasets1):
if items =='True':
continue
x = datasets1.dropna(axis=0)
print(x)
#calculating variance inflation factorss
def variance(self):
x.inner_display()
#we need intercept for calculating variance inflation factor
x['intercept'] = 1
print(x)
#Making the new dataframe
df = pd.DataFrame()
df['variables'] = x.columns
df['VIF'] = [variance_inflation_factor(x, i) for i in range(x.shape[1])]
print(df)
This is my code I do not know where the error is.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder
Light = ['On','Off']
Watering = ['Low','High']
# create combinations for all parameters
experiments = [(x,y) for x in Light for y in Watering]
exp_df = pd.DataFrame(experiments,columns=['A','B'])
print(exp_df)
OHE_model = OneHotEncoder(handle_unknown = 'ignore')
enc = OrdinalEncoder(categories=[['A','B'],['Low','High']])
encoded_df = pd.DataFrame(enc.fit_transform(exp_df[['A','B']]),columns=['A','B'])
#define the experiments order which must be random
encoded_df['exp_order'] = np.random.choice(np.arange(4),4,replace=False)
encoded_df['outcome'] = [25,37,55,65]
and this is the error
ValueError: Found unknown categories ['Off', 'On'] in column 0 during fit
The error indicates that On and Off are expected in place off A and B.
Use this code for your enc variable instead
enc = OrdinalEncoder(categories=[['On','Off'],['Low','High']])
Using DB scan I am iterating through a csv with x, y, z, and id data. I would like to generate a new csv for a every combination of eps and minimum samples that occur within a set range.
The output csv should also include the number of points in the cluster, eps value, and minimum samples used as columns along with the original x, y, z, and id data. The code below will do this, but will only create a csv for the last cluster calculated.
%matplotlib notebook
from interp import *
import pandas as pd
import pandas as pdfrom sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import itertools
points_file = "examples/example-1/fcr.csv"
eps_values = np.arange(0.1,0.5,0.1)
min_samples = np.arange(5,20)
dbscan_params = list(itertools.product(eps_values, min_samples))
cluster_n = []
epsvalues = []
min_samp = []
for p in dbscan_params:
dbscan_cluster = DBSCAN(eps=p[0],
min_samples=p[1]).fit(points)
epsvalues.append(p[0])
min_samp.append(p[1]), cluster_n.append(
len(np.unique(dbscan_cluster.labels_)))
df = pd.read_csv(points_file, header=None, names=["x", "y", "z", "id"])
df["cluster"] = clustering.labels_
df["eps"] = eps
df["min_sample"] = min_sam
csv_name = f'fcr_{eps}e{min_sam}m.csv'
df.to_csv(csv_name, index=False)
The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
regressionPred(train,test)
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data
In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.
Here is the sample ADF test in python to check for Cointegration between two pairs. However the final result gives only the numeric value for co-integration. How to get the historical results of Co-integration.
Taken from http://www.leinenbock.com/adf-test-in-python/
import numpy as np
import statsmodels.api as stat
import statsmodels.tsa.stattools as ts
x = np.random.normal(0,1, 1000)
y = np.random.normal(0,1, 1000)
def cointegration_test(y, x):
result = stat.OLS(y, x).fit()
return ts.adfuller(result.resid)
I assume you want to test for expanding cointegration? Note that you should use sm.tsa.coint to test for cointegration. You could test for historical cointegrating relationship between realgdp and realdpi using pandas like so
import pandas as pd
import statsmodels.api as sm
data = sm.datasets.macrodata.load_pandas().data
def rolling_coint(x, y):
yy = y[:len(x)]
# returns only the p-value
return sm.tsa.coint(x, yy)[1]
historical_coint = pd.expanding_apply(data.realgdp, rolling_coint,
min_periods=36,
args=(data.realdpi,))