When fitting a GLM in H2O_cluster_version: 3.32.0.5 with
lamdba_search = True, nlambdas = 20, and lambda_min_ratio = .0001
My team and I receive 24 lambdas in our regularization path. The last 4 lambdas in the path are repeats of the first 4, the largest values.
Here is a reproducible example:
import pandas as pd
import numpy as np
import tweedie
import scipy
import os
import sys
import time
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch
sys.path.append(h2odir)
from h2o_auto_init import h2o_auto_init
os.system(h2oshellscript)
time.sleep(10)
h2o_auto_init()
#sample data
resp = np.random.choice(range(0,100),size=1000)
pred1 = np.random.choice(range(40,50),size=1000)
pred2 = np.random.choice(range(20,30),size=1000)
pred3 = np.random.choice([1,2,3,4,5],size=1000)
weight = np.random.choice([1,1,1,.9,.37],size=1000)
folds = np.random.choice([1,2,3,4,5],size=1000)
data = pd.DataFrame({'resp': resp, 'pred1':pred1,'pred2':pred2,'pred3':pred3,'weight':weight,'fold_column':folds})
predictors = ['pred1','pred2','pred3']
# convert pandas df to h2oframe
H2Odata = h2o.H2OFrame(data, column_names=data.columns.tolist())
# set up model
model = H2OGeneralizedLinearEstimator(
family="tweedie",
tweedie_link_power = 0,
tweedie_variance_power = 1.7,
lambda_search=True,
early_stopping = False,
lambda_min_ratio = 0.0001,
nlambdas=20,
alpha=.5,
standardize = True,
weights_column='weight',
solver = 'IRLSM',
#beta_constraints = constraints,
keep_cross_validation_models = True,
keep_cross_validation_predictions = True,
keep_cross_validation_fold_assignment=True
)
# Train the model with training and validation data
model.train(
x=predictors,
y='resp',
training_frame=H2Odata,
fold_column = 'fold_column'
)
# get full regularization paths
#list of cross validation model objects
regpath_h2o_cv=[]
for i in range(0,len(model.cross_validation_models())):
regpath_h2o_cv.append(H2OGeneralizedLinearEstimator.getGLMRegularizationPath(model.cross_validation_models()[i]))
H2OGeneralizedLinearEstimator.getGLMRegularizationPath(model.cross_validation_models()[0])['lambdas']
When I run this, there is an extra lambda, a repeat of the first lambda.
Can anyone provide guidance on why H2O is providing more lambdas than requested, and especially repeated lambdas?
Does this mean it is fitting unnecessary models?
Our real use case is on very large data, and any time we can save
avoiding unnecessary modeling will be helpful.
Related
I am trying to run T-distributed Stochastic Neighbor Embedding (t-SNE) in Jupyter but always facing a issue with
ValueError: could not convert string to float: '<Null>'
Code:
enter image description here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Reading the data using pandas
df = pd.read_csv("E:\\Field data\Output\\Pixel values7.csv")
# print first five rows of df
print(df.head(9))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
I got error after this line
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(df)
print(standardized_data.shape)
# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
I got this link from somewhere, I am not expert in python. I request you to kindly help me out.
I am trying to run this program for my data but always getting a error
ValueError: could not convert string to float: '<Null>'
If there is any other code for T-distributed Stochastic Neighbor Embedding (t-SNE). Please let me know.
My data look like this
image of the error I am trying to build a collaborative recommendation system the code below. I am a noob to deep learning right now, and I am stuck with this error when I try to train the model. I want to train a model with a csv data set. Can anyone please help me understand what's happening? I would really appreciate it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the surprise packages
from surprise import Dataset
from surprise import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV
# Import train_test_split
from surprise.model_selection import train_test_split
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('user_clean.csv')
user_df.head()
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))
# load the dataframe with the reader
md = Dataset.load_from_df(rating_10_df, my_reader)
%%time
# Set the parameter grid for optimization
param_grid = {
# Number of latent factors. More factors could give better results, but can also lead overfitting
'n_factors': [50, 100, 150],
# Number of epochs. Number of iterations the algorithm will run
'n_epochs': [10, 20, 50],
# Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
'lr_all': [0.005, 0.1],
'biased': [False] }
# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)
# Fit the model to the data
GS.fit(md)
I estimate a VAR(1) model in statsmodels (the sample code is from statsmodels user guide).
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base.datetools import dates_from_str
# prepare the data
mdata = sm.datasets.macrodata.load_pandas().data
dates = mdata[['year', 'quarter']].astype(int).astype(str)
quarterly = dates["year"] + "Q" + dates["quarter"]
quarterly = dates_from_str(quarterly)
mdata = mdata[['realgdp','realcons','realinv']]
mdata.index = pd.DatetimeIndex(quarterly)
data = np.log(mdata).diff().dropna()
# make a VAR model
model = VAR(data)
results = model.fit(1)
I want to compute the variance of the VAR model (click here for an explanation). Is there an attribute or property of the VARResults object that can give the variance directly?
I have found the answer.
results.acf(0)
The acf() method of the VARResults object computes theoretical autocovariance function of the VAR model.
Suppose I fit a model on the dataset dataset1 using SARIMAX from statsmodels.tsa.statespace.sarimax - is it possible to then use this fit to make predictions on another dataset dataset2?
Namely, consider the following:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# generate example data
n=90
idx = pd.PeriodIndex(pd.date_range(start = '2015-01-02',end='2015-04-01',freq='D'))
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset1 = pd.Series(dat, index = idx)
# fit model
fit = SARIMAX(dataset1, order = (1,0,1)).fit()
# make 30 day forecast on dataset1
fit.forecast(30)
How would I go about using fit to make a prediction on dataset2?
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset2 = pd.Series(dat, index = idx)
Ideally, it'd be something super simple akin to fit(dataset2).forecast(30) but that clearly isn't the case.
I know I can extract the estimated parameters fit.params but short of going through this tedious process, is there a built-in way or a hack to using the existing fit instance?
You can use the apply results method:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# generate example data
n=90
idx = pd.PeriodIndex(pd.date_range(start = '2015-01-02',end='2015-04-01',freq='D'))
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset1 = pd.Series(dat, index = idx)
# fit model
fit = SARIMAX(dataset1, order = (1,0,1)).fit()
# make 30 day forecast on dataset1
fit.forecast(30)
# ------------------------------------
# get the new dataset
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset2 = pd.Series(dat, index = idx)
# apply the parameters from `fit` to the new dataset
fit2 = fit.apply(dataset2)
# make 30 day forecast on dataset2
fit2.forecast(30)
I am using LassoCV() model for feature selection. It is giving me this issue and not selecting any features too. "C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
UserWarning)"
The code is given below.
The data is in https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# dataset URL = https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
dataframe = pd.read_csv('Brewer Friend Beer Recipes.csv', encoding = 'latin')
# Encoding the non numerical columns
def encoding_data(dataframe):
if(dataframe.dtype == 'object'):
return LabelEncoder().fit_transform(dataframe.astype(str))
else:
return dataframe
# Feature Selection using the selected Target Feature
def feature_selection(raw_dataframe, target_feature_list):
output_list = []
# preprocessing Converting Categorical data into Numeric Data
dataframe = raw_dataframe.apply(encoding_data)
column_list = dataframe.columns.tolist()
dataframe = dataframe.dropna()
for target in target_feature_list:
target_feature = target
x = dataframe.drop(columns=[target_feature])
y = dataframe[target_feature].values
# Lasso feature selection
estimator = LassoCV(cv = 3, n_alphas = 1)
featureselection = SelectFromModel(estimator)
featureselection.fit(x,y)
features = featureselection.transform(x)
feature_list = x.columns[featureselection.get_support()]
features = ''
features = ', '.join(feature_list)
l = (target,features)
output_list.append(l)
output_df = pd.DataFrame(output_list,columns = ['Name','Selected Features'])
print('\nThe Feature Selection is done with the respective target feature(s)')
return output_df
print(feature_selection(dataframe, ['BrewMethod']))
I am getting this warning and no features are selected.
"C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict. UserWarning)"
Any idea how to rectify this ?
If no features have been selected you can gradually decrease lambda (or in scikit's case alpha). This will reduce the penalization and probably return some nonzero coefficients.
It is extremely unusual that no coefficients have been selected. You should think about checking correlations in your data. Maybe you have a lot of collinearity.