I am currently following along with a Machine Learning Full Course sponsored by Simplilearn to get a better understanding of regression, and am running into this error:
TypeError: init() got an unexpected keyword argument 'categorical_features'
From this code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
companies = pd.read_csv('Companies_1000.csv')
X = companies.iloc[:, :-1].values
X = companies.iloc[:, :4].values
companies.head()
cmap = sns.cm.rocket_r
sns.heatmap(companies.corr(), cmap = cmap)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
print(X)
This is the csv file: https://raw.githubusercontent.com/boosuro/profit_estimation_of_companies/master/1000_Companies.csv
The video does not get the same error as me, and I have assumed that it is outdated, however after crawling through the sklearn docs, I have come up empty-handed for a solution. I am using python 3. If you want to check out exactly the info and code that's happening in the video, here it is:
https://www.youtube.com/watch?v=9f-GarcDY58
My error appears around the 47:25 mark. Thank you for checking this out, and thanks for your answers.
The error is due to the following line
onehotencoder = OneHotEncoder(categorical_features = [3])
There is no parameter named "categorical_features" . Instead there is "categories" where you can pass a list of categories. By default "categories" is set to "auto" which means it will automatically determine categories from the training data.
So you need not to pass anything in the OneHotEncoder() function, just leave it like this.
Change the line as below
onehotencoder = OneHotEncoder()
Related
i am trying to learn about decision trees and I ended up finding a article about decision trees. The goal of the article is to decide if a flower is a iris flower or not but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch iris: inner_f Instance of 'tuple' has no 'target' member
and
iris: Bunch iris: inner_f Instance of 'tuple' has no 'data' member
i get these errors at the x = iris.data line and at the y = iris.target line.
Here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
Your error message indicates that the problematic value iris is a tuple, which doesn't have the attributes you're referencing. Check the documentation for the tools you're using; they should explain how to unpack datasets.load_iris() into the objects you need.
I would not filter warnings in most cases, as you get useful information from the warnings.
So, the sklearn datasets format is a Bunch, which is a specialized container object that works like a dictionary. You can access it with dot notation, e.g. iris.data or dictionary notation, e.g. iris['data']. Here, it is unclear what the error is on your machine, as I (like other commenters) had no problem accessing iris.data or iris['data'] in python 3.8.5.
I wanted to let you know a couple of places to improve your approach:
(1) It is unclear why you need to construct a dataframe as you can get the samples you need directly from calling train_test_split on the concatenated numpy arrays or you can get a random sample of indices from the numpy arrays directly.
(2) Your method for constructing the dataframe is more complex than it needs to be.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load iris data
iris = datasets.load_iris()
# train test split
X_train, y_train, X_test, y_test = train_test_split(iris.data, iris.target)
# random shuffle of data/target indices
rng = np.random.default_rng()
rng_size = iris.data.shape[0]
idx_sample = rng.choice(np.arange(rng_size), size=rng_size, replace=False)
# simpler way to create dataframe
# concatenate along the columns (axis 1)
# then set the column names in one place
df = pd.concat([pd.DataFrame(iris.data), pd.DataFrame(iris.target)], axis=1)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "types"]
I use sklearn to impute some time-series which include NaN values. At the moment, I use the following:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
in which array is a numpy array of shape n_points x n_time_steps. It works fine but I get a deprecation warning which suggest I should use SimpleImpute from sklearn.impute. Hence I replaced those lines with the following:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
but I get the following error on the last line:
ValueError: 'X' and 'missing_values' types are expected to be both
numerical. Got X.dtype=float32 and type(missing_values)=< class 'str'>.
If anybody has any idea on what is the cause of this error be glad if you let me know. I am using Python 3.6.7 with sklearn 0.20.1. Thanks!
If array contains missing values represented as np.NaN, you should use np.Nan as the argument to the constructor of SimpleImputer. That's the default argument, so this works:
imp = SimpleImputer(strategy='mean')
I am trying to evaluate an sklearn predictor which I have made over a larger than memory dask array of inputs. I have read over the parallel post fit documentation https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.ParallelPostFit.html and am still having some problems. The following code illustrates the kind issue that I am running into:
from dask.base import tokenize
import numpy as np
import dask.array as da
from dask.array import Array
from sklearn.linear_model import LinearRegression
from dask_ml.wrappers import ParallelPostFit
"""
for stack overflow question
"""
x = np.linspace(0,100,100,dtype=np.int32)
y = np.linspace(0,100,100,dtype=np.int32)
z = np.linspace(0,100,100,dtype=np.int32)
Y = np.random.normal(size=(100,))
X = np.stack([x,y,z],axis=1)
reg = LinearRegression().fit(X,Y)
#now try to compute on dask arrays over the whole space
x= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
y= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
z= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
x,y,z = da.meshgrid(x,y,z,sparse=False,indexing='ij')
stacked = da.stack([x.flatten(),y.flatten(),z.flatten()],axis=1)
clf = ParallelPostFit(estimator=reg)
clf.predict(stacked)
Excecuting clf.predict throws a value error Can't drop an axis with more than 1 block. Please use atop instead.
which I dont understand how to correct.
Thank You for any help.
Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.
However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'
My code snippet is below:
from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_
It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?
As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.
def get_xgb_imp(xgb, feat_names):
from numpy import array
imp_vals = xgb.booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
>>> import numpy as np
>>> from xgboost import XGBClassifier
>>>
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>>
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}
For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.
import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')
from xgboost import plot_importance
plot_importance(xgb_model, )
I found out the answer. It appears that version 0.4a30 does not have feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier object, you can refer to #David's answer if you want a workaround.
However, what I did is build it from the source by cloning the repo and running . ./build.sh which will install version 0.4 where the feature_importance_ attribute works.
Hope this helps others!
Get Feature Importance as a sorted data frame
import pandas as pd
import numpy as np
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.booster().get_fscore()
feats_imp = pd.DataFrame(imp_vals,index=np.arange(2)).T
feats_imp.iloc[:,0]= feats_imp.index
feats_imp.columns=['feature','importance']
feats_imp.sort_values('importance',inplace=True,ascending=False)
feats_imp.reset_index(drop=True,inplace=True)
return feats_imp
feature_importance_df = get_xgb_imp(xgb, feat_names)
For those having the same problem as Luís Bianchin, "TypeError: 'str' object is not callable", I found a solution (that works for me at least) here.
In short, I found modifying David's code from
imp_vals = xgb.booster().get_fscore()
to
imp_vals = xgb.get_fscore()
worked for me.
For more detail I would recommend visiting the link above.
Big thanks to David and ianozsvald
You can also use the built-in plot_importance function:
from xgboost import XGBClassifier, plot_importance
fit = XGBClassifier().fit(X,Y)
plot_importance(fit)
The alternative to built-in feature importance can be:
permutation-based importance from scikit-learn (permutation_importance method
importance with Shapley values (shap package)
I really like shap package because it provides additional plots. Example:
Importance Plot
Summary Plot
Dependence Plot
You can read about alternative ways to compute feature importance in Xgboost in this blog post of mine.
An update of the accepted answer since it no longer works:
def get_xgb_imp(xgb_model, feat_names):
imp_vals = xgb_model.get_fscore()
imp_dict = {feat: float(imp_vals.get(feat, 0.)) for feat in feat_names}
total = sum(list(imp_dict.values()))
return {k: round(v/total, 5) for k,v in imp_dict.items()}
It seems like the api keeps on changing. For xgboost version 1.0.2, just changing from imp_vals = xgb.booster().get_fscore() to imp_vals = xgb.get_booster().get_fscore() in #David's answer does the trick. The updated code is -
from numpy import array
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.get_booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
I used the following code to get feature_importance. Also, I used DictVectorizer() in the pipeline for one_hot_encoding. If you use
v = DictVectorizer()
X_to_dict = X.to_dict("records")
X_transformed = v.fit_transform(X_to_dict)
feature_names = v.get_feature_names()
best_model.get_booster().feature_names = feature_names
xgb.plot_importance(best_model.get_booster())
You can obtain the f_score plot. But I wanted to plot the feature importance against the feature names. So I modified it further.
f, ax = plt.subplots(figsize=(10, 30)) plt.barh(feature_names, best_model.feature_importances_) plt.xticks(rotation = 90) plt.show()
I have been successfully playing around with replicating one of the sklearn tutorials using the iris dataset in PyCharm using Python 2.7. However, when trying to repeat this with my own data I have been encountering an issue. I have been importing data from a .csv file using 'np.genfromtxt', but for some reason I keep getting a single column output for X_r2 (see below), when I should get a 2 column output. I have therefore replaced my data with some randomly generated variables to post onto SO, and I am still getting the same issue.
I have included the 'problem' code below, and I would be interested to know what I have done wrong. I have extensively used the debugging features in PyCharm to check that the type and shape of my variables are similar to the original sklearn example, but it did not help me with the problem. Any help or suggestions would be appreciated.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(2, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
The array y in the example you posted has values of 0, 1 and 2 while yours only has values of 0 and 1. This change achieves what you want:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(3, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)