I am running this code
import pandas as np
import numpy as np
from sklearn import cluster
from sklearn.cluster import KMeans
model = cluster.KMeans(n_clusters=4, random_state=10)
Then I put that through a dataframe I am working on and that includes the columns age and income, which is the clusters I am working on,
model.fit(df[['income', 'age']]
And so far it works well until I run the following bit, which aims at creating a column with the label of the cluster each data point belongs to.
df['cluster'] = model.labels_df.head()
And this is the error code I get:
AttributeError: 'KMeans' object has no attribute 'labels_df'
Any suggestions?
The attribute to access the labels of the model is: model.labels_
Use:
df['cluster'] = model.labels_
By typing model.labels_df.head() you request the head of model.labels_df that does not exist.
I believe you have mistyped it and you need:
df['cluster'] = model.labels_
df.head()
Related
I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])
i am trying to learn about decision trees and I ended up finding a article about decision trees. The goal of the article is to decide if a flower is a iris flower or not but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch iris: inner_f Instance of 'tuple' has no 'target' member
and
iris: Bunch iris: inner_f Instance of 'tuple' has no 'data' member
i get these errors at the x = iris.data line and at the y = iris.target line.
Here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
Your error message indicates that the problematic value iris is a tuple, which doesn't have the attributes you're referencing. Check the documentation for the tools you're using; they should explain how to unpack datasets.load_iris() into the objects you need.
I would not filter warnings in most cases, as you get useful information from the warnings.
So, the sklearn datasets format is a Bunch, which is a specialized container object that works like a dictionary. You can access it with dot notation, e.g. iris.data or dictionary notation, e.g. iris['data']. Here, it is unclear what the error is on your machine, as I (like other commenters) had no problem accessing iris.data or iris['data'] in python 3.8.5.
I wanted to let you know a couple of places to improve your approach:
(1) It is unclear why you need to construct a dataframe as you can get the samples you need directly from calling train_test_split on the concatenated numpy arrays or you can get a random sample of indices from the numpy arrays directly.
(2) Your method for constructing the dataframe is more complex than it needs to be.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load iris data
iris = datasets.load_iris()
# train test split
X_train, y_train, X_test, y_test = train_test_split(iris.data, iris.target)
# random shuffle of data/target indices
rng = np.random.default_rng()
rng_size = iris.data.shape[0]
idx_sample = rng.choice(np.arange(rng_size), size=rng_size, replace=False)
# simpler way to create dataframe
# concatenate along the columns (axis 1)
# then set the column names in one place
df = pd.concat([pd.DataFrame(iris.data), pd.DataFrame(iris.target)], axis=1)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "types"]
i am trying to learn about decision trees and I found a tutorial about the subject. The goal of the tutorial is to decide if a flower is a special type of flower called iris but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch
iris: inner_f
Instance of 'tuple' has no 'target' member
and
iris: Bunch
iris: inner_f
Instance of 'tuple' has no 'data' member
here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
%matplotlib inline
boston = load_boston()
print(boston.keys())
When I type this I get the output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
so I know that feature_names is an attribute. However, when I type
boston.columns = boston.feature_names
the ouput comes as
'DataFrame' object has no attribute 'feature_names'
To convert boston sklearn dataset to pandas Dataframe use:
df = pd.DataFrame(boston.data,columns=boston.feature_names)
df['target'] = pd.Series(boston.target)
I had something similar. Also with scikitlearn to make a random forest with this tutorial:
https://www.datacamp.com/tutorial/random-forests-classifier-python
I stumbled upon this line of code:
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_,**index=iris.feature_names**).sort_values(ascending=False)
feature_imp
I got an error from the bold part (between the **).
Thanks to the suggestions of #anky and #David Meu I tried:
feature_imp = pd.Series(clf.feature_importances_, index = dfNAN.columns.values).sort_values(ascending=False)
that results in the error:
ValueError: Length of values (4) does not match length of index (5)
so I tried:
feature_imp = pd.Series(clf.feature_importances_, index = dfNAN.columns.values[:4]).sort_values(ascending=False)
which works!
I have a dataset which consists of 13 columns and about 10million rows. Part of my project is to use isolation forest, elliptic envelope and K-mean in order to detect and remove outliers. I'm trying to use K-mean but everytime I run the code, nothing happens to the csv file, am I doing something wrong?
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
df = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\5-FINAL2\\Final After Simple Filtering.csv')
KMEAN = KMeans( n_clusters=100)
df['anomaly'] = KMEAN.fit_predict(df)
df = df[df['anomaly'] != -1]
del df['anomaly']
df.to_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\K TEST.csv', index=False)
Thank you.