i am trying to learn about decision trees and I ended up finding a article about decision trees. The goal of the article is to decide if a flower is a iris flower or not but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch iris: inner_f Instance of 'tuple' has no 'target' member
and
iris: Bunch iris: inner_f Instance of 'tuple' has no 'data' member
i get these errors at the x = iris.data line and at the y = iris.target line.
Here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
Your error message indicates that the problematic value iris is a tuple, which doesn't have the attributes you're referencing. Check the documentation for the tools you're using; they should explain how to unpack datasets.load_iris() into the objects you need.
I would not filter warnings in most cases, as you get useful information from the warnings.
So, the sklearn datasets format is a Bunch, which is a specialized container object that works like a dictionary. You can access it with dot notation, e.g. iris.data or dictionary notation, e.g. iris['data']. Here, it is unclear what the error is on your machine, as I (like other commenters) had no problem accessing iris.data or iris['data'] in python 3.8.5.
I wanted to let you know a couple of places to improve your approach:
(1) It is unclear why you need to construct a dataframe as you can get the samples you need directly from calling train_test_split on the concatenated numpy arrays or you can get a random sample of indices from the numpy arrays directly.
(2) Your method for constructing the dataframe is more complex than it needs to be.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load iris data
iris = datasets.load_iris()
# train test split
X_train, y_train, X_test, y_test = train_test_split(iris.data, iris.target)
# random shuffle of data/target indices
rng = np.random.default_rng()
rng_size = iris.data.shape[0]
idx_sample = rng.choice(np.arange(rng_size), size=rng_size, replace=False)
# simpler way to create dataframe
# concatenate along the columns (axis 1)
# then set the column names in one place
df = pd.concat([pd.DataFrame(iris.data), pd.DataFrame(iris.target)], axis=1)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "types"]
Related
when i try to define X and Y from my dataset that is already defined and I made some analysis based on it and i dont have any problem.
but when i start to define (X) and (Y), an error message appear that "NameError: name 'MyNewDataSet' is not defined
the dataset name is "MyNewDataSet"
do I need to define a new dataset before assign the values to X and Y? or what should I do
Here is how I define X and Y from my dataframe:
#Create X
X = MyNewDataSet.drop(['y'], axis=1)
For X I am using multiple columns so all I do is remove the column I will be using for my y variable.
#Create y
y = MyNewDataSet['y']
Here I create Y by assigning the columns y to the variable.
If this does not work please share some of your code. That way it might be easier for us to visualize your problem, but I hope this can help.
there is my code to define my new dataset
###KNN imputation to fill missing values.
import numpy as np
from numpy import isnan
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputer.fit(MyDataSet)
MyNewDataSet = pd.DataFrame(imputer.transform(MyDataSet), columns = MyDataSet.columns)
MyNewDataSet.set_index(MyDataSet.index, inplace= True)
MyNewDataSet = MyNewDataSet.astype(MyDataSet.dtypes.to_dict())
MyNewDataSet
and here is my code to assign values to X and Y
### Split the dataset into a set of features (X) and Target variable (y)
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import HuberRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
y=MyNewDataSet["Target"]
x=MyNewDataSet.drop("Target", axis=1)
i am trying to learn about decision trees and I found a tutorial about the subject. The goal of the tutorial is to decide if a flower is a special type of flower called iris but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch
iris: inner_f
Instance of 'tuple' has no 'target' member
and
iris: Bunch
iris: inner_f
Instance of 'tuple' has no 'data' member
here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
I have found a piece of sklearn code that I am finding relatively straightforward to run the example Iris Dataset, but how do I create my own dataset similar to this?
iris.data - contains the data measurements of three types of flower
iris.target - contains the labels of three types of flower
e.g. rather than analysing the three types of flower in the Iris Dataset, I would like to make my own datasets that follow this format and that I can pass through the code.
example_sports.data - contains the data measurements of three types of sports players
example_sports.target - contains the labels of three types of sport
from sklearn.datasets import load_iris #load inbuilt dataset from sklearn
iris = load_iris() #assign variable name iris to inbuilt dataset
iris.data # contains the four numeric variables, sepal length, sepal width, petal length, petal width
print(iris.data) #printing the measurements of the Iris Dataset below
iris.target # relates to the different species shown as 0,1,2 for the three different
# species of Iris, a categorical variable, basically a label
print(iris.target)
The full code can be found at https://www.youtube.com/watch?v=asW8tp1qiFQ
sklearn datasets are stored in Bunch which is basically just a type of dict. Sklearn data and targets are basically just NumPy arrays and can be fed into the fit() method of the sklearn estimator you are interested in. But if you want to make your own data as a Bunch, you can do something like the following:
from sklearn.utils import Bunch
import numpy as np
example_sports = Bunch(data = np.array([[1.0,1.2,2.1],[3.0,2.3,1.0],[4.5,3.4,0.5]]), target = np.array([3,2,1]))
print(example_sports.data)
print(example_sports.target)
Naturally, you can read your own custom lists into the data and target entries of the Bunch. Pandas is a good tool if you have the data in Excel/CSV files.
Try using type() command whenever you are stuck. In this case it shows you that it is a Bunch object. Then you can search documentations of that class on the web and understand how to use them.
The following will help you.
from sklearn.utils import Bunch
b = Bunch(a=1, b="textt", c = pd.Series(np.arange(5)), d = np.asarray([0, 8, 9]))
b.c
I have imported values into python from a PostgreSQL DB.
data = cur.fetchall()
The list is like this:-
[('Ending Crowds', 85, Decimal('50.49')), ('Salute Apollo', 73, Decimal('319.93'))][0]
I need to give 85 as X & Decimal('50.49') as Y in LinearRegression model
Then I imported packages & class-
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
I provide data & perform linear regression -
X = data.iloc[:, 1].values.reshape(-1, 1)
Y = data.iloc[:, 2].values.reshape(-1, 1)
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
I am getting the error-
AttributeError: 'list' object has no attribute 'iloc'
I am a beginner to pyhon and started just 2 days back but need to do linear regression in python at my job for a project. I think iloc can't be used for list object. But, not able to figure out as to how to pass on X & Y values to linear_regressor. All the examples performing Linear Regression on sites are using .CSV. Please help me out.
No, you can't use .iloc on 'list', it is for dataframe.
convert it into dataframe and try using .iloc
Your solution is below, please approve it if it is correct.
Because it's my 1st answer on StackOverflow
import pandas as pd
from decimal import Decimal
from sklearn.linear_model import LinearRegression
#I don't know what that "[0]" in your list,because I haven't used data fetched from PostgreSQL. Anyway remove it first and store it in temp
temp=[('Ending Crowds', 85, Decimal('50.49')), ('Salute Apollo', 73, Decimal('319.93'))]
#I don't know it really needed or not
var = list(var)
data = []
#It is to remove "Decimal" word
for row in var:
data.append(list(map(str, list(row))))
data=pd.DataFrame(data,columns=["no_use","X","Y"])
X=data['X'].values.reshape(-1, 1)
Y=data['Y'].values.reshape(-1, 1)
print(X,Y)
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
I am trying to perform some speed comparison test Python vs R and struggling with issue - LinearRegression under sklearn with categorical variables.
Code R:
# Start the clock!
ptm <- proc.time()
ptm
test_data = read.csv("clean_hold.out.csv")
# Regression Model
model_liner = lm(test_data$HH_F ~ ., data = test_data)
# Stop the clock
new_ptm <- proc.time() - ptm
Code Python:
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
start = time.time()
test_data = pd.read_csv("./clean_hold.out.csv")
x_train = [col for col in test_data.columns[1:] if col != 'HH_F']
y_train = ['HH_F']
model_linear = LinearRegression(normalize=False)
model_linear.fit(test_data[x_train], test_data[y_train])
but it's not work for me
return X.astype(np.float32 if X.dtype == np.int32 else np.float64)
ValueError: could not convert string to float: Bee True
I was tried another approach
test_data = pd.read_csv("./clean_hold.out.csv").to_dict()
v = DictVectorizer(sparse=False)
X = v.fit_transform(test_data)
However, I catched another error:
File
"C:\Anaconda32\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py",
line 258, in transform
Xa[i, vocab[f]] = dtype(v) TypeError: float() argument must be a string or a number
I don't understand how Python should resolve this issues ...
Example of data:
http://screencast.com/t/hYyyu7nU9hQm
I have to do some encoding before using fit.
There are several classes that can be used :
LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer
I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.