Feature Selection

Feature Selection - python

I tried to do recursive feature selection in scikit learn with following code.
from sklearn import datasets, svm
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE
import numpy as np
input_file_iris = "/home/anuradha/Project/NSL_KDD_master/Modified/iris.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
selector = RFE(estimator,3, step=1)
selector = selector.fit(X,y)
But it gives following error
Traceback (most recent call last):
File "/home/anuradha/PycharmProjects/LearnPython/Scikit-learn/univariate.py", line 30, in <module>
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_selection/rfe.py", line 131, in fit
return self._fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_selection/rfe.py", line 182, in _fit
raise RuntimeError('The classifier does not expose '
RuntimeError: The classifier does not expose "coef_" or
"feature_importances_" attributes
Please some one can help me to solve this or guide me to another solution

Change your kernel to linear and your code would work.
Besides, svm.OneClassSVM is used for unsupervised outlier detection. Are you sure that you want to use it as estimator? Or perhaps you want to use svm.SVC(). Look the following link for documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
Lastly, iris data set is already available in sklearn. You have imported the sklearn.datasets. So you can simply load iris as:
iris = datasets.load_iris()
X = iris.data
y = iris.target

Related

Scikit-learn - What am I predicting?

My aim is to predict between five and six numbers in an array, based on csv data with six columns. The below script is supposed to predict only one number, from an array of 5. I assumed I could work my way up to the entire 5 or 6 from there, but I might be wrong about that.
Mre:
import csv
import numpy as np
import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('subdata.csv')
ft = [9,8,15,4,6]
fintest = np.array(ft)
def train():
df.astype(np.float64)
df.drop(['One'], axis = 1)
X = df
y = X['One']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)
tree_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)
rfp = rf_model.predict(fintest.reshape(1, -1))
tmp = tree_model.predict(fintest.reshape(1, -1))
print(rfp)
print(tmp)
train()
Could you please clarify, what I am asking this script to predict in the final rfp and tmp lines?
My data looks like this:
The script as is currently gives an error:
Traceback (most recent call last):
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 43, in <module>
train()
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 37, in train
rfp = rf_model.predict(fintest.reshape(1, -1))
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 784, in predict
X = self._validate_X_predict(X)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py", line 402, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 437, in _validate_data
self._check_n_features(X, reset=reset)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 365, in _check_n_features
raise ValueError(
ValueError: X has 5 features, but DecisionTreeRegressor is expecting 6 features as input.
By adding a sixth digit to the ft array I can get around this error and receive wildly inaccurate outputs, that appear to have no correlation with the data whatsoever. For example, by setting variable ft to [9,8,15,4,6,2] which is the first row in the csv file, and setting X and y to use the 'Four' label; I get an output of [37.22] and [37.].
My other questions will probably be answered by my first. But here they are:
Could you also please clarify why I need to pass an array of 6?
And why are my predictions so close together (all ~35), no matter what array I pass for the prediction?

The way you defined your X is wrong. It is containing 6 features.
Your y is contained in your X in the way you defined it :
X = df #6 features
y = X['One'] #1 feature
I think what you wanted to do was something like this :
X = df[['Two', 'Three', 'Four', 'Five', 'Zero']]
y = df['One']

It depends on your data, and like I saw your data is an example without context, so actually you are trying to train your data to predict the 'One' column using two different models, that doesn't make sense to me.
The error is because you give to X the dataframe without column 'One' and after you are asking for the column 'One' to variable Y, Y=X['One'].

RandomForest score method ValueError

I am trying to find the score of a given data set with respect to some training data. I have written the following code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [[10,20,30,40,50,60,70,80,90], [10,20,30,40,50,60,70,80,90]]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
On compiling and trying to run the program I am getting the error:
Traceback (most recent call last):
File "trial.py", line 16, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 89, in _check_targets
raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported
On checking the documentation related to the score method it says:
score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
Both X and y in my case are arrays, 2d arrays.
I also went through this question but I couldn't understand where am I going wrong.
EDIT
So as per the answer and the comments that follow, I have edited the program as follows:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
mlb = MultiLabelBinarizer()
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [100,200]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
After this edit I am getting the error:
Traceback (most recent call last):
File "trial.py", line 19, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 82, in _check_targets
"".format(type_true, type_pred))
ValueError: Can't handle mix of binary and multiclass-multioutput

According to the documentation:
Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.
The score method invokes sklearn's accuracy metric but this isn't supported for the multi-class, multi-output classification problem you've defined.
It's not clear from your question if you really intend to solve a multi-class, multi-output problem. If that's not your intention, then you should restructure your input arrays.
If on the other hand you really want to solve this kind of problem, you'll simply need to define your own scoring function.
UPDATE
Since you are not solving a multi-class, multi-label problem you should restructure your data so that it looks something like this:
from sklearn.ensemble import RandomForestClassifier
# training data
X = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
y = [0,1]
# fit the model
randomForest.fit(X,y)
# test data
Xtest = [
[1,2,0,4,5,6,0,8,9],
[1,1,3,1,5,0,7,8,9]
]
ytest = [0,1]
output = randomForest.score(Xtest,ytest)
print(output) # 0.5

Error in using Non Linear SVM in Scikit-Learn

I have a code to try to use Non Linear SVM (RBF kernel).
raw_data1 = open("/Users/prateek/Desktop/Programs/ML/Dataset.csv")
raw_data2 = open("/Users/prateek/Desktop/Programs/ML/Result.csv")
dataset1 = np.loadtxt(raw_data1,delimiter=",")
result1 = np.loadtxt(raw_data2,delimiter=",")
clf = svm.NuSVC(kernel='rbf')
clf.fit(dataset1,result1)
However, when I try to fit, I get the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/prateek/Desktop/Programs/ML/lib/python2.7/site-packages/sklearn/svm/base.py", line 193, in fit
fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
File "/Users/prateek/Desktop/Programs/ML/lib/python2.7/site-packages/sklearn/svm/base.py", line 251, in _dense_fit
max_iter=self.max_iter, random_seed=random_seed)
File "sklearn/svm/libsvm.pyx", line 187, in sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2098)
ValueError: specified nu is infeasible
Link for Results.csv
Link for dataset
What is the reason for such an error?

The nu parameter is, as pointed out in the documentation, "An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors".
So, whenever you try to fit your data and this bound cannot be satisfied, optimization problem becomes infeasible. Therefore your error.
As a matter of fact, I looped from 1. to 0.1 (decreasing in decimal units) and still got the error, then just tried with 0.01 and no complaints arose. But of course, you should check the results of fitting your model with that value, check if accuracy is acceptable on predictions.
Update: actually I was curious and splitted your dataset to validate, output was 69% accuracy (also I think your training set might be very little)
Just for reproducibility purposes, here, the quick test I performed:
from sklearn import svm
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
raw_data1 = open("Dataset.csv")
raw_data2 = open("Result.csv")
dataset1 = np.loadtxt(raw_data1,delimiter=",")
result1 = np.loadtxt(raw_data2,delimiter=",")
clf = svm.NuSVC(kernel='rbf',nu=0.01)
X_train, X_test, y_train, y_test = train_test_split(dataset1,result1, test_size=0.25, random_state=42)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred, normalize=True, sample_weight=None)

Errors with Python Classification and Regression Trees

I'm learning how to use decision trees in python. I modified an example to import a csv file instead of using the iris dataset from this site:
http://machinelearningmastery.com/get-your-hands-dirty-with-scikit-learn-now/
Code:
import numpy as np
import urllib
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import datasets
from sklearn import metrics
# URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository)
url = "http://goo.gl/j0Rvxq"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
#print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print model
Error:
Traceback (most recent call last):
File "DatasetTest2.py", line 24, in <module>
model.fit(dataset.data, dataset.target)
AttributeError: 'numpy.ndarray' object has no attribute 'target'
I am not sure why this error is occuring. If I use the iris data set from the example it works just fine. Eventually, I need to be able to perform decision trees on csv files.
I've also tried the following code that also results in the same error:
# Import Python Modules
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import datasets
from sklearn import metrics
import pandas as pd
import numpy as np
#Import Data
raw_data = pd.read_csv("DataTest1.csv")
dataset = raw_data.as_matrix()
#print dataset.shape
#print dataset
# separate the data from the target attributes
X = dataset[:,[2,3,4,7,10]]
y = dataset[:,[1]]
#print X
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print model

The dataset object that is imported in that example is not a plain table of data. It is a special object that is set up with attributes like data and target so that it can be used as shown in the example. If you have your own data, you will need to decide what to use as data and target. From your example, it looks like you want to do model.fit(X, y).

How avoid error "TypeError: invalid data type for einsum" in Python

I try to load CSV file to numpy-array and use the array in LogisticRegression etc. Now, I am struggling with error is shown below:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
dataset = pd.read_csv('../Bookie_test.csv').values
X = dataset[1:, 32:34]
y = dataset[1:, 14]
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
I got an error:
> C:\Anaconda32\lib\site-packages\sklearn\utils\validation.py:332:
> UserWarning: The normalize function assumes floating point values as
> input, got object "got %s" % (estimator, X.dtype)) Traceback (most
> recent call last): File
> "X:/test3.py", line 23, in
> <module>
> normalized_X = preprocessing.normalize(X) File "C:\Anaconda32\lib\site-packages\sklearn\preprocessing\data.py", line
> 553, in normalize
> norms = row_norms(X) File "C:\Anaconda32\lib\site-packages\sklearn\utils\extmath.py", line 65,
> in row_norms
> norms = np.einsum('ij,ij->i', X, X) TypeError: invalid data type for einsum
I am new in Python and don't like transformation:
Load CSV to Pandas
Convert Pandas to NumPy
Use NumPy in LogisticRegression
Are there any simple approach, like:
Load to Pandas
Use Pandas Dataframes in ML methods?

Regarding the main question, thanks to Evert for advises I will check.
Regarding #2: I found great tutorial http://www.markhneedham.com/blog/2013/11/09/python-making-scikit-learn-and-pandas-play-nice/
and achieved desired result with pandas + sklearn

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Feature Selection - python

Related

Scikit-learn - What am I predicting?

RandomForest score method ValueError

Error in using Non Linear SVM in Scikit-Learn

Errors with Python Classification and Regression Trees

How avoid error "TypeError: invalid data type for einsum" in Python

Categories

Resources