Create train and test variables from loaded arff file - python

I want perform multilabel classification. A have a dataset in arff format which I load. However I don't now how convert import data to X and y vectors in order to apply sklearn/train_test_split.
How can I get X and y?
data, meta = scipy.io.arff.loadarff('../yeast-train.arff')
df = pd.DataFrame(data)
#Get X, y
X, y = ??? <---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Ok. Its a multilabel data in which features are in the columns Att1, Att2, Att3.... Att20 and targets are in the columns Class1, Class2, .... Class14.
So you need to use those columns for getting the X and y. Do it like this:
# Fill the .... with all other column names
feature_cols = ['Att1', 'Att2', 'Att3', 'Att4', 'Att5' .... 'Att20']
target_cols = ['Class1', 'Class2', 'Class3', 'Class4', .... 'Class14']
X, y = df[feature_cols], df[target_cols]

Related

Iterative split of multilabel classification dataset in pandas dataframe

I have dataset which contains text column with string values and multiple column with value 1 or 0 (classified or no). I want to use skmultilearn to split this data with even distribution, but I got this error:
KeyError: 'key of type tuple not found and not a MultiIndex'
And this is my code:
import pandas as pd
from skmultilearn.model_selection import iterative_train_test_split
y = pd.read_csv("dataset.csv")
x = y.pop("text")
x_train, x_test, y_train, y_test = iterative_train_test_split(x, y, test_size=0.1)
Here is what worked for me (this is 98/1/1 split):
import os
import pandas as pd
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
def main():
# load dataset
y = pd.read_csv("dataset.csv")
x = y.pop("text")
# save tag names to reuse them later for creating pandas DataFrames
tag_names = y.columns
# Data has to be in ndarray format
y = y.to_numpy()
x = x.to_numpy()
# split to train / test
msss = MultilabelStratifiedShuffleSplit(n_splits=2, test_size=0.02, random_state=42)
for train_index, test_index in msss.split(x, y):
x_train, x_test_temp = x[train_index], x[test_index]
y_train, y_test_temp = y[train_index], y[test_index]
# make some memory space
del x
del y
# split to test / validation
msss = MultilabelStratifiedShuffleSplit(n_splits=2, test_size=0.5, random_state=42)
for test_index, val_index in msss.split(x_test_temp, y_test_temp):
x_test, x_val = x_test_temp[test_index], x_test_temp[val_index]
y_test, y_val = y_test_temp[test_index], y_test_temp[val_index]
# train dataset
df_train = pd.DataFrame(data=y_train, columns=tag_names)
df_train.insert(0, "snippet", x_train)
# validation dataset
df_val = pd.DataFrame(data=y_val, columns=tag_names)
df_val.insert(0, "snippet", x_val)
# test dataset
df_test = pd.DataFrame(data=y_test, columns=tag_names)
df_test.insert(0, "snippet", x_test)
if __name__ == "__main__":
main()

Using StandardScaler on specific column in Pipeline and concatenate to original data

I have a dataframe which has 4 numeric columns and I am trying to scale only one column using StandardScaler in a Pipeline. I used below code to scale and transform my column.
num_feat = ['Quantity']
num_trans = Pipeline([('scale', StandardScaler())])
preprocessor = ColumnTransformer(transformers = ['num', num_trans, num_feat])
pipe = Pipeline([('preproc', preprocessor),
('rf', RandomForestRegressor(random_state = 0))
])
After doing this I am splitting my data and training my model as below.
y = df1['target']
x = df1.drop(['target','ID'], axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
pipe.fit(x_train, y_train)
This gives me error ValueError: not enough values to unpack (expected 3, got 1). I understand this could be because of other 3 numeric columns in my dataframe. So how do I concatenate scaled data to my remaining dataframe and train my model on whole data. Or is there any better way to do this.
Please add a paranthesis when intialising the transformer.
preprocessor = ColumnTransformer(transformers = [('num', num_trans, num_feat)],remainder='passthrough')

Shuffle and split 2 numpy arrays so as to maintain their ordering with respect to each other

I have 2 numpy arrays X and Y, with shape X: [4750, 224, 224, 3] and Y: [4750,1].
X is the training dataset and Y is the correct output label for each entry.
I want to split the data into train and test so as to validate my machine learning model. Therefore, I want to split them randomly so that they both have the correct ordering after random split is applied on X and Y. ie- every row of X is correctly has its corresponding label unchanged after the split.
How can I achieve the above objective ?
This is how I would do it
def split(x, y, train_ratio=0.7):
x_size = x.shape[0]
train_size = int(x_size * train_ratio)
test_size = x_size - train_size
train_indices = np.random.choice(x_size, size=train_size, replace=False)
mask = np.zeros(x_size, dtype=bool)
mask[train_indices] = True
x_train, y_train = x[mask], y[mask]
x_test, y_test = x[~mask], y[~mask]
return (x_train, y_train), (x_test, y_test)
I simply choose the required number of indices I need (randomly) for my train set, remaining will be for the test set.
Then use a mask to select the train and test samples.
You can also use the scikit-learn train_test_split to split your data using just 2 lines of code :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
sklearn.model_selection.train_test_split is a good choice!
But to craft one of your own
import numpy as np
def my_train_test_split(X, Y, train_ratio=0.8):
"""return X_train, Y_train, X_test, Y_test"""
n = X.shape[0]
split = int(n * train_ratio)
index = np.arange(n)
np.random.shuffle(index)
return X[index[:split]], Y[index[:split]], X[index[split:]], Y[index[split:]]

How to operate multidimensional features in SVM or use multidimensional features to train model?

If I have this input:
"a1,b1,c1,d1;A1,B1,C1,D1;α1,β1,γ1,θ1;Label1"
"... ... "
"an,bn,cn,dn;An,Bn,Cn,Dn;αn,βn,γn,θn;Labelx"
Array expression:
[
[[a1,b1,c1,d1],[A1,B1,C1,D1],[α1,β1,γ1,θ1],[Label1]],
... ... ... ...
[[an,bn,cn,dn],[An,Bn,Cn,Dn],[αn,βn,γn,θn],[Labelx]]
]
Instance:
[... ... ... ...
[[58.32,453.65,980.50,540.23],[774.40,428.79,1101.96,719.79],[503.70,624.76,1128.00,1064.26],[1]],
[[0,0,0,0],[871.05,478.17,1109.37,698.36],[868.63,647.56,1189.92,1040.80],[1]],
[[169.34,43.41,324.46,187.96],[50.24,37.84,342.39,515.21],[0,0,0,0],[0]]]
Like this:
There are 3 rectangles,and the label means intersect,contain or some other.
I want to use 3 or N features to train a model by SVM.
And I just learn the "python Iris SVM" code.What should I do?
The Opinion:
this is my try:
from sklearn import svm
import numpy as np
mport matplotlib as mpl
from sklearn.model_selection import train_test_split
def label_type(s):
it = {b'Situation_1': 0, b'Situation_2': 1, b'Unknown': 2}
return it[s]
path = 'C:/Users/SEARECLUSE/Desktop/MNIST_DATASET/temp_test.data'
data = np.loadtxt(path, dtype=list, delimiter=';', converters={3:
label_type})
x, y = np.split((data), (3,), axis=1)
x = x[:, :3]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,
train_size=0.6)
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(x_train, y_train.ravel())
Report Error:
Line: clf.fit(x_train, y_train.ravel())
ValueError: could not convert string to float:
If I try to convert the data:
x, y = np.split(float(data), (3,), axis=1)
Report Error:
Line: x, y = np.split(float(data), (3,), axis=1)
TypeError: only length-1 arrays can be converted to Python scalars
SVMs were not initially designed to handle multidimensional data. I suggest you flatten your input features:
x, y = np.split((data), (3,), axis=1)
x = x[:, :3]
# flatten the features
x = np.reshape(x,(len(x),-1))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,
train_size=0.6)
clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(x_train, y_train.ravel())
I have few questions before I go for an answer:
Q1. What kind of data you are using to train SVM model. Is it image data? If image data then, is it RGB data? The way you explained you data it seems you are intended to do image classification using SVM. Correct me if I am wrong.
Assumption
Let say you have image data. Then please convert to gray scale. Then you try to convert entire data into numpy array. check numpy module to find how to do that.
Once you data become numpy array then you can apply your model.
Let me know if that helps.

how to drop predictable value (y) in python random forest model

I run a random forest model in python to see the importance of features. However, the predictable value (y) cannot be dropped and it looks like it plays as one of the parameters that takes over 98% of importance.
The code is as below:
temp=pd.read_csv('temp_data.csv',sep=',',engine='python')
temp['y'] = temp['temp_actual']
y = temp['y'].values
temp = temp.drop(['y'],axis=1)
#X = temp.loc[:,:]
x= temp.values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
Please help correct the coding. Thanks!
In your code you made a copy of the target feature to column y by using the code
temp['y'] = temp['temp_actual']
Then you set y as the values in that column
y = temp['y'].values
You then dropped the column y from the data frame with the following code
temp = temp.drop(['y'],axis=1)
Now if you looked at the columns of the dataframe temp you can see that y is not present but temp_actual is there.
You have to remove that column from the dataframe, in order to do that you can do any of the following methods.
del temp['temp_actual']
OR
temp = temp.drop(['temp_actual'], axis=1)

Categories