StratifiedShuffleSplit reporting multiple args for n_iter - python

I'm trying to use scikit-learn's StratifiedShuffleSplit to make a single split of my dataset that preserves class sample ratios.
from sklearn.datasets import load_files
from sklearn.model_selection import StratifiedShuffleSplit
dataset = load_files('reviews/aggregated/')
split = StratifiedShuffleSplit(dataset.target, n_iter=1, test_size=0.2)
train_idx, test_idx = next(iter(split))
train_X, train_y = dataset.data[train_idx], dataset.target[train_idx]
test_X, test_y = dataset.data[test_idx], dataset.target[test_idx]
This gives me the below error:
TypeError: __init__() got multiple values for keyword argument 'n_iter'
But I'm clearly only passing a single value for it. Is StratifiedShuffleSplit somehow incompatible with datasets? The docs don't seem to have an answer

It turns out the documentation was outdated. Looking at the docstrings, I found that the correct way to do this is:
sss = StratifiedShuffleSplit(n_iter=1, test_size=0.2)
train_idx, test_idx = next(sss.split(dataset.data, dataset.target))

Related

sklearn DecisionTreeClassifier with CountVectorizer and additional predictor

I have built a text classification model with sklearn's DecisionTreeClassifier and would like to add another predictor. My data is in a pandas dataframe with columns labeled 'Impression' (text), 'Volume' (floats), and 'Cancer' (label). I've been using only Impression to predict Cancer but would like to use Impression and Volume to predict Cancer instead.
My code previously that ran without issue:
X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
I've tried a few different ways to add the Volume predictor (changes in bold):
1) Only fit_transform the Impressions
X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)
vectorizer = CountVectorizer()
X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])
X_test = vectorizer.transform(X_test)
dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
This throws the error
TypeError: float() argument must be a string or a number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.
2) Call fit_transform on both Impressions and Volumes. Same code as above except for fit_transform line:
X_train = vectorizer.fit_transform(X_train)
This of course throws the error:
ValueError: Number of labels=1800 does not match number of samples=2
...
X_train.shape
(2, 2)
y_train.shape
(1800,)
I'm pretty sure method #1 is the right way to go but I haven't been able to find any tutorials or solutions for how I can add the float predictor to this text classification model.
Any help would be appreciated!
ColumnTransformer() will exactly solve this problem. Instead of you manually appending the output of CountVectorizer with other columns, we can set the remainder param as passthrough in ColumnTransformer.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import set_config
set_config(print_changed_only='True', display='diagram')
data = pd.DataFrame({'Impression': ['this is the first text',
'second one goes like this',
'third one is very short',
'This is the final statement'],
'Volume': [123, 1, 2, 123],
'Cancer': [1, 0, 0, 1]})
X_train, X_test, y_train, y_test = train_test_split(
data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)
ct = make_column_transformer(
(CountVectorizer(), 'Impression'), remainder='passthrough')
pipeline = make_pipeline(ct, DecisionTreeClassifier())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Use 0.23.0 version, to see the visuals of pipeline objects (display param in set_config)
You can use hstack to combine two features together.
from scipy.sparse import hstack
X_train = vectorizer.fit_transform(X_train)
X_train_new = hstack(X_train, np.array(data['Volume']))
Now your new train contain both features. And if I may advice, use tfidfvectorizer instead of countvectorizer since tfidf considers the importance of words in each document/Impresion while countvectorizer only counts number of occurrences of words and hence a word like "THE" will have higher importance than those which really matter to us.

TypeError: __init__() got an unexpected keyword argument 'cv'

I want to use SVM with LeaveOneOut cross-validation (Loocv). The code is given below:
from sklearn.svm import SVC
from sklearn.model_selection import LeaveOneOut, train_test_split
import numpy as np
import pandas as pd
iRec = 'KSBPSSM_6_DCT_MIXED_49_937_937_1874_SMOTTMK.csv'
D = pd.read_csv(iRec, header=None) # Using pandas
X = D.iloc[:, :-1].values
y = D.iloc[:, -1].values
from sklearn.utils import shuffle
X, y = shuffle(X, y) # Avoiding bias
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75,
test_size=0.25)
tpot = SVC(kernel='rbf', C=2.123, gamma=0.0039, cv=LeaveOneOut(),
probability=True,)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline_' + str(index) + '.py')
When I run the code, I received the folloing error:
Traceback (most recent call last):
File "E:/PhD Folder/PhD research/DNA-binding Proteins literature
papers/Effective DNA binding protein prediction by using key features via
Chou’s general PseAAC_Code_dataset_10_10_2018/DNA_Binding-
master/SVM_jackknife_test.py", line 18, in <module>
tpot = SVC(kernel='rbf', C=2.123, gamma=0.0039, cv=LeaveOneOut(),
probability=True,)
TypeError: __init__() got an unexpected keyword argument 'cv'
Can anybody help me
First of all, take a look at SVC documentation and Cross-Validation Documentation (sklearn).
SVC() does not take any cv parameter, as a matter off fact, models don't take cross-validation into account. CV is used to check performance and prevent overfitting.
The example used in cross-validation documentation is actually with SVC.
In your case you could use cross_val_score as follows:
tpot = SVC(kernel='rbf', C=2.123, gamma=0.0039, probability=True)
scores = cross_val_score(tpot, X_test, y_test, cv=LeaveOneOut())
print(scores.mean())

TypeError: train_test_split() got an unexpected keyword argument 'test_size'

I'm trying to find the best feature set using random forest approach
I need to split the dataset into test and train. here is my code
from sklearn.model_selection import train_test_split
def train_test_split(x,y):
# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x, y,train_size=0.3,random_state=42)
#normalization
x_train_N = (x_train-x_train.mean())/(x_train.max()-x_train.min())
x_test_N = (x_test-x_test.mean())/(x_test.max()-x_test.min())
train_test_split(data,data_y)
parameters data,data_y are parsing correctly.
But I'm getting the following error. I couldn't figure out why this is.
You are using the same function name in your code same as the one from sklearn.preprocessing, changing your function name would do the job.
Something like this,
from sklearn.model_selection import train_test_split
def my_train_test_split(x,y):
# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.3,random_state=42)
#normalization
x_train_N = (x_train-x_train.mean())/(x_train.max()-x_train.min())
x_test_N = (x_test-x_test.mean())/(x_test.max()-x_test.min())
my_train_test_split(data,data_y)
Explaination :- Although there is method overloading in python (ie. same named function selected on the basis on the type of arguments) here in your case turns out both the functions need the same type of arguments, so different naming is the only possible solution IMO.
Another solution would be to rename sklearn.model_selection, which resolves the conflict between sklearn.model_selection and model_selection (the default name).
from sklearn.model_selection import train_test_split as sklearn_train_test_split

Error code with Preprocessor Scaling?

Using KNN and I wanted to experiment with different normalizers (Normalizer(), MinMaxScaler(), StandardScaler() etc).
I have loaded the data into a variable called X:
X = pd.read_csv('C:/Users/rmahesh/documents/parkinson.csv')
After doing some data wrangling, I try and run this code:
from sklearn import preprocessing
from sklearn.decomposition import PCA
T = preprocessing.Normalizer().fit(X)
from sklearn.cross_validation import train_test_split
T_train, T_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)
from sklearn.svm import SVC
model = SVC()
model = model.fit(T_train, y_train)
score = model.score(T_test, y_test)
print(score)
The specific error code I am getting is this:
TypeError: Singleton array array(Normalizer(copy=True, norm='l2'), dtype=object) cannot be considered a valid collection.
The code in which the error is appearing is this line:
T_train, T_test, y_train, y_test = train_test_split(T, y,
test_size = 0.3, random_state = 7)
Any help would be greatly appreciated!
You're fitting your normalizer and then treating it as an array directly. Replace
T = preprocessing.Normalizer().fit(X)
With
T = preprocessing.Normalizer().fit_transform(X)
So that the actual output of the normalization is used instead. .fit() returns the Normalizer object itself.

Specifying tree_method param for XGBoost in Python

I'm working on a predictive model using XGBoost (latest version on PyPl: 0.6) in Python, and have been developing it training on about half of my data. Now that I have my final model, I trained it on all my data, but got this message, which I've never seen before:
Tree method is automatically selected to be 'approx' for faster speed.
to use old behavior(exact greedy algorithm on single machine), set
tree_method to 'exact'"
As a reproduceable example, the following code also produces that message on my machine:
import numpy as np
import xgboost as xgb
rows = 10**7
cols = 20
X = np.random.randint(0, 100, (rows, cols))
y = np.random.randint(0,2, size=rows)
clf = xgb.XGBClassifier(max_depth=5)
clf.fit(X,y)
I've tried setting tree_method to 'exact' in both the initialization and fit() steps of my model, but each throws errors:
import xgboost as xgb
clf = xgb.XGBClassifier(tree_method = 'exact')
clf
> __init__() got an unexpected keyword argument 'tree_method'
my_pipeline.fit(X_train, Y_train, clf__tree_method='exact')
> self._final_estimator.fit(Xt, y, **fit_params) TypeError: fit() got an
> unexpected keyword argument 'tree_method'
How can I specify tree_method='exact' with XGBoost in Python?
According to the XGBoost parameter documentation, this is because the default for tree_method is "auto". The "auto" setting is data-dependent: for "small-to-medium" data, it will use the "exact" approach and for "very-large" datasets, it will use "approximate". When you started to use your whole training set (instead of 50%), you must have crossed the training-size threshold that changes the auto-value for tree_method. It's unclear from the docs how many observations are required to reach that threshold, but it seems that it's somewhere between 5 and 10 million rows (since you have rows = 10**7).
I don't know if the tree_method argument is exposed in the XGBoost Python module (it sounds like it's not, so maybe file a bug report?), but tree_method is exposed in the R API.
The docs describe why you see that warning message:
It is still not implemented in the scikit-learn API for xgboost.
Hence I'm referencing the below code example from here.
import xgboost as xgb
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits(2)
X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)
param = {'objective': 'binary:logistic',
'tree_method':'hist',
'grow_policy':"lossguide",
'eval_metric': 'auc'}
res = {}
bst = xgb.train(param, dtrain, 10, [(dtrain, 'train'), (dtest, 'test')], evals_result=res)
You can use GPU from sklearn API in xGBoost. You can use it like this:
import xgboost
xgb = xgboost.XGBClassifier(n_estimators=200, tree_method='gpu_hist', predictor='gpu_predictor')
xgb.fit(X_train, y_train)
You can use different tree methods. Refer to the documentation to choose the most appropriate methods for your need.

Categories