I am using the python interface for XGBoost for building models. I have a dataset that I am reading in using xgb.DMatrix(data_path). I need to split this data into train and test (and validation, if required). But most of the implementations I have seen are of the form
dtrain = xgb.DMatrix('')
dtest = xgb.DMatrix('')
I couldn't find a way to where we can read in the dataset and then split 'em into train, test (and validation) sets.
Furthermore, is it possible to perform stratified sampling while splitting into train and test?
I need to know this because I have slightly larger datasets and currently I am reading it in using spark, splitting them up, storing on disk and then reading from there. Is there a way I can do it without having to go through Pyspark and reading from the hdfs?
I would use sklearn's train_test_split, which also has a stratify parameter, and then put the results into dtrain and dtest.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
See implementation here: A Simple XGBoost Tutorial Using the Iris Dataset.
You can always read in data from HDF5 files using pandas (see pandas.HDStore) then do splitting using sklearn (either simple random or stratified train/train split, see stratify parameter of train_test_split). And then you can feed the pandas DataFrame objects directly into the sklearn API of xgboost or convert those into xgboost.DMatrix and use those in the native training API
Related
I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train) , I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train) works fine.
In a normal case scenario, without any pipeline code, what i have usually done is taken a ML model and applied fit_transform() on my training dataset and transform on my unseen dataset for generating predictions.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
But when i execute pipeline.fit_transform(X_train, y_train), I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier does not transform data because it is a model, not a transformer. Pipelines implement either transform or predict (and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train), then in testing or production you'll call pipeline.predict(X_test, y_test) (or predict_proba, or ...), which internally will transform with the first step(s) and predict with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]
Given a numpy array consisting of data which has been generated for ongoing time from a simulation. Based on this I'm using tensorflow and keras to train a neural network and my question refers to this line of code in my model:
model.fit(X1, Y1, epochs=1000, batch_size=100, verbose=1, shuffle=True, validation_split=0.2)
After having read in the documentation of Keras I found out that the validation data set (in this case 20% of the original data) is sliced from the end. As Im generating data for ongoing time I obviously don't want the last part to be sliced off because it would not be representative for validation. I'd rather want the validation data to be chosen randomly from the whole data set. For this purpose I am right now shuffling my whole data set (inputs and outputs for the ANN in unison) before training to gain random validation data.
I feel like I don't want to ruin the time component in my data which is why I'm searching for a solution to just choose the validation set randomly without having to shuffle the whole data set. Also, I'd like to know what you guys think of not shuffling time continuous data. Again, I'm not asking about the nature of the validation split, I just want to know how to modify the manner of how the validation data is selected.
As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.
Or, your can simply use the sklearn train_test_split() method:
x_train, x_valid, y_train, y_valid = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
This method has an argument named "shuffle" which determines whether to shuffle the data prior to the split (it is set on True by default).
However, a better split of the data would be by using the "stratify" argument, which will provide a similar distribution of labels among the validation and training sets:
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2,
random_state=0,
stratify=y)
I am working on a ML project (a binary classification problem) and was able to run successfully few Sci-Kit classifiers (RF, MLP, Extra Trees).
My question is now I have "Predict_Probas" results which I have converted into a Pandas Data frame and I would like to combine this with my original test data which later I will export in CSV. This I need to show to my management as the final result of my ML project. The issue is I adopted following approach -
First standardized the whole data (using StandardScaler)
Then encoded the data using One-Hot encoding.
Then using Train_test_split, split the standardized and encoded data into two parts
How can I now get my original test data back with (without standardization & one-hot encoding) with names of columns intact?
Usually it's done bit differently - we split data set in the original format, before doing preprocessing operations.
The preprocessing operations will be executed against the training data set (X_train) and passed to the estimator.
Then the same set of preprocessing operations will be executed also against the test data set (X_test) in order to estimate (score) your model using the unseen subset of data.
In practice often it's done using Pipeline() class:
X_train, X_test, y_train, y_test = \
train_test_split(df['text'], df['label'], test_size=0.25)
pipeline = Pipeline([
('scaler',StandardScaler()),
('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?
Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.
Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.
ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)
An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.
The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.
Unfortunately, the answer (despite our wish) is No! There are some existing datasets like MNIST etc. which can be directly loaded:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
This direct loading in a splitted way makes one have a false hope to have a general method, but unfortunately that isn't present here, though you may would be interested in using the wrappers for SciKit-Learn on Keras.
There is almost similar question on DataScience SE
I am trying to build decision trees and regression trees with Python. I am using sci-kit, but am open to alternatives.
What I don't understand about this library is whether a training and a validation subset can be provided, so that the library builds the model on the training subset, tests it on the validation and stops splitting based on some rules (typically when additional splits don't result in better performance on the validation subset- this prevents overfitting).
For example, this is what the JMP software does (http://www.jmp.com/support/help/Validation_2.shtml#1016975).
I found no mention of how to use a validation subset in the official website (http://scikit-learn.org/stable/modules/tree.html), nor on the internet.
Any help would be most welcome! Thanks!
There is a fairly rich set of cross validation routines and examples in the scikit learn cross validation section of the userguide.
Note that lots of progress seems to have been made in cross validation between SK-Learn version 0.14 and 0.15, so I recommend upgrading to 0.15 if you haven't already.
As jme notes in his comment, some of the cross validation capability has also been incorporated into the grid search and pipeline capabilities of SK-learn.
For completeness in answering your question, here is simple example, but more advanced k-fold, leave-one-out, leave-p-out, shuffle-split, etc. are all available:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
iris.target,
test_size=0.4,
random_state=0)
X_train.shape, y_train.shape
((90, 4), (90,))
X_test.shape, y_test.shape
((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
0.96...
I hope this helps... Good luck!