Training SGDRegressor on a dataset in chunks

Training SGDRegressor on a dataset in chunks - python

For a machine learning task I need to deal with quite large data sets. As a result, I cannot fit the entire data set at once in my algorithm. I am looking for a way to train my algorithm in parts on the data set, simply feeding new chunks won't work since my algorithm will just refit and not won't take the previous examples into account. Is there a method with which I can feed my algorithm new data, while "remembering" the previous data seen before?
Edit: The algorithm I use is the SGDRegressor from scikit-learn.
The code:
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
labels = pd.read_csv(os.path.join(dir,"Labels.csv"),chunksize = 5000)
algo = SGDRegressor(n_iter = 75)
print("looping for chunks in train")
for chunk in train:
algo.fit(train,labels)

You can use partial_fit to feed parts of training data to SGDRegressor.
See sample code in examples.

Related

Correct use of LinearSVC

I am trying to implement a machine learning algorithm which detects irregular ecg signals. I extracted some features, but I am not sure how to manage a correct input for the classifier.
I have 20k different ecg signals, each signal has 1000 values. They are all labeld as correct or incorrect.
I choose e.g. the two features heart_rate and xposition_of_3_highest_peaks, but how to feed them into the classifier?
Following you can see my attempt, but everytime I add a second feature the score decreases. Why?
clf = svm.SVC()
#[64,70,48,89...74,58]
X_train_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
X_train_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
I am not sure if the StandardScaler().fit_transform is necessary or if the np.concatenate is correct? Maybe there is even a better classifier for this use case?
Sorry I am a complete beginner, please be kind :)

When you are doing any transformations for pre-processing, you must use the same process from the training data and apply it to the validation / test data. However, this process must use the same statistics from the training data, because you are assuming that the validation / test data also come from this same distribution. Therefore, you need to create an object to store the transformations of the training data, then apply it to the training and test data equally. Your decreased performance is because you are not applying the right statistics to both training and validation / test correctly. You are scaling both datasets using separate means and standard deviations, which can cause out-of-distribution predictions if your sample size isn't large enough.
Therefore, call fit_transform on the training data, then just transform on the validation / test data. fit_transform will simultaneously find the parameters of the scaling for each column, then apply it to the input data and return the transformed data to you. transform assumes an already fit scaler, such as what was done in fit_transform and applies the scaling accordingly. I sometimes like to separate the operations and do a separate fit on the training data, then transform on the training and validation/test data after. This is a common source of confusion for new practitioners. You also need to save the scaler object so you can apply this to your validation / test data later.
clf = svm.SVC()
#[64,70,48,89...74,58]
heartRate_scaler = StandardScaler()
X_train_heartRate = heartRate_scaler.fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = heartRate_scaler.transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
three_peaks_scalar = StandardScaler()
X_train_3_peaks = three_peaks_scalar.fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = three_peaks_scalar.transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
Take note that you can concatenate the features you want first, then apply the StandardScaler after the fact because the method applies the standardization to each feature/column independently. The above method of scaling the different sets of features and concatenating them after is no different than concatenating the features first, then scaling after.
Minor Note
I forgot to ask about the fe object. What is that doing under the hood? Does it use the training data in any way to get you features? You must make sure that this object operates on the statistics of your training data and test data, not separately. What I mentioned about ensuring that the pre-processing must match between training and validation/test, the statistics must also match in this fe object as well. I assume this either uses the training data's statistics to both sets of data, or it is an independent transformation that is agnostic. Either way, you haven't specified what this is doing under the hood, but I will assume the happy path.
Possible Improvement
Consider using a decision tree-based algorithm like a Random Forest Classifier that does not require scaling of the input features, as the job is to partition the feature space of your data into N-dimensional hypercubes, with N being the number of features in your dataset (if N=2, this would be a 2D rectangle, N=3 a 3D rectangle, etc). Depending on how your data is distributed, tree-based algorithms can do better and are the first things to try in Kaggle competitions.

Averaging linear separators obtained from SVM

For research purposes, I find myself needing to traing SVM via SGD on a large DS (that is, a large number of examples). This makes using scikit-learn's implementation (SGDClassifier) problematic, as it requires loading the entire DS at once.
The algorithm I am familiar with uses n step of SGD to obtain n different separators w_i, and then averages them (specifics can be seen in slide 12 of https://www.cse.huji.ac.il/~shais/Lectures2014/lecture8.pdf).
This made me think that maybe I can use scikit-learn to train multiple such classifiers and then take the average of the resulting linear separators (assume no bias).
Is this a reasonable line of thinking, or does scikit-learn's implementation not fall under my logic?
Edit: I am well aware of the alternatives for training SVM in different ways, but this is for a specific research purpose. I would just like to know if this line of thinking is possible with scikit-learn's implementation, or if you are aware of an alternative that will allow me to train SVM using SGD without loading an entire DS to memory.

SGDClassifier have a partial_fit method, and one of the primary objectives of partial_fit method is to scale sklearn models to large-scale datasets. Using this, you can load a part of the dataset into RAM, feed it to SGD, and keep repeating this unless full dataset is used.
In code below, I use KFold mainly to imitate loading chunk of dataset.
class GD_SVM(BaseEstimator, ClassifierMixin):
def __init__(self):
self.sgd = SGDClassifier(loss='hinge',random_state=42,fit_intercept=True,l1_ratio=0,tol=.001)
def fit(self,X,y):
cv = KFold(n_splits=10,random_state=42,shuffle=True)
for _,test_id in cv.split(X,y):
xt,yt = X[test_id],y[test_id]
self.sgd = self.sgd.partial_fit(xt,yt,classes=np.unique(y))
def predict(self,X):
return self.sgd.predict(X)
To test this against regular (linear) SVM:
X,y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X) #For simplicity, Pipeline is better choice
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=43)
sgd = GD_SVM()
svm = LinearSVC(loss='hinge',max_iter=1,random_state=42,
C=1.0,fit_intercept=True,tol=.001)
r = cross_val_score(sgd,X,y,cv=cv) #cross_val_score(svm,X,y,cv=cv)
print(r.mean())
This returned 95% accuracy for above GD_SVM, and 96% for SVM. In Digits dataset SVM had 93% accuracy, while GD_SVM had 91%. While these performances are broadly similar, as these measurements show, please note that they are not identical. This is expected, since these algorithms use pretty different optimization algorithms, but I think careful tuning of hyper-parameter would reduce the gap.

Based on the concern of loading all of the data in memory, if you have access to more compute resources, you may want to use PySpark's SVM implementation: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-support-vector-machine, as that Spark is built for large scale data processing. I don't know if averaging the separators from multiple Scikit-Learn models would work as expected; there isn't a clean way to instantiate a new model with new separators, based on the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), so it would probably have to be implemented as an ensemble approach.

If you insist on using the whole DS for training instead of sampling (btw that is what the slides describe) and you do not care about performance, I would train n classifiers, and then select only their support vectors and retrain final version on those support vectors only. This way you effectively dismiss most of the data and concentrate only on the points that are important for the classification.

Testing new data on a trained LGBM model

I am a newbie to ML, and trying to replicate a price optimization solution available at https://www.kaggle.com/tunguz/more-effective-ridge-lgbm-script-lb-0-44823?source=post_page-------
I followed the same code as given, and then trying to test it on a new data. However, it is not predicting the price correctly at all. I am making sure I save the trained model/vectors, load it fresh and transform the new data as per the model requirements, similar to as done to the training set.
The issue is, if my new data is exactly same as my Test dataset (600k + rows) used during testing the model, then it is returning me exact correct results as during test prediction. But if I use only, example, first 10 rows of it, then it is not matching the existing results at all, even though I am transforming the features through saved vectors.
#below is while training the model
cvname = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cvname.fit_transform(merge['name'])
pickle.dump(cvname, open("namevector.pkl", "wb"))
.
.
.
.
#after completing the training, and loading the new data
handle_missing_inplace(mytest)
cutting(mytest)
to_categorical(mytest)
cv1 = pickle.load(open("namevector.pkl", "rb"))
X_name1 = cv1.transform(mytest['name'])
cv2 = pickle.load(open("categoryvector.pkl", "rb"))
X_category1 = cv2.transform(mytest['category_name'])
tv1 = pickle.load(open("descriptionvector.pkl", "rb"))
X_description1 = tv1.transform(mytest['item_description'])
lb1 = pickle.load(open("brandvector.pkl", "rb"))
X_brand1 = lb1.transform(mytest['brand_name'])
t1 = pd.get_dummies(mytest[['item_condition_id', 'shipping']],sparse=True)
X_dummies1 = csr_matrix(t1.values.astype('int64'))
sparse_merge1 = hstack((X_dummies1, X_description1, X_brand1, X_category1, X_name1)).tocsr()
X_test1 = sparse_merge1
my_pred = pkl_bst1.predict(X_test1)
mysubmission['price'] = np.expm1(my_pred)
Can anyone please let me know what am I missing? The model worked fine on train and test dataset, but not on new data, or even small subset of Test dataset.

It is usually called overfitting. Or perhaps it is underfitting. As any other ML algorithm, LGBM is susceptible to both.
Meaning the model does very well on training and test data, but perform poorly on new data. The model is not generalizing well, it is just memorizing the training data.
There are some suggestions here on how to deal with overfitting on LGBM in particular, but there is a lot of information about the issue in general you should take the time to read. Google is the usual starting point.
Collecting more data is sometimes the way to deal with the problem. Hundred of thousands, millions. Machine learning is a data hungry business.
You will have to tweak some model parameters and do a lot of training until your predictions start to improve, if ever. It is called parameter tuning.
That's the tough side of ML.
Don't get discouraged though.

SciKit LogisticRegression failing to predict accurately

I have a set of generated data describing web connections in CSV that looks like this:
conn_duration,conn_destination,response_size,response_code,is_malicious
1564,130,279,532,302,0
1024,200,627,1032,307,0
2940,130,456,3101,201,1
Full CSV here
The class indicates which ones are of interest based on duration, destination_id and response code.
I think LogisticRegression would be a good fit here but the results I'm getting aren't great. On the generated dataset I've got 750 rows with a 0 class and 150 with a 1.
This is how I'm manipulating and providing the data:
names = ['conn_duration', 'conn_destination', 'response_size', 'response_code', 'is_malicious']
dataframe = pandas.read_csv(path, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:4]
y = array[:,4]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])
model = LogisticRegression()
model.fit(X, y)
# Two test bits of data, expect the first to be predicted 1 and the second to be 0
Xnew = [[[3492, 150, 750, 200]], [[3492, 120, 901, 200]]]
for conn in Xnew:
# make a prediction
ynew = model.predict(conn)
print("X=%s, Predicted=%s" % (conn[0], ynew[0]))
The criteria for a malicious bit of traffic is that the response code is 200, conn_destination is 150, and the response size is greater than 500.
I'm getting reasonable prediction but wonder if LogisticRegression is the right algorithm to be using?
TIA!

If the code is working, but you aren't sure what algorithm to use, I would recommend trying an SVM, random forest, etc. Use the GridSearchCV module to determine which algorithm gives the best performance.

Since there's a simple rule to classify the traffic, as "response code is 200, conn_destination is 150, and the response size is greater than 500", you don't actually need a model to solve it. Don't overkill a simple problem.
For studying purposes it's ok, but the model should get very close to 100% because it should learn this rule.
Anyway, conn_destination and response_code are categorical data, if you directly normalize it the algorith will understand 200 closer to 201 then to 300, but they categories not numbers.
Here's a reference of some ways to threat categorical data: Linear regression analysis with string/categorical features (variables)?

I would try XGBoost (Extreme Gradient Boosted Trees). In large datasets SVM is computationally costly and I specially like Random Forests when you have a highly imbalanced dataset.
Logistic regression can be part of a Neural Network, if you want to develop something more accurate and sophisticated, like tuning hyperparameters, avoiding overfitting and increasing generalization properties. You can also do that in XGBoost, by pruning trees.
XGBoost and Neural Networks would be my choices for a classification problem. but the whole thing is bigger than that. It's not about choosing an algorithm, but understand how it works, what is going on under the hood and HOW you can adjust it in a way you can accurately predict classes.
Also, data preparation, variable selection, outlier detection, descriptive statistics are very important for the quality and accuracy of your model.

Sklearn-GMM on large datasets

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set.
Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on mini batch of data ??

There is no reason to fit it repeatedly.
Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset.
randomly_sampled = np.random.choice(full_dataset, size=10000, replace=False)
#If data does not fit in memory you can find a way to randomly sample when you read it
GMM.fit(randomly_sampled)
And the use
GMM.predict(full_dataset)
# Again you can fit one by one or batch by batch if you cannot read it in memory
on the rest to classify them.

fit will always forget previous data in scikit-learn. For incremental fitting, there is the partial_fit function. Unfortunately, GMM doesn't have a partial_fit (yet), so you can't do that.

As Andreas Mueller mentioned, GMM doesn't have partial_fit yet which will allow you to train the model in an iterative fashion. But you can make use of warm_start by setting it's value to True when you create the GMM object. This allows you to iterate over batches of data and continue training the model from where you left it in the last iteration.

I think you can set the init_para to empty string '' when you create the GMM object, then you might be able to train the whole data set.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Training SGDRegressor on a dataset in chunks - python

You can use partial_fit to feed parts of training data to SGDRegressor. See sample code in examples.

Related

Correct use of LinearSVC

Averaging linear separators obtained from SVM

Testing new data on a trained LGBM model

SciKit LogisticRegression failing to predict accurately

Sklearn-GMM on large datasets

Categories

Resources