I have an array with the shape 57159x924 which I will use as training data. 896 of these 924 columns are feature and the remaining labels. I want to use logistic regression on this, but when I use the fit function from logistic regression I get a memory error. I guess it is because it's too much data for my computer's memory to handle. Is there any way to get around this problem?
The code I want to use is
lr = LogisticRegression(random_state=1)
lr.fit(train_set, train_label)
lr.predict_proba(x_test)
And the following is the error
line 21, in main
lr.fit(train_set, train_label)
....
return array(a, dtype, copy=False, order=order)
MemoryError
You haven't given enough details to really understand the problem or give a definite answer, but here are a few options I hope will help:
The amount of memory available might be configurable.
Training over all the data at the same time would raise OOM problems in many contexts, which is why the common practice is to use SGD (stochastic gradient descent) by training over batches, i.e. introducing only subsets of the data every iteration and getting a global optimization solution in a stochastic sense. If I'm guessing correctly, you're using sklearn.linear_model.LogisticRegression, which has different "solvers". Maybe the saga solver will handle your situation better.
There are other implementations out there, and some of them definitely have batching options built-in in a highly configurable way. And if worst comes to worst, implementing a logistic-regression model is fairly simple, and then you can batch easy as pie.
Edit (due to discussion in comments):
Here's a practical way to go about it, with a very very simple (and easy) example -
from sklearn.linear_model import SGDClassifier
import numpy as np
import random
X1 = np.random.multivariate_normal(mean=[10, 5], cov = np.diag([3, 8]), size=1000) # diagonal covariance for simplicity
Y1 = np.zeros((1000, 1))
X2 = np.random.multivariate_normal(mean=[-4, 55], cov = np.diag([5, 1]), size=1000) # diagonal covariance for simplicity
Y2 = np.ones((1000, 1))
X = np.vstack([X1, X2])
Y = np.vstack([Y1, Y2]).reshape([2000,])
sgd = SGDClassifier(loss='log', warm_start=True) # as mentioned in answer. note that shuffle is defaulted to True.
sgd.partial_fit(X, Y, classes = [0, 1]) # first time you need to say what your classes are
for k in range(1000):
batch_indexs = random.sample(range(2000), 20)
sgd.partial_fit(X[batch_indexs, :], Y[batch_indexs])
In practice you should be looking at the loss and accuracy and using a suitable while instead of for, but that much is left for the reader ;-)
Note that you can control more than I've shown (like the number of iterations etc.), so you should read the documentation of SGDClassifier properly.
Another thing to note is that there are different practices of batching. I just took a random subset every iteration, but some prefer to make sure every point in the data has been seen an equal amount of times (e.g. shuffle the data and then take batches in order indexes or something).
Related
I'm using pytorch to implement a simple linear regression model.
The code works perfectly for randomly created datasets, but when it comes to the dataset I wanted to train, it gives significantly wrong results.
Here is the code:
x = torch.linspace(1,100,steps=100)
learn_rate = 0.000001
x_train = x[:100]
x_test = x[100:]
y_train = data[:100]
y_test = data[100:]
# y_train = -0.01*x_train + torch.randn(100)*10 #Code for generating random data.
w = torch.rand(1,requires_grad=True)
b= torch.rand(1,requires_grad=True)
for i in range(1000):
loss = torch.mean((y_train-(w*x_train+b))**2)
if(i%100==0):
print(loss)
loss.backward()
w.data.add_(-w.grad.data*learn_rate)
b.data.add_(-b.grad.data*learn_rate)
w.grad.data.zero_()
b.grad.data.zero_()
The result it gives makes no sense.
However, when I used a randomly generated dataset, it works perfectly:
The dataset actually looks similar. I am not sure for the reason of the inaccuracy of this model.
Code for plotting data:
plt.plot(x_train.numpy(),y_train.numpy())
plt.plot(x_train.numpy(),(w*x_train+b).data.numpy())
plt.show()
--
Now the problem seems to be that weight converges much faster than bias. At the current learning rate, bias will not converge to the optimal. However, if I increase the learning rate just by a little, the weight will simply diverge. I have to set two learning rates.
However, I'm wondering whether setting different learning rate is the best solution for a simple model like this, because I've found out that not much model actually uses different learning rate for different parameters.
Your code seems to be correct, but your model converges slower when there is a large bias in your data (because it now has to update the bias parameter many times before it reaches the correct value).
You could try running it for more iterations or increasing the learning rate.
In this simple toy example, the network leans the XOR operation:
import tensorflow as tf
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score
model = tf.keras.Sequential(layers=[
tf.keras.layers.Input(shape=(2,)),
tf.keras.layers.Dense(4, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(
loss=tf.keras.losses.binary_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01)
)
x_train = np.random.uniform(-1, 1, (10000, 2))
tmp = x_train > 0
y_train = (tmp[:, 0] ^ tmp[:, 1])
model.fit(x=x_train, y=y_train, epochs=10)
x_test = np.random.uniform(-1, 1, (1000, 2))
tmp = x_test > 0
y_test = (tmp[:, 0] ^ tmp[:, 1])
prediction = model.predict(x_test) > 0.5
print(f'Accuracy: {accuracy_score(y_pred=prediction, y_true=y_test)}')
print(f'recall: {recall_score(y_pred=prediction, y_true=y_test)}')
print(f'precision: {precision_score(y_pred=prediction, y_true=y_test)}')
This example can also be found in the tensorflow playground
When the initial loss is <3, this will quickly converge (in 2-3 epochs). But sometimes the initial conditions lead it to have ~7 loss, in which case it never converges (not even after 1000 epochs).
It's easy to know right after the first epoch if it's going to work or not, but it makes searching for hyper parameters very difficult, since you never know if converged successfully by chance due to initial conditions, or if the hyper parameter is the cause.
Is there a way to make this network less dependent on initial conditions? A different optimizer? Some optimizer hyper parameter? weight regularization?
I've tried changing these, but didn't get consistent improvements.
In the playground example, it never gets stuck at this kind of high loss.
Edit: If you make the training long enough, it might jump to loss 7 even after settling on a good solution with loss < 0.03.
Theoretically, there's no way to be 100% sure if it's the hyper params or the initial config. You'll need to implement something for the case when there's divergence.
Practically though, you could:
Train multiple times, and incorporate how often the network converges into the strategy on selecting the best hyperparameters;.
Find some ranges for which you feel like the model is consistent.
Incorporate the initialization of your weights into your hyperparameter optimization. Right now they are randomly initiliazed, and are the cause of the problem. There's a number of ways to do this. Try playing around with different initilizers, but there's no "one best initiliazers for every ML problem".
Just fix the initial conditions. Fix the random seed that tensorflow uses for initilization using tf.random.set_seed, but that will affect your performance by a lot of course, so I don't think that's really what you want. You could make the claim that you are now sure that a network performs well because of the architecture, but that's only true for that specific random seed, not for all.
According to this blog, adding batchnorm should make the network less sensitive to the initialisation approach.
Context to what I'm trying to achieve:
I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).
import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC
def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature
trnArray = np.zeros([10000,324])
tstArray = np.zeros([1000,324])
for i in range (0, 10000 ):
trnFeatures = computeFeatures(trnImages[:,:,:,i])
trnArray[i,:] = trnFeatures
for i in range (0, 1000):
tstFeatures = computeFeatures(tstImages[:,:,:,i])
tstArray[i,:] = tstFeatures
pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels
train_data = trnModel
train_labels = trnLabels
C = 1
model = SVC(kernel='linear', C=C)
model.fit(train_data, train_labels.ravel())
y_pred = model.predict(test_data)
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))
Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?
First of all,
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
this is wrong. You have to use:
tstModel = pca.transform(tstArray)
Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.
Just for interest, check the balance of classes.
Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.
The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.
I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?
The PCA is standard for images, but should be followed by something very simple such as K-means.
The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.
Otherwise the calculation looks fine.
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).
I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question.
BTW could some explain, "shape[0]"please? Is it needed?
One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.
This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.
Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?
I am new to Tensorflow and am working through the examples of regression examples given here tensorflow tutorials. Speicifically, I am working on the 3rd: "polynomial_regression.py"
I followed the linear regression example fine, and have now moved on to the polynomial regression.
However, I wanted to try substituting another set of data instead of that made up in the example. I did this by exchanging
xs = np.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1], dtype=np.float32)
ys = np.asarray([1.7,2.76,-2.09,3.19,1.9,1.573,3.366,2.596,2.53,1.221,
2.827,-3.465,1.65,-2.1004,2.42,2.94,1.3], dtype=np.float32)
n_observations = xs.shape[0]
for
n_observations = 100
xs = np.linspace(-3, 3, n_observations)
ys = np.tan(xs) + np.random.uniform(-0.5, 0.5, n_observations)
I.e. the second was what was given in the example, and I wanted to try to run the same training with the new xs,ys, n_observation. These were the only lines I changed. I also tried changing the dtype of the array to be float64, but this did not change the output.
The output I am getting (which is from print(training_cost) is just a repeated nan. When I switch back to the original data, the network runs fine, and generates a fitting funciton.
Thank you for any ideas!
NaNs can be caused by many things, usually some form of numerical instability. Lowering the learning rate or using a more stable optimizer are good things to try.
As in sklearn, LogisticRegression(short for LR) has not direct method for solving weighted LR, so i pass to SGDClassifier(SGD).
As with my experiment: i generate data follow LR distribution with parametre intercept=0, beta=2. And run LR and SGD to estimate them.
To compare this two, i set the same penalty parameter(my original idea is setting them to 0, but as they can't be setted to 0, i give big C for LR and small alpha for SGD )
As i see, LR almost do well for estimation, but it's difficult to setting the parametre for SGD. The main problem is the choice of eta0 and the learning_rate: 'constant' (too slow), 'optimal' or 'invscaling'.
My idea is watching the loss function, if it's likely to go down, increase the n_iter. if it go down too slow, increase the eta0.
But
1.how to return the value of loss function for each epoch, i see them by change verbose to 1, but i don't know how to get return the value. (may be partiel_fit?)
2.is there more intelligent (automatique) way to this work? if not i should relance the training process many times And more complicated if i use cross validation
Thanks you for all your advice. If i was not clear, please let me know.
P.S. the code in Python require indent block, as i'm new to stackoverflow, i don't know how to do this, so if you want to execute de code, please add indented block after the def.
import random
import numpy as np
from sklearn.linear_model import LogisticRegression,SGDClassifier
def simule_logistic(n):
beta=0.2
x=[]
seuil=[]
for i in range(n):
x.append(random.normalvariate(1, 2))
seuil.append(random.uniform(0, 1))
x=np.array(x)
seuil=np.array(seuil)
p=1.0/(1+ np.exp(-x*beta))
y=[]
for i in range(n):
if p[i]<seuil[i]:
y.append(0)
else:
y.append(1)
y=np.array(y)
return x, p,y
if __name__=='__main__':
n=100000
x,p,y=simule_logistic(n)
x=x.reshape((n,1))
print x.shape
print y.shape
l=LogisticRegression(C=1000000,penalty='l1')
l.fit(x,y)
sgd=SGDClassifier(n_iter=100,n_jobs=1, loss='log',alpha=1.0/1000000,l1_ratio=1,learning_rate='optimal',eta0=0.01)
print sgd
sgd.fit(x,y)
#methode regression
print 'l',l.coef_
print l.intercept_
print 'sgd',sgd.coef_
print sgd.intercept_