I am trying to write a prediction function that predicts on large text data (so it has to be by batch). But the predict function is a bit slow. So I wonder what can I do to improve its time.
What I have so far:
def get_embeddings(model, data_loader, device):
model.eval() # eval mode
with torch.no_grad():
embeddings = torch.tensor([], dtype=torch.float64, device=device) #initialize empty tensor
for _, d in enumerate(tqdm(data_loader), 0):
# model inputs
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
# model outputs
embeddings = torch.cat((embeddings, model.predict(
input_ids,
attention_mask=attention_mask
))) # concat predictions
return embeddings.cpu().numpy() # convert to numpy array
I think converting from GPU to CPU takes time so I decided to initialize an empty tensor first and concat all the predictions. Then convert it back to numpy array at the very end. But I am unsure if tensor concatenation will actually be slower.
So I am wondering if there's any better or best practices when it comes to prediction.
A few things caught my attention in your code:
You are using float64. GPUs are considerably slower with it than float32 and float16.
You might be forcing a conversion from float32 to float64, which can be slow too.
You are copying the attention mask to the GPU for every input. Instead, create and/or keep it there.
Make sure all your operations are done in batches to use the full parallelism of the GPU.
Concatenating is probably useless: GPU->CPU transfers are mostly bandwidth-limited, so it doesn't matter much if it's one big tensor or many small ones (up to a point). Secondly, allocations (caused by cat) can be slow.
Related
I have a pretrained keras model that has output with dimesion of [n, 4000] (It makes the classification on 4000 classes).
I need to make a prediction on the test data (300K observations).
But when I call method model.predict(X_train), I get an run-out memory error, because I don't have enough RAM to store matrix with shape (300K , 4000).
Therefore, it would be logical to convert the model output to a sparse matrix.
But wrapping the predict method into scipy function sparse.csr_matrix does not work (sparse.csr_matrix(model.predict(X_train))), because it first allocates space in the RAM for the prediction, and only then it converts into the sparse matrix.
I can also make a prediction on a specific batch of test data, and then convert it using forloop.
But it seems to me that this is not optimal and very resource-consuming way.
Please give me advice, can there be any other methods for converting the model output into a sparse matrix?
Isn't there a batch_size parameter in the predict()?
If I get it correct the n means number of sample right?
Assume that you system ram is enough to hold the entire data but the VRAM is not.
I have a model trained in keras and is saved as a .h5 file. The model is trained with single precision floating point values with tensorflow backend. Now I want to implement an hardware accelerator which performs the convolution operation on an Xilinx FPGA. However, before I decide on the fixed point bit width to be used on the FPGA, I need to evaluate the model accuracy by quantizing the weights to 8 or 16 bit numbers. I came across the tensorflow quantise but I am not sure how I can go about taking weights from each layer, quantise it and store it in a list of numpy arrays. After all layers are quantised, I want to set the weights of the model to the new formed quantised weights. Could someone help me do this?
This is what I have tried so far to reduce precision from float32 to float16. Please let me know if this is the correct approach.
for i in range(len(w_orginal)):
temp_shape = w_orginal[i].shape
print('Shape of index: '+ str(i)+ 'array is :')
print(temp_shape)
temp_array = w_orginal[i]
temp_array_flat = w_orginal[i].flatten()
for j in range(len(temp_array)):
temp_array_flat[j] = temp_array_flat[j].astype(np.float16)
temp_array_flat = temp_array_flat.reshape(temp_shape)
w_fp_16_test.append(temp_array_flat)
Sorry for that I'm not familiar to tensorflow, so I can't give you the code, but maybe my experience with quantizing a caffe model could make sense.
If I understand you correctly, you have a tensorflow model(float32) which you want to quantize it to int8 and save it in a numpy.array.
Firstly, you should read all weights for each layer, which might be python list or numpy.array or something else, it does't matter.
Then, the quantize algorithm will influence the accuracy significantly, you must choose the best one for your model. However, these algorithms have the same core -- scale. All you need to do is scale all the weights to -127 to 127(int8), like the scale layer without bias, and record the scale factor.
Meanwile, if want to implement it on FPGA, the data should be qiantized too. Here we have a new problem -- the result of int8 * int8 is a int16, which is obvious overflow.
To solve this, we create a new parameter -- shift -- to shift int16 result back to int8. Notice, the shift parameter won't be constant 8, suppose we have 0 * 0 = 0, we don't need to shift the result at all.
The last question we shoud think over is that if the net is too deep, the layer result could overflow because some unreasonable scale parameters, so we can't directly quantize each single layer without think about other layers.
After all the net finished on FPGA, if you want to dequantize int8 to float32, just use the last scale parameter(of final result) to do some mul/div(depend on how you define scale).
This is a basic quantize algorithm, others like tf.quantization may have higher accuracy. Now we have the quantized model, you can save it into whatever you like, it's not a hard work.
P.S. Why numpy? bin file is the best for FPGA, isn't it?
And, do you have some idea about implementing softmax on FPGA? I'm confused about it...
I have trained a 3D convnet using mxnet. I saved the network architecture and parameters with an intention of testing more data with it to check its performance. Since I am not training, I do not want to obtain batches of the dataset. How do I get the network to read in the entire dataset as input? Just passing the network the dataset object directly is only a 4D tensor whereas the network wants 5D. Right now I am using the dataloader but setting batch size as the entire dataset, and I feel like there is a more efficient way to do this.
DataLoader requires either a batch_size or a BatchSampler. In theory, you could write a BatchSampler that fetches the entire dataset as one batch, though I don't think you'll see a significant performance gain if your batch size is significantly large. Additionally, using batches is beneficial if you have more than one worker - have you considered using num_workers > 0 to take advantage of parallel processing?
I have some embedding_vectors and I need to use the following new_embeddings:
new_embeddings = tf.nn.embedding_lookup_sparse(
params=embedding_vectors,
sp_ids=some_ids,
sp_weights=None,
)
The problem is that some_ids is really big and remarkably sparsed but constant for the given data 2-D tensor. My pipeline includes the evaluation of its indices, values and shape which I use directly with the sparse_placeholder in training loop to feed up the some_ids placeholder.
Unfortunately it is very slow. It seems that in every training step the some_ids are converted to dense tensor which seems really unnecessary and strange. Am I right about this convertion and is there any alternative for embedding_lookup_sparse?
I find tf.sparse_tensor_dense_matmul() is mush faster than tf.nn.embedding_lookup_sparse().
I've got a model in Keras that I need to train, but this model invariably blows up my little 8GB memory and freezes my computer.
I've come to the limit of training just one single sample (batch size = 1) and still it blows up.
Please assume my model has no mistakes or bugs and this question is not about "what is wrong with my model". (Yes, smaller models work ok with the same data, but aren't good enough for the task).
How can I split my model in two and train each part separately, but propagating the gradients between them?
Is there a possibility? (There is no limitation about using theano or tensorflow)
Using CPU only, no GPU.
You can do this thing, but it will cause your training time to approach sizes that will only make the results useful for future generations.
Let's consider what all we have in our memory when we train with a batch size of 1 (assuming you've only read in that one sample into memory):
1) that sample
2) the weights of your model
3) the activations of each layer #your model stores these for backpropogation
None of this stuff is unnecessary for training. However, you could, theoretically, do a forward pass on the first half of the model, dump the weights and activations to disk, load the second half of the model, do a forward pass on that, then the backward pass on that, dump those weights and activations to disk, load back the weights and activations of the first half, then complete the backward pass on that. This process could be split up even more to the point of doing one layer at a time.
OTOH, this is akin to what swap space does, without you having to think about it. If you want a slightly less optimized version of this (which, optimization is clearly moot at this point), you can just increase your swap space to 500GB and call it a day.