TensorFlow embedding_lookup_sparse optimization - python

I have some embedding_vectors and I need to use the following new_embeddings:
new_embeddings = tf.nn.embedding_lookup_sparse(
params=embedding_vectors,
sp_ids=some_ids,
sp_weights=None,
)
The problem is that some_ids is really big and remarkably sparsed but constant for the given data 2-D tensor. My pipeline includes the evaluation of its indices, values and shape which I use directly with the sparse_placeholder in training loop to feed up the some_ids placeholder.
Unfortunately it is very slow. It seems that in every training step the some_ids are converted to dense tensor which seems really unnecessary and strange. Am I right about this convertion and is there any alternative for embedding_lookup_sparse?

I find tf.sparse_tensor_dense_matmul() is mush faster than tf.nn.embedding_lookup_sparse().

Related

Most efficient way to make predictions on large data?

I am trying to write a prediction function that predicts on large text data (so it has to be by batch). But the predict function is a bit slow. So I wonder what can I do to improve its time.
What I have so far:
def get_embeddings(model, data_loader, device):
model.eval() # eval mode
with torch.no_grad():
embeddings = torch.tensor([], dtype=torch.float64, device=device) #initialize empty tensor
for _, d in enumerate(tqdm(data_loader), 0):
# model inputs
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
# model outputs
embeddings = torch.cat((embeddings, model.predict(
input_ids,
attention_mask=attention_mask
))) # concat predictions
return embeddings.cpu().numpy() # convert to numpy array
I think converting from GPU to CPU takes time so I decided to initialize an empty tensor first and concat all the predictions. Then convert it back to numpy array at the very end. But I am unsure if tensor concatenation will actually be slower.
So I am wondering if there's any better or best practices when it comes to prediction.
A few things caught my attention in your code:
You are using float64. GPUs are considerably slower with it than float32 and float16.
You might be forcing a conversion from float32 to float64, which can be slow too.
You are copying the attention mask to the GPU for every input. Instead, create and/or keep it there.
Make sure all your operations are done in batches to use the full parallelism of the GPU.
Concatenating is probably useless: GPU->CPU transfers are mostly bandwidth-limited, so it doesn't matter much if it's one big tensor or many small ones (up to a point). Secondly, allocations (caused by cat) can be slow.

What are symbolic tensors, and why do they throw "use `steps_per_epoch` argument" error?

Note: I already solved my issue, but I'm posting the question in case others have it too and because I don't understand how I solved it.
I was building a Named Entity Classifier (sequence labelling model) in Keras with Tensorflow backend. When I tried to fit the model, I got this error (which, amazingly, returns only 4 Google results):
"If your data is in the form of symbolic tensors, you should specify the `steps_per_epoch` argument (instead of the batch_size argument, because symbolic tensors are expected to produce batches of input data)."
This stackoverflow post discussed the issue, and someone suggested to the op:
one of your data tensors that is being used by Fit() is a symbolic tensor. The one hot label function returns a symbolic tensor. Try something like:
label_onehot = tf.Session().run(K.one_hot(label, 5))
Then I read on this (not related) site:
The Wolfram System also has powerful algorithms to manipulate algebraic combinations of expressions representing [...] arrays. These expressions are called symbolic arrays or symbolic tensors.
These two sources made me think symbolic arrays (at least in TensorFlow) might be something more like arrays of functions that are yet to be evaluated, rather than actual values.
So, using %whos to view all my variables, I saw that my X and Y data were tensors (rather than arrays, like I normally use for my models). The data/info column had quite a complicated description for them, but I lost it once I solved my issue and I can't work out how to get back to the state where I was getting the error.
In any case, I know I solved the problem by changing my data pre-processing so that the X and y data (i.e. X_train and y_train) were of type <class 'numpy.ndarray'> and of dimensions (num sents, max len) for X_train and (num_sents, max len, 1) for y_train (the 1 is necessary because my final layer expects 3D input). Now the model works fine. But I'm still wondering, what are these symbolic tensors and how/why is using steps per epoch instead of batch size supposed to help? I tried that too initially but had no luck.
This can be solved bu using the eval() or numpy() function of your tensors.
Check:
How can I convert a tensor into a numpy array in TensorFlow?

Convert Keras model output into sparse matrix without forloop

I have a pretrained keras model that has output with dimesion of [n, 4000] (It makes the classification on 4000 classes).
I need to make a prediction on the test data (300K observations).
But when I call method model.predict(X_train), I get an run-out memory error, because I don't have enough RAM to store matrix with shape (300K , 4000).
Therefore, it would be logical to convert the model output to a sparse matrix.
But wrapping the predict method into scipy function sparse.csr_matrix does not work (sparse.csr_matrix(model.predict(X_train))), because it first allocates space in the RAM for the prediction, and only then it converts into the sparse matrix.
I can also make a prediction on a specific batch of test data, and then convert it using forloop.
But it seems to me that this is not optimal and very resource-consuming way.
Please give me advice, can there be any other methods for converting the model output into a sparse matrix?
Isn't there a batch_size parameter in the predict()?
If I get it correct the n means number of sample right?
Assume that you system ram is enough to hold the entire data but the VRAM is not.

Tensor Flow passing a tensor to optimizer minimize function trains better

I am encountering something a bit strange (to me) in tensorflow and was hoping someone could shed some light on the situation.
I have a simple neural network that processes images. The cost function I am minimizing is the simple MSE.
At first I implemented the following:
cost = tf.square(DECONV - Y)
which I then passed to my optimizer as follows:
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(cost)
I was able to obtain great results with this implementation. However, as I tried to implement a regularizer, I realized that I wasn't passing a scalar value to the optimizer.minimize() but in fact passing a tensor of shape [batch, dim_x, dim_y].
I changed my implementation to the following:
cost = tf.losses.mean_squared_error(Y, DECONV)
as well as many variations of this like:
cost = tf.reduce_mean(tf.square(tf.subtract(DECONV, Y)))
etc.
My issue is that with these new implementations of the MSE I am not able to even come close to the results I obtained using the original "wrong" implementation.
Is the original way a valid way to train? If so, how can I implement regularizers? If not, what am I doing wrong with the new implementations? Why can't I replicate the results?
Can you precise what you mean by
I was able to achieve greater result [..]
I assume that you have another metric than cost - this time an actual scalar, which enables you to compare the models trained by each method.
Also, have you tried adjusting the learning rate for the second method? I ask this because my intuition is that when you ask tensorflow to minimize a tensor (which has no mathematical meaning as far as I know), it minimizes the scalar obtained by summing over all the axis of the tensor. This is how tf.gradients works, and the reason why I think this is happening. So maybe in the second method if you multiply the learning rate by batch*dim_x*dim_y you would get the same behavior as in the first method.
Even if this works, I don't think passing a tensor to the minimize function is a good idea - minimization of a d-dimensional value has no meaning as you have no order rule in such spaces.

Tensorflow flatten vs numpy flatten function effect on machine learning training

I am starting with deep learning stuff using keras and tensorflow. At very first stage i am stuck with a doubt. when I use tf.contrib.layers.flatten (Api 1.8) for flattening a image (could be multichannel as well).
How is this different than using flatten function from numpy?
How does this affect the training. I can see the tf.contrib.layers.flatten is taking longer time than numpy flatten. Is it doing something more?
This is a very close question but here the accepted answer includes Theano and does not solve my doubts exactly.
Example:
Lets say i have a training data of (10000,2,96,96) shape. Now I need the output to be in (10000,18432) shape. I can do this using tensorflow flatten or by using numpy flatten like
X_reshaped = X_train.reshape(*X_train.shape[:1], -2)
what difference does it make in training and which is the best practice?
The biggest difference between np.flatten and tf.layers.flatten (or tf.contrib.layers.flatten) is that numpy operations are applicable only to static nd arrays, while tensorflow operations can work with dynamic tensors. Dynamic in this case means that the exact shape will be known only at runtime (either training or testing).
So my recommendation is pretty simple:
If the input data is static numpy array, e.g. in pre-processing, use np.flatten. This avoids unnecessary overhead and returns numpy array as well.
If the data is already a tensor, use any of the flatten ops provided by tensorflow. Between those, tf.layers.flatten is better choice since tf.layers API is more stable than tf.contrib.*.
Use numpy directly on your data, without participation of a neural network. This is for preprocessing and postprocessing only
Use TF or Keras layers inside models if this operation is needed for some reason in the model. This will assure model connectivity and proper backpropagation
Models are symbolic graphs meant to create Neural Networks that can be trained. There will be a proper connection and backpropagation will work properly when you have a graph connected from input to output.
If you don't intend to create a network, don't use a TF layer. If your goal just to flatten an array, you don't need a neural network.
Now if inside a model you need to change the format of the data without losing connection and backpropagation, then go for the flatten layer.
The flatten function in numpy does a complete array flattening, meaning that you end up with a single axis of data (1 dimension only).
For example,
import numpy as np
a = np.arange(20).reshape((5,4))
print(a)
print(a.flatten().shape)
In the previous example, you end up with a 1-d array of 20 elements.
In tensorflow, the flatten layer (tf.layers.flatten) preserves the batch axis (axis 0). In the previous example, with tensorflow, you would still have a shape of (5,4).
In any case, there is no effect on training if you use flatten in an equivalent way. However, you should avoid using numpy when working with tensorflow, since almost all numpy operations have their tensorflow counterparts. Tensorflow and numpy rely on different runtime libraries and combining both could be runtime inefficient.
Moreover, avoid using contrib package layers, when they already exist in the main package (use tf.layers.flatten instead of tf.contrib.layers.flatten).
For a more general performance comparison between numpy and tensorflow, have a look at this question: Tensorflow vs. Numpy Performance
Difference
When you use tensorflow flatten, it gets added as an operation (op) in the graph. It can operate only on tensors. Numpy on the other hand works on actual numpy arrays. The usage is completely different.
Usage
You would use tensorflow op if this is an operation in the training process such as resizing before feeding to the next layer.
You would use numpy op when you want to operate on actual value at that time, like reshaping for calculating accuracy at the end of training step.
So if you had a task of
tensor A -> reshape -> matrix_mul
If you use tensorflow for reshape, you can directly run the matrix_mul from session.
If you use numpy however, you'd have to run the operation in two stages (two session calls).
You calculate tensor A
You reshape it in numpy.
Run the matrix_mul by "feeding" in reshaped array.
Performance
I haven't benchmarked anything but I'd say for just a reshape operation as standalone, numpy would be faster (ignoring gpu ) , but in a process where reshape is an intermediate op, tensorflow should be faster.

Categories