RuntimeError: stack expects a non-empty TensorList - python

I am trying to create an embedding to use for a matching technique of words but I get the following error:
Traceback (most recent call last)
/var/folders/k1/jt1nfyks4cx689d50f5mtg0w0000gp/T/ipykernel_1349/3490519318.py in <module>
53 #Compute embedding for both lists
54
---> 55 embeddings1 = model.encode(fifteen_percent_list, convert_to_tensor=True)
56
57
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
185
186 if convert_to_tensor:
--> 187 all_embeddings = torch.stack(all_embeddings)
188 elif convert_to_numpy:
189 all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
RuntimeError: stack expects a non-empty TensorList
I do not seem to understand why it happens since my second embedding(2) goes through just fine without any errors?
Here is some of the code if that helps:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
fifteen_percent_list = list(fiften_percent)
#Compute embedding for both lists
embeddings1 = model.encode(fifteen_percent_list, convert_to_tensor=True)
# try on a smaller set of 10k, as it takes too long to run on full set of queries
rest_of_queries_list = list(set(rest_of_queries))[:10000]
embeddings2 = model.encode(rest_of_queries_list, convert_to_tensor=True)

Related

How to link netcdf as shared libraries during conda installation of esmpy?

I installed esmpy using conda install -c conda-forge esmpy but am unable to get it to create a mesh from an existing Netcdf file using something like this:
mesh = ESMF.Mesh(filename=myfile.nc),filetype=ESMF.FileFormat.ESMFMESH)
My input file is the output from the CAM-SE global atmospheric model at ne120 resolution. This model returns unstructured output. I get an error message saying that Netcdf should be built as a shared library. I know the Netcdf libraries exist because I use xarray all the time to read and process them. But how does one link those libraries during the installation step for esmpy using conda? Do I need to build esmpy from source to be able to do this?
Update: Have added the full traceback below. I installed netcdf using conda-forge within the conda environment I was using and it appears that was not the source of the error as the error message remains unchanged. I am now wondering if my command for generating mesh is just going to work after directly feeding in a CAM-SE output file. The file does not really have any information about the number of elements, etc. Also, should I rename the dimensions to some common form expected by ESMF? How will ESMF know which dimension represents the number of nodes, etc.? Here is the list of dimensions from one of the output files (followed by the traceback):
dimensions:
time = UNLIMITED ; // (292 currently)
lev = 30 ;
ilev = 31 ;
ncol = 777602 ;
nbnd = 2 ;
chars = 8 ;
string1 = 1 ;
----------Traceback------------------------
ArgumentError Traceback (most recent call last)
Input In [13], in <cell line: 5>()
----> 5 grid = ESMF.Mesh(filename='B1850.ne120_t12.cam.h2.0338-01-01-21600.nc',
filetype=ESMF.FileFormat.ESMFMESH)
File /software/conda/envs/dask_Jul23_2022/lib/python3.10/site-packages/ESMF/util/decorators.py:81, in initialize.<locals>.new_func(*args, **kwargs)
78 from ESMF.api import esmpymanager
80 esmp = esmpymanager.Manager(debug = False)
---> 81 return func(*args, **kwargs)
File /software/conda/envs/dask_Jul23_2022/lib/python3.10/site-packages/ESMF/api/mesh.py:198, in Mesh.__init__(self, parametric_dim, spatial_dim, coord_sys, filename, filetype, convert_to_dual, add_user_area, meshname, mask_flag, varname)
195 self._coord_sys = coord_sys
196 else:
197 # call into ctypes layer
--> 198 self._struct = ESMP_MeshCreateFromFile(filename, filetype,
199 convert_to_dual,
200 add_user_area, meshname,
201 mask_flag, varname)
202 # get the sizes
203 self._size[node] = ESMP_MeshGetLocalNodeCount(self)
File /software/conda/envs/dask_Jul23_2022/lib/python3.10/site-packages/ESMF/util/decorators.py:93, in netcdf.<locals>.new_func(*args, **kwargs)
90 from ESMF.api.constants import _ESMF_NETCDF
92 if _ESMF_NETCDF:
---> 93 return func(*args, **kwargs)
94 else:
95 raise NetCDFMissing("This function requires ESMF to have been built with NetCDF.")
File /software/conda/envs/dask_Jul23_2022/lib/python3.10/site-packages/ESMF/interface/cbindings.py:1218, in ESMP_MeshCreateFromFile(filename, fileTypeFlag, convertToDual, addUserArea, meshname, maskFlag, varname)
1197 """
1198 Preconditions: ESMP has been initialized.\n
1199 Postconditions: An ESMP_Mesh has been created.\n
(...)
1215 string (optional) :: varname\n
1216 """
1217 lrc = ct.c_int(0)
-> 1218 mesh = _ESMF.ESMC_MeshCreateFromFile(filename, fileTypeFlag,
1219 convertToDual, addUserArea,
1220 meshname, maskFlag, varname,
1221 ct.byref(lrc))
1222 rc = lrc.value
1223 if rc != constants._ESMP_SUCCESS:
ArgumentError: argument 1: <class 'AttributeError'>: 'list' object has no attribute 'encode'

How to write a proper dataset_fn in tff.simulation.FilePerUserClientData?

I'm currently implementing federated learning using tff.
Because the dataset is very large, we split it into many npy files, and I'm currently putting the dataset together using tff.simulation.FilePerUserClientData.
This is what I'm trying to do
client_ids_to_files = dict()
for i in range(len(train_filepaths)):
client_ids_to_files[str(i)] = train_filepaths[i]
def dataset_fn(filepath):
print(filepath)
dataSample = np.load(filepath)
label = filepath[:-4].strip().split('_')[-1]
return tf.data.Dataset.from_tensor_slices((dataSample, label))
train_filePerClient = tff.simulation.FilePerUserClientData(client_ids_to_files,dataset_fn)
However, it doesn't seem to work well, the filepath in the callback function has is a tensor with dtype of string. The value of filepath is: Tensor("hash_table_Lookup/LookupTableFindV2:0", shape=(), dtype=string)
Instead of containing a path in client_ids_to_files, the tensor seems to contains error messages? Am I doing something wrong? How can I write a proper dataset_fn for tff.simulation.FilePerUserClientData using npy files?
EDIT:
Here is the error log. The error itself is not really related to the question I'm asking, but you can find the called functions:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-46-e61ddbe06cdb> in <module>
22 return tf.data.Dataset.from_tensor_slices(filepath)
23
---> 24 train_filePerClient = tff.simulation.FilePerUserClientData(client_ids_to_files,dataset_fn)
25
~/fasttext-venv/lib/python3.6/site-packages/tensorflow_federated/python/simulation/file_per_user_client_data.py in __init__(self, client_ids_to_files, dataset_fn)
52 return dataset_fn(client_ids_to_files[client_id])
53
---> 54 #computations.tf_computation(tf.string)
55 def dataset_computation(client_id):
56 client_ids_to_path = tf.lookup.StaticHashTable(
~/fasttext-venv/lib/python3.6/site-packages/tensorflow_federated/python/core/impl/wrappers/computation_wrapper.py in __call__(self, tff_internal_types, *args)
405 parameter_type)
406 args, kwargs = unpack_arguments_fn(next(wrapped_fn_generator))
--> 407 result = fn_to_wrap(*args, **kwargs)
408 if result is None:
409 raise ComputationReturnedNoneError(fn_to_wrap)
~/fasttext-venv/lib/python3.6/site-packages/tensorflow_federated/python/simulation/file_per_user_client_data.py in dataset_computation(client_id)
59 list(client_ids_to_files.values())), '')
60 client_path = client_ids_to_path.lookup(client_id)
---> 61 return dataset_fn(client_path)
62
63 self._create_tf_dataset_fn = create_dataset_for_filename_fn
<ipython-input-46-e61ddbe06cdb> in dataset_fn(filepath)
17 filepath = tf.print(filepath)
18 print(filepath)
---> 19 dataSample = np.load(filepath)
20 print(dataSample)
21 label = filepath[:-4].strip().split('_')[-1]
~/fasttext-venv/lib/python3.6/site-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
426 own_fid = False
427 else:
--> 428 fid = open(os_fspath(file), "rb")
429 own_fid = True
430
TypeError: expected str, bytes or os.PathLike object, not Operation
The problem is the dataset_fn must be serializable as a tf.Graph. This is required because TFF uses TensorFlow graphs to execute logic on remote machines.
In this case, np.load is not serializable to a graph operation. It looks like numpy is used to load from disk in to memory, and then tf.data.Dataset.from_tensor_slices is used to create a dataset from an in-memory object? I may be possible to save the file in a different format and use a native tf.data.Dataset operation to load from disk, rather than using Python. Some options could be tf.data.TFRecordDataset, tf.data.TextLineDataset, or tf.data.experimental.SqlDataset.

How to fix 'KeyError: dtype('float32')' in LDAviz

I use LDAvis library to visualize my LDA topics. It works fine before, but it gets me this error when I download the saved model files from Sagemaker to the local computer. I don't know why does this happen? Does that relate to Sagemaker?
If I run from the local, and saved the model from local, and then run LDAviz library, it works fine.
KeyError Traceback (most recent call last)
in ()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
116 See pyLDAvis.prepare for **kwargs.
117 """
--> 118 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
119 return vis_prepare(**opts)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyLDAvis\gensim.py in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
46 gamma = topic_model.inference(corpus)
47 else:
---> 48 gamma, _ = topic_model.inference(corpus)
49 doc_topic_dists = gamma / gamma.sum(axis=1)[:, None]
50 else:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
665 # phinorm is the normalizer.
666 # TODO treat zeros explicitly, instead of adding epsilon?
--> 667 eps = DTYPE_TO_EPS[self.dtype]
668 phinorm = np.dot(expElogthetad, expElogbetad) + eps
669
KeyError: dtype('float32')
I know this is late but I just fixed a similar problem by updating my gensim library from 3.4 to the current version which for me is 3.8.

How to derive weights for bucketized_column in tf.estimator.LinearRegressor in tensorflow?

I am studying Google Crash ML cause.
I have trouble in chapter “Feature Cross”.
https://developers.google.com/machine-learning/crash-course/feature-crosses/programming-exercise
I tried to get the weight of cross feature from linear_regressor.
# here I change _ to linear_model
linear_model = train_model(
learning_rate=1.0,
steps=500,
batch_size=100,
feature_columns=construct_feature_columns(),
training_examples=training_examples,
training_targets=training_targets,
validation_examples=validation_examples,
validation_targets=validation_targets)
Weight_bucketized_longitude= linear_model.get_variable_value('linear/linear_model/bucketized_longitude/weights')
print(Weight_bucketized_longitude)
However, I got error message as below:
Error Message:
NotFoundError: Key linear/linear_model/bucketized_longitude/weights
not found in checkpoint
It looks like the path is wrong.
The path works for numeric_column, but it doesn’t for bucketized_column.
Could you help to indicate the correct path?
Thanks.
#
I tried Geeocode's method.
However, I still got error message.
Weight_bucketized_longitude= linear_model.get_variable_value('linear/linear_model/bucketized_longitude/weights')
AttributeErrorTraceback (most recent call last)
in ()
----> 1 Weight_bucketized_longitude= >linear_model.get_variable_value(["linear", "linear_model", >"bucketized_longitude", "weights"])
/usr/local/lib/python2.7/dist->packages/tensorflow/python/estimator/estimator.pyc in >get_variable_value(self, name)
252 _check_checkpoint_available(self.model_dir)
253 with context.graph_mode():
--> 254 return training.load_variable(self.model_dir, name)
255
256 def get_variable_names(self):
/usr/local/lib/python2.7/dist->packages/tensorflow/python/training/checkpoint_utils.pyc in >load_variable(ckpt_dir_or_file, name)
77 """
78 # TODO(b/29227106): Fix this in the right place and remove >this.
---> 79 if name.endswith(":0"):
80 name = name[:-2]
81 reader = load_checkpoint(ckpt_dir_or_file)
AttributeError: 'list' object has no attribute 'endswith'
The problem is that linear_model.get_variable_value() have to pass a list of string with variables' name. From the documentation:
get_variable_value
get_variable_value(name)
Returns value of the variable given by name.
Args: name: string or a list of string, name of the tensor. Returns:
Numpy array - value of the tensor.
Raises: ValueError: If the Estimator has not produced a checkpoint
yet.
Thus your code should changes as follow:
Weight_bucketized_longitude= linear_model.get_variable_value(["linear", "linear_model", "bucketized_longitude", "weights"])

TypeError: ufunc 'add' did not contain a loop with signature matching types

I am creating bag of words representation of the sentence. Then taking the words that exist in the sentence to compare to the file "vectors.txt", in order to get their embedding vectors. After getting vectors for each word that exists in the sentence, I am taking average of the vectors of the words in the sentence. This is my code:
import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown
news = brown.words(categories='news')
news_sents = brown.sents(categories='news')
fdist = FreqDist(w.lower() for w in news)
vocabulary = [word for word, _ in fdist.most_common(10)]
num_sents = len(news_sents)
def averageEmbeddings(sentenceTokens, embeddingLookupTable):
listOfEmb=[]
for token in sentenceTokens:
embedding = embeddingLookupTable[token]
listOfEmb.append(embedding)
return sum(np.asarray(listOfEmb)) / float(len(listOfEmb))
embeddingVectors = {}
with open("D:\\Embedding\\vectors.txt") as file:
for line in file:
(key, *val) = line.split()
embeddingVectors[key] = val
for i in range(num_sents):
features = {}
for word in vocabulary:
features[word] = int(word in news_sents[i])
print(features)
print(list(features.values()))
sentenceTokens = []
for key, value in features.items():
if value == 1:
sentenceTokens.append(key)
sentenceTokens.remove(".")
print(sentenceTokens)
print(averageEmbeddings(sentenceTokens, embeddingVectors))
print(features.keys())
Not sure why, but I get this error:
TypeError Traceback (most recent call last)
<ipython-input-4-643ccd012438> in <module>()
39 sentenceTokens.remove(".")
40 print(sentenceTokens)
---> 41 print(averageEmbeddings(sentenceTokens, embeddingVectors))
42
43 print(features.keys())
<ipython-input-4-643ccd012438> in averageEmbeddings(sentenceTokens, embeddingLookupTable)
18 listOfEmb.append(embedding)
19
---> 20 return sum(np.asarray(listOfEmb)) / float(len(listOfEmb))
21
22 embeddingVectors = {}
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U9') dtype('<U9') dtype('<U9')
P.S. Embedding Vector looks like:
the 0.011384 0.010512 -0.008450 -0.007628 0.000360 -0.010121 0.004674 -0.000076
of 0.002954 0.004546 0.005513 -0.004026 0.002296 -0.016979 -0.011469 -0.009159
and 0.004691 -0.012989 -0.003122 0.004786 -0.002907 0.000526 -0.006146 -0.003058
one 0.014722 -0.000810 0.003737 -0.001110 -0.011229 0.001577 -0.007403 -0.005355
in -0.001046 -0.008302 0.010973 0.009608 0.009494 -0.008253 0.001744 0.003263
After using np.sum I get this error:
TypeError Traceback (most recent call last)
<ipython-input-13-8a7edbb9d946> in <module>()
40 sentenceTokens.remove(".")
41 print(sentenceTokens)
---> 42 print(averageEmbeddings(sentenceTokens, embeddingVectors))
43
44 print(features.keys())
<ipython-input-13-8a7edbb9d946> in averageEmbeddings(sentenceTokens, embeddingLookupTable)
18 listOfEmb.append(embedding)
19
---> 20 return np.sum(np.asarray(listOfEmb)) / float(len(listOfEmb))
21
22 embeddingVectors = {}
C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in sum(a, axis, dtype, out, keepdims)
1829 else:
1830 return _methods._sum(a, axis=axis, dtype=dtype,
-> 1831 out=out, keepdims=keepdims)
1832
1833
C:\Anaconda3\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32 return umr_sum(a, axis, dtype, out, keepdims)
33
34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):
TypeError: cannot perform reduce with flexible type
You have a numpy array of strings, not floats. This is what is meant by dtype('<U9') -- a little endian encoded unicode string with up to 9 characters.
try:
return sum(np.asarray(listOfEmb, dtype=float)) / float(len(listOfEmb))
However, you don't need numpy here at all. You can really just do:
return sum(float(embedding) for embedding in listOfEmb) / len(listOfEmb)
Or if you're really set on using numpy.
return np.asarray(listOfEmb, dtype=float).mean()

Categories