Related
I'm trying to download the MNIST dataset from openml, using the openml library.
I tried using Jupyter notebooks because I don't want to download the same dataset every time.
Problem is, after running the following code, I get an error:
from openml.datasets import get_dataset
mnist = get_dataset(554)
x, y, p, q = mnist.get_data(
dataset_format="dataframe", target=mnist.default_target_attribute
)
I'm pasting the whole error message I get, the problem occurs when I try assigning the .get_data to x, y, p and q.
The environment I'm running this on is called Oceanic.
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
File ~\anaconda3\envs\Oceanic\lib\site-packages\openml\datasets\dataset.py:491, in OpenMLDataset._cache_compressed_file_from_file(self, data_file)
490 try:
--> 491 data = pd.read_parquet(data_file)
492 except Exception as e:
File ~\anaconda3\envs\Oceanic\lib\site-packages\pandas\io\parquet.py:493, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
491 impl = get_engine(engine)
--> 493 return impl.read(
494 path,
495 columns=columns,
496 storage_options=storage_options,
497 use_nullable_dtypes=use_nullable_dtypes,
498 **kwargs,
499 )
File ~\anaconda3\envs\Oceanic\lib\site-packages\pandas\io\parquet.py:240, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
239 try:
--> 240 result = self.api.parquet.read_table(
241 path_or_handle, columns=columns, **kwargs
242 ).to_pandas(**to_pandas_kwargs)
243 if manager == "array":
File ~\anaconda3\envs\Oceanic\lib\site-packages\pyarrow\parquet.py:1731, in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes)
1727 dataset = ParquetFile(
1728 source, metadata=metadata, read_dictionary=read_dictionary,
1729 memory_map=memory_map, buffer_size=buffer_size)
-> 1731 return dataset.read(columns=columns, use_threads=use_threads,
1732 use_pandas_metadata=use_pandas_metadata)
1734 if ignore_prefixes is not None:
File ~\anaconda3\envs\Oceanic\lib\site-packages\pyarrow\parquet.py:1608, in _ParquetDatasetV2.read(self, columns, use_threads, use_pandas_metadata)
1606 use_threads = False
-> 1608 table = self._dataset.to_table(
1609 columns=columns, filter=self._filter_expression,
1610 use_threads=use_threads
1611 )
1613 # if use_pandas_metadata, restore the pandas metadata (which gets
1614 # lost if doing a specific `columns` selection in to_table)
File ~\anaconda3\envs\Oceanic\lib\site-packages\pyarrow\_dataset.pyx:458, in pyarrow._dataset.Dataset.to_table()
File ~\anaconda3\envs\Oceanic\lib\site-packages\pyarrow\_dataset.pyx:2889, in pyarrow._dataset.Scanner.to_table()
File ~\anaconda3\envs\Oceanic\lib\site-packages\pyarrow\error.pxi:141, in pyarrow.lib.pyarrow_internal_check_status()
File ~\anaconda3\envs\Oceanic\lib\site-packages\pyarrow\error.pxi:112, in pyarrow.lib.check_status()
OSError: NotImplemented: Support for codec 'snappy' not built
The above exception was the direct cause of the following exception:
Exception Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 x, y, p, q = mnist.get_data(
2 dataset_format="dataframe", target=mnist.default_target_attribute
3 )
File ~\anaconda3\envs\Oceanic\lib\site-packages\openml\datasets\dataset.py:698, in OpenMLDataset.get_data(self, target, include_row_id, include_ignore_attribute, dataset_format)
658 def get_data(
659 self,
660 target: Optional[Union[List[str], str]] = None,
(...)
668 List[str],
669 ]:
670 """ Returns dataset content as dataframes or sparse matrices.
671
672 Parameters
(...)
696 List of attribute names.
697 """
--> 698 data, categorical, attribute_names = self._load_data()
700 to_exclude = []
701 if not include_row_id and self.row_id_attribute is not None:
File ~\anaconda3\envs\Oceanic\lib\site-packages\openml\datasets\dataset.py:531, in OpenMLDataset._load_data(self)
528 self._download_data()
530 file_to_load = self.data_file if self.parquet_file is None else self.parquet_file
--> 531 return self._cache_compressed_file_from_file(file_to_load)
533 # helper variable to help identify where errors occur
534 fpath = self.data_feather_file if self.cache_format == "feather" else self.data_pickle_file
File ~\anaconda3\envs\Oceanic\lib\site-packages\openml\datasets\dataset.py:493, in OpenMLDataset._cache_compressed_file_from_file(self, data_file)
491 data = pd.read_parquet(data_file)
492 except Exception as e:
--> 493 raise Exception(f"File: {data_file}") from e
495 categorical = [data[c].dtype.name == "category" for c in data.columns]
496 attribute_names = list(data.columns)
Exception: File: C:\Users\Irving\.openml\org\openml\www\datasets\554\dataset.pq
Now, I've written the same code on Pycharm and it works just fine, I managed to correctly assign the dataframes and show them to me. I've got no idea why this isn't working and I would like to know why because I would prefer to work with Jupyter notebooks.
Any help is appreciated, thanks in advance.
I am using TuriCreate to create model to classify a human activity, but I get error when I try to run activity_classifier.create(...) method.
Code
This is what I did:
Load all data:
train_sf = tc.SFrame("data/cleaned_train_sframe")
valid_sf = tc.SFrame("data/cleaned_valid_sframe")
test_sf = tc.SFrame("data/cleaned_test_sframe")
Dividing the SFrame randomly into two smaller SFrames:
train, valid = tc.activity_classifier.util.random_split_by_session(train_sf, session_id='sessionId', fraction=0.9)
Trying to build and train my model:
model = tc.activity_classifier.create(dataset=train_sf,
session_id='sessionId',
target='activity',
features=["rotX", "rotY", "rotZ", "accelX", "accelY", "accelZ"],
prediction_window=50,
validation_set=valid_sf,
max_iterations=20)
Error
The third step raise the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [34], in <cell line: 1>()
----> 1 model = tc.activity_classifier.create(dataset=train_sf,
2 session_id='sessionId',
3 target='activity',
4 features=["rotX", "rotY", "rotZ", "accelX", "accelY", "accelZ"],
5 prediction_window=50,
6 validation_set=valid_sf,
7 max_iterations=20)
File ~/Desktop/PFG/lib/python3.8/site-packages/turicreate/toolkits/activity_classifier/_activity_classifier.py:200, in create(dataset, session_id, target, features, prediction_window, validation_set, max_iterations, batch_size, verbose, random_seed)
197 options["_show_loss"] = False
198 options["random_seed"] = random_seed
--> 200 model.train(dataset, target, session_id, validation_set, options)
201 return ActivityClassifier(model_proxy=model, name=name)
File ~/Desktop/PFG/lib/python3.8/site-packages/turicreate/extensions.py:305, in _ToolkitClass.__getattr__.<locals>.<lambda>(*args, **kwargs)
302 return _wrap_function_return(self._tkclass.get_property(name))
303 elif name in self._functions:
304 # is it a function?
--> 305 ret = lambda *args, **kwargs: self.__run_class_function(name, args, kwargs)
306 ret.__doc__ = (
307 "Name: " + name + "\nParameters: " + str(self._functions[name]) + "\n"
308 )
309 try:
File ~/Desktop/PFG/lib/python3.8/site-packages/turicreate/extensions.py:290, in _ToolkitClass.__run_class_function(self, fnname, args, kwargs)
288 # unwrap it
289 try:
--> 290 ret = self._tkclass.call_function(fnname, argument_dict)
291 except RuntimeError as exc:
292 # Expose C++ exceptions using ToolkitError.
293 raise _ToolkitError(exc)
File cy_model.pyx:35, in turicreate._cython.cy_model.UnityModel.call_function()
File cy_model.pyx:40, in turicreate._cython.cy_model.UnityModel.call_function()
ValueError: stod: no conversion
Does anyone know what the problem could be?
You can get passed this issue by setting the validation_set to None.
This does mean that you have no validation, but at least you can create your model.
I am more or less following this example to integrate the ray tune hyperparameter library with the huggingface transformers library using my own dataset.
Here is my script:
import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.examples.pbt_transformers.utils import download_data, \
build_compute_metrics_fn
from ray.tune.schedulers import PopulationBasedTraining
from transformers import glue_tasks_num_labels, AutoConfig, \
AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
def get_model():
# tokenizer = AutoTokenizer.from_pretrained(model_name, additional_special_tokens = ['[CHARACTER]'])
model = ElectraForSequenceClassification.from_pretrained('google/electra-small-discriminator', num_labels=2)
model.resize_token_embeddings(len(tokenizer))
return model
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
training_args = TrainingArguments(
"electra_hp_tune",
report_to = "wandb",
learning_rate=2e-5, # config
do_train=True,
do_eval=True,
evaluation_strategy="epoch",
load_best_model_at_end=True,
num_train_epochs=2, # config
per_device_train_batch_size=16, # config
per_device_eval_batch_size=16, # config
warmup_steps=0,
weight_decay=0.1, # config
logging_dir="./logs",
)
trainer = Trainer(
model_init=get_model,
args=training_args,
train_dataset=chunked_encoded_dataset['train'],
eval_dataset=chunked_encoded_dataset['validation'],
compute_metrics=compute_metrics
)
tune_config = {
"per_device_train_batch_size": 32,
"per_device_eval_batch_size": 32,
"num_train_epochs": tune.choice([2, 3, 4, 5])
}
scheduler = PopulationBasedTraining(
time_attr="training_iteration",
metric="eval_acc",
mode="max",
perturbation_interval=1,
hyperparam_mutations={
"weight_decay": tune.uniform(0.0, 0.3),
"learning_rate": tune.uniform(1e-5, 2.5e-5),
"per_device_train_batch_size": [16, 32, 64],
})
reporter = CLIReporter(
parameter_columns={
"weight_decay": "w_decay",
"learning_rate": "lr",
"per_device_train_batch_size": "train_bs/gpu",
"num_train_epochs": "num_epochs"
},
metric_columns=[
"eval_f1", "eval_loss", "epoch", "training_iteration"
])
from ray.tune.integration.wandb import WandbLogger
trainer.hyperparameter_search(
hp_space=lambda _: tune_config,
backend="ray",
n_trials=10,
scheduler=scheduler,
keep_checkpoints_num=1,
checkpoint_score_attr="training_iteration",
progress_reporter=reporter,
name="tune_transformer_gr")
The last function call (to trainer.hyperparameter_search) is when the error is raised. The error message is:
AttributeError: module 'pickle' has no attribute 'PickleBuffer'
And here is the full stack trace:
AttributeError Traceback (most recent call
last)
in ()
8 checkpoint_score_attr="training_iteration",
9 progress_reporter=reporter,
---> 10 name="tune_transformer_gr")
14 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in
hyperparameter_search(self, hp_space, compute_objective, n_trials,
direction, backend, hp_name, **kwargs) 1666 1667
run_hp_search = run_hp_search_optuna if backend ==
HPSearchBackend.OPTUNA else run_hp_search_ray
-> 1668 best_run = run_hp_search(self, n_trials, direction, **kwargs) 1669 1670 self.hp_search_backend = None
/usr/local/lib/python3.7/dist-packages/transformers/integrations.py in
run_hp_search_ray(trainer, n_trials, direction, **kwargs)
231
232 analysis = ray.tune.run(
--> 233 ray.tune.with_parameters(_objective, local_trainer=trainer),
234 config=trainer.hp_space(None),
235 num_samples=n_trials,
/usr/local/lib/python3.7/dist-packages/ray/tune/utils/trainable.py in
with_parameters(trainable, **kwargs)
294 prefix = f"{str(trainable)}_"
295 for k, v in kwargs.items():
--> 296 parameter_registry.put(prefix + k, v)
297
298 trainable_name = getattr(trainable, "name", "tune_with_parameters")
/usr/local/lib/python3.7/dist-packages/ray/tune/registry.py in
put(self, k, v)
160 self.to_flush[k] = v
161 if ray.is_initialized():
--> 162 self.flush()
163
164 def get(self, k):
/usr/local/lib/python3.7/dist-packages/ray/tune/registry.py in
flush(self)
169 def flush(self):
170 for k, v in self.to_flush.items():
--> 171 self.references[k] = ray.put(v)
172 self.to_flush.clear()
173
/usr/local/lib/python3.7/dist-packages/ray/_private/client_mode_hook.py
in wrapper(*args, **kwargs)
45 if client_mode_should_convert():
46 return getattr(ray, func.name)(*args, **kwargs)
---> 47 return func(*args, **kwargs)
48
49 return wrapper
/usr/local/lib/python3.7/dist-packages/ray/worker.py in put(value)
1512 with profiling.profile("ray.put"): 1513 try:
-> 1514 object_ref = worker.put_object(value) 1515 except ObjectStoreFullError: 1516 logger.info(
/usr/local/lib/python3.7/dist-packages/ray/worker.py in
put_object(self, value, object_ref)
259 "inserting with an ObjectRef")
260
--> 261 serialized_value = self.get_serialization_context().serialize(value)
262 # This must be the first place that we construct this python
263 # ObjectRef because an entry with 0 local references is created when
/usr/local/lib/python3.7/dist-packages/ray/serialization.py in
serialize(self, value)
322 return RawSerializedObject(value)
323 else:
--> 324 return self._serialize_to_msgpack(value)
/usr/local/lib/python3.7/dist-packages/ray/serialization.py in
_serialize_to_msgpack(self, value)
302 metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
303 pickle5_serialized_object =
--> 304 self._serialize_to_pickle5(metadata, python_objects)
305 else:
306 pickle5_serialized_object = None
/usr/local/lib/python3.7/dist-packages/ray/serialization.py in
_serialize_to_pickle5(self, metadata, value)
262 except Exception as e:
263 self.get_and_clear_contained_object_refs()
--> 264 raise e
265 finally:
266 self.set_out_of_band_serialization()
/usr/local/lib/python3.7/dist-packages/ray/serialization.py in
_serialize_to_pickle5(self, metadata, value)
259 self.set_in_band_serialization()
260 inband = pickle.dumps(
--> 261 value, protocol=5, buffer_callback=writer.buffer_callback)
262 except Exception as e:
263 self.get_and_clear_contained_object_refs()
/usr/local/lib/python3.7/dist-packages/ray/cloudpickle/cloudpickle_fast.py
in dumps(obj, protocol, buffer_callback)
71 file, protocol=protocol, buffer_callback=buffer_callback
72 )
---> 73 cp.dump(obj)
74 return file.getvalue()
75
/usr/local/lib/python3.7/dist-packages/ray/cloudpickle/cloudpickle_fast.py
in dump(self, obj)
578 def dump(self, obj):
579 try:
--> 580 return Pickler.dump(self, obj)
581 except RuntimeError as e:
582 if "recursion" in e.args[0]:
/usr/local/lib/python3.7/dist-packages/pyarrow/io.pxi in
pyarrow.lib.Buffer.reduce_ex()
AttributeError: module 'pickle' has no attribute 'PickleBuffer'
My environment set-up:
Am using Google Colab
Platform: Linux-5.4.109+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
Transformers version: 4.6.1
ray version: 1.3.0
What I have tried:
Updating pickle
Installed and imported pickle5 as pickle
Made sure that I did not have a python file with the name of 'pickle' in my immediate directory
Where is this bug coming from and how can I resolve it?
I had the same error when trying to use pickle.dump(), for me it worked to downgrade pickle5 from version 0.0.11 to 0.0.10
I also encountered this error on Google Colab trying ray tune hyperparameter search with the huggingface transformers.
This helped me:
!pip install pickle5
Then
import pickle5 as pickle
After the first run there will be the pickle warning to restart the notebook and the same error. After the second “Restart and run all” the ray tune hyperparameter search begins.
Not a "real" solution but at least a workaround. For me this issue was occurring on Python 3.7. Switching to Python 3.8 solved the issue.
I deployed a tensorflow saved_model on using the following code:
`model_path = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz'
from sagemaker.tensorflow.serving import Model
model = Model(model_data=model_path, role=role)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge')`
the model takes images of dimensions 1,48,48,1
Immediately after, when I try to make a prediction using the following code:
`predictor.predict(preprocessed_faces_emo.tolist())`
I get the following error, and I know understand what the problem is. I am using this code from within sagemaker with Python version 3.7 and Tensorflow version 1.14.0:
`---------------------------------------------------------------------------
ModelError Traceback (most recent call last)
<ipython-input-37-4dc04dc0679c> in <module>()
----> 1 predictor.predict(preprocessed_faces_emo.tolist())~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args)
105
106 request_args = self._create_request_args(data, initial_args)
--> 107 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
108 return self._handle_response(response)
109 ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.__name__ = str(py_operation_name)~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
659 error_code = parsed_response.get("Error", {}).get("Code")
660 error_class = self.exceptions.from_code(error_code)
--> 661 raise error_class(parsed_response, operation_name)
662 else:
663 return parsed_responseModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (502) from model with message "<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.16.1</center>
</body>
</html>
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-serving-2020-01-13-13-43-12-354 in account 970351559819 for more information.`
I'm trying to read my pretrained doc2vec model:
from gensim.models import Doc2Vec
model = Doc2Vec.load('/path/to/pretrained/model')
However, an error appears during reading process. Could anyone suggest how to deal with this? Here is the error:
AttributeErrorTraceback (most recent call last)
<ipython-input-9-819b254ac835> in <module>()
----> 1 model = Doc2Vec.load('/path/to/pretrained/model')
/opt/jupyter-notebook/.local/lib/python2.7/site-packages/gensim/models/word2vec.pyc in load(cls, *args, **kwargs)
1682 #classmethod
1683 def load(cls, *args, **kwargs):
-> 1684 model = super(Word2Vec, cls).load(*args, **kwargs)
1685 # update older models
1686 if hasattr(model, 'table'):
/opt/jupyter-notebook/.local/lib/python2.7/site-packages/gensim/utils.pyc in load(cls, fname, mmap)
246 compress, subname = SaveLoad._adapt_by_suffix(fname)
247
--> 248 obj = unpickle(fname)
249 obj._load_specials(fname, mmap, compress, subname)
250 return obj
/opt/jupyter-notebook/.local/lib/python2.7/site-packages/gensim/utils.pyc in unpickle(fname)
909 with smart_open(fname) as f:
910 # Because of loading from S3 load can't be used (missing readline in smart_open)
--> 911 return _pickle.loads(f.read())
912
913
AttributeError: 'module' object has no attribute 'defaultdict'
As noted in the comments to the question, this was likely related to an issue in gensim that was fixed in 0.13.4 release.