I've been studying time series forecasting, and I'm trying to learn how to use gluon-ts&python.
Here is the source code of gluon-ts:
https://github.com/awslabs/gluon-ts/
I tried to use LSTNetEstimator module, however, it turns out an error as follow. And I found that all the discussions of gluon-ts is about DeepAR.
Is there anyone that may come to help?
GluonTSDataError: Input for field "target" does not have the requireddimension (field: target, ndim observed: 1, expected ndim: 2)
I think it has something to do with my custom dataset, which includes a dataframe with the shape of [320000,3] and columns of ['time','Power_Cdp','Power_Dp'].
here is the trainset:
from gluonts.dataset.common import ListDataset
from gluonts.model.lstnet import LSTNetEstimator
from gluonts.mx.trainer import Trainer
training_data = ListDataset(
[{"start": LSTNet_df.index[0], "target": LSTNet_df['Power_Cdp'][:-10000}],
freq = "15min")
estimator = LSTNetEstimator(freq="15min", prediction_length=24*4, context_length=24*4,
num_series=48*4, skip_size=72*4, ar_window=24*4, channels=32,
trainer=Trainer(epochs=10))
predictor = estimator.train(training_data=training_data) # Error
The full report is bellow.
GluonTSDataError Traceback (most recent call last)
<ipython-input-311-ad48fdc20df5> in <module>
7 num_series=48*4, skip_size=72*4, ar_window=24*4, channels=32,
8 trainer=Trainer(epochs=10))
----> 9 predictor = estimator.train(training_data=training_data)
~/.local/lib/python3.8/site-packages/gluonts/mx/model/estimator.py in train(self, training_data, validation_data, num_workers, num_prefetch, shuffle_buffer_length, cache_data, **kwargs)
192 **kwargs,
193 ) -> Predictor:
--> 194 return self.train_model(
195 training_data=training_data,
196 validation_data=validation_data,
~/.local/lib/python3.8/site-packages/gluonts/mx/model/estimator.py in train_model(self, training_data, validation_data, num_workers, num_prefetch, shuffle_buffer_length, cache_data)
145 transformed_training_data = transformation.apply(training_data)
146
--> 147 training_data_loader = self.create_training_data_loader(
148 transformed_training_data
149 if not cache_data
~/.local/lib/python3.8/site-packages/gluonts/model/lstnet/_estimator.py in create_training_data_loader(self, data, **kwargs)
216 ) -> DataLoader:
217 input_names = get_hybrid_forward_input_names(LSTNetTrain)
--> 218 with env._let(max_idle_transforms=maybe_len(data) or 0):
219 instance_splitter = self._create_instance_splitter("training")
220 return TrainDataLoader(
~/.local/lib/python3.8/site-packages/gluonts/itertools.py in maybe_len(obj)
21 def maybe_len(obj) -> Optional[int]:
22 try:
---> 23 return len(obj)
24 except (NotImplementedError, AttributeError):
25 return None
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in __len__(self)
99 # NOTE this is unsafe when transformations are run with is_train = True
100 # since some transformations may not be deterministic (instance splitter)
--> 101 return sum(1 for _ in self)
102
103 def __iter__(self) -> Iterator[DataEntry]:
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in <genexpr>(.0)
99 # NOTE this is unsafe when transformations are run with is_train = True
100 # since some transformations may not be deterministic (instance splitter)
--> 101 return sum(1 for _ in self)
102
103 def __iter__(self) -> Iterator[DataEntry]:
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in __iter__(self)
102
103 def __iter__(self) -> Iterator[DataEntry]:
--> 104 yield from self.transformation(
105 self.base_dataset, is_train=self.is_train
106 )
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in __call__(self, data_it, is_train)
122 self, data_it: Iterable[DataEntry], is_train: bool
123 ) -> Iterator:
--> 124 for data_entry in data_it:
125 try:
126 yield self.map_transform(data_entry.copy(), is_train)
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in __call__(self, data_it, is_train)
126 yield self.map_transform(data_entry.copy(), is_train)
127 except Exception as e:
--> 128 raise e
129
130 #abc.abstractmethod
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in __call__(self, data_it, is_train)
124 for data_entry in data_it:
125 try:
--> 126 yield self.map_transform(data_entry.copy(), is_train)
127 except Exception as e:
128 raise e
~/.local/lib/python3.8/site-packages/gluonts/transform/_base.py in map_transform(self, data, is_train)
139
140 def map_transform(self, data: DataEntry, is_train: bool) -> DataEntry:
--> 141 return self.transform(data)
142
143 #abc.abstractmethod
~/.local/lib/python3.8/site-packages/gluonts/transform/convert.py in transform(self, data)
127 value = np.asarray(data[self.field], dtype=self.dtype)
128
--> 129 assert_data_error(
130 value.ndim == self.expected_ndim,
131 'Input for field "{self.field}" does not have the required'
~/.local/lib/python3.8/site-packages/gluonts/exceptions.py in assert_data_error(condition, message, *args, **kwargs)
114 exception message.
115 """
--> 116 assert_gluonts(GluonTSDataError, condition, message, *args, **kwargs)
~/.local/lib/python3.8/site-packages/gluonts/exceptions.py in assert_gluonts(exception_class, condition, message, *args, **kwargs)
93 """
94 if not condition:
---> 95 raise exception_class(message.format(*args, **kwargs))
96
97
GluonTSDataError: Input for field "target" does not have the requireddimension (field: target, ndim observed: 1, expected ndim: 2)
#darth baba I tried again lately, and this time I used DeepVAR.
Here is what got.
I reset "target": ... , stepped into my code and found
gluonts.exceptions.GluonTSDataError: Array 'target' has bad shape - expected 1 dimensions, got 2.
This is because I set "target": train_df[['Power_Cdp', 'Power_Active_Fan']], which might be reported as an error at ./gluonts/dataset/common.py +385.
I found self.req_ndim != value.ndim , which suggested that I could have input a wrong shape of target, thus I reset input like "target": train_df.index[:] and this problem solved.
But it reports an another error,
gluonts.exceptions.GluonTSDataError: Input for field "target" does not have the requireddimension (field: target, ndim observed: 1, expected ndim: 2)
To be sincerely, this one confused me a lot, for the gluonts's code seems so complicated.
I checked again and found it reports error at ./gluonts/transform/convert.py +129.
It seems like expected_ndim is not equals to value.ndim in this case.
However, after rewriting expected_ndim manually, the training progress worked eventually.
I have no idea whether this modification is right.
Although the avg_epoch_loss are decreasing, the final forecast is not as good as expected.
Related
I have a df with some columns target time series which I need to scale as 1D array and as two thus defining custom wrapper to extend scikit-learn scalers with optional preprocessing as flatten. transform and inverse_transform are not posted for simplicity
import copy
class ScalerOptFlatten( TransformerMixin,
BaseEstimator ):
def __init__( self, instance, flatten , **kwargs):
#super().__init__(**kwargs)
self.scaler = instance
self.flatten = flatten
def get_params(self, deep=True):
cp = copy.copy(self.scaler)
cp.__class__ = type(self.scaler)
params = type(self.scaler).get_params(cp, deep)
return params
def fit( self, X, y = None ):
if self.flatten:
scale_in = X.reshape(-1).reshape(-1, 1)
else:
scale_in = X
if y==None:
self.scaler.fit(scale_in)
else:
self.scaler.fit(scale_in,y)
return self
It works when used on its own
scaler_transformer = ScalerOptFlatten( instance = StandardScaler(),
flatten = True )
scaler_transformer.fit(df_unstack)
But fails when stacked in
preprocessor = ColumnTransformer(
transformers=[
("ts", scaler_transformer, target_feature_names)
]
)
preprocessor.fit(df_unstack)
With
---------------------------------------------------------------------------
Empty Traceback (most recent call last)
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
826 try:
--> 827 tasks = self._ready_batches.get(block=False)
828 except queue.Empty:
~/anaconda3/envs/tf_2_2_0/lib/python3.8/queue.py in get(self, block, timeout)
166 if not self._qsize():
--> 167 raise Empty
168 elif timeout is None:
Empty:
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-494-0eb4b591d8a6> in <module>
----> 1 ts_scaler_step.fit(df_unstack)
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py in fit(self, X, y)
492 # we use fit_transform to make sure to set sparse_output_ (for which we
493 # need the transformed data) to have consistent output type in predict
--> 494 self.fit_transform(X, y=y)
495 return self
496
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
529 self._validate_remainder(X)
530
--> 531 result = self._fit_transform(X, y, _fit_transform_one)
532
533 if not result:
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
456 self._iter(fitted=fitted, replace_strings=True))
457 try:
--> 458 return Parallel(n_jobs=self.n_jobs)(
459 delayed(func)(
460 transformer=clone(trans) if not fitted else trans,
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1046 # remaining jobs.
1047 self._iterating = False
-> 1048 if self.dispatch_one_batch(iterator):
1049 self._iterating = self._original_iterator is not None
1050
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
836 big_batch_size = batch_size * n_jobs
837
--> 838 islice = list(itertools.islice(iterator, big_batch_size))
839 if len(islice) == 0:
840 return False
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py in <genexpr>(.0)
458 return Parallel(n_jobs=self.n_jobs)(
459 delayed(func)(
--> 460 transformer=clone(trans) if not fitted else trans,
461 X=_safe_indexing(X, column, axis=1),
462 y=y,
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/anaconda3/envs/tf_2_2_0/lib/python3.8/site-packages/sklearn/base.py in clone(estimator, safe)
86 for name, param in new_object_params.items():
87 new_object_params[name] = clone(param, safe=False)
---> 88 new_object = klass(**new_object_params)
89 params_set = new_object.get_params(deep=False)
90
TypeError: __init__() missing 2 required positional arguments: 'instance' and 'flatten'
I suspect that error is in get_param method
I'm running jupyter lab on windows and fastai.vision.utils.verify_images(fns) is giving me problems because it calls fastcore.parallel.parallel with default n_workers=8. There are many ways around it, but I was trying to figure out a code block that I could slap in any notebook and have it so all underlying calls to parallel will run with n_workers=1.
I tried the following cell:
import fastcore
import sys
_fastcore = fastcore
_parallel = lambda *args, **kwargs: fastcore.parallel.parallel(*args, **kwargs, n_workers=1)
_fastcore.parallel.parallel = _parallel
sys.modules['fastcore'] = _fastcore
fastcore.parallel.parallel
printing
<function __main__.<lambda>(*args, **kwargs)>
but when I try running verify_images it still fails as if the patch never happened
---------------------------------------------------------------------------
BrokenProcessPool Traceback (most recent call last)
<ipython-input-37-f1773f2c9e62> in <module>
3 # from mock import patch
4 # with patch('fastcore.parallel.parallel') as _parallel:
----> 5 failed = verify_images(fns)
6 # failed = L(fns[i] for i,o in enumerate(_parallel(verify_image, fns)) if not o)
7 failed
~\anaconda3\lib\site-packages\fastai\vision\utils.py in verify_images(fns)
59 def verify_images(fns):
60 "Find images in `fns` that can't be opened"
---> 61 return L(fns[i] for i,o in enumerate(parallel(verify_image, fns)) if not o)
62
63 # Cell
~\anaconda3\lib\site-packages\fastcore\parallel.py in parallel(f, items, n_workers, total, progress, pause, threadpool, timeout, chunksize, *args, **kwargs)
121 if total is None: total = len(items)
122 r = progress_bar(r, total=total, leave=False)
--> 123 return L(r)
124
125 # Cell
~\anaconda3\lib\site-packages\fastcore\foundation.py in __call__(cls, x, *args, **kwargs)
95 def __call__(cls, x=None, *args, **kwargs):
96 if not args and not kwargs and x is not None and isinstance(x,cls): return x
---> 97 return super().__call__(x, *args, **kwargs)
98
99 # Cell
~\anaconda3\lib\site-packages\fastcore\foundation.py in __init__(self, items, use_list, match, *rest)
103 def __init__(self, items=None, *rest, use_list=False, match=None):
104 if (use_list is not None) or not is_array(items):
--> 105 items = listify(items, *rest, use_list=use_list, match=match)
106 super().__init__(items)
107
~\anaconda3\lib\site-packages\fastcore\basics.py in listify(o, use_list, match, *rest)
54 elif isinstance(o, list): res = o
55 elif isinstance(o, str) or is_array(o): res = [o]
---> 56 elif is_iter(o): res = list(o)
57 else: res = [o]
58 if match is not None:
~\anaconda3\lib\concurrent\futures\process.py in _chain_from_iterable_of_lists(iterable)
482 careful not to keep references to yielded objects.
483 """
--> 484 for element in iterable:
485 element.reverse()
486 while element:
~\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
609 # Careful not to keep a reference to the popped future
610 if timeout is None:
--> 611 yield fs.pop().result()
612 else:
613 yield fs.pop().result(end_time - time.monotonic())
~\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
437 raise CancelledError()
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
440 else:
441 raise TimeoutError()
~\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I suspect it has to do with fastai.vision.utils using * imports for fastcore. Is there a way to achieve what I want?
Since the parallel function has already been imported into the fastai.vision.utils module, the correct way is to monkeypatch that module rather than fastcore.parallel:
... # your code for custom `parallel` function goes here
import fastai.vision.utils
fastai.vision.utils.parallel = _parallel # assign your custom function here
The example is fully reproducible. Here is full notebook (which downloads data too): https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb
After this part in notebook above:
full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("linear", LinearRegression())
])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)
I am trying to get predictions on the test set with this code:
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)
But I am receiving error:
C:\Users\Alex\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py:430: FutureWarning: Given feature/column names or counts do not match the ones for the data given during fit. This will fail from v0.24.
FutureWarning)
---------------------------------------------------------------------------
Empty Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
796 try:
--> 797 tasks = self._ready_batches.get(block=False)
798 except queue.Empty:
~\AppData\Local\Continuum\anaconda3\lib\queue.py in get(self, block, timeout)
166 if not self._qsize():
--> 167 raise Empty
168 elif timeout is None:
Empty:
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-141-dc87b1c9e658> in <module>
5
6 X_test_prepared = full_pipeline.transform(X_test)
----> 7 final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)
8
9 final_mse = mean_squared_error(y_test, final_predictions)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
114
115 # lambda, but not partial, allows help() to work with update_wrapper
--> 116 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
117 # update the docstring of the returned function
118 update_wrapper(out, self.fn)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
417 Xt = X
418 for _, name, transform in self._iter(with_final=False):
--> 419 Xt = transform.transform(Xt)
420 return self.steps[-1][-1].predict(Xt, **predict_params)
421
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in transform(self, X)
586
587 self._validate_features(X.shape[1], X_feature_names)
--> 588 Xs = self._fit_transform(X, None, _transform_one, fitted=True)
589 self._validate_output(Xs)
590
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _fit_transform(self, X, y, func, fitted)
455 message=self._log_message(name, idx, len(transformers)))
456 for idx, (name, trans, column, weight) in enumerate(
--> 457 self._iter(fitted=fitted, replace_strings=True), 1))
458 except ValueError as e:
459 if "Expected 2D array, got 1D array instead" in str(e):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not None
1006
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
806 big_batch_size = batch_size * n_jobs
807
--> 808 islice = list(itertools.islice(iterator, big_batch_size))
809 if len(islice) == 0:
810 return False
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in <genexpr>(.0)
454 message_clsname='ColumnTransformer',
455 message=self._log_message(name, idx, len(transformers)))
--> 456 for idx, (name, trans, column, weight) in enumerate(
457 self._iter(fitted=fitted, replace_strings=True), 1))
458 except ValueError as e:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\__init__.py in _safe_indexing(X, indices, axis)
404 if axis == 1 and indices_dtype == 'str' and not hasattr(X, 'loc'):
405 raise ValueError(
--> 406 "Specifying the columns using strings is only supported for "
407 "pandas DataFrames"
408 )
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Question: How can I correct that error? And why that error happens?
Since your final pipeline:
full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("linear", LinearRegression())
])
clearly contains already the full_pipeline, you should not "prepare" again your X_test; doing so, you are "preparing" X_test twice, which is wrong. So, your code should be simply
final_predictions = full_pipeline_with_predictor.predict(X_test)
exactly as it is for getting predictions for some_data, i.e.
full_pipeline_with_predictor.predict(some_data)
which some_data you correctly do not "prepare" before feeding them into the final pipeline.
The whole point of using pipelines is exactly this, i.e. to avoid having to run separately fit-predict for possibly several preparation steps, having wrapped all of them into a single pipeline instead. You correctly apply this process here when you predict some_data, but you somehow seem to have forgotten it in the next step, when you try to predict X_test.
I'm getting the below error when I try to target encode a categorical column.
"TypeError: Categorical cannot perform the operation mean"
When I try to run similar code through Jupyter Notebook it works fine but when I try to run it as part of python file it errors out with the above error message.
I know this sounds a bit crazy but I am not able to understand what's going on in the background?
Error:
TypeError Traceback (most recent call last)
<ipython-input-4-9ba53c6b7375> in <module>()
----> 1 tpmodeller.initialize()
~/SageMaker/tp/tp_kvr.py in initialize(self)
127
128 # Target Encode Te_cat_col Features
--> 129 df_cat_te = target_encode_bin(self.train[self.Te_cat_col], self.train['vol_trmnt_in_4_quarters'])
130 self.train = pd.concat([self.train, df_cat_te], axis=1)
131
~/SageMaker/tp/tp_kvr.py in target_encode_bin(df_te, target)
366 te = TargetEncoder(smoothing = 1, min_samples_leaf = 5, handle_unknown='ignore')
--> 367 df_te = te.fit_transform(df_te, target)
368 #
369 # Binning and then placing it in {col}_bin feature
~/anaconda3/envs/python3/lib/python3.6/site-packages/category_encoders/target_encoder.py in fit_transform(self, X, y, **fit_params)
249 transform(X)
250
--> 251 return self.fit(X, y, **fit_params).transform(X, y)
252
~/anaconda3/envs/python3/lib/python3.6/site-packages/category_encoders/target_encoder.py in fit(self, X, y, **kwargs)
138 self.ordinal_encoder = self.ordinal_encoder.fit(X)
139 X_ordinal = self.ordinal_encoder.transform(X)
--> 140 self.mapping = self.fit_target_encoding(X_ordinal, y)
141
142 X_temp = self.transform(X, override_return_df=True)
~/anaconda3/envs/python3/lib/python3.6/site-packages/category_encoders/target_encoder.py in fit_target_encoding(self, X, y)
164 values = switch.get('mapping')
165
--> 166 prior = self._mean = y.mean()
167
168 stats = y.groupby(X[col]).agg(['count', 'mean'])
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
10954 skipna=skipna)
10955 return self._reduce(f, name, axis=axis, skipna=skipna,
> 10956 numeric_only=numeric_only)
10957
10958 return set_function_name(stat_func, name, cls)
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
3614 # TODO deprecate numeric_only argument for Categorical and use
3615 # skipna as well, see GH25303
-> 3616 return delegate._reduce(name, numeric_only=numeric_only, **kwds)
3617 elif isinstance(delegate, ExtensionArray):
3618 # dispatch to ExtensionArray interface
~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py in _reduce(self, name, axis, **kwargs)
2170 if func is None:
2171 msg = 'Categorical cannot perform the operation {op}'
-> 2172 raise TypeError(msg.format(op=name))
2173 return func(**kwargs)
2174
TypeError: Categorical cannot perform the operation mean
Your target feature needs to be converted to a 'number' type.
I've a notebook on google colab that fails with following error
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
93 exception = e
---> 94 raise e
95 finally: cb_handler.on_train_end(exception)
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
83 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
85 if cb_handler.on_batch_end(loss): break
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
24 if opt is not None:
---> 25 loss = cb_handler.on_backward_begin(loss)
26 loss.backward()
/usr/local/lib/python3.6/dist-packages/fastai/callback.py in on_backward_begin(self, loss)
223 for cb in self.callbacks:
--> 224 a = cb.on_backward_begin(**self.state_dict)
225 if a is not None: self.state_dict['last_loss'] = a
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in on_backward_begin(self, smooth_loss, **kwargs)
266 if self.pbar is not None and hasattr(self.pbar,'child'):
--> 267 self.pbar.child.comment = f'{smooth_loss:.4f}'
268
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in __format__(self, format_spec)
377 if self.dim() == 0:
--> 378 return self.item().__format__(format_spec)
379 return object.__format__(self, format_spec)
RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-33-dd390b1c8108> in <module>()
----> 1 lr_find(learn)
2 learn.recorder.plot()
/usr/local/lib/python3.6/dist-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
26 cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
27 a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 28 learn.fit(a, start_lr, callbacks=[cb], **kwargs)
29
30 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
160 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
161 fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162 callbacks=self.callbacks+callbacks)
163
164 def create_opt(self, lr:Floats, wd:Floats=0.)->None:
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
93 exception = e
94 raise e
---> 95 finally: cb_handler.on_train_end(exception)
96
97 loss_func_name2activ = {'cross_entropy_loss': partial(F.softmax, dim=1), 'nll_loss': torch.exp, 'poisson_nll_loss': torch.exp,
/usr/local/lib/python3.6/dist-packages/fastai/callback.py in on_train_end(self, exception)
254 def on_train_end(self, exception:Union[bool,Exception])->None:
255 "Handle end of training, `exception` is an `Exception` or False if no exceptions during training."
--> 256 self('train_end', exception=exception)
257
258 class AverageMetric(Callback):
/usr/local/lib/python3.6/dist-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
185 "Call through to all of the `CallbakHandler` functions."
186 if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 187 return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
188
189 def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:
/usr/local/lib/python3.6/dist-packages/fastai/callback.py in <listcomp>(.0)
185 "Call through to all of the `CallbakHandler` functions."
186 if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 187 return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
188
189 def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:
/usr/local/lib/python3.6/dist-packages/fastai/callbacks/lr_finder.py in on_train_end(self, **kwargs)
45 # restore the valid_dl we turned of on `__init__`
46 self.data.valid_dl = self.valid_dl
---> 47 self.learn.load('tmp')
48 if hasattr(self.learn.model, 'reset'): self.learn.model.reset()
49 print('LR Finder complete, type {learner_name}.recorder.plot() to see the graph.')
/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in load(self, name, device)
202 "Load model `name` from `self.model_dir` using `device`, defaulting to `self.data.device`."
203 if device is None: device = self.data.device
--> 204 self.model.load_state_dict(torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device))
205 return self
206
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in load(f, map_location, pickle_module)
356 f = open(f, 'rb')
357 try:
--> 358 return _load(f, map_location, pickle_module)
359 finally:
360 if new_fd:
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _load(f, map_location, pickle_module)
527 unpickler = pickle_module.Unpickler(f)
528 unpickler.persistent_load = persistent_load
--> 529 result = unpickler.load()
530
531 deserialized_storage_keys = pickle_module.load(f)
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in persistent_load(saved_id)
493 if root_key not in deserialized_objects:
494 deserialized_objects[root_key] = restore_location(
--> 495 data_type(size), location)
496 storage = deserialized_objects[root_key]
497 if view_metadata is not None:
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in restore_location(storage, location)
376 elif isinstance(map_location, torch.device):
377 def restore_location(storage, location):
--> 378 return default_restore_location(storage, str(map_location))
379 else:
380 def restore_location(storage, location):
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in default_restore_location(storage, location)
102 def default_restore_location(storage, location):
103 for _, _, fn in _package_registry:
--> 104 result = fn(storage, location)
105 if result is not None:
106 return result
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _cuda_deserialize(obj, location)
84 'to an existing device.'.format(
85 device, torch.cuda.device_count()))
---> 86 return obj.cuda(device)
87
88
/usr/local/lib/python3.6/dist-packages/torch/_utils.py in _cuda(self, device, non_blocking, **kwargs)
74 else:
75 new_type = getattr(torch.cuda, self.__class__.__name__)
---> 76 return new_type(self.size()).copy_(self, non_blocking)
77
78
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorCopy.cpp:20
There is no information about the real cause, I tried to get the stack trace by forcing cuda to run on one gpu (as suggested here) using a cell like this
!export CUDA_LAUNCH_BLOCKING=1
But this does not seem to work, still having the same error with.
Is there another way that works with Google Colab?
Be sure that your targets values starts from zero to number of classes - 1. Ex: you have 100 classification class so your target should be from 0 to 99
!export FOO=blah is usually not useful to run in a notebook because ! means run the following command in a sub-shell, so the effect of the statement is gone by the time the ! returns.
You might have more success by storing your python code in a file and then executing that file in a subshell:
In one cell:
%%writefile foo.py
[...your code...]
In the next cell:
!export CUDA_LAUNCH_BLOCKING=1; python3 foo.py
(or s/python3/python2/ if you're writing py2)
Switch Hardware Accelerator Type to "None" under Runtime->Change Runtime Type . This should give you a more meaningful error message.
The proper way to set environmental variables in Google Colab is to use os:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
Using the os library will allow you to set whatever environmental variables you need. Setting CUDA_LAUNCH_BLOCKING this way enables proper CUDA tracebacks in Google Colab.