Google Colab RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED - python

Yesterday and today running the same Python notebooks that I am running the past few months, I am getting the error
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
---> 99 allow_unreachable=True) # allow_unreachable flag
100
101
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The point in the code where this error seems to be random since it changes from try to try. From what I have searched, it looks to be a compatibility issue.
Also, if I rerun the cell, I might get another error which is,
/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
346 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
347 if self._pin_memory:
--> 348 data = _utils.pin_memory.pin_memory(data)
349 return data
350
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
53 return type(data)(*(pin_memory(sample) for sample in data))
54 elif isinstance(data, container_abcs.Sequence):
---> 55 return [pin_memory(sample) for sample in data]
56 elif hasattr(data, "pin_memory"):
57 return data.pin_memory()
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
53 return type(data)(*(pin_memory(sample) for sample in data))
54 elif isinstance(data, container_abcs.Sequence):
---> 55 return [pin_memory(sample) for sample in data]
56 elif hasattr(data, "pin_memory"):
57 return data.pin_memory()
/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils /pin_memory.py in pin_memory(data)
45 def pin_memory(data):
46 if isinstance(data, torch.Tensor):
---> 47 return data.pin_memory()
48 elif isinstance(data, string_classes):
49 return data
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
Does anyone else have the same problem? Did anyone solve it, how?

Finally, I solved the problem.
Somewhere in my code I use a CrossEntropyLoss function with ignore_index parameter as ignore_index = my_ignore_index. By mistake, I had my_ignore_index = -1 which as value, it is not a valid value for my data; -1 never appears in my data values. Updating correctly solved the problem. This solved the "... an illegal memory access was encou..." error.
The other thing that I did and helped to solve the problem was to use a newer version of anaconda3. This solved the CUDNN_STATUS_NOT_INITIALIZED error.
I hope that helps.

Related

How to write a proper dataset_fn in tff.simulation.FilePerUserClientData?

I'm currently implementing federated learning using tff.
Because the dataset is very large, we split it into many npy files, and I'm currently putting the dataset together using tff.simulation.FilePerUserClientData.
This is what I'm trying to do
client_ids_to_files = dict()
for i in range(len(train_filepaths)):
client_ids_to_files[str(i)] = train_filepaths[i]
def dataset_fn(filepath):
print(filepath)
dataSample = np.load(filepath)
label = filepath[:-4].strip().split('_')[-1]
return tf.data.Dataset.from_tensor_slices((dataSample, label))
train_filePerClient = tff.simulation.FilePerUserClientData(client_ids_to_files,dataset_fn)
However, it doesn't seem to work well, the filepath in the callback function has is a tensor with dtype of string. The value of filepath is: Tensor("hash_table_Lookup/LookupTableFindV2:0", shape=(), dtype=string)
Instead of containing a path in client_ids_to_files, the tensor seems to contains error messages? Am I doing something wrong? How can I write a proper dataset_fn for tff.simulation.FilePerUserClientData using npy files?
EDIT:
Here is the error log. The error itself is not really related to the question I'm asking, but you can find the called functions:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-46-e61ddbe06cdb> in <module>
22 return tf.data.Dataset.from_tensor_slices(filepath)
23
---> 24 train_filePerClient = tff.simulation.FilePerUserClientData(client_ids_to_files,dataset_fn)
25
~/fasttext-venv/lib/python3.6/site-packages/tensorflow_federated/python/simulation/file_per_user_client_data.py in __init__(self, client_ids_to_files, dataset_fn)
52 return dataset_fn(client_ids_to_files[client_id])
53
---> 54 #computations.tf_computation(tf.string)
55 def dataset_computation(client_id):
56 client_ids_to_path = tf.lookup.StaticHashTable(
~/fasttext-venv/lib/python3.6/site-packages/tensorflow_federated/python/core/impl/wrappers/computation_wrapper.py in __call__(self, tff_internal_types, *args)
405 parameter_type)
406 args, kwargs = unpack_arguments_fn(next(wrapped_fn_generator))
--> 407 result = fn_to_wrap(*args, **kwargs)
408 if result is None:
409 raise ComputationReturnedNoneError(fn_to_wrap)
~/fasttext-venv/lib/python3.6/site-packages/tensorflow_federated/python/simulation/file_per_user_client_data.py in dataset_computation(client_id)
59 list(client_ids_to_files.values())), '')
60 client_path = client_ids_to_path.lookup(client_id)
---> 61 return dataset_fn(client_path)
62
63 self._create_tf_dataset_fn = create_dataset_for_filename_fn
<ipython-input-46-e61ddbe06cdb> in dataset_fn(filepath)
17 filepath = tf.print(filepath)
18 print(filepath)
---> 19 dataSample = np.load(filepath)
20 print(dataSample)
21 label = filepath[:-4].strip().split('_')[-1]
~/fasttext-venv/lib/python3.6/site-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
426 own_fid = False
427 else:
--> 428 fid = open(os_fspath(file), "rb")
429 own_fid = True
430
TypeError: expected str, bytes or os.PathLike object, not Operation
The problem is the dataset_fn must be serializable as a tf.Graph. This is required because TFF uses TensorFlow graphs to execute logic on remote machines.
In this case, np.load is not serializable to a graph operation. It looks like numpy is used to load from disk in to memory, and then tf.data.Dataset.from_tensor_slices is used to create a dataset from an in-memory object? I may be possible to save the file in a different format and use a native tf.data.Dataset operation to load from disk, rather than using Python. Some options could be tf.data.TFRecordDataset, tf.data.TextLineDataset, or tf.data.experimental.SqlDataset.

“.. in pandas..html.py .. self.fmt.col_space.items()”: AttributeError: 'NoneType' object has no attribute 'items'

I'm trying to find out error in a word similarity calculation.
def word_similarity_error_analysis(eval_df):
eval_df['distance_rank'] = _normalized_ranking(eval_df['distance'])
eval_df['score_rank'] = _normalized_ranking(eval_df['score'])
eval_df['error'] = abs(eval_df['distance_rank'] - eval_df['score_rank'])
return eval_df.sort_values('error')
def _normalized_ranking(series):
ranks = series.rank(method='dense')
return ranks / ranks.sum()
word_similarity_error_analysis(eval_df).head()
And I'm getting this error:
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in __call__(self, obj)
336 method = get_real_method(obj, self.print_method)
337 if method is not None:
--> 338 return method()
339 return None
340 else:
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in _repr_html_(self)
732 return buf.getvalue()
733
--> 734 max_rows = get_option("display.max_rows")
735 min_rows = get_option("display.min_rows")
736 max_cols = get_option("display.max_columns")
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/format.py in to_html(self, buf, encoding, classes, notebook, border)
980 Whether the generated HTML is for IPython Notebook.
981 border : int
--> 982 A ``border=border`` attribute is included in the opening
983 ``<table>`` tag. Default ``pd.options.display.html.border``.
984 """
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/html.py in __init__(self, formatter, classes, border)
57 self.col_space = {
58 column: f"{value}px" if isinstance(value, int) else value
---> 59 for column, value in self.fmt.col_space.items()
60 }
61
AttributeError: 'NoneType' object has no attribute 'items'
word1 word2 score ... distance_rank score_rank error
1041 hummingbird pelican -32.0 ... 0.000243 0.000244 2.434543e-07
2315 lily pigs -13.0 ... 0.000488 0.000487 4.016842e-07
2951 bucket girls -4.0 ... 0.000602 0.000603 4.151568e-07
150 night sunset -43.0 ... 0.000102 0.000103 6.520315e-07
2062 oak petals -17.0 ... 0.000435 0.000436 7.162632e-07
[5 rows x 7 columns]
I saw many people faced the same type of error. However, can't find a suitable solution. What's wrong with this code?
Also, how a result is getting produced after the error. The exact line number isn't visible where error is occurring.
(I'm using Google Colab and Pandas version 1.0.5 in case you need it)
I'm encountering the same issue on an AWS SageMaker notebook (Python3.6), with relatively standard Pandas code. I have a growing suspicion that this phenomenon is related to python-pandas version mismatches, because the same code on a local notebook with Python3.7 does not replicate the exception.
I have yet to validate this is definitely the issue because the environments are not quite identical, but try to run the same code with the same data on python > 3.6.
Up(down)grading to pandas==1.3.3 fixed this for me when running under Jupyter on AWS Sagemaker as at February 2022 - from the notebook:
!pip install pandas==1.3.3
Pandas 1.3.5 didn't work for me.
I don't know whether it's a version mismatch or a regression. Finally solved for me at least! :)
Had the same problem. Furthermore, also used Jupyter with pandas.print(df) worked fine, but df only returned an exception. After update of conda package problem disappeared.

File load error: not enough storage available with 1.7TB storage free

I'm using the following code to load my files in NiFTI format in Python.
import nibabel as nib
img_arr = []
for i in range(len(datadir)):
img = nib.load(datadir[i])
img_data = img.get_fdata()
img_arr.append(img_data)
img.uncache()
A small amount of images works fine, but if I want to load more images, I get the following error:
OSError Traceback (most recent call last)
<ipython-input-55-f982811019c9> in <module>()
10 #img = nilearn.image.smooth_img(datadir[i],fwhm = 3) #Smoothing filter for preprocessing (necessary?)
11 img = nib.load(datadir[i])
---> 12 img_data = img.get_fdata()
13 img_arr.append(img_data)
14 img.uncache()
~\AppData\Roaming\Python\Python36\site-packages\nibabel\dataobj_images.py in get_fdata(self, caching, dtype)
346 if self._fdata_cache.dtype.type == dtype.type:
347 return self._fdata_cache
--> 348 data = np.asanyarray(self._dataobj).astype(dtype, copy=False)
349 if caching == 'fill':
350 self._fdata_cache = data
~\AppData\Roaming\Python\Python36\site-packages\numpy\core\_asarray.py in asanyarray(a, dtype, order)
136
137 """
--> 138 return array(a, dtype, copy=False, order=order, subok=True)
139
140
~\AppData\Roaming\Python\Python36\site-packages\nibabel\arrayproxy.py in __array__(self)
353 def __array__(self):
354 # Read array and scale
--> 355 raw_data = self.get_unscaled()
356 return apply_read_scaling(raw_data, self._slope, self._inter)
357
~\AppData\Roaming\Python\Python36\site-packages\nibabel\arrayproxy.py in get_unscaled(self)
348 offset=self._offset,
349 order=self.order,
--> 350 mmap=self._mmap)
351 return raw_data
352
~\AppData\Roaming\Python\Python36\site-packages\nibabel\volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
507 shape=shape,
508 order=order,
--> 509 offset=offset)
510 # The error raised by memmap, for different file types, has
511 # changed in different incarnations of the numpy routine
~\AppData\Roaming\Python\Python36\site-packages\numpy\core\memmap.py in __new__(subtype, filename, dtype, mode, offset, shape, order)
262 bytes -= start
263 array_offset = offset - start
--> 264 mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
265
266 self = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm,
OSError: [WinError 8] Not enough storage is available to process this command
I thought that img.uncache() would delete the image from memory so it wouldn't take up too much storage but still being able to work with the image array. Adding this bit to the code didn't change anything though.
Does anyone know how I can help this? The computer I'm working on has 24 core 2,6 GHz CPU, more than 52 GB memory and the working directory has over 1.7 TB free storage. I'm trying to load around 1500 MRI images from the ADNI database.
Any suggestions are much appreciated.
This error is not being caused because the 1.7TB hard drive is filling up, it's because you're running out of memory, aka RAM. It's going to be important to understand how those two things differ.
uncache() does not remove an item from memory completely, as documented here, but that link also contains more memory saving tips.
If you want to remove an object from memory completely, you can use the Garbage Collector interface, like so:
import nibabel as nib
import gc
img_arr = []
for i in range(len(datadir)):
img = nib.load(datadir[i])
img_data = img.get_fdata()
img_arr.append(img_data)
img.uncache()
# Delete the img object and free the memory
del img
gc.collect()
That should help reduce the amount of memory you are using.
How to fix "not enough storage available.."?
Try to do these steps:
Press the Windows + R key at the same time on your keyboard, then type Regedit.exe in the Run window and click on OK.
Then Unfold HKEY_LOCAL_MACHINE, then SYSTEM, then CurrentControlSet, then services, then LanmanServer, then Parameters.
Locate IRPStackSize (If found skip to step 5), If it does not exist then right-click the right Window and choose New > Dword Value (32)
Now type IRPStackSize under the name, then hit enter.
Right-click IRPStackSize and click on Modify, then set any value higher then 15 but lower than 50 and click OK
Restart your system and try to repeat the same action as you did when the error occurred.
Or :
Set the following registry key HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\LargeSystemCache to value "1"
Set the following registry
HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\Size to value "3"
Another way to saving memory in "nibabel" :
There are other ways to saving memory alongside to uncache() method, you can use :
The array proxy instead of get_fdata()
The caching keyword to get_fdata()

How to fix 'KeyError: dtype('float32')' in LDAviz

I use LDAvis library to visualize my LDA topics. It works fine before, but it gets me this error when I download the saved model files from Sagemaker to the local computer. I don't know why does this happen? Does that relate to Sagemaker?
If I run from the local, and saved the model from local, and then run LDAviz library, it works fine.
KeyError Traceback (most recent call last)
in ()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
116 See pyLDAvis.prepare for **kwargs.
117 """
--> 118 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
119 return vis_prepare(**opts)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyLDAvis\gensim.py in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
46 gamma = topic_model.inference(corpus)
47 else:
---> 48 gamma, _ = topic_model.inference(corpus)
49 doc_topic_dists = gamma / gamma.sum(axis=1)[:, None]
50 else:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
665 # phinorm is the normalizer.
666 # TODO treat zeros explicitly, instead of adding epsilon?
--> 667 eps = DTYPE_TO_EPS[self.dtype]
668 phinorm = np.dot(expElogthetad, expElogbetad) + eps
669
KeyError: dtype('float32')
I know this is late but I just fixed a similar problem by updating my gensim library from 3.4 to the current version which for me is 3.8.

Why does numba not work with this nested function?

I reported the described bug here: https://github.com/numba/numba/issues/3095 , if anyone is interested in the solution of the problem.
I am trying to precompile a minimzation running on 3D time series data with numba. As a first step, I wanted to define a cost function, but this already fails. Here is my code:
from numba import jit
#jit
def tester(axis,data):
def lineCost(pars):
A=pars[0]
B=pars[1]
return np.sum((A*axis+B - data)**2)
return lineCost([axis,data])
tester(1,2)
This yields a "Not implemented" error:
~/.local/lib/python3.5/site-packages/numba/lowering.py in lower(self)
171 if self.generator_info is None:
172 self.genlower = None
--> 173 self.lower_normal_function(self.fndesc)
174 else:
175 self.genlower = self.GeneratorLower(self)
~/.local/lib/python3.5/site-packages/numba/lowering.py in lower_normal_function(self, fndesc)
212 # Init argument values
213 self.extract_function_arguments()
--> 214 entry_block_tail = self.lower_function_body()
215
216 # Close tail of entry block
~/.local/lib/python3.5/site-packages/numba/lowering.py in lower_function_body(self)
237 bb = self.blkmap[offset]
238 self.builder.position_at_end(bb)
--> 239 self.lower_block(block)
240
241 self.post_lower()
~/.local/lib/python3.5/site-packages/numba/lowering.py in lower_block(self, block)
252 with new_error_context('lowering "{inst}" at {loc}', inst=inst,
253 loc=self.loc, errcls_=defaulterrcls):
--> 254 self.lower_inst(inst)
255
256 def create_cpython_wrapper(self, release_gil=False):
/usr/lib/python3.5/contextlib.py in __exit__(self, type, value, traceback)
75 value = type()
76 try:
---> 77 self.gen.throw(type, value, traceback)
78 raise RuntimeError("generator didn't stop after throw()")
79 except StopIteration as exc:
~/.local/lib/python3.5/site-packages/numba/errors.py in new_error_context(fmt_, *args, **kwargs)
583 from numba import config
584 tb = sys.exc_info()[2] if config.FULL_TRACEBACKS else None
--> 585 six.reraise(type(newerr), newerr, tb)
586
587
~/.local/lib/python3.5/site-packages/numba/six.py in reraise(tp, value, tb)
657 if value.__traceback__ is not tb:
658 raise value.with_traceback(tb)
--> 659 raise value
660
661 else:
LoweringError: Failed at object (object mode backend)
make_function(closure=$0.3, defaults=None, name=$const0.5, code=<code object lineCost at 0x7fd7ada3b810, file "<ipython-input-59-ef6835d3b147>", line 3>)
File "<ipython-input-59-ef6835d3b147>", line 3:
def tester(axis,data):
def lineCost(pars):
^
[1] During: lowering "$0.6 = make_function(closure=$0.3, defaults=None, name=$const0.5, code=<code object lineCost at 0x7fd7ada3b810, file "<ipython-input-59-ef6835d3b147>", line 3>)" at <ipython-input-59-ef6835d3b147> (3)
-------------------------------------------------------------------------------
This should not have happened, a problem has occurred in Numba's internals.
Please report the error message and traceback, along with a minimal reproducer
at: https://github.com/numba/numba/issues/new
If more help is needed please feel free to speak to the Numba core developers
directly at: https://gitter.im/numba/numba
Thanks in advance for your help in improving Numba!
Could you help me in understanding which part of the code causes problems for numba? That would be of very big help. Thank you!
Best,
Malte
Avoid global variables (data and axis are global in lineCost), avoid functions which contains functions and avoid lists ([axis,data]).
Working example
from numba import jit
#jit
def lineCost(axis,data):
return np.sum((axis*axis+data - data)**2)
#jit
def tester(axis,data):
return lineCost(axis,data)
tester(1,2)
Most of this things should work in the latest release, but using the latest features which often contains some bugs or some details which aren't supported, isn't recommendable.
Even if this would work, it wouldn't suprise me much, if the performance is less than expected.
Actually, it seems this was a bug that was removed in the newest release :)
https://github.com/numba/numba/issues/3095

Categories