I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions.
the code is as follows:
from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)
for key, row in table.scan(limit=5000): #BOOM!
pass
leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed.
I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source.
When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined):
ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0
I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key)
So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ?
I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC:
google.cloud.happybase.table.Table# scan()
google.cloud.bigtable.table.Table# read_rows().consume_all()
the abbreviated stacktrace is as follows:
---------------------------------------------------------------------------
InvalidChunk Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
1 row_gen = table.scan(limit=n)
2 rows = []
----> 3 for kvp in row_gen:
4 pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
391 while True:
392 try:
--> 393 partial_rows_data.consume_next()
394 for row_key in sorted(rows_dict):
395 curr_row_data = rows_dict.pop(row_key)
.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
273 for chunk in response.chunks:
274
--> 275 self._validate_chunk(chunk)
276
277 if chunk.reset_row:
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
388 self._validate_chunk_new_row(chunk)
389 if self.state == self.ROW_IN_PROGRESS:
--> 390 self._validate_chunk_row_in_progress(chunk)
391 if self.state == self.CELL_IN_PROGRESS:
392 self._validate_chunk_cell_in_progress(chunk)
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
368 self._validate_chunk_status(chunk)
369 if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370 _raise_if(not chunk.timestamp_micros or not chunk.value)
371 _raise_if(chunk.row_key and
372 chunk.row_key != self._row.row_key)
.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
439 """Helper for validation methods."""
440 if predicate:
--> 441 raise InvalidChunk(*args)
InvalidChunk:
Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk?
(try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class)
Also, can you show me code to chunk stream a table scan in BigTable?
HappyBase batch_size & scan_batching don't seem to be supported.
This was likely due to this bug: https://github.com/googleapis/google-cloud-python/issues/2980
The bug has been fixed, so this should no longer be an issue.
Related
I'm trying to find out error in a word similarity calculation.
def word_similarity_error_analysis(eval_df):
eval_df['distance_rank'] = _normalized_ranking(eval_df['distance'])
eval_df['score_rank'] = _normalized_ranking(eval_df['score'])
eval_df['error'] = abs(eval_df['distance_rank'] - eval_df['score_rank'])
return eval_df.sort_values('error')
def _normalized_ranking(series):
ranks = series.rank(method='dense')
return ranks / ranks.sum()
word_similarity_error_analysis(eval_df).head()
And I'm getting this error:
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in __call__(self, obj)
336 method = get_real_method(obj, self.print_method)
337 if method is not None:
--> 338 return method()
339 return None
340 else:
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in _repr_html_(self)
732 return buf.getvalue()
733
--> 734 max_rows = get_option("display.max_rows")
735 min_rows = get_option("display.min_rows")
736 max_cols = get_option("display.max_columns")
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/format.py in to_html(self, buf, encoding, classes, notebook, border)
980 Whether the generated HTML is for IPython Notebook.
981 border : int
--> 982 A ``border=border`` attribute is included in the opening
983 ``<table>`` tag. Default ``pd.options.display.html.border``.
984 """
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/html.py in __init__(self, formatter, classes, border)
57 self.col_space = {
58 column: f"{value}px" if isinstance(value, int) else value
---> 59 for column, value in self.fmt.col_space.items()
60 }
61
AttributeError: 'NoneType' object has no attribute 'items'
word1 word2 score ... distance_rank score_rank error
1041 hummingbird pelican -32.0 ... 0.000243 0.000244 2.434543e-07
2315 lily pigs -13.0 ... 0.000488 0.000487 4.016842e-07
2951 bucket girls -4.0 ... 0.000602 0.000603 4.151568e-07
150 night sunset -43.0 ... 0.000102 0.000103 6.520315e-07
2062 oak petals -17.0 ... 0.000435 0.000436 7.162632e-07
[5 rows x 7 columns]
I saw many people faced the same type of error. However, can't find a suitable solution. What's wrong with this code?
Also, how a result is getting produced after the error. The exact line number isn't visible where error is occurring.
(I'm using Google Colab and Pandas version 1.0.5 in case you need it)
I'm encountering the same issue on an AWS SageMaker notebook (Python3.6), with relatively standard Pandas code. I have a growing suspicion that this phenomenon is related to python-pandas version mismatches, because the same code on a local notebook with Python3.7 does not replicate the exception.
I have yet to validate this is definitely the issue because the environments are not quite identical, but try to run the same code with the same data on python > 3.6.
Up(down)grading to pandas==1.3.3 fixed this for me when running under Jupyter on AWS Sagemaker as at February 2022 - from the notebook:
!pip install pandas==1.3.3
Pandas 1.3.5 didn't work for me.
I don't know whether it's a version mismatch or a regression. Finally solved for me at least! :)
Had the same problem. Furthermore, also used Jupyter with pandas.print(df) worked fine, but df only returned an exception. After update of conda package problem disappeared.
I'm using the following code to load my files in NiFTI format in Python.
import nibabel as nib
img_arr = []
for i in range(len(datadir)):
img = nib.load(datadir[i])
img_data = img.get_fdata()
img_arr.append(img_data)
img.uncache()
A small amount of images works fine, but if I want to load more images, I get the following error:
OSError Traceback (most recent call last)
<ipython-input-55-f982811019c9> in <module>()
10 #img = nilearn.image.smooth_img(datadir[i],fwhm = 3) #Smoothing filter for preprocessing (necessary?)
11 img = nib.load(datadir[i])
---> 12 img_data = img.get_fdata()
13 img_arr.append(img_data)
14 img.uncache()
~\AppData\Roaming\Python\Python36\site-packages\nibabel\dataobj_images.py in get_fdata(self, caching, dtype)
346 if self._fdata_cache.dtype.type == dtype.type:
347 return self._fdata_cache
--> 348 data = np.asanyarray(self._dataobj).astype(dtype, copy=False)
349 if caching == 'fill':
350 self._fdata_cache = data
~\AppData\Roaming\Python\Python36\site-packages\numpy\core\_asarray.py in asanyarray(a, dtype, order)
136
137 """
--> 138 return array(a, dtype, copy=False, order=order, subok=True)
139
140
~\AppData\Roaming\Python\Python36\site-packages\nibabel\arrayproxy.py in __array__(self)
353 def __array__(self):
354 # Read array and scale
--> 355 raw_data = self.get_unscaled()
356 return apply_read_scaling(raw_data, self._slope, self._inter)
357
~\AppData\Roaming\Python\Python36\site-packages\nibabel\arrayproxy.py in get_unscaled(self)
348 offset=self._offset,
349 order=self.order,
--> 350 mmap=self._mmap)
351 return raw_data
352
~\AppData\Roaming\Python\Python36\site-packages\nibabel\volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
507 shape=shape,
508 order=order,
--> 509 offset=offset)
510 # The error raised by memmap, for different file types, has
511 # changed in different incarnations of the numpy routine
~\AppData\Roaming\Python\Python36\site-packages\numpy\core\memmap.py in __new__(subtype, filename, dtype, mode, offset, shape, order)
262 bytes -= start
263 array_offset = offset - start
--> 264 mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
265
266 self = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm,
OSError: [WinError 8] Not enough storage is available to process this command
I thought that img.uncache() would delete the image from memory so it wouldn't take up too much storage but still being able to work with the image array. Adding this bit to the code didn't change anything though.
Does anyone know how I can help this? The computer I'm working on has 24 core 2,6 GHz CPU, more than 52 GB memory and the working directory has over 1.7 TB free storage. I'm trying to load around 1500 MRI images from the ADNI database.
Any suggestions are much appreciated.
This error is not being caused because the 1.7TB hard drive is filling up, it's because you're running out of memory, aka RAM. It's going to be important to understand how those two things differ.
uncache() does not remove an item from memory completely, as documented here, but that link also contains more memory saving tips.
If you want to remove an object from memory completely, you can use the Garbage Collector interface, like so:
import nibabel as nib
import gc
img_arr = []
for i in range(len(datadir)):
img = nib.load(datadir[i])
img_data = img.get_fdata()
img_arr.append(img_data)
img.uncache()
# Delete the img object and free the memory
del img
gc.collect()
That should help reduce the amount of memory you are using.
How to fix "not enough storage available.."?
Try to do these steps:
Press the Windows + R key at the same time on your keyboard, then type Regedit.exe in the Run window and click on OK.
Then Unfold HKEY_LOCAL_MACHINE, then SYSTEM, then CurrentControlSet, then services, then LanmanServer, then Parameters.
Locate IRPStackSize (If found skip to step 5), If it does not exist then right-click the right Window and choose New > Dword Value (32)
Now type IRPStackSize under the name, then hit enter.
Right-click IRPStackSize and click on Modify, then set any value higher then 15 but lower than 50 and click OK
Restart your system and try to repeat the same action as you did when the error occurred.
Or :
Set the following registry key HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\LargeSystemCache to value "1"
Set the following registry
HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\Size to value "3"
Another way to saving memory in "nibabel" :
There are other ways to saving memory alongside to uncache() method, you can use :
The array proxy instead of get_fdata()
The caching keyword to get_fdata()
I use LDAvis library to visualize my LDA topics. It works fine before, but it gets me this error when I download the saved model files from Sagemaker to the local computer. I don't know why does this happen? Does that relate to Sagemaker?
If I run from the local, and saved the model from local, and then run LDAviz library, it works fine.
KeyError Traceback (most recent call last)
in ()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
116 See pyLDAvis.prepare for **kwargs.
117 """
--> 118 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
119 return vis_prepare(**opts)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyLDAvis\gensim.py in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
46 gamma = topic_model.inference(corpus)
47 else:
---> 48 gamma, _ = topic_model.inference(corpus)
49 doc_topic_dists = gamma / gamma.sum(axis=1)[:, None]
50 else:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
665 # phinorm is the normalizer.
666 # TODO treat zeros explicitly, instead of adding epsilon?
--> 667 eps = DTYPE_TO_EPS[self.dtype]
668 phinorm = np.dot(expElogthetad, expElogbetad) + eps
669
KeyError: dtype('float32')
I know this is late but I just fixed a similar problem by updating my gensim library from 3.4 to the current version which for me is 3.8.
I want to see the implementation of the conjugate function used in Numpy. Then I tried the following:
import numpy as np
import inspect
inspect.getsource(np.conjugate)
However, I received the following error message stating that <ufunc 'conjugate'> is not a module, class, method, function, traceback, frame, or code object. May someone answer why?
In [8]: inspect.getsource(np.conjugate)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-821ecfb71e08> in <module>()
----> 1 inspect.getsource(np.conj)
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsource(object)
699 or code object. The source code is returned as a single string. An
700 IOError is raised if the source code cannot be retrieved."""
--> 701 lines, lnum = getsourcelines(object)
702 return string.join(lines, '')
703
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsourcelines(object)
688 original source file the first line of code was found. An IOError is
689 raised if the source code cannot be retrieved."""
--> 690 lines, lnum = findsource(object)
691
692 if ismodule(object): return lines, 0
/Users/duanlx/anaconda/lib/python2.7/site-packages/IPython/core/ultratb.pyc in findsource(object)
149 FIXED version with which we monkeypatch the stdlib to work around a bug."""
150
--> 151 file = getsourcefile(object) or getfile(object)
152 # If the object is a frame, then trying to get the globals dict from its
153 # module won't work. Instead, the frame object itself has the globals
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsourcefile(object)
442 Return None if no way can be identified to get the source.
443 """
--> 444 filename = getfile(object)
445 if string.lower(filename[-4:]) in ('.pyc', '.pyo'):
446 filename = filename[:-4] + '.py'
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getfile(object)
418 return object.co_filename
419 raise TypeError('{!r} is not a module, class, method, '
--> 420 'function, traceback, frame, or code object'.format(object))
421
422 ModuleInfo = namedtuple('ModuleInfo', 'name suffix mode module_type')
TypeError: <ufunc 'conjugate'> is not a module, class, method, function, traceback, frame, or code object
Thanks!
Numpy is written in C, for speed. You can only see the source of Python functions.
I'm really close to completing a large code, but the final segment of it seems to be failing and I don't know why. What I'm trying to do here is take an image-array, compare it to a different image array, and wherever the initial image array equals 1, I want to mask that portion out in the second image array. However, I'm getting a strange error:
Code:
maskimg='omask'+str(inimgs)[5:16]+'.fits'
newmaskimg=pf.getdata(maskimg)
oimg=pf.getdata(inimgs)
for i in range (newmaskimg.shape[0]):
for j in range (newmaskimg.shape[1]):
if newmaskimg[i,j]==1:
oimg[i,j]=0
pf.writeto('newestmask'+str(inimgs)[5:16]+'.fits',newmaskimg)
Error:
/home/vidur/se_files/fetch_swarp10.py in objmask(inimgs, inwhts, thresh1, thresh2, tfdel, xceng, yceng, outdir, tmpdir)
122 if newmaskimg[i,j]==1:
123 oimg[i,j]=0
--> 124 pf.writeto('newestmask'+str(inimgs)[5:16]+'.fits',newmaskimg)
125
126
/usr/local/lib/python2.7/dist-packages/pyfits/convenience.pyc in writeto(filename, data, header, output_verify, clobber, checksum)
396 hdu = PrimaryHDU(data, header=header)
397 hdu.writeto(filename, clobber=clobber, output_verify=output_verify,
--> 398 checksum=checksum)
399
400
/usr/local/lib/python2.7/dist-packages/pyfits/hdu/base.pyc in writeto(self, name, output_verify, clobber, checksum)
348 hdulist = HDUList([self])
349 hdulist.writeto(name, output_verify, clobber=clobber,
--> 350 checksum=checksum)
351
352 def _get_raw_data(self, shape, code, offset):
/usr/local/lib/python2.7/dist-packages/pyfits/hdu/hdulist.pyc in writeto(self, fileobj, output_verify, clobber, checksum)
651 os.remove(filename)
652 else:
--> 653 raise IOError("File '%s' already exists." % filename)
654 elif (hasattr(fileobj, 'len') and fileobj.len > 0):
655 if clobber:
IOError: File 'newestmaskPHOTOf105w0.fits' already exists.
If you don't care about overwriting the existing file, pyfits.writeto accepts a clobber argument to automatically overwrite existing files (it will still output a warning):
pyfits.writeto(..., clobber=True)
As an aside, let me be very emphatic that the code you posted above is very much not the right way to use Numpy. The loop in your code can be written in one line and will be orders of magnitude faster. For example, one of many possibilities is to write it like this:
oimg[newmaskimg == 1] = 0
Yes, add clobber = True. I've used this in my codes before and it works just fine. Or, simply go and sudo rm path/to/file and get rid of them so you can run it again.
I had the same issue and as it turns out using the argument clobber still works but won't be supported in future versions of AstroPy.
The argument overwrite does the same thing and doesn't put out an error message.