what reliable method to save huge numpy arrays

what reliable method to save huge numpy arrays - python

I saved some arrays using numpy.savez_compressed(). One of the arrays is gigantic, it has the shape (120000,7680), type float32.
Trying to load the array gave me the error below (message caught using Ipython).
Is seems like this is a Numpy limitation:
Numpy: apparent memory error
What are other ways to save such a huge array? (I had problems with cPickle as well)
In [5]: t=numpy.load('humongous.npz')
In [6]: humg = (t['arr_0.npy'])
/usr/lib/python2.7/dist-packages/numpy/lib/npyio.pyc in __getitem__(self, key)
229 if bytes.startswith(format.MAGIC_PREFIX):
230 value = BytesIO(bytes)
--> 231 return format.read_array(value)
232 else:
233 return bytes
/usr/lib/python2.7/dist-packages/numpy/lib/format.pyc in read_array(fp)
456 # way.
457 # XXX: we can probably chunk this to avoid the memory hit.
--> 458 data = fp.read(int(count * dtype.itemsize))
459 array = numpy.fromstring(data, dtype=dtype, count=count)
460
SystemError: error return without exception set
System: Ubuntu 12.04 64 bit, Python 2.7, numpy 1.6.1

Related

File load error: not enough storage available with 1.7TB storage free

I'm using the following code to load my files in NiFTI format in Python.
import nibabel as nib
img_arr = []
for i in range(len(datadir)):
img = nib.load(datadir[i])
img_data = img.get_fdata()
img_arr.append(img_data)
img.uncache()
A small amount of images works fine, but if I want to load more images, I get the following error:
OSError Traceback (most recent call last)
<ipython-input-55-f982811019c9> in <module>()
10 #img = nilearn.image.smooth_img(datadir[i],fwhm = 3) #Smoothing filter for preprocessing (necessary?)
11 img = nib.load(datadir[i])
---> 12 img_data = img.get_fdata()
13 img_arr.append(img_data)
14 img.uncache()
~\AppData\Roaming\Python\Python36\site-packages\nibabel\dataobj_images.py in get_fdata(self, caching, dtype)
346 if self._fdata_cache.dtype.type == dtype.type:
347 return self._fdata_cache
--> 348 data = np.asanyarray(self._dataobj).astype(dtype, copy=False)
349 if caching == 'fill':
350 self._fdata_cache = data
~\AppData\Roaming\Python\Python36\site-packages\numpy\core\_asarray.py in asanyarray(a, dtype, order)
136
137 """
--> 138 return array(a, dtype, copy=False, order=order, subok=True)
139
140
~\AppData\Roaming\Python\Python36\site-packages\nibabel\arrayproxy.py in __array__(self)
353 def __array__(self):
354 # Read array and scale
--> 355 raw_data = self.get_unscaled()
356 return apply_read_scaling(raw_data, self._slope, self._inter)
357
~\AppData\Roaming\Python\Python36\site-packages\nibabel\arrayproxy.py in get_unscaled(self)
348 offset=self._offset,
349 order=self.order,
--> 350 mmap=self._mmap)
351 return raw_data
352
~\AppData\Roaming\Python\Python36\site-packages\nibabel\volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
507 shape=shape,
508 order=order,
--> 509 offset=offset)
510 # The error raised by memmap, for different file types, has
511 # changed in different incarnations of the numpy routine
~\AppData\Roaming\Python\Python36\site-packages\numpy\core\memmap.py in __new__(subtype, filename, dtype, mode, offset, shape, order)
262 bytes -= start
263 array_offset = offset - start
--> 264 mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
265
266 self = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm,
OSError: [WinError 8] Not enough storage is available to process this command
I thought that img.uncache() would delete the image from memory so it wouldn't take up too much storage but still being able to work with the image array. Adding this bit to the code didn't change anything though.
Does anyone know how I can help this? The computer I'm working on has 24 core 2,6 GHz CPU, more than 52 GB memory and the working directory has over 1.7 TB free storage. I'm trying to load around 1500 MRI images from the ADNI database.
Any suggestions are much appreciated.

This error is not being caused because the 1.7TB hard drive is filling up, it's because you're running out of memory, aka RAM. It's going to be important to understand how those two things differ.
uncache() does not remove an item from memory completely, as documented here, but that link also contains more memory saving tips.
If you want to remove an object from memory completely, you can use the Garbage Collector interface, like so:
import nibabel as nib
import gc
img_arr = []
for i in range(len(datadir)):
img = nib.load(datadir[i])
img_data = img.get_fdata()
img_arr.append(img_data)
img.uncache()
# Delete the img object and free the memory
del img
gc.collect()
That should help reduce the amount of memory you are using.

How to fix "not enough storage available.."?
Try to do these steps:
Press the Windows + R key at the same time on your keyboard, then type Regedit.exe in the Run window and click on OK.
Then Unfold HKEY_LOCAL_MACHINE, then SYSTEM, then CurrentControlSet, then services, then LanmanServer, then Parameters.
Locate IRPStackSize (If found skip to step 5), If it does not exist then right-click the right Window and choose New > Dword Value (32)
Now type IRPStackSize under the name, then hit enter.
Right-click IRPStackSize and click on Modify, then set any value higher then 15 but lower than 50 and click OK
Restart your system and try to repeat the same action as you did when the error occurred.
Or :
Set the following registry key HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\LargeSystemCache to value "1"
Set the following registry
HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\Size to value "3"
Another way to saving memory in "nibabel" :
There are other ways to saving memory alongside to uncache() method, you can use :
The array proxy instead of get_fdata()
The caching keyword to get_fdata()

Python: ValueError: could not convert string to float: 'D'

I am loading a train.csv file to fit it with a RandomForestClassifier.
The load and processing of the .csv file happens fine.I am able to play around with my dataframe.
When I try:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=150, min_samples_split=2, n_jobs=-1)
rf.fit(train, target)
I get this:
ValueError: could not convert string to float: 'D'
I have tried:
train=train.astype(float)
Replacing all 'D' with another value.
train.convert_objects(convert_numeric=True)
But the issue still persists.
I also tried printing all the valueErrors in my csv file, but cannot find a reference to 'D'.
This is my trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-9d8e309c06b6> in <module>()
----> 1 rf.fit(train, target)
\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight)
222
223 # Convert data
--> 224 X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
225
226 # Remap output
\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_arrays(*arrays, **options)
279 array = np.ascontiguousarray(array, dtype=dtype)
280 else:
--> 281 array = np.asarray(array, dtype=dtype)
282 if not allow_nans:
283 _assert_all_finite(array)
\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: 'D'
How should I approach this problem?

Without RandomForestClassifier is not (as far as I could find) a python library (as included in python), it's difficult to know what's going on in your case. However, what's really happening there is that at some point, you're trying to transform a string 'D' into a float.
I can reproduce your error by doing:
float('D')
Now, to be able to debug this problem, I recommend you to catch the exception:
try:
rf.fit(train, target)
except ValueError as e:
print(e)
#do something clever with train and target like pprint them or something.
Then you can look into what's really going on. I couldn't find much about that random forest classifier except for this that might help:
https://www.npmjs.com/package/random-forest-classifier

You should explore and clean your data. Probably you have a 'D' somewhere in your data which your code try to convert to a float. A trace within a "try-except" block is a good idea.

PyFITS: File already exists

I'm really close to completing a large code, but the final segment of it seems to be failing and I don't know why. What I'm trying to do here is take an image-array, compare it to a different image array, and wherever the initial image array equals 1, I want to mask that portion out in the second image array. However, I'm getting a strange error:
Code:
maskimg='omask'+str(inimgs)[5:16]+'.fits'
newmaskimg=pf.getdata(maskimg)
oimg=pf.getdata(inimgs)
for i in range (newmaskimg.shape[0]):
for j in range (newmaskimg.shape[1]):
if newmaskimg[i,j]==1:
oimg[i,j]=0
pf.writeto('newestmask'+str(inimgs)[5:16]+'.fits',newmaskimg)
Error:
/home/vidur/se_files/fetch_swarp10.py in objmask(inimgs, inwhts, thresh1, thresh2, tfdel, xceng, yceng, outdir, tmpdir)
122 if newmaskimg[i,j]==1:
123 oimg[i,j]=0
--> 124 pf.writeto('newestmask'+str(inimgs)[5:16]+'.fits',newmaskimg)
125
126
/usr/local/lib/python2.7/dist-packages/pyfits/convenience.pyc in writeto(filename, data, header, output_verify, clobber, checksum)
396 hdu = PrimaryHDU(data, header=header)
397 hdu.writeto(filename, clobber=clobber, output_verify=output_verify,
--> 398 checksum=checksum)
399
400
/usr/local/lib/python2.7/dist-packages/pyfits/hdu/base.pyc in writeto(self, name, output_verify, clobber, checksum)
348 hdulist = HDUList([self])
349 hdulist.writeto(name, output_verify, clobber=clobber,
--> 350 checksum=checksum)
351
352 def _get_raw_data(self, shape, code, offset):
/usr/local/lib/python2.7/dist-packages/pyfits/hdu/hdulist.pyc in writeto(self, fileobj, output_verify, clobber, checksum)
651 os.remove(filename)
652 else:
--> 653 raise IOError("File '%s' already exists." % filename)
654 elif (hasattr(fileobj, 'len') and fileobj.len > 0):
655 if clobber:
IOError: File 'newestmaskPHOTOf105w0.fits' already exists.

If you don't care about overwriting the existing file, pyfits.writeto accepts a clobber argument to automatically overwrite existing files (it will still output a warning):
pyfits.writeto(..., clobber=True)
As an aside, let me be very emphatic that the code you posted above is very much not the right way to use Numpy. The loop in your code can be written in one line and will be orders of magnitude faster. For example, one of many possibilities is to write it like this:
oimg[newmaskimg == 1] = 0

Yes, add clobber = True. I've used this in my codes before and it works just fine. Or, simply go and sudo rm path/to/file and get rid of them so you can run it again.

I had the same issue and as it turns out using the argument clobber still works but won't be supported in future versions of AstroPy.
The argument overwrite does the same thing and doesn't put out an error message.

Numpy: apparent memory error

Using Python/Numpy, I'm trying to import a file; however, the script returns an error that I believe is a memory error:
In [1]: import numpy as np
In [2]: npzfile = np.load('cuda400x400x2000.npz')
In [3]: U = npzfile['U']
---------------------------------------------------------------------------
SystemError Traceback (most recent call last)
<ipython-input-3-0539104595dc> in <module>()
----> 1 U = npzfile['U']
/usr/lib/pymodules/python2.7/numpy/lib/npyio.pyc in __getitem__(self, key)
232 if bytes.startswith(format.MAGIC_PREFIX):
233 value = BytesIO(bytes)
--> 234 return format.read_array(value)
235 else:
236 return bytes
/usr/lib/pymodules/python2.7/numpy/lib/format.pyc in read_array(fp)
456 # way.
457 # XXX: we can probably chunk this to avoid the memory hit.
--> 458 data = fp.read(int(count * dtype.itemsize))
459 array = numpy.fromstring(data, dtype=dtype, count=count)
460
SystemError: error return without exception set
If properly loaded, U will contain 400*400*2000 doubles, so that's about 2.5 GB. It seems the system has enough memory available:
bogeholm#bananabot ~/Desktop $ free -m
total used free shared buffers cached
Mem: 7956 3375 4581 0 35 1511
-/+ buffers/cache: 1827 6128
Swap: 16383 0 16383
Is this a memory issue? Can it be fixed in any way other than buying more RAM? The box is Linux Mint DE with Python 2.7.3rc2 and Numpy 1.6.2.
Cheers,
\T

scipy, fftpack and float64

I would like to use the dct functionality from the scipy.fftpack with an array of numpy float64. However, it seems it is only implemented for np.float32. Is there any quick workaround I could do to get this done? I looked into it quickly but I am not sure of all the dependencies. So, before messing everything up, I thought I'd ask for tips here!
The only thing I have found so far about this is this link : http://mail.scipy.org/pipermail/scipy-svn/2010-September/004197.html
Thanks in advance.
Here is the ValueError it raises:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-f09567c28e37> in <module>()
----> 1 scipy.fftpack.dct(c[100])
/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/scipy/fftpack/realtransforms.pyc in dct(x, type, n, axis, norm, overwrite_x)
118 raise NotImplementedError(
119 "Orthonormalization not yet supported for DCT-I")
--> 120 return _dct(x, type, n, axis, normalize=norm, overwrite_x=overwrite_x)
121
122 def idct(x, type=2, n=None, axis=-1, norm=None, overwrite_x=0):
/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/scipy/fftpack/realtransforms.pyc in _dct(x, type, n, axis, overwrite_x, normalize)
215 raise ValueError("Type %d not understood" % type)
216 else:
--> 217 raise ValueError("dtype %s not supported" % tmp.dtype)
218
219 if normalize:
ValueError: dtype >f8 not supported

The problem is not the double precision. Double precision is of course supported. The problem is that you have a little endian computer and (maybe loading a file from a file?) have big endian data, note the > in dtype >f8 not supported. It seems you will simply have to cast it to native double yourself. If you know its double precision, you probably just want to convert everytiong to your native order once:
c = c.astype(float)
Though I guess you could also check c.dtype.byteorder which I think should be '=', and if, switch... something along:
if c.dtype.byteorder != '=':
c = c.astype(c.dtype.newbyteorder('='))
Which should work also if you happen to have single precision or integers...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

what reliable method to save huge numpy arrays - python

Related

File load error: not enough storage available with 1.7TB storage free

Python: ValueError: could not convert string to float: 'D'

PyFITS: File already exists

Numpy: apparent memory error

scipy, fftpack and float64

Categories

Resources