Python: Saving / loading large array using numpy - python

I have saved a large array of complex numbers using python,
numpy.save(file_name, eval(variable_name))
that worked without any trouble. However, loading,
variable_name=numpy.load(file_name)
yields the following error,
ValueError: total size of new array must be unchanged
Using: Python 2.7.9 64-bit and the file is 1.19 GB large.

There is no problem with the size of your array, you likely didn't opened your file in the right way, try this:
with open(file_name, "rb") as file_:
variable_name = np.load(file_)

Alternatively you can use pickle:
import pickle
# Saving:
data_file = open('filename.bi', 'w')
pickle.dump(your_data, data_file)
data_file.close()
# Loading:
data_file = open('filename.bi')
data = pickle.load(data_file)
data_file.close()

Related

Saving multiple Numpy arrays to a Numpy binary file (Python)

I want to save multiple large-sized numpy arrays to a numpy binary file to prevent my code from crashing, but it seems like it keeps getting overwritten when I add on an array. The last array saved is what is set to allarrays when save.npy is opened and read. Here is my code:
with open('save.npy', 'wb') as f:
for num in range(500):
array = np.random.rand(100,400)
np.save(f, array)
with open('save.npy', 'rb') as f:
allarrays = np.load(f)
If the file existed before, I want it to be overwritten if the code is rerun. That's why I chose 'wb' instead of 'ab'.
alist =[]
with open('save.npy', 'rb') as f:
alist.append(np.load(f))
When you load you have collect all loads in a list or something. load only loads one array, starting at the current file position.
You can try memory mapping to disk.
# merge arrays using memory mapped file
mm = np.memmap("mmap.bin", dtype='float32', mode='w+', shape=(500,100,400))
for num in range(500):
mm[num::] = np.random.rand(100,400)
# save final array to npy file
with open('save.npy', 'wb') as f:
np.save(f, mm[::])
I ran into this problem as well, and solved it in not a very neat way, but perhaps it's useful for others. It's inspired by hpaulj's approach, which is incomplete (i.e., doesn't load the data). Perhaps this is not how one is supposed to solve this problem to begin with...but anyhow, read on.
I had saved my data using a similar procedure as the OP,
# Saving the data in a for-loop
with open(savefilename, 'wb') as f:
for datafilename in list_of_datafiles:
# Do the processing
data_to_save = ...
np.save( savefilename, data_to_save )
And ran into the problem that calling np.load() only loaded the last saved array, none of the rest. However, I knew that the data was in principle contained in the *.npy file, given the file size was growing during the saving loop. What was required was to simply loop over the content of the numpy array while calling the load command repeatedly. As I didn't quite know how many files were contained in the file, I simply looped over the loading loop until it failed. It's hacky, but it works.
# Loading the data in a for-loop
data_to_read = []
with open(savefilename, 'r') as f:
while True:
try:
data_to_read.append( np.load(f) )
except:
print("all data has been read!")
break
Then you can call, e.g., len(data_to_read) to see how many of the arrays are contained in it. Calling, e.g., data_to_read[0] gives you the first saved array, etc.

How to read more efficiently in Python?

I am trying to access a file, it has 27000+ lines so when I read it takes too long that is 30mins or more. Now just to clarify I am running it in a Coursera external Jupyter notebook, so I don't think it is a system limitation.
with open(filename) as training_file:
# Your code starts here
file = training_file.read()
lines = file.split('\n')
images = []
labels = []
images = np.array(images)
labels = np.array(labels)
c=0
for line in lines[1:]:
row = line.split(',')
labels = np.append(labels, row[0])
images = np.append(images, np.array_split(row[1:], 28))
c += 1
print(c)
images = images.astype(np.float64)
labels = labels.astype(np.float64)
# Your code ends here
return images, labels
Use the built-in numpy functions for reading a CSV (fromfile, genfromtxt etc) rather than rolling your own; they're written in C and much faster than doing the same thing in Python.
Are you sure that it takes too much time because of file reading? Comment out numpy code and run only file read part. In my opinion, numpy.append is the slowest part. Have a look at this: NumPy append vs Python append
you can save on memory by reading the file line by line with a for loop:
with open("filename") as f:
for line in f:
<your code>
but as mentioned in other comments, there are CSV tools you can use that will be way faster: see csv or numpy

Compressing numpy float arrays

I have a large numpy float array (~4k x 16k float 64) that I want to store on disk. I am trying to understand the differences in the following compression approaches :
1) Use np.save - Save in .npy format and zip this using GZIP (like in one of the answers to Compress numpy arrays efficiently)
f = gzip.GzipFile("my_file.npy.gz", "w")
numpy.save(f, my_array)
f.close()
I get equivalent file sizes if I do the following as well
numpy.save('my_file',my_array)
check_call(['gzip', os.getcwd()+'/my_file.npy'])
2) Write the array into a binary file using tofile(). Close the file and zip this generated binary file using GZIP.
f = open("my_file","wb")
my_array.tofile(f)
f.close()
with open('my_file', 'rb') as f_in:
with gzip.open('my_file.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
The above is a workaround to the following code which does not achieve any compression. This is expected according to the GzipFile docs.
f = gzip.GzipFile("my_file_1.gz", "w")
my_array.tofile(f)
f.close()
Here is my question: The file size using 1) is about 6 times smaller than that using 2). From what I understand in the .npy format, it is the exact same way as a binary file with the exception of some headers etc. to preserve array shape. I don't see any reason as to why the file sizes should differ so drastically.

HDF5 core driver (H5FD_CORE): loading selected dataset(s)

Currently, I load HDF5 data in python via h5py and read a dataset into memory.
f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]
This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:
f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]
I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.
My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.
I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.
Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?
I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.
If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.
I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.
import h5py as h5
import time
import numpy as np
import h5py_cache as h5c
def Reading():
File_Name_HDF5='Test.h5'
t1=time.time()
f = h5.File(File_Name_HDF5, 'r',driver='core')
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5.File(File_Name_HDF5, 'r')
dset = f['Test'][:]
print(time.time()-t1)
f.close()
if __name__ == "__main__":
Reading()
This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)

Unable to load CIFAR-10 dataset: Invalid load key '\x1f'

I'm currently playing around with some neural networks in TensorFlow - I decided to try working with the CIFAR-10 dataset. I downloaded the "CIFAR-10 python" dataset from the website: https://www.cs.toronto.edu/~kriz/cifar.html.
In Python, I also tried directly copying the code that is provided to load the data:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
However, when I run this, I end up with the following error: _pickle.UnpicklingError: invalid load key, '\x1f'. I've also tried opening the file using the gzip module (with gzip.open(file, 'rb') as fo:), but this didn't work either.
Is the dataset simply bad, or this an issue with code? If the dataset's bad, where can I obtain the proper dataset for CIFAR-10?
Extract your *.gz file and use this code
from six.moves import cPickle
f = open("path/data_batch_1", 'rb')
datadict = cPickle.load(f,encoding='latin1')
f.close()
X = datadict["data"]
Y = datadict['labels']
Just extract your tar.gz file, you will get a folder of data_batch_1, data_batch_2, ...
After that just use, the code provided to load data into your project :
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
dict = unpickle('data_batch_1')
It seems like that you need to unzip *gz file and then unzip *tar file to get a folder of data_batches. Afterwards you could apply pickle.load() on these batches.
I was facing the same problem using jupyter(vscode) and python3.8/3.7. I have tried to edit the source cifar.py cifar10.py but without success.
the solution for me was run these two lines of code in separate normal .py file:
from tensorflow.keras.datasets import cifar10
cifar10.load_data()
after that it worked fine on Jupyter.
Try this:
import pickle
import _pickle as cPickle
import gzip
with gzip.open(path_of_your_cpickle_file, 'rb') as f:
var = cPickle.load(f)
Try in this way
import pickle
import gzip
with gzip.open(path, "rb") as f:
loaded = pickle.load(f, encoding='bytes')
it works for me

Categories