I need to create a very large (~30 GB) bytearray, but when I create it, I get a Memory Error because there is not enough RAM to store it. Question: is it possible to create such an object in python that will have the same properties (mutability and the ability to access an arbitrary offset), but will not take up space in memory while it is empty? I need to fill it in arbitrary places with only a small amount of data
You probably want to use Numpy's Memmap. This will let you reference a numpy object (any data type, any number of dimensions, a byte array is just a 1D array with bytes dtype). You can read and write subsections of the array which are backed by disk.
Note that when you read or write data from a Memmap array that section will stay in memory as long as you keep the object open. If memory becomes an issue you can always close/delete and reopen the object at an appropriate interval. The Numpy API doesn't provide a way to flush the objects in-memory cache (any segment you read or write).
You use the numpy Memmap object in the same way you would with a normal numpy object, e.g. slicing, numpy functions, etc.
https://numpy.org/doc/stable/reference/generated/numpy.memmap.html
Examples from the docs copied here, there are more examples in the docs referenced above.
import numpy as np
data = np.arange(12, dtype='float32')
data.resize((3,4))
# This example uses a temporary file so that doctest doesn’t write files to your directory. You would use a ‘normal’ filename.
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'newfile.dat')
# Create a memmap with dtype and shape that matches our data:
fp = np.memmap(filename, dtype='float32', mode='w+', shape=(3,4))
fp
memmap([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32)
# Write data to memmap array:
fp[:] = data[:]
fp
memmap([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]], dtype=float32)
Related
OpenAI has published a set of Machine Learning/Reinforcement Learning environments called 'Open AI Gym'. Some of the environments are image based, and as such can potentially have a very large memory footprint when used with algorithms that store 100 000s or millions of frames worth of environment observations.
While poking around in their reference implementation of DeepQ Learning I found a pair of classes, LazyFrameStack and LazyFrames that claim to "ensure that common frames between the observations are only stored once... to optimize memory usage which can be huge for DQN's 1M frames replay buffers."
In the reference implementation, the DeepQ agent gets frames stacked together in groups of four, which are then put into the replay buffer. Having looked at the implementation of both classes, it's not obvious to me how these save memory -- if anything, because LazyFrames is basically a container object around a set of four numpy arrays, shouldn't a LazyFrame have a larger memory footprint?
In Python, objects are passed as reference. That means even though a LazyFrame object might be a list of extremely big numpy arrays, the size of that LazyFrame object itself is small, since it only stores the reference to the np.ndarrays. In other words, you can think of LazyFrame just pointing to the np.ndarray data, and not actually storing each copy of the individual array within itself.
import numpy as np
a = np.ones((2,3))
b = np.ones((2,3))
X = [a, b]
print(X)
>>> [array([[1., 1., 1.],
[1., 1., 1.]]),
array([[1., 1., 1.],
[1., 1., 1.]])]
X_stacked = np.stack(X)
print(X_stacked)
>>> array([[[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.]]])
a[0] = 2
print(X)
>>> [array([[2., 2., 2.],
[1., 1., 1.]]),
array([[1., 1., 1.],
[1., 1., 1.]])]
print(X_stacked)
>>> array([[[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.]]])
As you can see here, X (which is a list of arrays) stores only the reference to a and b, thus when we do a[0] = 2, the change can be seen by printing X. But once you stack the arrays, you actually create a new array with that much memory.
To address your "how does it save memory" question a bit more directly, here's an example.
import sys
a = np.random.randn(210, 160, 3)
b = np.random.randn(210, 160, 3)
X = [a,b]
X_stacked = np.stack(X)
print(sys.getsizeof(X))
>>> 80
print(sys.getsizeof(X_stacked))
>>> 1612944
I use numpy.save and numpy.load to R/W large datasets in my project. I realized that that numpy.save does not apply append mode. For instance (Python 3):
import numpy as np
n = 5
dim = 5
for _ in range(3):
Matrix = np.random.choice(np.arange(10, 40, dim), size=(n, dim))
np.save('myfile', Matrix)
M1 = np.load('myfile.npy', mmap_mode='r')[1:7].copy()
print(M1)
Loading specific portion of data using slicing [1:7] is not correct because the np.save does not append. I found this answer but it looks strange ( file(filename, 'a') what is file file??). Is there a clever workaround to achieve that without using additional lists?
The npy file format doesn't work that way. An npy file encodes a single array, with a header specifying shape, dtype, and other metadata. You can see the npy file format spec in the NumPy docs.
Support for appending data was not a design goal of the npy format. Even if you managed to get numpy.save to append to an existing file instead of overwriting the contents, the result wouldn't be a valid npy file. Producing a valid npy file with additional data would require rewriting the header, and since this could require resizing the header, it could shift the data and require the whole file to be rewritten.
NumPy comes with no tools to append data to existing npy files, beyond reading the data into memory, building a new array, and writing the new array to a file. If you want to save more data, consider writing a new file, or pick a different file format.
In Python3 repeated save and load to the same open file works:
In [113]: f = open('test.npy', 'wb')
In [114]: np.save(f, np.arange(10))
In [115]: np.save(f, np.zeros(10))
In [116]: np.save(f, np.ones(10))
In [117]: f.close()
In [118]: f = open('test.npy', 'rb')
In [119]: for _ in range(3):
...: print(np.load(f))
...:
[0 1 2 3 4 5 6 7 8 9]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
In [120]: np.load(f)
OSError: Failed to interpret file <_io.BufferedReader name='test.npy'> as a pickle
Each save writes a self contained block of data to the file. That consists of a header block, and an image of the databuffer. The header block has information about the length of the databuffer.
Each load reads the defined header block, and the known number of data bytes.
As far as I know this is not documented, but has been demonstrated in previous SO questions. It is also evident from the save and load code.
Note these are separate arrays, both on saving and loading. But we could concatenate the loads into one file if the dimensions are compatible.
In [122]: f = open('test.npy', 'rb')
In [123]: np.stack([np.load(f) for _ in range(3)])
Out[123]:
array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [124]: f.close()
Append multiple numpy files to one big numpy file in python
loading arrays saved using numpy.save in append mode
The file function was deprecated in Python 3. Though I won't guarantee that it works, the Python 3 code equivalent to the code in the link in your question would be
with open('myfile.npy', 'ab') as f_handle:
np.save(f_handle, Matrix)
This should then append Matrix to 'myfile.npy'.
I'm following this repository (https://github.com/gitlimlab/SSGAN-Tensorflow) and trying to use my own dataset. As mention there
Store your data as an h5py file datasets/YOUR_DATASET/data.hy and each
data point contains
'image': has shape [h, w, c], where c is the
number of channels (grayscale images: 1, color images: 3)
'label':
represented as an one-hot vector
I could not find something that helps in creating a file with same extension data.hy but I tried to follow the main tutorial on h5py:
import h5py
f = h5py.File("dataset.hy", "w")
dataset = f.create_dataset("default", shape=(3,10)) #I have ten classes
but to check that the initialization is correct I printed datatset[0] which gave the following output
In [7]: dataset.shape
Out[7]: (3, 10)
In [8]: dataset[0]
Out[8]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)
This obviously means that I did not shape the dataset correctly but I don't know how to fix it. I know that the h5py follows the same was as numpy shaping but not sure how to fix it in here.
EDIT:
What I want to do is to fix the shape of the dataset so each point has two columns, each has a 1-d vector with a different number of elements e.g.
[[h,w,c],[0,1,2,3,4,5,6,7,8,9]]
I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.
Suppose I have a NxN matrix M (lil_matrix or csr_matrix) from scipy.sparse, and I want to make it (N+1)xN where M_modified[i,j] = M[i,j] for 0 <= i < N (and all j) and M[N,j] = 0 for all j. Basically, I want to add a row of zeros to the bottom of M and preserve the remainder of the matrix. Is there a way to do this without copying the data?
Scipy doesn't have a way to do this without copying the data but you can do it yourself by changing the attributes that define the sparse matrix.
There are 4 attributes that make up the csr_matrix:
data: An array containing the actual values in the matrix
indices: An array containing the column index corresponding to each value in data
indptr: An array that specifies the index before the first value in data for each row. If the row is empty then the index is the same as the previous column.
shape: A tuple containing the shape of the matrix
If you are simply adding a row of zeros to the bottom all you have to do is change the shape and indptr for your matrix.
x = np.ones((3,5))
x = csr_matrix(x)
x.toarray()
>> array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
# reshape is not implemented for csr_matrix but you can cheat and do it yourself.
x._shape = (4,5)
# Update indptr to let it know we added a row with nothing in it. So just append the last
# value in indptr to the end.
# note that you are still copying the indptr array
x.indptr = np.hstack((x.indptr,x.indptr[-1]))
x.toarray()
array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 0.]])
Here is a function to handle the more general case of vstacking any 2 csr_matrices. You still end up copying the underlying numpy arrays but it is still significantly faster than the scipy vstack method.
def csr_vappend(a,b):
""" Takes in 2 csr_matrices and appends the second one to the bottom of the first one.
Much faster than scipy.sparse.vstack but assumes the type to be csr and overwrites
the first matrix instead of copying it. The data, indices, and indptr still get copied."""
a.data = np.hstack((a.data,b.data))
a.indices = np.hstack((a.indices,b.indices))
a.indptr = np.hstack((a.indptr,(b.indptr + a.nnz)[1:]))
a._shape = (a.shape[0]+b.shape[0],b.shape[1])
return a
Not sure if you're still looking for a solution, but maybe others can look into hstack and vstack - http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html. I think we can define a csr_matrix for the single additional row and then vstack it with the previous matrix.
I don't think that there is any way to really escape from doing the copying. Both of those types of sparse matrices store their data as Numpy arrays (in the data and indices attributes for csr and in the data and rows attributes for lil) internally and Numpy arrays can't be extended.
Update with more information:
LIL does stand for LInked List, but the current implementation doesn't quite live up to the name. The Numpy arrays used for data and rows are both of type object. Each of the objects in these arrays are actually Python lists (an empty list when all values are zero in a row). Python lists aren't exactly linked lists, but they are kind of close and quite frankly a better choice due to O(1) look-up. Personally, I don't immediately see the point of using a Numpy array of objects here rather than just a Python list. You could fairly easily change the current lil implementation to use Python lists instead which would allow you to add a row without copying the whole matrix.