Is it possible to np.concatenate memory-mapped files? - python

I saved a couple of numpy arrays with np.save(), and put together they're quite huge.
Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?

Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.
For any case you must choose the right order for the array (row-major or column-major).
The following examples illustrate how to concatenate along axis 0 and axis 1.
1) concatenate along axis=0
a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222
You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:
c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b
Concatenating along axis=0 does not require to pass order='C' because this is already the default order.
2) concatenate along axis=1
a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222
The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:
c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b
Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.
Related questions:
Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.
import numpy as np
import dask.array as da
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))
c = da.concatenate([a, b], axis=0)
This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute.
Note that there are two caveats:
it is not possible to do in-place re-assignment e.g. c[::2] = 0 is not possible, so creative solutions are necessary in those cases.
this also means the original files can no longer be updated. To save results out, the dask store methods should be used. This method can again accept a memmapped array.

If u use order='F',will leads another problem, which when u load the file next time it will be quit a mess even pass the order='F. So my solution is below, I have test a lot, it just work fine.
fp = your old memmap...
shape = fp.shape
data = your ndarray...
data_shape = data.shape
concat_shape = data_shape[:-1] + (data_shape[-1] + shape[-1],)
print('cancat shape:{}'.format(concat_shape))
new_fp = np.memmap(new_file_name, dtype='float32', mode='r+', shape=concat_shape)
if len(concat_shape) == 1:
new_fp[:shape[0]] = fp[:]
new_fp[shape[0]:] = data[:]
if len(concat_shape) == 2:
new_fp[:, :shape[-1]] = fp[:]
new_fp[:, shape[-1]:] = data[:]
elif len(concat_shape) == 3:
new_fp[:, :, :shape[-1]] = fp[:]
new_fp[:, :, shape[-1]:] = data[:]
fp = new_fp
fp.flush()

Related

give a key value to an np array

I have an array that is composed of multiple np arrays. I want to give every array a key and convert it to an HDF5 file
arr = np.concatenate((Hsp_data, Hsp_rdiff, PosC44_WKS, PosX_WKS, PosY_WKS, PosZ_WKS,
RMS_Acc_HSp, RMS_Acc_Rev, RMS_Schall, Rev_M, Rev_rdiff, X_rdiff, Z_I, Z_rdiff, time), axis=1)
d1 = np.random.random(size=(7501, 15))
hf = h5py.File('data.hdf5', 'w')
hf.create_dataset('arr', data=d1)
hf.close()
hf = h5py.File('data.hdf5', 'r+')
print(hf.key)
This what I have done so far and I get this error AttributeError: 'File' object has no attribute 'key'.
I want the final answer to be like this when printing the keys
<KeysViewHDF5 ['Hsp_M', 'Hsp_rdiff', 'PosC44_WKS', 'PosX_WKS', 'PosY_WKS', 'PosZ_WKS', 'RMS_Acc_HSp', 'RMS_Acc_Rev', 'RMS_Schall', 'Rev_M', 'Rev_rdiff', 'X_rdiff', 'Z_I', 'Z_rdiff']>
any ideas?
You/we need a clearer idea of how the original .mat is laid out. In h5py, the file is viewed as a nested set of groups, which are dict like. Hence the use of keys(). At the ends of that nesting are datasets which can be loaded (or saved from) as numpy arrays. The datasets/arrays don't have keys; it's the file and groups that have those.
Creating your file:
In [69]: import h5py
In [70]: d1 = np.random.random(size=(7501, 15))
...: hf = h5py.File('data.hdf5', 'w')
...: hf.create_dataset('arr', data=d1)
...: hf.close()
Reading it:
In [71]: hf = h5py.File('data.hdf5', 'r+')
In [72]: hf.keys()
Out[72]: <KeysViewHDF5 ['arr']>
In [73]: hf['arr']
Out[73]: <HDF5 dataset "arr": shape (7501, 15), type "<f8">
In [75]: arr = hf['arr'][:]
In [76]: arr.shape
Out[76]: (7501, 15)
'arr' is the name of the dataset that we created at the start. In this case there's no group; just the one dataset. [75] loads the dataset to an array which I called arr, but that name could be anything (like the original d1).
Arrays and datasets may have a compound dtype, which has named fields. I don't know if MATLAB uses those or not.
Without knowledge of the group and dataset layout in the original .mat, it's hard to help you. And when looking at datasets, pay particular attention to shape and dtype.

Is there a way to extend a PyTables EArray in the second dimension?

I have a 2D array that can grow to larger sizes than I'm able to fit on memory, so I'm trying to store it in a h5 file using Pytables. The number of rows is known beforehand but the length of each row is not known and is variable between rows. After some research, I thought something along these lines would work, where I can set the extendable dimension as the second dimension.
filename = os.path.join(tempfile.mkdtemp(), 'example.h5')
h5_file = open_file(filename, mode="w", title="Example Extendable Array")
h5_group = h5_file.create_group("/", "example_on_dim_2")
e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0)) # Assume num of rows is 100
# Add some item to index 2
print(e_array[2]) # should print an empty array
e_array[2] = np.append(e_array[2], 5) # add the value 5 to row 2
print(e_array[2]) # should print [5], currently printing empty array
I'm not sure if it's possible to add elements in this way (I might have misunderstood the way earrays work), but any help would be greatly appreciated!
You're close...but have a small misunderstanding of some of the arguments and behavior. When you create the EArray with shape=(100,0), you don't have any data...just an object designated to have 100 rows that can add columns. Also, you need to use e_array.append() to add data, not np.append(). Also, if you are going to create a very large array, consider defining the expectedrows= parameter for improved performance as the EArray grows.
Take a look at this code.
import tables as tb
import numpy as np
filename = 'example.h5'
with tb.File(filename, mode="w", title="Example Extendable Array") as h5_file :
h5_group = h5_file.create_group("/", "example_on_dim_2")
# Assume num of rows is 100
#e_array = h5_file.create_earray(h5_group, "example", Int32Atom(shape=()), (100, 0))
e_array = h5_file.create_earray(h5_group, "example", atom=tb.IntAtom(), shape=(100, 0))
print (e_array.shape)
e_array.append(np.arange(100,dtype=int).reshape(100,1)) # append a column of values
print (e_array.shape)
print(e_array[2]) # prints [2]
Here is an example showing how to create a VLArray (Variable Length). It is similar to the EArray example above, and follows the example from the Pytables doc (link in comment above). However, although a VLArray supports variable length rows, it does not have a mechanism to add items to an existing row (AFAIK).
import tables as tb
import numpy as np
filename = 'example_vlarray.h5'
with tb.File(filename, mode="w", title="Example Variable Length Array") as h5_file :
h5_group = h5_file.create_group("/", "vl_example")
vlarray = h5_file.create_vlarray(h5_group, "example", tb.IntAtom(), "ragged array of ints",)
# Append some (variable length) rows:
vlarray.append(np.array([0]))
vlarray.append(np.array([1, 2]))
vlarray.append([3, 4, 5])
vlarray.append([6, 7, 8, 9])
# Now, read it through an iterator:
print('-->', vlarray.title)
for x in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, x))

Convert a list of numpy arrays to a 5D numpy array

I am having a database of 7000 objects (list_of_objects), each one of these files contains a numpy array with size of 10x5x50x50x3. I would like to create a 5d numpy array that will contain 7000*10x5x50x50x3. I tried to do so using two for-loops. My sample code:
fnl_lst = []
for object in list_of_objects:
my_array = read_array(object) # size 10x5x50x50x3
for ind in my_array:
fnl_lst.append(ind)
fnl_lst= np.asarray( fnl_lst) # print(fnl_lst) -> (70000,)
That code result in the end in a nested numpy array which contains 70000 arrays each of them has a size of 5x50x50x3. However, I would like instead to build a 5d array with size 70000x5x50x50x3. How can I do that instead?
fnl_lst = np.stack([ind for ind in read_array(obj) for obj in list_of_objects])
or, just append to the existing code:
fnl_lst = np.stack(fnl_lst)
UPD: by hpaulj's comment, if my_array is indeed 10x5x50x50x3, this might be enough:
fnl_lst = np.stack([read_array(obj) for obj in list_of_objects])

Pandas/numpy array filling

I´ve a Pandas dataframe that I read from csv and contains X and Y coordinates and a value that I need to put in a matrix and save it to a text file. So, I created a numpy array with max(X) and max(Y) extension.
I´ve this file:
fid,x,y,agblongo_tch_alive
2368458,1,1,45.0126083457747
2368459,1,2,44.8996854102889
2368460,2,2,45.8565022933761
2358154,3,1,22.6352522929758
2358155,3,3,23.1935887499899
And I need this one:
45.01 44.89 -9999.00
-9999.00 45.85 -9999.00
22.63 -9999.00 23.19
To do that, I´m using a loop like this:
for row in data.iterrows():
p[int(row[1][2]),int(row[1][1])] = row[1][3]
and then I save it to disk using np.array2string. It works.
As the original csv has 68 M lines, it´s taking a lot of time to process, so I wonder if there´s another more pythonic and fast way to do that.
Assuming the columns of your df are 'x', 'y', 'value', you can use advanced indexing
>>> x, y, value = data['x'].values, data['y'].values, data['value'].values
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> result[y, x] = value
This will, however, not work properly if coordiantes are not unique.
In that case it is safer (but slower) to use add.at:
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> np.add.at(result, (y, x), value)
Alternatively, you can create a sparse matrix since your data happen to be in sparse coo format. Using the '.A' property you can then convert that to a normal (dense) array as needed:
>>> from scipy import sparse
>>> spM = sparse.coo_matrix((value, (y, x)), (y.max()+1, x.max()+1))
>>> (spM.A == result).all()
True
Update: if the fillvalue is not zero the above must be modified.
Method 1: replace second line with (remember this should only be used if coordinates are unique):
>>> result = np.full((y.max()+1, x.max()+1), fillvalue, value.dtype)
Method 2: does not work
Method 3: after creating spM do
>>> spM.sum_duplicates()
>>> assert spM.has_canonical_format
>>> spM.data -= fillvalue
>>> result2 = spM.A + fillvalue

Numpy set dtype=None, cannot splice columns and set dtype=object cannot set dtype.names

I am running Python 2.6. I have the following example where I am trying to concatenate the date and time string columns from a csv file. Based on the dtype I set (None vs object), I am seeing some differences in behavior that I cannot explained, see Question 1 and 2 at the end of the post. The exception returned is not too descriptive, and the dtype documentation doesn't mention any specific behavior to expect when dtype is set to object.
Here is the snippet:
#! /usr/bin/python
import numpy as np
# simulate a csv file
from StringIO import StringIO
data = StringIO("""
Title
Date,Time,Speed
,,(m/s)
2012-04-01,00:10, 85
2012-04-02,00:20, 86
2012-04-03,00:30, 87
""".strip())
# (Fail) case 1: dtype=None splicing a column fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr1 = np.genfromtxt(data, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
arr1.dtype.names = header # assign the header to names
# so we can do y=arr['Speed']
y1 = arr1['Speed']
# Q1 IndexError: invalid index
#a1 = arr1[:,0]
#print a1
# EDIT1:
print "arr1.shape "
print arr1.shape # (3,)
# Fails as expected TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'
# z1 = arr1['Date'] + arr1['Time']
# This can be workaround by specifying dtype=object, which leads to case 2
data.seek(0) # resets
# (Fail) case 2: dtype=object assign header fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr2 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
# Q2 ValueError: there are no fields define
#arr2.dtype.names = header # assign the header to names. so we can use it to do indexing
# ie y=arr['Speed']
# y2 = arr['Date'] + arr['Time'] # column headings were assigned previously by arr.dtype.names = header
data.seek(0) # resets
# (Good) case 3: dtype=object but don't assign headers
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr3 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
y3 = arr3[:,0] + arr3[:,1] # slice the columns
print y3
# case 4: dtype=None, all data are ints, array dimension 2-D
# simulate a csv file
from StringIO import StringIO
data2 = StringIO("""
Title
Date,Time,Speed
,,(m/s)
45,46,85
12,13,86
50,46,87
""".strip())
next(data2) # eat away the title line
header = [item.strip() for item in next(data2).split(',')] # get the headers
arr4 = np.genfromtxt(data2, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
#arr4.dtype.names = header # Value error
print "arr4.shape "
print arr4.shape # (3,3)
data2.seek(0) # resets
Question 1: At comment Q1, why can I not slice a column, when dtype=None?
This could be avoided by
a) arr1=np-genfromtxt... was initialized with dtype=object like case 3,
b) arr1.dtype.names=... wascommented out to avoid the Value error in case 2
Question 2: At comment Q2, why can I not set the dtype.names when dtype=object?
EDIT1:
Added a case 4 that shows when the dimension of the array would be 2-D if the values in the simulated csv files are all ints instead. One can slice the column, but assigning the dtype.names would still fail.
Update the term 'splice' to 'slice'.
Question 1
This is indexing, not 'splicing', and you can't index into the columns of data for exactly the same reason I explained to you before in my answer to Question 7 here. Look at arr1.shape - it is (3,), i.e. arr1 is 1D, not 2D. There are no columns for you to index into.
Now look at the shape of arr2 - you'll see that it's (3,3). Why is this? If you do specify dtype=desired_type, np.genfromtxt will treat every delimited part of your input string the same (i.e. as desired_type), and it will give you an ordinary, non-structured numpy array back.
I'm not quite sure what you wanted to do with this line:
z1 = arr1['Date'] + arr1['Time']
Did you mean to concatenate the date and time strings together like this: '2012-04-01 00:10'? You could do it like this:
z1 = [d + ' ' + t for d,t in zip(arr1['Date'],arr1['Time'])]
It depends what you want to do with the output (this will give you a list of strings, not a numpy array).
I should point out that, as of version 1.7, Numpy has core array types that support datetime functionality. This would allow you to do much more useful things like computing time deltas etc.
dts = np.array(z1,dtype=np.datetime64)
Edit:
If you want to plot timeseries data, you can use matplotlib.dates.strpdate2num to convert your strings to matplotlib datenums, then use plot_date():
from matplotlib import dates
from matplotlib import pyplot as pp
# convert date and time strings to matplotlib datenums
dtconv = dates.strpdate2num('%Y-%m-%d%H:%M')
datenums = [dtconv(d+t) for d,t in zip(arr1['Date'],arr1['Time'])]
# use plot_date to plot timeseries
pp.plot_date(datenums,arr1['Speed'],'-ob')
You should also take a look at Pandas, which has some nice tools for visualising timeseries data.
Question 2
You can't set the names of arr2 because it is not a structured array (see above).

Categories