Shuffling multiple HDF5 datasets in-place - python

I have multiple HDF5 datasets saved in the same file, my_file.h5. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?

Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.
An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,
x = np.random.rand(5)
idx_map = numpy.arange(x.shape[0])
numpy.random.shuffle(idx_map)
Then you can use advanced numpy indexing to access your shuffled data,
x[idx_map[2]] # equivalent to x_shuffled[2]
x[idx_map] # equivament to x_shuffled[:], etc.
this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.

Shuffling arrays like this in numpy is straight forward
Create the large suffling index (shuffle np.arange(1000000)) and index the arrays
features = features[I, ...]
labels = labels[I]
info = info[I, :]
This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.
An alternative
features[I,...] = features
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.
But look at what h5py says about this same sort of 'fancy indexing':
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I] selection is implemented, but with restrictions.
List selections may not be empty
Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance
Your shuffled I is, by definition not in increasing order. And it is very large.
Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....

import numpy as np
import h5py
data = h5py.File('original.h5py', 'r')
with h5py.File('output.h5py', 'w') as out:
indexes = np.arange(data['some_dataset_in_original'].shape[0])
np.random.shuffle(indexes)
for key in data.keys():
print(key)
feed = np.take(data[key], indexes, axis=0)
out.create_dataset(key, data=feed)

Related

Converting 2D numpy array to 3D array without looping

I have a 2D array of shape (t*40,6) which I want to convert into a 3D array of shape (t,40,5) for the LSTM's input data layer. The description on how the conversion is desired in shown in the figure below. Here, F1..5 are the 5 input features, T1...40 are the time steps for LSTM and C1...t are the various training examples. Basically, for each unique "Ct", I want a "T X F" 2D array, and concatenate all along the 3rd dimension. I do not mind losing the value of "Ct" as long as each Ct is in a different dimension.
I have the following code to do this by looping over each unique Ct, and appending the "T X F" 2D arrays in 3rd dimension.
# load 2d data
data = pd.read_csv('LSTMTrainingData.csv')
trainX = []
# loop over each unique ct and append the 2D subset in the 3rd dimension
for index, ct in enumerate(data.ct.unique()):
trainX.append(data[data['ct'] == ct].iloc[:, 1:])
However, there are over 1,800,000 such Ct's so this makes it quite slow to loop over each unique Ct. Looking for suggestions on doing this operation faster.
EDIT:
data_3d = array.reshape(t,40,6)
trainX = data_3d[:,:,1:]
This is the solution for the original question posted.
Updating the question with an additional problem: the T1...40 time steps can have the highest number of steps = 40, but it could be less than 40 as well. The rest of the values can be 'np.nan' out of the 40 slots available.
Since all Ct have not the same length , you have no other choice than rebuild a new block.
But use of data[data['ct'] == ct] can be O(n²) so it's a bad way to do it.
Here a solution using Panel . cumcount renumber each Ct line :
t=5
CFt=randint(0,t,(40*t,6)).astype(float) # 2D data
df= pd.DataFrame(CFt)
df2=df.set_index([df[0],df.groupby(0).cumcount()]).sort_index()
df3=df2.to_panel()
This automatically fills missing data with Nan. But It warns :
DeprecationWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
So perhaps working with df2 is the recommended way to manage your data.

Construct sparse matrix on disk on the fly in Python

I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix of float32s with dimensions of ~ (2M, 5M). I'm constructing this matrix column by column when reading a corpus of 5M documents. For this purpose I use a sparse dok_matrix data structure from SciPy. However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes. What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using sklearn, but, as said, it is impossible to hold and construct the entire matrix in memory. I've looked into numpy.memmap, as sklearn supports this, and tried to memmap some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.
It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space. So I think that HDF5 and PyTables are no option for me (?).
My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?
Thanks!
We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk. I'll show you a small simple example of how I would deal with this. My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once. This will work even if you can't fit one entire copy.
I would construct an on disk sparse CSC matrix column by column. A sparse csc matrix uses 3 underlying arrays:
data: the values stored in the matrix
indices: the row index for each value in the matrix
indptr: an array of length n_cols + 1, which divides indices and data by which column they belong to.
As an explanatory example, the values for column i are stored in the range indptr[i]:indptr[i+1] of data. Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]].
To simulate your data generating process (parsing a document, I assume) I'll define a function process_document which returns the values for indices and data for the relevant document.
import numpy as np
import h5py
from scipy import sparse
from tqdm import tqdm # For monitoring the writing process
from typing import Tuple, Union # Just for argument annotation
def process_document():
"""
Simulate processing a document. Results in sparse vector represenation.
"""
n_items = np.random.negative_binomial(2, .0001)
indices = np.random.choice(2_000_000, n_items, replace=False)
indices.sort()
data = np.random.random(n_items).astype(np.float32)
return indices, data
def data_generator(n):
"""Iterator which yields simulated data."""
for i in range(n):
yield process_document()
Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.
def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
"""
Create a group in an hdf5 file that can store a CSC sparse matrix.
"""
g = f.create_group(groupname)
g.attrs["shape"] = shape
g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
return g
And finally a function for reading this group as a sparse matrix (this one is pretty simple).
def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])
Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).
N_COLS = 10
def make_disk_matrix(f, groupname, data_iter, shape):
group = make_sparse_csc_group(f, "mtx", shape)
indptr = group["indptr"]
data = group["data"]
indices = group["indices"]
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices.resize((n_total,))
data.resize((n_total,))
indices[n_prev:] = cur_indices
data[n_prev:] = cur_data
indptr[doc_num+1] = n_total
# Writing
with h5py.File("data.h5", "w") as f:
make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))
# Reading
with h5py.File("data.h5", "r") as f:
mtx = read_sparse_csc_group(f["mtx"])
Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it. A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions). However, using a slight modification of this code should give you better performance:
def make_memory_mtx(data_iter, shape):
indices_list = []
data_list = []
indptr = np.zeros(shape[1]+1, dtype=int)
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices_list.append(cur_indices)
data_list.append(cur_data)
indptr[doc_num+1] = n_total
indices = np.concatenate(indices_list)
data = np.concatenate(data_list)
return sparse.csc_matrix((data, indices, indptr), shape=shape)
mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))
This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays. Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.
It would be great if you could provide a minimal working code. I can't see if your matrix gets too big by construction (1) or just because you have too much data (2). If you don't really care about building this matrix yourself, you can directly look at my remark 2.
For problem (1), in the example code below, I made a wrapper class to build a csr_matrix chunk by chunk. The idea is to just add (row,column,data) tuples of lists until a buffer limit (see remark 1) is reached, and actually update the matrix at this moment. When the limit is reached, it will reduce the data in memory since the csr_matrix constructor adds data that have the same (row,column) tuples. This part only allows you to construct the sparse matrix in a fast manner (much faster than creating a sparse matrix for each row) and avoids memory error due to the redundancy of the (row,column) when a word appears several times in a document.
import numpy as np
import scipy.sparse
class SparseMatrixBuilder():
def __init__(self, shape, build_size_limit):
self.sparse_matrix = scipy.sparse.csr_matrix(shape)
self.shape = shape
self.build_size_limit = build_size_limit
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def add(self, data, col_indices, row_indices):
self.data_temp.append(data)
self.col_indices_temp.append(col_indices)
self.row_indices_temp.append(row_indices)
if len(self.data_temp) == self.build_size_limit:
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def get_matrix(self):
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
return self.sparse_matrix
For problem (2), you can easily extend this class by adding a save method that stores the matrix on disk once the limit (or a second limit) is reached. As such, you'll end up with multiple chunks of sparse matrices on disk. Then you'll need a dimensionality reduction algorithm that can handle chunked matrices (see remark 2).
remark 1: the buffer limit here is not really well defined. It would be better to check for the actual size of the numpy arrays data_temp, col_indices_temp and row_indices_temp compared to the RAM available on the machine (which is quite easy to automatize with python).
remark 2: gensim is a python library that has the great advantage to use chunked files for building NLP models. So you could build a dictionary, construct a sparse matrix and reduce it dimension with that library, without much RAM needed.
I'm assuming that all your data can fit in memory using a more memory-friendly sparse matrix format such as COO. If it does not, there is almost no hope you will be able to proceed with sklearn, even by using mmap. Indeed sklearn will likely create subsequent objects with memory requirements of the same order of magnitude as your input.
Scipy's dok_matrix are actually a sub-class of the vanilla dict. They store the data using individual python objects and tons of pointers, so they are not memory efficient. The most compact representation is the coo_matrix format. You can incrementally build the data required to create a COO matrix by pre-allocating arrays for the coordinates (rows and cols) and the data; and eventually increase these buffers if your initial guess was wrong.
def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
counter = 0
rows = numpy.empty(n_data_hint, dtype=idx_dtype)
cols = numpy.empty(n_data_hint, dtype=idx_dtype)
data = numpy.empty(n_data_hint, dtype=data_dtype)
for row, col, value in iterable:
if counter >= n_data_hint:
n_data_hint *= 2
rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
rows[counter] = row
cols[counter] = col
data[counter] = value
counter += 1
rows = rows[:counter]
cols = cols[:counter]
data = data[:counter]
return coo_matrix((data, (rows, cols)))
def _reallocate(rows, cols, data, n):
new_rows = numpy.empty(n, dtype=rows.dtype)
new_cols = numpy.empty(n, dtype=cols.dtype)
new_data = numpy.empty(n, dtype=data.dtype)
new_rows[:rows.size] = rows
new_cols[:cols.size] = cols
new_data[:data.size] = data
return new_rows, new_cols, new_data
which you can test with randomly-generated data like this:
def get_random_data(n, max_row=2000, max_col=5000):
for _ in range(n):
row = numpy.random.choice(max_row)
col = numpy.random.choice(max_col)
val = numpy.random.randn()
yield row, col, val
# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)
# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)
Once you have your COO matrix, you may want to convert to CSR using coo.tocsr(). The CSR matrices are more optimized for common operations such as dot product.
It requires a bit more memory in the case where some rows were empty originally. This is because it stores pointers for all rows even empty ones.
Look at here, at the end he explain how to store and read directly sparse matrix to a Hdf5 file.

pandas' memory usage for list of SparseSeries

I'm trying to create a list of SparseSeries from a sparse numpy matrix. Creating the lil_matrix is fast and does not consume a lot of memory (in reality my dimension are more in the order of millions, i.e. 15 million samples and 4 million features). I have read a previous topic on this. But that solution as well seems to eat up all my memory, freezing my computer. At the surface it looks like the pandas SparseSeries is not really sparse, or am I doing something wrong? The ultimate goal is to create a SparseDataFrame from this (just like in the other topic I referred to).
from scipy.sparse import lil_matrix, csr_matrix
from numpy import random
import pandas as pd
nsamples = 10**5
nfeatures = 10**4
rm = lil_matrix((nsamples,nfeatures))
for i in xrange(nsamples):
index = random.randint(0,nfeatures,size=4)
rm[i,index] = 1
l=[]
for i in xrange(nsamples):
l.append(pd.Series(rm[i,:].toarray().ravel()).to_sparse(fill_value=0))
Since your goal is a sparse dataframe, I skipped the Series stage and went straight to a dataframe. I only had the patience to do this on a smaller array size:
nsamples = 10**3
nfeatures = 10**2
Creation of rm is the same, but I don't load into a list, but rather do this:
df = pd.DataFrame(rm[1,:].toarray().ravel()).to_sparse(0)
for i in xrange(1,nsamples):
df[i] = rm[i,:].toarray().ravel()
This is unfortunately much slower to run than what you have, but the result is a dataframe, not a list. I played around with this a little and as best I can tell there is not any fast way to build a large, sparse dataframe (even one full of zeros) column by column, rather than all at once (which is not going to be memory efficient). All of the examples in the documentation that I could find start with a dense structure and then convert to sparse in one step.
In any event, this way should be fairly memory efficient by compressing one column at a time such that you never have the full array/dataframe uncompressed at the same time. The resulting dataframe is definitely sparse:
In [39]: type(df)
Out[39]: pandas.sparse.frame.SparseDataFrame
and definitely saves space (almost 25x compression):
In [40]: df.memory_usage().sum()
Out[40]: 31528
In [41]: df.to_dense().memory_usage().sum()
Out[41]: 800000

doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

calculating means of many matrices in numpy

I have many csv files which each contain roughly identical matrices. Each matrix is 11 columns by either 5 or 6 rows. The columns are variables and the rows are test conditions. Some of the matrices do not contain data for the last test condition, which is why there are 5 rows in some matrices and six rows in other matrices.
My application is in python 2.6 using numpy and sciepy.
My question is this:
How can I most efficiently create a summary matrix that contains the means of each cell across all of the identical matrices?
The summary matrix would have the same structure as all of the other matrices, except that the value in each cell in the summary matrix would be the mean of the values stored in the identical cell across all of the other matrices. If one matrix does not contain data for the last test condition, I want to make sure that its contents are not treated as zeros when the averaging is done. In other words, I want the means of all the non-zero values.
Can anyone show me a brief, flexible way of organizing this code so that it does everything I want to do with as little code as possible and also remain as flexible as possible in case I want to re-use this later with other data structures?
I know how to pull all the csv files in and how to write output. I just don't know the most efficient way to structure flow of data in the script, including whether to use python arrays or numpy arrays, and how to structure the operations, etc.
I have tried coding this in a number of different ways, but they all seem to be rather code intensive and inflexible if I later want to use this code for other data structures.
You could use masked arrays. Say N is the number of csv files. You can store all your data in a masked array A, of shape (N,11,6).
from numpy import *
A = ma.zeros((N,11,6))
A.mask = zeros_like(A) # fills the mask with zeros: nothing is masked
A.mask = (A.data == 0) # another way of masking: mask all data equal to zero
A.mask[0,0,0] = True # mask a value
A[1,2,3] = 12. # fill a value: like an usual array
Then, the mean values along first axis, and taking into account masked values, are given by:
mean(A, axis=0) # the returned shape is (11,6)

Categories