Vectorization - how to append array without loop for - python

I have the following code:
x = range(100)
M = len(x)
sample=np.zeros((M,41632))
for i in range(M):
lista=np.load('sample'+str(i)+'.npy')
for j in range(41632):
sample[i,j]=np.array(lista[j])
print i
to create an array made of sample_i numpy arrays.
sample0, sample1, sample3, etc. are numpy arrays and my expected output is a Mx41632 array like this:
sample = [[sample0],[sample1],[sample2],...]
How can I compact and make more quick this operation without loop for? M can reach also 1 million.
Or, how can I append my sample array if the starting point is, for example, 1000 instead of 0?
Thanks in advance

Initial load
You can make your code a lot faster by avoiding the inner loop and not initialising sample to zeros.
x = range(100)
M = len(x)
sample = np.empty((M, 41632))
for i in range(M):
sample[i, :] = np.load('sample'+str(i)+'.npy')
In my tests this took the reading code from 3 seconds to 60 miliseconds!
Adding rows
In general it is very slow to change the size of a numpy array. You can append a row once you have loaded the data in this way:
sample = np.insert(sample, len(sample), newrow, axis=0)
but this is almost never what you want to do, because it is so slow.
Better storage: HDF5
Also if M is very large you will probably start running out of memory.
I recommend that you have a look at PyTables which will allow you to store your sample results in one HDF5 file and manipulate the data without loading it into memory. This will in general be a lot faster than the .npy files you are using now.

It is quite simple with numpy. Consider this example:
import numpy as np
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
#create an array with 4 rows and 3 columns
arr = np.zeros([4,3])
arr[:,:] = l
You can also insert rows or columns separately:
#insert the first row
arr[0,:] = l[0]
You just have to provide that dimensions are the same.

Related

How do I save a N x M array/list using Pandas?

I have a N x M numpy array / list. I want to save this matrix into a .csv file using Pandas. Unfortunately I don't know a priori the values of M and N which can be large. I am interested in Pandas because I find it manageable in terms of data columns access.
Let's start with this MWE:
import numpy as np
import pandas as pd
N,M = np.random.randint(10,100, size = 2)
A = np.random.randint(10, size = (N,M))
columns = []
for i in range(len(A[0,:])):
columns.append( "column_{} ".format(i) )
I cannot do something like pd.append( ) i.e. appending columns with new additional indices via a for loop.
Is there a way to save A into a .csv file?
Following the comment of Quang Hoang, there are 2 possibilities:
pd.DataFrame(A).to_csv('yourfile.csv').
np.save("yourfile.npy",A) and then A = np.load("yourfile.npy").

Calculate partitioned sum efficiently with CuPy or NumPy

I have a very long array* of length L (let's call it values) that I want to sum over, and a sorted 1D array of the same length L that contains N integers with which to partition the original array – let's call this array labels.
What I'm currently doing is this (module being cupy or numpy):
result = module.empty(N)
for i in range(N):
result[i] = values[labels == i].sum()
But this can't be the most efficient way of doing it (it should be possible to get rid of the for loop, but how?). Since labels is sorted, I could easily determine the break points and use those indices as start/stop points, but I don't see how this solves the for loop problem.
Note that I would like to avoid creating an array of size NxL along the way, if possible, since L is very large.
I'm working in cupy, but any numpy solution is welcome too and could probably be ported. Within cupy, it seems this would be a case for a ReductionKernel, but I don't quite see how to do it.
* in my case, values is 1D, but I assume the solution wouldn't depend on this
You are describing a groupby sum aggregation. You could write a CuPy RawKernel for this, but it would be much easier to use the existing groupby aggregations implemented in cuDF, the GPU dataframe library. They can interoperate without requiring you to copy the data. If you call .values on the resulting cuDF Series, it will give you a CuPy array.
If you went back to the CPU, you could do the same thing with pandas.
import cupy as cp
import pandas as pd
N = 100
values = cp.random.randint(0, N, 1000)
labels = cp.sort(cp.random.randint(0, N, 1000))
L = len(values)
result = cp.empty(L)
for i in range(N):
result[i] = values[labels == i].sum()
result[:5]
array([547., 454., 402., 601., 668.])
import cudf
df = cudf.DataFrame({"values": values, "labels": labels})
df.groupby(["labels"])["values"].sum().values[:5]
array([547, 454, 402, 601, 668])
Here is a solution which, instead of a N x L array, uses a N x <max partition size in labels> array (which should not be large, if the disparity between different partitions is not too high):
Resize the array into a 2-D array with partitions in each row. Since the length of the row equals the size of the maximum partition, fill unavailable values with zeros (since it doesn't affect any sum). This uses #Divakar's solution given here.
def jagged_to_regular(a, parts):
lens = np.ediff1d(parts,to_begin=parts[0])
mask = lens[:,None]>np.arange(lens.max())
out = np.zeros(mask.shape, dtype=a.dtype)
out[mask] = a
return out
parts_stack = jagged_to_regular(values, labels)
Sum along axis 1:
result = np.sum(parts_stack, axis = 1)
In case you'd like a CuPy implementation, there's no direct CuPy alternative to numpy.ediff1d in jagged_to_regular. In that case, you can substitute the statement with numpy.diff like so:
lens = np.insert(np.diff(parts), 0, parts[0])
and then continue to use CuPy as a drop-in replacement for numpy.

Construct sparse matrix on disk on the fly in Python

I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix of float32s with dimensions of ~ (2M, 5M). I'm constructing this matrix column by column when reading a corpus of 5M documents. For this purpose I use a sparse dok_matrix data structure from SciPy. However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes. What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using sklearn, but, as said, it is impossible to hold and construct the entire matrix in memory. I've looked into numpy.memmap, as sklearn supports this, and tried to memmap some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.
It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space. So I think that HDF5 and PyTables are no option for me (?).
My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?
Thanks!
We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk. I'll show you a small simple example of how I would deal with this. My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once. This will work even if you can't fit one entire copy.
I would construct an on disk sparse CSC matrix column by column. A sparse csc matrix uses 3 underlying arrays:
data: the values stored in the matrix
indices: the row index for each value in the matrix
indptr: an array of length n_cols + 1, which divides indices and data by which column they belong to.
As an explanatory example, the values for column i are stored in the range indptr[i]:indptr[i+1] of data. Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]].
To simulate your data generating process (parsing a document, I assume) I'll define a function process_document which returns the values for indices and data for the relevant document.
import numpy as np
import h5py
from scipy import sparse
from tqdm import tqdm # For monitoring the writing process
from typing import Tuple, Union # Just for argument annotation
def process_document():
"""
Simulate processing a document. Results in sparse vector represenation.
"""
n_items = np.random.negative_binomial(2, .0001)
indices = np.random.choice(2_000_000, n_items, replace=False)
indices.sort()
data = np.random.random(n_items).astype(np.float32)
return indices, data
def data_generator(n):
"""Iterator which yields simulated data."""
for i in range(n):
yield process_document()
Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.
def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
"""
Create a group in an hdf5 file that can store a CSC sparse matrix.
"""
g = f.create_group(groupname)
g.attrs["shape"] = shape
g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
return g
And finally a function for reading this group as a sparse matrix (this one is pretty simple).
def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])
Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).
N_COLS = 10
def make_disk_matrix(f, groupname, data_iter, shape):
group = make_sparse_csc_group(f, "mtx", shape)
indptr = group["indptr"]
data = group["data"]
indices = group["indices"]
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices.resize((n_total,))
data.resize((n_total,))
indices[n_prev:] = cur_indices
data[n_prev:] = cur_data
indptr[doc_num+1] = n_total
# Writing
with h5py.File("data.h5", "w") as f:
make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))
# Reading
with h5py.File("data.h5", "r") as f:
mtx = read_sparse_csc_group(f["mtx"])
Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it. A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions). However, using a slight modification of this code should give you better performance:
def make_memory_mtx(data_iter, shape):
indices_list = []
data_list = []
indptr = np.zeros(shape[1]+1, dtype=int)
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices_list.append(cur_indices)
data_list.append(cur_data)
indptr[doc_num+1] = n_total
indices = np.concatenate(indices_list)
data = np.concatenate(data_list)
return sparse.csc_matrix((data, indices, indptr), shape=shape)
mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))
This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays. Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.
It would be great if you could provide a minimal working code. I can't see if your matrix gets too big by construction (1) or just because you have too much data (2). If you don't really care about building this matrix yourself, you can directly look at my remark 2.
For problem (1), in the example code below, I made a wrapper class to build a csr_matrix chunk by chunk. The idea is to just add (row,column,data) tuples of lists until a buffer limit (see remark 1) is reached, and actually update the matrix at this moment. When the limit is reached, it will reduce the data in memory since the csr_matrix constructor adds data that have the same (row,column) tuples. This part only allows you to construct the sparse matrix in a fast manner (much faster than creating a sparse matrix for each row) and avoids memory error due to the redundancy of the (row,column) when a word appears several times in a document.
import numpy as np
import scipy.sparse
class SparseMatrixBuilder():
def __init__(self, shape, build_size_limit):
self.sparse_matrix = scipy.sparse.csr_matrix(shape)
self.shape = shape
self.build_size_limit = build_size_limit
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def add(self, data, col_indices, row_indices):
self.data_temp.append(data)
self.col_indices_temp.append(col_indices)
self.row_indices_temp.append(row_indices)
if len(self.data_temp) == self.build_size_limit:
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def get_matrix(self):
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
return self.sparse_matrix
For problem (2), you can easily extend this class by adding a save method that stores the matrix on disk once the limit (or a second limit) is reached. As such, you'll end up with multiple chunks of sparse matrices on disk. Then you'll need a dimensionality reduction algorithm that can handle chunked matrices (see remark 2).
remark 1: the buffer limit here is not really well defined. It would be better to check for the actual size of the numpy arrays data_temp, col_indices_temp and row_indices_temp compared to the RAM available on the machine (which is quite easy to automatize with python).
remark 2: gensim is a python library that has the great advantage to use chunked files for building NLP models. So you could build a dictionary, construct a sparse matrix and reduce it dimension with that library, without much RAM needed.
I'm assuming that all your data can fit in memory using a more memory-friendly sparse matrix format such as COO. If it does not, there is almost no hope you will be able to proceed with sklearn, even by using mmap. Indeed sklearn will likely create subsequent objects with memory requirements of the same order of magnitude as your input.
Scipy's dok_matrix are actually a sub-class of the vanilla dict. They store the data using individual python objects and tons of pointers, so they are not memory efficient. The most compact representation is the coo_matrix format. You can incrementally build the data required to create a COO matrix by pre-allocating arrays for the coordinates (rows and cols) and the data; and eventually increase these buffers if your initial guess was wrong.
def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
counter = 0
rows = numpy.empty(n_data_hint, dtype=idx_dtype)
cols = numpy.empty(n_data_hint, dtype=idx_dtype)
data = numpy.empty(n_data_hint, dtype=data_dtype)
for row, col, value in iterable:
if counter >= n_data_hint:
n_data_hint *= 2
rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
rows[counter] = row
cols[counter] = col
data[counter] = value
counter += 1
rows = rows[:counter]
cols = cols[:counter]
data = data[:counter]
return coo_matrix((data, (rows, cols)))
def _reallocate(rows, cols, data, n):
new_rows = numpy.empty(n, dtype=rows.dtype)
new_cols = numpy.empty(n, dtype=cols.dtype)
new_data = numpy.empty(n, dtype=data.dtype)
new_rows[:rows.size] = rows
new_cols[:cols.size] = cols
new_data[:data.size] = data
return new_rows, new_cols, new_data
which you can test with randomly-generated data like this:
def get_random_data(n, max_row=2000, max_col=5000):
for _ in range(n):
row = numpy.random.choice(max_row)
col = numpy.random.choice(max_col)
val = numpy.random.randn()
yield row, col, val
# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)
# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)
Once you have your COO matrix, you may want to convert to CSR using coo.tocsr(). The CSR matrices are more optimized for common operations such as dot product.
It requires a bit more memory in the case where some rows were empty originally. This is because it stores pointers for all rows even empty ones.
Look at here, at the end he explain how to store and read directly sparse matrix to a Hdf5 file.

pandas' memory usage for list of SparseSeries

I'm trying to create a list of SparseSeries from a sparse numpy matrix. Creating the lil_matrix is fast and does not consume a lot of memory (in reality my dimension are more in the order of millions, i.e. 15 million samples and 4 million features). I have read a previous topic on this. But that solution as well seems to eat up all my memory, freezing my computer. At the surface it looks like the pandas SparseSeries is not really sparse, or am I doing something wrong? The ultimate goal is to create a SparseDataFrame from this (just like in the other topic I referred to).
from scipy.sparse import lil_matrix, csr_matrix
from numpy import random
import pandas as pd
nsamples = 10**5
nfeatures = 10**4
rm = lil_matrix((nsamples,nfeatures))
for i in xrange(nsamples):
index = random.randint(0,nfeatures,size=4)
rm[i,index] = 1
l=[]
for i in xrange(nsamples):
l.append(pd.Series(rm[i,:].toarray().ravel()).to_sparse(fill_value=0))
Since your goal is a sparse dataframe, I skipped the Series stage and went straight to a dataframe. I only had the patience to do this on a smaller array size:
nsamples = 10**3
nfeatures = 10**2
Creation of rm is the same, but I don't load into a list, but rather do this:
df = pd.DataFrame(rm[1,:].toarray().ravel()).to_sparse(0)
for i in xrange(1,nsamples):
df[i] = rm[i,:].toarray().ravel()
This is unfortunately much slower to run than what you have, but the result is a dataframe, not a list. I played around with this a little and as best I can tell there is not any fast way to build a large, sparse dataframe (even one full of zeros) column by column, rather than all at once (which is not going to be memory efficient). All of the examples in the documentation that I could find start with a dense structure and then convert to sparse in one step.
In any event, this way should be fairly memory efficient by compressing one column at a time such that you never have the full array/dataframe uncompressed at the same time. The resulting dataframe is definitely sparse:
In [39]: type(df)
Out[39]: pandas.sparse.frame.SparseDataFrame
and definitely saves space (almost 25x compression):
In [40]: df.memory_usage().sum()
Out[40]: 31528
In [41]: df.to_dense().memory_usage().sum()
Out[41]: 800000

doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

Categories