Saving (and averaging) Large Sparse Numpy Array - python

I have written a code that generates a large 3d numpy array of data observations (floats). The dimensions are (33,000 x 2016 x 53), which corresponds to (#obs.locations x 5min_intervals_perweek x weeks_in_leapyear). It is very sparse (about 1.5% of entries are filled).
Currently I do this by calling:
my3Darray = np.zeros(33000,2016,53)
or
my3Darray = np.empty(33000,2016,53)
My loop then indexes into the array one entry at a time and updates 1.5% with floats (this part is actually very fast). I then need to:
Save each 2D (33000 x 2016) slice as a CSV or other 'general format' data file
Take the mean over the 3rd dimension (so I should get a 33000 x 2016 matrix)
I have tried saving with:
for slice_2d_week_i in xrange(nweeks):
weekfile = str(slice_2d_week_i)
np.savetxt(weekfile, my3Darray[:,:,slice_2d_week_i], delimiter=",")
However, this is extremely slow and the empty entries in the output show up as
0.000000000000000000e+00
which makes the file sizes huge.
Is there a more efficient way to save (possibly leaving blanks for entries that were never updated?) Is there a better way to allocate the array besides np.zeros or np.empty? And how can I take the mean over the 3rd dimension while ignoring non-updated entries ( mean(my3Darray,3) does not ignore the 0 entries ).

You can save in one of numpy's binary formats, here's one I use: np.savez.
You can average with np.sum(a, axis=2) / np.sum(a != 0, axis=2). Keep in mind that this will still give you NaN's when there are zeros in the denominator.

Related

Converting 2D numpy array to 3D array without looping

I have a 2D array of shape (t*40,6) which I want to convert into a 3D array of shape (t,40,5) for the LSTM's input data layer. The description on how the conversion is desired in shown in the figure below. Here, F1..5 are the 5 input features, T1...40 are the time steps for LSTM and C1...t are the various training examples. Basically, for each unique "Ct", I want a "T X F" 2D array, and concatenate all along the 3rd dimension. I do not mind losing the value of "Ct" as long as each Ct is in a different dimension.
I have the following code to do this by looping over each unique Ct, and appending the "T X F" 2D arrays in 3rd dimension.
# load 2d data
data = pd.read_csv('LSTMTrainingData.csv')
trainX = []
# loop over each unique ct and append the 2D subset in the 3rd dimension
for index, ct in enumerate(data.ct.unique()):
trainX.append(data[data['ct'] == ct].iloc[:, 1:])
However, there are over 1,800,000 such Ct's so this makes it quite slow to loop over each unique Ct. Looking for suggestions on doing this operation faster.
EDIT:
data_3d = array.reshape(t,40,6)
trainX = data_3d[:,:,1:]
This is the solution for the original question posted.
Updating the question with an additional problem: the T1...40 time steps can have the highest number of steps = 40, but it could be less than 40 as well. The rest of the values can be 'np.nan' out of the 40 slots available.
Since all Ct have not the same length , you have no other choice than rebuild a new block.
But use of data[data['ct'] == ct] can be O(n²) so it's a bad way to do it.
Here a solution using Panel . cumcount renumber each Ct line :
t=5
CFt=randint(0,t,(40*t,6)).astype(float) # 2D data
df= pd.DataFrame(CFt)
df2=df.set_index([df[0],df.groupby(0).cumcount()]).sort_index()
df3=df2.to_panel()
This automatically fills missing data with Nan. But It warns :
DeprecationWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
So perhaps working with df2 is the recommended way to manage your data.

Remove every second row of array multiple times

So I have several .txt files with over +80.000 rows of data.
As such this might not be much for Python, however, I need to use this data in R where I need a certain package. And over there it takes around 30 sec to load one file - and I have 1200 of these files.
However, the data in these files are rather dense. There is no need to have such small steps, i.e. I want to remove some in order to make the file smaller.
What I'm using now is as follows:
np.delete(np.array(data_lines), np.arange(1, np.array(data_lines).size, 2))
I make it start at the row index 1, and the remove every second row of the data_lines array containing the +80.000 lines of data. However, as you can see, this only reduces the rows with 1/2. And I probably need at least a 1/10 reduction. So in principle I could probably just do some kind of loop to do this, but I was wondering if there was a smarter way to achieve it ?
a = np.array(data_lines)[::10]
Takes every tenth row of data. No data is copied, the slicing works with view objects.
You should use slicing. In my example array, the values in each row are identical to the row index (0,1,...,79999). I cut out every 10 rows of my 80000 x 1 np array (the number of columns doesn't matter... this would work on an array with more than 1 column). If you want to slice it differently, here's more info on slicing https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html
import numpy as np
data_lines = np.arange(0,80000).reshape((80000,1))
#
data_lines = data_lines.reshape((80000,1))
data_lines_subset = data_lines[::10]
##data_lines_subset
## array([[ 0],
# [ 10],
# [ 20],
# ...,
# [79970],
# [79980],
# [79990]])
So in your case, if your data_lines array isn't already a np array:
data_lines_subset = np.array(data_lines)[::10]

sampling 2-D numpy array using another array

I have a pretty big numpy matrix (2-D array with more than 1000 * 1000 cells), and another 2-D array of indexes in the following form: [[x1,y1],[x2,y2],...,[xn,yn]], which is also quite large (n > 1000). I want to extract all the cells in the matrix that their (x,y) coordinates appear in the array as efficient as possible, i.e. without loops. If the array was an array of tuples I could just to
cells = matrix[array]
and get what I want, but the array is not in that format, and I couldn't find an efficient way to convert it to the desired form...
You can make your array into a tuple of arrays like this:
tuple(array.T)
This matches the output style of np.where(), which can be indexed.
cells=matrix[tuple(array.T)]
you can also do standard numpy array slicing and get Divakar's answer in the comments:
cells=matrix[array[:,0],array[:,1]]

Shuffling multiple HDF5 datasets in-place

I have multiple HDF5 datasets saved in the same file, my_file.h5. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?
Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.
An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,
x = np.random.rand(5)
idx_map = numpy.arange(x.shape[0])
numpy.random.shuffle(idx_map)
Then you can use advanced numpy indexing to access your shuffled data,
x[idx_map[2]] # equivalent to x_shuffled[2]
x[idx_map] # equivament to x_shuffled[:], etc.
this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.
Shuffling arrays like this in numpy is straight forward
Create the large suffling index (shuffle np.arange(1000000)) and index the arrays
features = features[I, ...]
labels = labels[I]
info = info[I, :]
This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.
An alternative
features[I,...] = features
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.
But look at what h5py says about this same sort of 'fancy indexing':
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I] selection is implemented, but with restrictions.
List selections may not be empty
Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance
Your shuffled I is, by definition not in increasing order. And it is very large.
Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....
import numpy as np
import h5py
data = h5py.File('original.h5py', 'r')
with h5py.File('output.h5py', 'w') as out:
indexes = np.arange(data['some_dataset_in_original'].shape[0])
np.random.shuffle(indexes)
for key in data.keys():
print(key)
feed = np.take(data[key], indexes, axis=0)
out.create_dataset(key, data=feed)

calculating means of many matrices in numpy

I have many csv files which each contain roughly identical matrices. Each matrix is 11 columns by either 5 or 6 rows. The columns are variables and the rows are test conditions. Some of the matrices do not contain data for the last test condition, which is why there are 5 rows in some matrices and six rows in other matrices.
My application is in python 2.6 using numpy and sciepy.
My question is this:
How can I most efficiently create a summary matrix that contains the means of each cell across all of the identical matrices?
The summary matrix would have the same structure as all of the other matrices, except that the value in each cell in the summary matrix would be the mean of the values stored in the identical cell across all of the other matrices. If one matrix does not contain data for the last test condition, I want to make sure that its contents are not treated as zeros when the averaging is done. In other words, I want the means of all the non-zero values.
Can anyone show me a brief, flexible way of organizing this code so that it does everything I want to do with as little code as possible and also remain as flexible as possible in case I want to re-use this later with other data structures?
I know how to pull all the csv files in and how to write output. I just don't know the most efficient way to structure flow of data in the script, including whether to use python arrays or numpy arrays, and how to structure the operations, etc.
I have tried coding this in a number of different ways, but they all seem to be rather code intensive and inflexible if I later want to use this code for other data structures.
You could use masked arrays. Say N is the number of csv files. You can store all your data in a masked array A, of shape (N,11,6).
from numpy import *
A = ma.zeros((N,11,6))
A.mask = zeros_like(A) # fills the mask with zeros: nothing is masked
A.mask = (A.data == 0) # another way of masking: mask all data equal to zero
A.mask[0,0,0] = True # mask a value
A[1,2,3] = 12. # fill a value: like an usual array
Then, the mean values along first axis, and taking into account masked values, are given by:
mean(A, axis=0) # the returned shape is (11,6)

Categories