For loops with Dask arrays and/or h5py - python

I have a time series with over a hundred million rows of data. I am trying to reshape it to include a time window. My sample data is of shape (79499, 9) and I am trying to reshape it to (79979, 10, 9). The following for loop works fine in numpy.
def munge(data, backprop_window):
result = []
for index in range(len(data) - backprop_window):
result.append(data[index: index + backprop_window])
return np.array(result)
X_train = munge(X_train, backprop_window)
I have tried a few variations with dask, but all of them seem to hang without giving any error messages, including this one:
import h5py
import dask.array as da
f1 = h5py.File("data.hdf5")
X_train = f1.create_dataset('X_train',data = X_train, dtype='float32')
x = da.from_array(X_train, chunks=(10000, d.shape[1]))
result = x.compute(munge(x, backprop_window))
Any wise thoughts appreciated.

This doesn't necessarily solve your dask issue, but as a much faster alternative to munge, you could instead use numpy's stride_tricks to create a rolling view into your data (based on example here).
def munge_strides(data, backprop_window):
""" take a rolling view into array by manipulating strides """
from numpy.lib.stride_tricks import as_strided
new_shape = (data.shape[0] - backprop_window,
backprop_window,
data.shape[1])
new_strides = (data.strides[0], data.strides[0], data.strides[1])
return as_strided(data, shape=new_shape, strides=new_strides)
X_train = np.arange(100).reshape(20, 5)
np.array_equal(munge(X_train, backprop_window=3),
munge_strides(X_train, backprop_window=3))
Out[112]: True
as_strided needs to be used very carefully - it is an 'advanced' feature and incorrect parameters can easily lead you into segfaults - see docstring

Related

How to efficiently index a numpy array based on varying start and stop indexes per row

I have a 2D numpy array with rows being time series of a feature, based on which I'm training a neural network. For generalisation purposes, I would like to subset these time series at random points. I'd like them to have a minimum subset length as well. However, the network requires fixed length time series, so I need to pre-pad the resulting subsets with zeroes.
Currently, I'm doing it using the code below, which includes a nasty for-loop, because I don't know how I can use fancy indexing for this particular problem. As this piece of code is part of the network data generator, it needs to be fast to keep up to pace with the data-hungry GPU. Does anyone know a numpy-way of doing this without the for-loop?
import numpy as np
import matplotlib.pyplot as plt
# Amount of time series to consider
batchsize = 25
# Original length of the time series
timesteps = 150
# As an example, fill the 2D array with sine function time series
sinefunction = np.expand_dims(np.sin(np.arange(timesteps)), axis=0)
originalarray = np.repeat(sinefunction, batchsize, axis=0)
# Now the real thing, we want:
# - to start the time series at a random moment (between 0 and maxstart)
# - to end the time series at a random moment
# - however with a minimum length of the resulting subset time series (minlength)
maxstart = 50
minlength = 75
# get random starts
randomstarts = np.random.choice(np.arange(0, maxstart), size=batchsize)
# get random stops
randomstops = np.random.choice(np.arange(maxstart + minlength, timesteps), size=batchsize)
# determine the resulting random sizes of the subset time series
randomsizes = randomstops - randomstarts
# finally create a new 2D array with all the randomly subset time series, however pre-padded with zeros
# THIS IS THE FOR LOOP WE SHOULD TRY TO AVOID
cutarray = np.zeros_like(originalarray)
for i in range(batchsize):
cutarray[i, -randomsizes[i]:] = originalarray[i, randomstarts[i]:randomstops[i]]
To show what goes in and out of the function:
# Show that it worked
f, ax = plt.subplots(2, 1)
ax[0].imshow(originalarray)
ax[0].set_title('original array')
ax[1].imshow(cutarray)
ax[1].set_title('zero-padded subset array')
Approach #1 : Views-based
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a zeros padded version of the input and assign into a zeros padded version of the output. All of that padding is needed for a vectorized solution on account of the ragged nature. Upside is that working on views would be efficient on memory and performance.
The implementation would look something like this -
from skimage.util.shape import view_as_windows
n = randomsizes.max()
max_extent = randomstarts.max()+n
padlen = max_extent - origalarray.shape[1]
p = np.zeros((origalarray.shape[0],padlen),dtype=origalarray.dtype)
a = np.hstack((origalarray,p))
w = view_as_windows(a,(1,n))[...,0,:]
out_vals = w[np.arange(len(randomstarts)),randomstarts]
out_starts = origalarray.shape[1]-randomsizes
out_extensions_max = out_starts.max()+n
out = np.zeros((origalarray.shape[0],out_extensions_max),dtype=origalarray.dtype)
w2 = view_as_windows(out,(1,n))[...,0,:]
w2[np.arange(len(out_starts)),out_starts] = out_vals
cutarray_out = out[:,:origalarray.shape[1]]
Approach #2 : With masking
cutarray_out = np.zeros_like(origalarray)
r = np.arange(origalarray.shape[1])
m = (randomstarts[:,None]<=r) & (randomstops[:,None]>r)
s = origalarray.shape[1]-randomsizes
m2 = s[:,None]<=r
cutarray_out[m2] = origalarray[m]

How should I modify the test data for SVM method to be able to use the `precomputed` kernel function without error?

I am using sklearn.svm.SVR for a "regression task" which I want to use my "customized kernel method". Here is the dataset samples and the code:
index density speed label
0 14 58.844020 77.179139
1 29 67.624946 78.367394
2 44 77.679100 79.143744
3 59 79.361877 70.048869
4 74 72.529289 74.499239
.... and so on
from sklearn import svm
import pandas as pd
import numpy as np
density = np.random.randint(0,100, size=(3000, 1))
speed = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
label = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
d = np.hstack((a,b,c))
data = pd.DataFrame(d, columns=['density', 'speed', 'label'])
data.density = data.density.astype(dtype=np.int32)
def my_kernel(X,Y):
return np.dot(X,X.T)
svr = svm.SVR(kernel=my_kernel)
x = data[['density', 'speed']].iloc[:2000]
y = data['label'].iloc[:2000]
x_t = data[['density', 'speed']].iloc[2000:3000]
y_t = data['label'].iloc[2000:3000]
svr.fit(x,y)
y_preds = svr.predict(x_t)
the problem happens in the last line svm.predict which says:
X.shape[1] = 1000 should be equal to 2000, the number of samples at training time
I searched the web to find a way to deal with the problem but many questions alike (like {1}, {2}, {3}) were left unanswered.
Actually, I had used SVM methods with rbf, sigmoid, ... before and the code was working just fine but this was my first time using customized kernels and I suspected that it must be the reason why this error happened.
So after a little research and reading documentation I found out that when using precomputed kernels, the shape of the matrix for SVR.predict() must be like [n_samples_test, n_samples_train] shape.
I wonder how to modify x_test in order to get predictions and everything works just fine with no problem like when we don't use customized kernels?
If possible please describe "the reason that why the inputs for svm.predict function in precomputed kernel differentiates with the other kernels".
I really hope the unanswered questions that are related to this issue could be answered respectively.
The problem is in your kernel function, it doesn't do the job.
As the documentation https://scikit-learn.org/stable/modules/svm.html#using-python-functions-as-kernels says, "Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2, n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2)." The sample kernel on the same page satisfies this criteria:
def my_kernel(X, Y):
return np.dot(X, Y.T)
In your function the second argument of dot is X.T and thus the output will have shape (n_samples_1, n_samples_1) which is not that is expected.
The shape does not match means the test data and train data are of not equal shape, always think about matrix or array in numpy. If you are doing any arithmetic operation you always need a similar shape. That's why we check array.shape.
[n_samples_test, n_samples_train] you can modify shapes but its not best idea.
array.shape, reshape, resize
are used for that

Reshaping error in multivariate normal function with Numpy - Python

I have this data (c4), I want to use 4-fold cross validation testing on this matrix.
The way that I'm splitting the data is as follows:
from scipy.stats import multivariate_normal
from sklearn.model_selection import KFold
import math
c4 = np.array([
[5,10,14,18,22,19,21,18,18,19,19,18,15,15,12,4,4,4,3,3,3,3,3,3,3,3,3,3,3,1],
[6,9,11,12,10,10,13,16,18,21,20,19,8,5,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3],
[4,8,12,17,18,21,21,21,17,16,15,13,7,8,8,7,7,4,4,4,3,3,3,3,4,4,3,3,3,2],
[3,7,12,17,19,20,22,20,20,19,19,18,17,16,16,15,14,13,12,9,4,4,4,3,3,3,3,3,2,1],
[2,5,8,10,10,11,11,10,13,17,19,20,22,22,20,16,15,15,13,11,8,3,3,3,3,3,3,3,2,1],
[4,8,10,11,10,15,15,17,18,19,18,20,18,17,15,13,12,7,4,4,4,4,4,4,4,4,3,3,3,2],
[2,8,12,15,18,20,19,20,21,21,23,19,19,16,16,16,14,12,10,7,7,7,7,6,3,3,3,3,2,1],
[2,13,17,18,21,22,20,18,18,17,17,15,13,11,8,8,4,4,4,4,4,4,4,4,4,4,4,4,3,1],
[6,6,9,14,15,18,20,20,22,20,16,16,15,11,8,8,8,5,4,4,4,4,4,4,4,5,5,5,5,4],
[8,13,16,20,20,20,19,17,17,17,17,15,14,13,10,6,3,3,3,4,4,4,3,3,4,3,3,3,2,2],
[5,9,17,18,19,18,17,16,14,13,12,12,11,10,4,4,4,3,3,3,3,3,3,3,4,4,3,3,3,3],
[4,6,8,11,16,17,18,20,16,17,16,17,17,16,14,12,12,10,9,9,8,8,6,4,3,3,3,2,2,2] ])
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(c4):
X_train, X_test = c4[train_index], c4[test_index]
X_train_mean = np.mean(X_train)
X_train_cov = np.cov(X_train.T)
v = multivariate_normal(X_train_mean, X_train_cov)
res = v.pdf(X_test)
print (res)
but it didn't work with me, despite that the splitting loop works well with small sample of data.
The error message that I got:
ValueError: cannot reshape array of size 900 into shape (1,1)
Note: the length of all rows is equal.
Thanks in advance.
You are taking the mean of entire matrix X_train when you do np.mean(X_train). What you should do is take mean across the sample axis i.e. if your features are across columns and different samples are across rows, then replace np.mean(X_train) by np.mean(X_train, axis=0). This should solve the error.
Including this line in the above code makes it work. Basically, np.mean(c4[test_index], axis=0) will given you a 1 x 30 mean vector instead of a scalar mean.
from scipy.stats import multivariate_normal as mvn
v = mvn(np.mean(c4[test_index], axis=0), X_train_cov + np.eye(30))
I had to add an identity matrix because I was getting a singular matrix error. However, that has to do with how c4 is defined and nothing to do with this code. Note that to avoid the singularity, you typically add a very small value on the diagonal and not an identity matrix. This is just for illustration.
What is multivariate_normal ? If it is from scipy.stats, then per the doc you must do
multivariate_normal.pdf(X_test, np.mean(X_train, axis=0), X_train_cov)
The doc is here.

Python efficient summation in large 2D array

My task is fairly simple: I have a large 2D matrix, containing only zeros and ones. For each position in this matrix, I want to sum all pixels in a window around this position. The problem is that the matrix has the shape (166667, 17668) and window sizes range from (333, 333) to (5333, 5333). So far I have only tried on a subset of the data. The code I arrived at:
out_arr = np.array( in_arr.shape )
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
for y in range(out_arr.shape[0]):
for x in range(out_arr.shape[1]):
out_arr[y, x] = np.sum(in_arr[y:y+windowsize, x:x+windowsize])
Obviously, this takes a long time. But in my case it was faster than a rolling window approach using numpy.stride_tricks.as_strided, as described here. I tried compiling it using cython, without effect.
What would be your suggestions to speed this up, apart from parallelizing?
I have a Nvidia Titan X at hand. Is there a way to benefit from that?
(e.g. using cupy)
For windowed summation convolution is actually overkill since a simple O(n) solution exists:
import numpy as np
from scipy.signal import convolve
def winsum(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2+1, mode='reflect')[:-1, :-1]
in_arr[0] = 0
in_arr[:, 0] = 0
ps = in_arr.cumsum(0).cumsum(1)
return ps[windowsize:, windowsize:] + ps[:-windowsize, :-windowsize] \
- ps[windowsize:, :-windowsize] - ps[:-windowsize, windowsize:]
This is already fast but you can save even more because ps calculated once for the largest window size could be reused for all smaller window sizes.
However, there is one potential drawback, which are the very large numbers that may arise from summing everything like that. A numerically more sound version eliminates this problem by taking the differences first. Downside: the extra saving through sharing ps is no longer available.
def winsum_safe(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
in_arr[windowsize:] -= in_arr[:-windowsize]
in_arr[:, windowsize:] -= in_arr[:, :-windowsize]
return in_arr.cumsum(0)[windowsize-1:].cumsum(1)[:, windowsize-1:]
For reference, here is the closest competitor which is fft based convolution. You need an up-to-date version of scipy for this to work efficiently. On older versions use fftconvolve instead of convolve.
def winsumc(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid')
The next one is to simulate scipy's old - and excruciatingly slow - behavior.
def winsum_nofft(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid', method='direct')
Testing and benchmarking:
data = np.random.random((1000, 1000))
assert np.allclose(winsum(data, 333), winsumc(data, 333))
assert np.allclose(winsum(data, 333), winsum_safe(data, 333))
kwds = dict(globals=globals(), number=10)
from timeit import timeit
from time import perf_counter
print('data 100x1000, window 333x333')
print('cumsum: ', timeit('winsum(data, 333)', **kwds)*100, 'ms')
print('cumsum safe: ', timeit('winsum_safe(data, 333)', **kwds)*100, 'ms')
print('fftconv: ', timeit('winsumc(data, 333)', **kwds)*100, 'ms')
t = perf_counter()
res = winsum_nofft(data, 99) # 333 just takes too long
t = perf_counter() - t
assert np.allclose(winsum(data, 99), res)
print('data 100x1000, window 99x99')
print('conv: ', t*1000, 'ms')
Sample output:
data 100x1000, window 333x333
cumsum: 70.33260859316215 ms
cumsum safe: 59.98647050000727 ms
fftconv: 298.60571819590405 ms
data 100x1000, window 99x99
conv: 135224.8261970235 ms
#Divakar pointed out in the comments that you can use conv2d and he is right. Here is an example:
import numpy as np
from scipy import signal
data = np.random.rand(5,5) # you original data that you want to sum
kernel = np.ones((2,2)) # square matrix of your dimensions, filled with ones
output = signal.convolve2d(data,kernel,mode='same') # the convolution

Joblib parallel write to "shared" numpy sparse matrix

Im trying to compute number of shared neighbors for each node of a very big graph (~1m nodes). Using Joblib Im trying to run it in parallel. But Im worrying about parallel writes to sparse matrix, which supposed to keep all data. Will this piece of code produce consistent results?
vNum = 1259084
NN_Matrix = csc_matrix((vNum, vNum), dtype=np.int8)
def nn_calc_parallel(node_id = None):
i, j = np.unravel_index(node_id, (1259084, 1259084))
NN_Matrix[i, j] = len(np.intersect1d(nx.neighbors(G, i), nx.neighbors(G,j)))
num_cores = multiprocessing.cpu_count()
result = Parallel(n_jobs=num_cores)(delayed(nn_calc_parallel)(i) for i in xrange(vNum**2))
If not, can you help me to solve this?
I needed to do the same work, in my case was just ok to merge the matrixes together into one matrix which you can do this way:
from scipy.sparse import vstack
matrixes = Parallel(n_jobs=-3)(delayed(nn_calc_parallel)(x) for x in documents)
matrix = vstack(matrixes)
Njob-3 means all CPUS except 2, otherwise it might throw some memory errors.

Categories