Apache Beam : Expected 2D array, got 1D array instead in distributed kmeans

Apache Beam : Expected 2D array, got 1D array instead in distributed kmeans - python

so i have this code :
class distKmeans(beam.DoFn):
#i will do an init function to add the kmeans parameters
def __init__(self, n_clusters,rseed=2):
self.n_clusters = n_clusters
self.rseed = rseed
self.centers = None
#The function "process" implements the main functionality of the K-means algorithm
def process(self,element):
if self.centers is None:
rng = np.random.RandomState(self.rseed)
#we use len instead of shape because element is a PCOLLECTION
i = rng.permutation(element.shape[0])[:self.n_clusters]
self.centers = element[i]
# b1. Calculate the closest center μ to xi
labels = pairwise_distances_argmin(element, self.centers)
# b2. Update the center
new_centers = np.array([element[labels == i].mean(0)
for i in range(self.n_clusters)])
# c.
if np.all(self.centers == new_centers):
return
self.centers = new_centers
yield self.centers, labels
with beam.Pipeline() as pipeline:
mydata = pipeline | beam.Create(X)
mydata = mydata |beam.ParDo(distKmeans(3))
mydata |"write" >> beam.io.WriteToText("sample_data/output.txt")
as i'm trying to create a distributed kmeans with apache beam, my data was generated using this code :
n_samples=200
n_features=2
X, y = make_blobs(n_samples=n_samples,centers=3, n_features=n_features)
data = np.c_[X,y]
plt.scatter(data[:, 0], data[:, 1], s=50);
and then X is :
X = data[['X1','X2']].to_numpy()
X = X[1:]
it shape is (200, 2 )
The code seems correct but i always get the fellowing error even tho my data is a 2d array:
Expected 2D array, got 1D array instead:
array=[-6.03120913 11.30181549].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. [while running '[54]: ParDo(distKmeans)']
and this error comes in this line :
labels = pairwise_distances_argmin(element, self.centers)

Related

Overlapping chunks in Xarray dataset for Kernel operations

I try to run a 9x9 pixel kernel across a large satellite image with a custom filter. One satellite scene has ~ 40 GB and to fit it into my RAM, I'm using xarrays options to chunk my dataset with dask.
My filter includes a check if the kernel is complete (i.e. not missing data at the edge of the image). In that case a NaN is returned to prevent a potential bias (and I don't really care about the edges). I now realized, that this introduces not only NaNs at the edges of the image (expected behaviour), but also along the edges of each chunk, because the chunks don't overlap. dask provides options to create chunks with an overlap, but are there any comparable capabilities in xarray? I found this issue, but it doesn't seem like there has been any progress in this regard.
Some sample code (shortened version of my original code):
import numpy as np
import numba
import math
import xarray as xr
#numba.jit("f4[:,:](f4[:,:],i4)", nopython = True)
def water_anomaly_filter(input_arr, window_size = 9):
# check if window size is odd
if window_size%2 == 0:
raise ValueError("Window size must be odd!")
# prepare an output array with NaNs and the same dtype as the input
output_arr = np.zeros_like(input_arr)
output_arr[:] = np.nan
# calculate how many pixels in x and y direction around the center pixel
# are in the kernel
pix_dist = math.floor(window_size/2-0.5)
# create a dummy weight matrix
weights = np.ones((window_size, window_size))
# get the shape of the input array
xn,yn = input_arr.shape
# iterate over the x axis
for x in range(xn):
# determine limits of the kernel in x direction
xmin = max(0, x - pix_dist)
xmax = min(xn, x + pix_dist+1)
# iterate over the y axis
for y in range(yn):
# determine limits of the kernel in y direction
ymin = max(0, y - pix_dist)
ymax = min(yn, y + pix_dist+1)
# extract data values inside the kernel
kernel = input_arr[xmin:xmax, ymin:ymax]
# if the kernel is complete (i.e. not at image edge...) and it
# is not all NaN
if kernel.shape == weights.shape and not np.isnan(kernel).all():
# apply the filter. In this example simply keep the original
# value
output_arr[x,y] = input_arr[x,y]
return output_arr
def run_water_anomaly_filter_xr(xds, var_prefix = "band",
window_size = 9):
variables = [x for x in list(xds.variables) if x.startswith(var_prefix)]
for var in variables[:2]:
xds[var].values = water_anomaly_filter(xds[var].values,
window_size = window_size)
return xds
def create_test_nc():
data = np.random.randn(1000, 1000).astype(np.float32)
rows = np.arange(54, 55, 0.001)
cols = np.arange(10, 11, 0.001)
ds = xr.Dataset(
data_vars=dict(
band_1=(["x", "y"], data)
),
coords=dict(
lon=(["x"], rows),
lat=(["y"], cols),
),
attrs=dict(description="Testdata"),
)
ds.to_netcdf("test.nc")
if __name__ == "__main__":
# if required, create test data
create_test_nc()
# import data
with xr.open_dataset("test.nc",
chunks = {"x": 50,
"y": 50},
) as xds:
xds_2 = xr.map_blocks(run_water_anomaly_filter_xr,
xds,
template = xds).compute()
xds_2["band_1"][:200,:200].plot()
This yields:
enter image description here
You can clearly see the rows and columns of NaNs along the edges of each chunk.
I'm happy for any suggestions. I would love to get the overlapping chunks (or any other solution) within xarray, but I'm also open for other solutions.

You can use Dask's map_blocks as follows:
arr = dask.array.map_overlap(
water_anomaly_filter, xds.band_1.data, dtype='f4', depth=4, window_size=9
).compute()
da = xr.DataArray(arr, dims=xds.band_1.dims, coords=xds.band_1.coords)
Note that you will likely want to tune depth and window_size for your specific application.

Try to work around the numpy.core._exceptions._ArrayMemoryError issue within my code

I have a data frame -> data with the shape (10000,257). I need to preprocess this dataframe so that I can use it in LSTM which requires a 3 dimensional input - (nrows,ntimesteps,nfeatures)I am working with the code snippet that is provided here:
def univariate_processing(variable, window):
import numpy as np
# create empty 2D matrix from variable
V = np.empty((len(variable)-window+1, window))
# take each row/time window
for i in range(V.shape[0]):
V[i,:] = variable[i : i+window]
V = V.astype(np.float32) # set common data type
return V
def RNN_regprep(df, y, len_input, len_pred): #, test_size):
# create 3D matrix for multivariate input
X = np.empty((df.shape[0]-len_input+1, len_input, df.shape[1]))
# Iterate univariate preprocessing on all variables - store them in XM
for i in range(df.shape[1]):
X[ : , : , i ] = univariate_processing(df[:,i], len_input)
# create 2D matrix of y sequences
y = y.reshape((-1,)) # reshape to 1D if needed
Y = univariate_processing(y, len_pred)
## Trim dataframes as explained
X = X[ :-(len_pred + 1) , : , : ]
Y = Y[len_input:-1 , :]
# Set common datatype
X = X.astype(np.float32)
Y = Y.astype(np.float32)
return X, Y
X,y = RNN_regprep(data,label, len_ipnut=200,len_pred=1)
While running this the following error is obtained:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 28.9 GiB for an array with shape (10000, 200, 257) and data type float64
I do understand that this is more of an issue with my memory within my server. I want to know any solution that I can change within my code to see if I can avoid this memory error or try reducing this memory consumption?

This is what windowed views are for. Using my recipe here:
var = np.random.rand(10000,257)
w = window_nd(var, 200, axis = 0)
Now you have a windowed view over var:
w.shape
Out[]: (9801, 200, 257)
But, importantly, it's using the exact same data as var, just looking into it in a windowed way:
w.__array_interface__['data'] #This is the memory's starting address
Out[]: (1448954720320, False)
var.__array_interface__['data']
Out[]: (1448954720320, False)
np.shares_memory(var, w)
Out[]: True
w.base.base.base is var #(lots of rearranging views in the background)
Out[]: True
So you can do:
def univariate_processing(variable, window):
return window_nd(variable, window, axis = 0)
That should significantly reduce your memory allocation, no "magic" required :)
You can also try
from skimage.util import view_as_windows
w = np.squeeze(view_as_windows(var, (200, 1)))
Which does almost the same thing. In this case: your answer would be:
def univariate_processing(variable, window):
from skimage.util import view_as_windows
window = (window,) + (1,)*(len(variable.shape)-1)
return np.squeeze(view_as_windows(variable, window))

Sklearn logistic regression shape error, but x, y shapes are consistent

I get a ValueError: Found input variables with inconsistent numbers of samples: [20000, 1] when I run the following even though the row values of x and y are correct. I load in the RCV1 dataset, get indices of the categories with the top x documents, create list of tuples with equal number of randomly-selected positives and negatives for each category, and then finally attempt to run a logistic regression on one of the categories.
import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse
rcv1 = sklearn.datasets.fetch_rcv1()
def get_top_cat_indices(target_matrix, num_cats):
cat_counts = target_matrix.sum(axis=0)
#cat_counts = cat_counts.reshape((1,103)).tolist()[0]
cat_counts = cat_counts.reshape((103,))
#b = sorted(cat_counts, reverse=True)
ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]
ind = [ind_temp[i] for i in range(5)]
return ind
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[idx_cat,:]
sampled_y_neg = temp.tocsr()[idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)
x, y = test_res[0]
print(x.shape)
print(y.shape)
LogisticRegression().fit(x, y)
Could it be an issue with the sparse matrices, or problem with dimensionality (there are 20K samples and 47K features)

When I run your code, I get following error:
AttributeError: 'bool' object has no attribute 'any'
That's because y for LogisticRegression needs to numpy array. So, I changed last line to:
LogisticRegression().fit(x, y.A.flatten())
Then I get following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
This is because your sampling code has a bug. You need to subset y array with rows having that category before using sampling indices. See code below:
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
c1 = np.where(temp.sum(axis=1)>0)[0]
c2 = np.where(temp.sum(axis=1)==0)[0]
cat_present = x.tocsr()[c1,:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[c2,:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
print(sampled_y_pos.nnz)
sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
print(sampled_y_neg.nnz)
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
Now, Everything works like a charm

How to generate a dynamic number of samples from tensorflow dataset

My goal is to allow my Tensorflow Dataset pipeline to allow near arbitrary sized inputs, which will be converted to uniform (known at 'compile' time) sized samples, which number more than the original. Thus I have a py_func (similar to 1 in idea of mapping one to many) which aims to return a dataset for use in flat_map
def split_fn(x, y):
""" Splits X into a number of subsamples, each labeled y"""
full_width = x.shape[1]
full_height = x.shape[0]
print(full_width)
print(full_height)
slice_width = SLICE_WIDTH
slice_height = SLICE_HEIGHT
# The splits created by these offset cover the complete input image
offsets1 = [[x,0] for x in range(0, full_width-slice_width, slice_width)]
if full_width % slice_width != 0:
offsets1.append([full_width-slice_width, 0])
# The splits from these offsets are random, intended for data augmentation
offsets2 = [[x,0] for x in random.sample(range(0,full_width-slice_width), 5)]
#Combine the two lists of offsets
offsets = offsets1 + offsets2
image = x.reshape(1, full_height, full_width, 1)
#This creates a list of the slices corresponding to the offsets
ts = list(map(lambda offset: tf.image.crop_to_bounding_box(image,
offset[1],
offset[0],
slice_height,
slice_width),
offsets))
#Create and concatenate a dataset for each of the samples
datasets = map(lambda d: tf.data.Dataset.from_tensors((d, y)), ts)
ds = reduce((lambda x, y: x.concatenate(y)), datasets)
return ds
However, where I define offsets1,
TypeError: __index__ returned non-int (type NoneType)
. I've tried to fix this by wrapping it in a py_func which returns a dataset
dataset = dataset.flat_map(
lambda image, label: tuple(tf.py_func(
split_fn, [image, label], [tf.data.Dataset])))
however I can't seem to get this to correctly work:
TypeError: Expected DataType for argument 'Tout' not < class
'tensorflow.python.data.ops.dataset_ops.Dataset' > .
What can I do to get this to work?
Thank you

How to vectorize a code with python numpy.bincount, using apply along axis

I'm trying to vectorize a code with numpy, to run it using multiprocessing, but i can't understand how numpy.apply_along_axis works. This is an example of the code, vectorized using map
import numpy
from scipy import sparse
import multiprocessing
from matplotlib import pyplot
#first i build a matrix of some x positions vs time datas in a sparse format
matrix = numpy.random.randint(2, size = 100).astype(float).reshape(10,10)
x = numpy.nonzero(matrix)[0]
times = numpy.nonzero(matrix)[1]
weights = numpy.random.rand(x.size)
#then i define an array of y positions
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
#now i build an image using x-y-times coordinates and x-times weights
def mapIt(ithStep):
ncolumns = 80
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
image = list(map(mapIt, range(nStepsY)))
image = numpy.array(image)
a = pyplot.imshow(image, aspect = 10)
Here the output plot
I tried to use numpy.apply_along_axis, but this function allows me to iterate only along the rows of image, while i need to iterate along the ithStep index too. E.g.:
#now i build an image using x-y-times coordinates and x-times weights
nrows = nStepsY
ncolumns = 80
matrix = numpy.zeros(nrows*ncolumns).reshape(nrows,ncolumns)
def applyIt(image):
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
imageApplied = numpy.apply_along_axis(applyIt,1,matrix)
a = pyplot.imshow(imageApplied, aspect = 10)
It obviously return only the firs row nrows times, since nothing iterates ithStep:
And here the wrong plot
There is a way to iterate an index, or to use an index while numpy.apply_along_axis iterates?
Here the code with only matricial operations: it's quite faster than map or apply_along_axis but uses so much memory.
(in this function i use a trick with scipy.sparse, which works more intuitively than numpy arrays when you try to sum numbers on a same element)
def fullmatrix(nRows, nColumns):
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
yTimed = numpy.outer(y,times)
x3d = numpy.outer(numpy.ones(nStepsY),x)
weights3d = numpy.outer(numpy.ones(nStepsY),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions)))).todense()
return matrix
image = fullmatrix(nStepsY, 80)
a = pyplot.imshow(image, aspect = 10)
This way is simplier and very fast! Thank you so much.
nStepsY = 5
nRows = nStepsY
nColumns = 80
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
fakeRow = numpy.zeros(positions.size)
def itermatrix(ithStep):
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((weights, (fakeRow, positions))).todense()
matrix = numpy.ravel(matrix)
missColumns = (nColumns-matrix.size)
zeros = numpy.zeros(missColumns)
matrix = numpy.concatenate((matrix, zeros))
return matrix
for i in numpy.arange(nStepsY):
image[i] = itermatrix(i)
#or, without initialization of image:
imageMapped = list(map(itermatrix, range(nStepsY)))
imageMapped = numpy.array(imageMapped)

It feels like attempting to use map or apply_along_axis is obscuring the essentially iteration of the problem.
I rewrote your code as an explicit loop on y:
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nStepsY, 80))
for i, yi in enumerate(y):
yTimed = yi*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[i, positions] = values
a = pyplot.imshow(image, aspect = 10)
pyplot.show()
Looking at the code, I think I could calculate positions for all y values making a (y.shape[0],times.shape[0]) array. But the rest, the bincount and unique still have to work row by row.
apply_along_axis when working with a 2d array, and axis=1 essentially does:
res = np.zeros_like(arr)
for i in range....:
res[i,:] = func1d(arr[i,:])
If the input array has more dimensions it constructs a more elaborate indexing object [i,j,k,:]. And it can handle cases where func1d returns a different size array than the input. But in any case it is just a generalized iteration tool.
Moving the initial positions creation outside the loop:
yTimed = y[:,None]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
image = numpy.zeros((positions.shape[0], 80))
for i, pos in enumerate(positions):
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
image[i, pos] = values
Now I can cast this as an apply_along_axis problem, with an applyIt that takes a positions vector (with all the yTimed information) rather than blank image vector.
def applyIt(pos, size, weights):
acolumn = numpy.zeros(size)
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
acolumn[pos] = values
return acolumn
image = numpy.apply_along_axis(applyIt, 1, positions, 80, weights)
Timing wise I expect it's a bit slower than my explicit iteration. It has to do more setup work, including a test call applyIt(positions[0,:],...) to determine the size of its return array (i.e image has different shape than positions.)
def csrmatrix(y, times, x, weights):
yTimed = numpy.outer(y,times)
n=y.shape[0]
x3d = numpy.outer(numpy.ones(n),x)
weights3d = numpy.outer(numpy.ones(n),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
#print(y.shape, weights3d.shape, y3d.shape, positions.shape)
matrix = sparse.csr_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions))))
#print(repr(matrix))
return matrix
# one call
image = csrmatrix(y, times, x, weights)
# iterative call
alist = []
for yi in numpy.arange(1,nStepsY+1):
alist.append(csrmatrix(numpy.array([yi]), times, x, weights))
def mystack(alist):
# concatenate without offset
row, col, data = [],[],[]
for A in alist:
A = A.tocoo()
row.extend(A.row)
col.extend(A.col)
data.extend(A.data)
print(len(row),len(col),len(data))
return sparse.csr_matrix((data, (row, col)))
vimage = mystack(alist)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Beam : Expected 2D array, got 1D array instead in distributed kmeans - python

Related

Overlapping chunks in Xarray dataset for Kernel operations

Try to work around the numpy.core._exceptions._ArrayMemoryError issue within my code

Sklearn logistic regression shape error, but x, y shapes are consistent

How to generate a dynamic number of samples from tensorflow dataset

How to vectorize a code with python numpy.bincount, using apply along axis

Categories

Resources