I have two data series, that are slightly shifted to each other. Both contain nan values, that need to be respected. Hence I would like to align them automatically. My idea is to use cross-correlation and numpy arrays to solve the problem. The code below is extremely slow and I would like to speed things up, but as a non python expert, I don't see any possibilities for improvement.
The idea is to have a baseline and target array. The code calculates the offset of each target position relative to the baseline in a windowed fashion. For each window it is calculated how much the data point has to be shifted for an optimal alignment. The first point that can be aligned is at window_size//2 and the last at basline.size-window_size//2
window_size = 50
N = 100
randN = 10
baseline = np.random.rand(N,)
target = np.random.rand(N,)
mask=np.zeros(N,dtype=bool)
mask[:randN] = True
np.random.shuffle(mask)
baseline[mask] = np.nan
np.random.shuffle(mask)
target[mask] = np.nan
stacked = np.column_stack((baseline,target))
stacked_windows = sliding_window_view(stacked, (window_size,2))
offset_np = np.zeros([stacked.shape[0], ])
offset_np[:] = np.nan
for idx in range(stacked_windows.shape[0]):
window = stacked_windows[idx]
baseline_window_np = window.reshape(window_size,2)[:,0]
target_window_np = window.reshape(window_size,2)[:,1]
#
baseline_window_masked = ma.masked_invalid(baseline_window_np)
target_window_masked = ma.masked_invalid(target_window_np)
#
cc_np = np.empty([window_size + 1, ], dtype=np.float32)
cc_np = np.zeros([window_size, ])
cc_np[:] = np.nan
for lag in range(-int(window_size//2),int(window_size//2)):
masked_tmp = ma.masked_invalid(shift_numpy(target_window_masked, lag))
cc_np[lag+int(window_size//2)] = ma.corrcoef(baseline_window_masked,masked_tmp)[0,1]
if not np.isnan(cc_np).all():
offset_np[window_size//2+idx] = np.floor(window_size//2)-np.argmax(cc_np)
result_np = np.column_stack((stacked, offset_np))
result_df = df = pd.DataFrame(result_np, columns = ['baseline','target','offset'])
Related
I try to run a 9x9 pixel kernel across a large satellite image with a custom filter. One satellite scene has ~ 40 GB and to fit it into my RAM, I'm using xarrays options to chunk my dataset with dask.
My filter includes a check if the kernel is complete (i.e. not missing data at the edge of the image). In that case a NaN is returned to prevent a potential bias (and I don't really care about the edges). I now realized, that this introduces not only NaNs at the edges of the image (expected behaviour), but also along the edges of each chunk, because the chunks don't overlap. dask provides options to create chunks with an overlap, but are there any comparable capabilities in xarray? I found this issue, but it doesn't seem like there has been any progress in this regard.
Some sample code (shortened version of my original code):
import numpy as np
import numba
import math
import xarray as xr
#numba.jit("f4[:,:](f4[:,:],i4)", nopython = True)
def water_anomaly_filter(input_arr, window_size = 9):
# check if window size is odd
if window_size%2 == 0:
raise ValueError("Window size must be odd!")
# prepare an output array with NaNs and the same dtype as the input
output_arr = np.zeros_like(input_arr)
output_arr[:] = np.nan
# calculate how many pixels in x and y direction around the center pixel
# are in the kernel
pix_dist = math.floor(window_size/2-0.5)
# create a dummy weight matrix
weights = np.ones((window_size, window_size))
# get the shape of the input array
xn,yn = input_arr.shape
# iterate over the x axis
for x in range(xn):
# determine limits of the kernel in x direction
xmin = max(0, x - pix_dist)
xmax = min(xn, x + pix_dist+1)
# iterate over the y axis
for y in range(yn):
# determine limits of the kernel in y direction
ymin = max(0, y - pix_dist)
ymax = min(yn, y + pix_dist+1)
# extract data values inside the kernel
kernel = input_arr[xmin:xmax, ymin:ymax]
# if the kernel is complete (i.e. not at image edge...) and it
# is not all NaN
if kernel.shape == weights.shape and not np.isnan(kernel).all():
# apply the filter. In this example simply keep the original
# value
output_arr[x,y] = input_arr[x,y]
return output_arr
def run_water_anomaly_filter_xr(xds, var_prefix = "band",
window_size = 9):
variables = [x for x in list(xds.variables) if x.startswith(var_prefix)]
for var in variables[:2]:
xds[var].values = water_anomaly_filter(xds[var].values,
window_size = window_size)
return xds
def create_test_nc():
data = np.random.randn(1000, 1000).astype(np.float32)
rows = np.arange(54, 55, 0.001)
cols = np.arange(10, 11, 0.001)
ds = xr.Dataset(
data_vars=dict(
band_1=(["x", "y"], data)
),
coords=dict(
lon=(["x"], rows),
lat=(["y"], cols),
),
attrs=dict(description="Testdata"),
)
ds.to_netcdf("test.nc")
if __name__ == "__main__":
# if required, create test data
create_test_nc()
# import data
with xr.open_dataset("test.nc",
chunks = {"x": 50,
"y": 50},
) as xds:
xds_2 = xr.map_blocks(run_water_anomaly_filter_xr,
xds,
template = xds).compute()
xds_2["band_1"][:200,:200].plot()
This yields:
enter image description here
You can clearly see the rows and columns of NaNs along the edges of each chunk.
I'm happy for any suggestions. I would love to get the overlapping chunks (or any other solution) within xarray, but I'm also open for other solutions.
You can use Dask's map_blocks as follows:
arr = dask.array.map_overlap(
water_anomaly_filter, xds.band_1.data, dtype='f4', depth=4, window_size=9
).compute()
da = xr.DataArray(arr, dims=xds.band_1.dims, coords=xds.band_1.coords)
Note that you will likely want to tune depth and window_size for your specific application.
I implemented an algorithm that uses opencv kmeans to quantize the unique brightness values present in a greyscale image. Quantizing the unique values helped avoid biases towards image backgrounds which are typically all the same value.
However, I struggled to find a way to utilize this data to quantize a given input image.
I implemented a very naive solution, but it is unusably slow for the required input sizes (4000x4000):
for x in range(W):
for y in range(H):
center_id = np.argmin([(arr[y,x]-center)**2 for center in centers])
ret_labels2D[y,x] = sortorder.index(center_id)
ret_qimg[y,x] = centers[center_id]
Basically, I am simply adjusting each pixel to the predefined level with the minimum squared error.
Is there any way to do this faster? I was trying to process an image of size 4000x4000 and this implementation was completely unusable.
Full code:
def unique_quantize(arr, K, eps = 0.05, max_iter = 100, max_tries = 20):
"""#param arr: 2D numpy array of floats"""
H, W = arr.shape
unique_values = np.squeeze(np.unique(arr.copy()))
unique_values = np.array(unique_values, float)
if unique_values.ndim == 0:
unique_values = np.array([unique_values],float)
unique_values = np.ravel(unique_values)
unique_values = np.expand_dims(unique_values,1)
Z = unique_values.astype(np.float32)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER,max_iter,eps)
compactness, labels, centers = cv2.kmeans(Z,K,None,criteria,max_tries,cv2.KMEANS_RANDOM_CENTERS)
labels = np.ravel(np.squeeze(labels))
centers = np.ravel(np.squeeze(centers))
sortorder = list(np.argsort(centers)) # old index --> index to sortorder
ret_center = centers[sortorder]
ret_labels2D = np.zeros((H,W),int)
ret_qimg = np.zeros((H,W),float)
for x in range(W):
for y in range(H):
center_id = np.argmin([(arr[y,x]-center)**2 for center in centers])
ret_labels2D[y,x] = sortorder.index(center_id)
ret_qimg[y,x] = centers[center_id]
return ret_center, ret_labels2D, ret_qimg
EDIT: I looked at the input file again. The size was actually 12000x12000.
As your image is grayscale (presumably 8 bits), a lookup-table will be an efficient solution. It suffices to map all 256 gray-levels to the nearest center once for all, then use this as a conversion table. Even a 16 bits range (65536 entries) would be significantly accelerated.
I recently thought of a much better answer. This code is not extensively tested, but it worked for the use case in my project.
I made use of obscure fancy-indexing techniques in order to keep the entire algorithm contained within numpy functions.
def unique_quantize(arr, K, eps = 0.05, max_iter = 100, max_tries = 20):
"""#param arr: 2D numpy array of floats"""
H, W = arr.shape
unique_values = np.squeeze(np.unique(arr.copy()))
unique_values = np.array(unique_values, float)
if unique_values.ndim == 0:
unique_values = np.array([unique_values],float)
unique_values = np.ravel(unique_values)
unique_values = np.expand_dims(unique_values,1)
Z = unique_values.astype(np.float32)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER,max_iter,eps)
compactness, labels, centers = cv2.kmeans(Z,K,None,criteria,max_tries,cv2.KMEANS_RANDOM_CENTERS)
labels = np.ravel(np.squeeze(labels))
centers = np.ravel(np.squeeze(centers))
sortorder = np.argsort(centers) # old index --> index to sortorder
inverse_sortorder = np.array([list(sortorder).index(i) for i in range(len(centers))],int)
ret_center = centers[sortorder]
ret_labels2D = np.zeros((H,W),int)
ret_qimg = np.zeros((H,W),float)
errors = [np.power((arr-center),2) for center in centers]
errors = np.array(errors,float)
classification = np.squeeze(np.argmin(errors,axis=0))
ret_labels2D = inverse_sortorder[classification]
ret_qimg = centers[classification]
return np.array(ret_center,float), np.array(ret_labels2D,int), np.array(ret_qimg,float)
I have a pandas data frame and I want to calculate some features based on some short_window, long_window and bins values. More specifically, for each different row, I want to calculate some features. In order to do so, I move one row forward the df_long = df.loc[row:long_window+row] such as in the first iteration the pandas data frame for row=0 would be df_long = df.loc[0:50+0] and some features would be calculated based on this data frame, for row=1 would be df_long = df.loc[1:50+1] and some other features would be calculated and continues.
from numpy.random import seed
from numpy.random import randint
import pandas as pd
from joblib import Parallel, delayed
bins = 12
short_window = 10
long_window = 50
# seed random number generator
seed(1)
price = pd.DataFrame({
'DATE_TIME': pd.date_range('2012-01-01', '2012-02-01', freq='30min'),
'value': randint(2, 20, 1489),
'amount': randint(50, 200, 1489)
})
def vap(row, df, short_window, long_window, bins):
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = pd.DataFrame(df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum())
return group_months['amount'].tolist(), df.loc[long_window + row + 1, 'DATE_TIME']
def feature_extraction(data, short_window, long_window, bins):
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
amount, date = zip(*Parallel(n_jobs=4)(delayed(vap)(i, data, short_window, long_window, bins)
for i in range(0, data.shape[0] - long_window - 1)))
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(amount, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')
return data
df = feature_extraction(price, short_window, long_window, bins)
I tried to run it in parallel in order to save time but due to the dimensions of my data, it takes a long of time to finish.
Is there any way to change this iterative process (df_long = df.loc[row:long_window+row]) in order to reduce the computational cost? I was wondering if there is any way to use pandas.rolling but I am not sure how to use it in this case.
Any help would be much appreciated!
Thank you
This is the first try to speed up the calculation. I checked the first 100 rows and found out that the binning variable was always the same. So I managed to do an efficient algorithm with fixed bins. But when I checked the function on the whole data, I found out that there are about 100 lines out of 1489, that had a different binning variable so the solution below deviates in 100 lines from the original answer.
Benchmarking:
My fast function: 28 ms
My precise function: 388 ms
Original function: 12200 ms
So a speed up of around 500 times for the fast function and 20 times for precise function
Fast function code:
def feature_extraction2(data, short_window, long_window, bins):
ls = [f"feature{row + 1}" for row in range(bins)]
binning = pd.cut([2,19], bins, retbins=True)[1]
bin_group = np.digitize(data['value'], binning, right=True)
l_sum = []
for i in range(1, bins+1):
sum1 = ((bin_group == i)*data['amount']).rolling(short_window).sum()
l_sum.append(sum1)
ar_sum = np.array(l_sum).T
ar_shifted = np.empty_like(ar_sum)
ar_shifted[:long_window+1,:] = np.nan
ar_shifted[long_window+1:,:] = ar_sum[long_window:-1,:]
temp = pd.DataFrame(ar_shifted, columns = ls)
data = pd.concat([data,temp], axis = 1, sort = False)
return data
Precise function:
data = price.copy()
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
data.shape[0] - long_window - 1)))
norm_volume = []
date = []
for i in range(0, data.shape[0] - long_window - 1):
row = i
df = data
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum().values
x,y = group_months, df.loc[long_window + row + 1, 'DATE_TIME']
norm_volume.append(x)
date.append(y)
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(norm_volume, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')
I'm running a loop that appends values to an empty dataframe out side of the loop. However, when this is done, the datframe remains empty. I'm not sure what's going on. The goal is to find the power value that results in the lowest sum of squared residuals.
Example code below:
import tweedie
power_list = np.arange(1.3, 2, .01)
mean = 353.77
std = 17298.24
size = 860310
x = tweedie.tweedie(mu = mean, p = 1.5, phi = 50).rvs(len(x))
variance = 299228898.89
sum_ssr_df = pd.DataFrame(columns = ['power', 'dispersion', 'ssr'])
for i in power_list:
power = i
phi = variance/(mean**power)
tvs = tweedie.tweedie(mu = mean, p = power, phi = phi).rvs(len(x))
sort_tvs = np.sort(tvs)
df = pd.DataFrame([x, sort_tvs]).transpose()
df.columns = ['actual', 'random']
df['residual'] = df['actual'] - df['random']
ssr = df['residual']**2
sum_ssr = np.sum(ssr)
df_i = pd.DataFrame([i, phi, sum_ssr])
df_i = df_i.transpose()
df_i.columns = ['power', 'dispersion', 'ssr']
sum_ssr_df.append(df_i)
sum_ssr_df[sum_ssr_df['ssr'] == sum_ssr_df['ssr'].min()]
What exactly am I doing incorrectly?
This code isn't as efficient as is could be as noted by ALollz. When you append, it basically creates a new dataframe in memory (I'm oversimplifying here).
The error in your code is:
sum_ssr_df.append(df_i)
should be:
sum_ssr_df = sum_ssr_df.append(df_i)
I'm trying to vectorize a code with numpy, to run it using multiprocessing, but i can't understand how numpy.apply_along_axis works. This is an example of the code, vectorized using map
import numpy
from scipy import sparse
import multiprocessing
from matplotlib import pyplot
#first i build a matrix of some x positions vs time datas in a sparse format
matrix = numpy.random.randint(2, size = 100).astype(float).reshape(10,10)
x = numpy.nonzero(matrix)[0]
times = numpy.nonzero(matrix)[1]
weights = numpy.random.rand(x.size)
#then i define an array of y positions
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
#now i build an image using x-y-times coordinates and x-times weights
def mapIt(ithStep):
ncolumns = 80
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
image = list(map(mapIt, range(nStepsY)))
image = numpy.array(image)
a = pyplot.imshow(image, aspect = 10)
Here the output plot
I tried to use numpy.apply_along_axis, but this function allows me to iterate only along the rows of image, while i need to iterate along the ithStep index too. E.g.:
#now i build an image using x-y-times coordinates and x-times weights
nrows = nStepsY
ncolumns = 80
matrix = numpy.zeros(nrows*ncolumns).reshape(nrows,ncolumns)
def applyIt(image):
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
imageApplied = numpy.apply_along_axis(applyIt,1,matrix)
a = pyplot.imshow(imageApplied, aspect = 10)
It obviously return only the firs row nrows times, since nothing iterates ithStep:
And here the wrong plot
There is a way to iterate an index, or to use an index while numpy.apply_along_axis iterates?
Here the code with only matricial operations: it's quite faster than map or apply_along_axis but uses so much memory.
(in this function i use a trick with scipy.sparse, which works more intuitively than numpy arrays when you try to sum numbers on a same element)
def fullmatrix(nRows, nColumns):
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
yTimed = numpy.outer(y,times)
x3d = numpy.outer(numpy.ones(nStepsY),x)
weights3d = numpy.outer(numpy.ones(nStepsY),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions)))).todense()
return matrix
image = fullmatrix(nStepsY, 80)
a = pyplot.imshow(image, aspect = 10)
This way is simplier and very fast! Thank you so much.
nStepsY = 5
nRows = nStepsY
nColumns = 80
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
fakeRow = numpy.zeros(positions.size)
def itermatrix(ithStep):
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((weights, (fakeRow, positions))).todense()
matrix = numpy.ravel(matrix)
missColumns = (nColumns-matrix.size)
zeros = numpy.zeros(missColumns)
matrix = numpy.concatenate((matrix, zeros))
return matrix
for i in numpy.arange(nStepsY):
image[i] = itermatrix(i)
#or, without initialization of image:
imageMapped = list(map(itermatrix, range(nStepsY)))
imageMapped = numpy.array(imageMapped)
It feels like attempting to use map or apply_along_axis is obscuring the essentially iteration of the problem.
I rewrote your code as an explicit loop on y:
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nStepsY, 80))
for i, yi in enumerate(y):
yTimed = yi*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[i, positions] = values
a = pyplot.imshow(image, aspect = 10)
pyplot.show()
Looking at the code, I think I could calculate positions for all y values making a (y.shape[0],times.shape[0]) array. But the rest, the bincount and unique still have to work row by row.
apply_along_axis when working with a 2d array, and axis=1 essentially does:
res = np.zeros_like(arr)
for i in range....:
res[i,:] = func1d(arr[i,:])
If the input array has more dimensions it constructs a more elaborate indexing object [i,j,k,:]. And it can handle cases where func1d returns a different size array than the input. But in any case it is just a generalized iteration tool.
Moving the initial positions creation outside the loop:
yTimed = y[:,None]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
image = numpy.zeros((positions.shape[0], 80))
for i, pos in enumerate(positions):
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
image[i, pos] = values
Now I can cast this as an apply_along_axis problem, with an applyIt that takes a positions vector (with all the yTimed information) rather than blank image vector.
def applyIt(pos, size, weights):
acolumn = numpy.zeros(size)
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
acolumn[pos] = values
return acolumn
image = numpy.apply_along_axis(applyIt, 1, positions, 80, weights)
Timing wise I expect it's a bit slower than my explicit iteration. It has to do more setup work, including a test call applyIt(positions[0,:],...) to determine the size of its return array (i.e image has different shape than positions.)
def csrmatrix(y, times, x, weights):
yTimed = numpy.outer(y,times)
n=y.shape[0]
x3d = numpy.outer(numpy.ones(n),x)
weights3d = numpy.outer(numpy.ones(n),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
#print(y.shape, weights3d.shape, y3d.shape, positions.shape)
matrix = sparse.csr_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions))))
#print(repr(matrix))
return matrix
# one call
image = csrmatrix(y, times, x, weights)
# iterative call
alist = []
for yi in numpy.arange(1,nStepsY+1):
alist.append(csrmatrix(numpy.array([yi]), times, x, weights))
def mystack(alist):
# concatenate without offset
row, col, data = [],[],[]
for A in alist:
A = A.tocoo()
row.extend(A.row)
col.extend(A.col)
data.extend(A.data)
print(len(row),len(col),len(data))
return sparse.csr_matrix((data, (row, col)))
vimage = mystack(alist)